How can the sum of fast parts be slower than the sum of slower parts?   This is one of the conundrums we faced as we have been benchmarking the latest Jetty-9 releases. The explanation is good insight into modern CPUs and an indication of how software engineers need to be somewhat aware of the hardware when creating high performance software that scales.

Jetty-9 Performance Expectations

With the development of Jetty-9, we have refactored and/or refined many of the core components to take advantage of newer JVMs, new protocols and more experience. The result has been that the IO layer, HTTP parser, HTTP generator, buffer pools and other components all micro-benchmark much better than their predecessors in Jetty-8.  They use less heap, produce less garbage, have less code and run faster.   For example a micro benchmark of the Jetty-8 HTTP Parser gave the following results:

The same task done by the Jetty-9 HTTP parser gave  better results as it executed faster and produced almost half the garbage:

Another example of an improved component in Jetty-9 is the IO layer. The following example is an test that simply echoes a 185 bytes HTTP message between the client and  server a million times:

Jetty-9 is using half the heap, generating 85% less garbage, forcing less GCs, using less CPU and achieving the same throughput.  Surely the CPU and memory freed by such an improvement would be well used to improve the total performance of the server?

Jetty-9 Disappointment

Our expectation for the jetty-9 as a server built from a combination of these improved components, was that it would be much faster than jetty-8.

Thus we were amazed to discover that for our initial benchmarks, jetty-9 was significantly slower and more resource hungry than jetty-8!!! The test this was most apparent in was a single connection driven with as many pipelined requests that could be fed to it (note that this is preciously the kind of non realistic benchmark load that I argue against in Truth in Benchmarking and Lies, DamnLies and Benchmarks, but so long as you know what you are testing the results are interesting none the less):

Jetty-8 achieves a healthy 26,525 requests per second on a single connection and core! Jetty-9 disappointed:

Only 21,181 requests per second and 1.3 cores were needed to produce those results! That’s 25% slower with 30% more CPU!?!?!? How could this be so?  All the jetty 9 components when tested individually were faster yet when run together, there were slower!

Benchmark analysis – Parallel Slowdown.

We profiled the benchmarks using various profiling tools, there were a few minor hot spots and garbage producers identified.  These were easily found and fixed (eg replaced StringMap usage with a new Trie implementation), but only gave us about a 10% improvement leaving another 15% to be found just to break even!

But profiling revealed no really significant hot spots and no stand out methods that obviously needed to be improved and no tasks being done that were not done by jetty-8.  The 15% was not going to be found in a few methods, it looked like we had to find 0.015% from 1000 methods ie it looked like every bit of the code was running a little bit slower than it should do.

The clue that helped us was that jetty-9 was using more than 1 core for a single connection.  Thus we started suspecting that it was an issue with how we were using threads and perhaps with CPU caches. Jetty-9 makes a fair bit more usage of Atomics than Jetty-8, in an effort to support even more asynchronous behaviour.  Investigating this led us to the excellent blog of Marc Brooker where he investigates the performance implications of CPU caching on integer incrementing.

While it turned out that there is nothing wrong with our usage of Atomics, the analysis tools that Marc describes (linux perf) revealed our smoking gun.  The Linux perf tool gives access to the CPU and kernel performance counters so that you can glimpse what is going on within the hardware of a modern multicore machine.   For my i7 CPU I worked out that the following command gave the extra information needed:

Running this against a warmed up Jetty-8 server for the entire pipeline test gave the following results:

The key number in all this gritty detail is the instructions per cycle figure.  In Jetty-8, the CPU was able to execute 0.92 instructions every clock tick, with the remainder of the time being spent waiting for data from slow memory to fill either the instruction or the data caches. The same test for jetty-9 reveals the full horror of what was going on:

Jetty-9 was only able to execute 0.54 instructions per tick, so almost half the CPU time was spent waiting for data from memory.  Worse still, this caused so little load on the CPU that the power governor only felt the need to clock the CPU at 1.498GHz rather than the 1.944GHz achieve by jetty-8 (Note that some recommend to peg CPU frequencies during benchmarks, but I believe that unless you do that in your data centre and pay the extra power/cooling charges, then don’t do it in your benchmarks. Your code must be able to drive the CPU governors to dynamically increase the clock speed as needed).

The cause of this extra time waiting for memory is revealed by the cache figures.  The L1 caches were being hit a little bit more often and missing a lot more often!  This flowed through to the LLC cache that had to do 4 times more loads with 8 times more cache misses! This is a classic symptom of Parallel Slowdown, because Jetty-9 was attempting to use multiple cores to handle a job best done by a single core (ie a serial sequence of requests on single connection), it was wasting more time in sharing data between cores than it was gaining by increased computing power.

Where Jetty-9-M3 got it wrong!

One of the changes that we had made in Jetty-9 was an attempt to better utilize the selector thread so to reduce unnecessary dispatches to the thread pool.   By default, we configure jetty with an NIO selector and selector thread for each available CPU core. In Jetty-8 when the selector detects a connection that is readable, it dispatched the endpoint to a thread from the pool, which would do the IO read, parse the HTTP request, call the servlet container and flush the response.

In Jetty-9, we realized that it is only when calling the application in the servlet container that there is a possibility that the thread might block and that it would thus be safe to let the selector thread do the IO read and HTTP parsing without a dispatch to a thread.  Only once the HTTP parser had received an entire HTTP request, would a dispatch be done to an application handler to handle the request (probably via a servlet).  This seemed like a great idea at the time that at worst would cost nothing, but may save some dispatches for slow clients.

Our retrospect-a-scope now tells us that is is a very bad idea to have a different thread do the HTTP parsing and the handling.  The issue is that once one thread had finished parsing a HTTP request, then it’s caches are full of all the information just parsed. The method, URI and request object holding them,  are all going to be in or near to the L1 cache.   Dispatching the handling to another thread just creates the possibility that another core will execute the thread and will need to fill it’s cache from main memory with all the parsed parts of the request.

Luckily with the flexible architecture of jetty, we were able to quickly revert the dispatching model to dispatch on IO selection rather than HTTP request completion and we were instantly rewarded with another 10% performance gain.   But we were still a little slower than jetty-8 and still using 1.1 cores rather than 1.0.  Perf again revealed that we were still suffering from some parallel slowdown, which turned out to be the way Jetty-9 was handling pipelined requests.  Previously Jetty’s IO handling thread had looped until all read data was consumed or until an upgrade or request suspension was done.  Those “or”s made for a bit of complex code, so to simplify the code base, Jetty-9 always returned from the handling thread after handling a request and it was the completion callback that dispatched a new thread to handle any pipelined requests.  This new thread might then execute on a different core, requiring its cache to be loaded with the IO buffer and the connection, request and other objects before the next request can be parsed.

Testing pipelines is more of an exercise for interest rather than handling something likely to be encountered in real productions, but it is worth while to handle them well if at least to deal with such simple unrealistic benchmarks.    Thus we have reverted to the previous behaviour and found another huge gain in performance.

Jetty-9 getting it right

With the refactored components, the minor optimizations found from profiling, and the reversion to the jetty-8 threading model, jetty-9 is now meeting our expectations and out performing jetty-8.  The perf numbers now look much better:

The CPU is now executing 0.74 instructions per tick, not as good as jetty-8, but a good improvement.  Most importantly, the macro benchmark numbers now indicate that parallel slowdown is not having an effect and the improved jetty-9 components are now able to do their stuff and provide some excellent results:

This is 34,280 requests per second (29% better that Jetty-8), using only half the heap and generating 83% less garbage!     If this was in anyway a realistic benchmark working with a load profile in any way resembling a real world load, then these numbers would be absolutely AWESOME!

But this is just a single connection pipeline test, nothing like the load profile that 99.999% of servers will encounter.  So while these results are very encouraging, I’ll wait until we do some tuning against some realistic load benchmarks before I get too excited.  Also I believe that the perf numbers are showing that there may also be room for even more improvement with jetty-9 and the same tools can also be used to get significant results by improving (or avoiding) branch prediction.

The code for the benchmarks used is available at git@github.com:jetty-project/jetty-bench.git.

Avoiding Parallel Slowdown in Jetty-9 with CPU Cache analysis.

10 thoughts on “Avoiding Parallel Slowdown in Jetty-9 with CPU Cache analysis.

  • December 18, 2012 at 1:27 pm
    Permalink

    Why didn’t you bind the jetty process to a single core using numactl ??

  • December 18, 2012 at 1:55 pm
    Permalink

    Ups, Typo : I meant bind to one _cpu_

  • December 20, 2012 at 1:04 pm
    Permalink

    So is this problem solved in the latest release of Jetty 9 or should I wait for M4?

  • Pingback:Jetty-9 goes fast with Mechanical Sympathy | Webtide Blogs

  • January 31, 2013 at 9:52 pm
    Permalink

    Binding threads to cores may help, but that can’t be done with 100%. However I suspect that something along those lines might need to be considered once we have MUX on websocket and SPDY as one connection may result in many messages to be handled in parallel and then we have to consider is it best to keep them on core (and serialized them) or fan them out to other cores (and pay the cache penalty).

  • February 28, 2013 at 2:50 pm
    Permalink

    Not sure if this helps but I’m getting better results using jetty-8 than 9. I use jetty-maven-plugin though.

    [ details cut for brevity ]

    • February 28, 2013 at 10:11 pm
      Permalink

      Wow – so miliage may vary! We will have to have a look at that. Please send your code/config to gregw@intalio.com

      • February 28, 2013 at 11:00 pm
        Permalink

        Note also that I’m not a fan of siege unless used in very limited circumstances. It uses a few very busy connections, which is a very unrealistic work load. OK yes the micro benchmarks I’ve been doing are also loading a single connection to it’s limit, but these are done specifically as micro benchmarks. Once you are trying to actually measure the performance of a server with a real application, then it is important to have a realistic load generator, which means lots of connections that have big idle periods. Having said that, still keen to analyse your tests in more detail.

        • March 11, 2013 at 11:07 am
          Permalink

          Are there any load testers that you can recommend?

          also I was wondering how the microbenchmark output was created (Garbage generated in Young Generation, etc)

      • March 1, 2013 at 11:07 pm
        Permalink

        I believe this slowdown was due to a issue with our timer queues and non persistent connections. The idle timeout was not cancelled onClose of the a connection, so the timer queue was growing massively and thus slowing the server. Even with a cancel onClose there is still a bit of an issue as the cancel just nulls the entry and we have to sweep the queue. Post 9.0.0 we will be looking at some alternate scheduler algorithms that should greatly improve the 1 request per connection pathelogical case.

Comments are closed.