How can the sum of fast parts be slower than the sum of slower parts? This is one of the conundrums we faced as we have been benchmarking the latest Jetty-9 releases. The explanation is good insight into modern CPUs and an indication of how software engineers need to be somewhat aware of the hardware when creating high performance software that scales.
Jetty-9 Performance Expectations
With the development of Jetty-9, we have refactored and/or refined many of the core components to take advantage of newer JVMs, new protocols and more experience. The result has been that the IO layer, HTTP parser, HTTP generator, buffer pools and other components all micro-benchmark much better than their predecessors in Jetty-8. They use less heap, produce less garbage, have less code and run faster. For example a micro benchmark of the Jetty-8 HTTP Parser gave the following results:
Jetty-8 HttpParser+HttpFields ======================================== Operative System: Linux 3.5.0-19-generic amd64 JVM : Java HotSpot(TM) 64-Bit Server 23.3-b01 1.7.0_07-b10 Processors: 8 System Memory: 89.56941% used of 7.7324257 GiB Used/Max Heap Size: 8.314537/981.375 MiB - - - - - - - - - - - - - - - - - - - - tests 10000000 requests 10000000 headers 60000000 - - - - - - - - - - - - - - - - - - - - Elapsed time: 60600 ms Time in JIT compilation: 0 ms Time in Young Generation GC: 26 ms (26 collections) Time in Old Generation GC: 0 ms (0 collections) Garbage Generated in Young Generation: 7795.7827 MiB Garbage Generated in Survivor Generation: 0.28125 MiB Garbage Generated in Old Generation: 0.03125 MiB Average CPU Load: 99.933975/800 ----------------------------------------
The same task done by the Jetty-9 HTTP parser gave better results as it executed faster and produced almost half the garbage:
Jetty-9 HttpParser+HttpFields ======================================== Operative System: Linux 3.5.0-19-generic amd64 JVM : Java HotSpot(TM) 64-Bit Server 23.3-b01 1.7.0_07-b10 Processors: 8 System Memory: 88.25224% used of 7.7324257 GiB Used/Max Heap Size: 8.621246/981.375 MiB - - - - - - - - - - - - - - - - - - - - tests 10000000 requests 10000000 headers 60000000 - - - - - - - - - - - - - - - - - - - - Statistics Ended at Mon Dec 17 10:00:04 EST 2012 Elapsed time: 57701 ms Time in JIT compilation: 0 ms Time in Young Generation GC: 18 ms (15 collections) Time in Old Generation GC: 0 ms (0 collections) Garbage Generated in Young Generation: 4716.9775 MiB Garbage Generated in Survivor Generation: 0.34375 MiB Garbage Generated in Old Generation: 0.0234375 MiB Average CPU Load: 99.92787/800 ----------------------------------------
Another example of an improved component in Jetty-9 is the IO layer. The following example is an test that simply echoes a 185 bytes HTTP message between the client and server a million times:
Jetty-8 Echo Connection Server ======================================== Used/Max Heap Size: 20.490265/981.375 MiB - - - - - - - - - - - - - - - - - - - - Filled 185000000 bytes in 1000000 fills - - - - - - - - - - - - - - - - - - - - Elapsed time: 67778 ms Time in Young Generation GC: 12 ms (14 collections) Garbage Generated in Young Generation: 4169.701 MiB Average CPU Load: 118.37115/800 ---------------------------------------- Jetty-9 Echo Connection Server ======================================== Used/Max Heap Size: 11.668541,981.375 MiB - - - - - - - - - - - - - - - - - - - - Filled 185000000 bytes in 1000000 fills - - - - - - - - - - - - - - - - - - - - Elapsed time: 66846 ms Time in Young Generation GC: 2 ms (2 collections) Garbage Generated in Young Generation: 653.2649 MiB Average CPU Load: 111.07558/800 ----------------------------------------
Jetty-9 is using half the heap, generating 85% less garbage, forcing less GCs, using less CPU and achieving the same throughput. Surely the CPU and memory freed by such an improvement would be well used to improve the total performance of the server?
Our expectation for the jetty-9 as a server built from a combination of these improved components, was that it would be much faster than jetty-8.
Thus we were amazed to discover that for our initial benchmarks, jetty-9 was significantly slower and more resource hungry than jetty-8!!! The test this was most apparent in was a single connection driven with as many pipelined requests that could be fed to it (note that this is preciously the kind of non realistic benchmark load that I argue against in Truth in Benchmarking and Lies, DamnLies and Benchmarks, but so long as you know what you are testing the results are interesting none the less):
jetty-8 pipeline: ======================================== Used/Max Heap Size: 3.0077057/1023.625 MiB - - - - - - - - - - - - - - - - - - - - Pipeline Requests 1000000 of 1000000 - - - - - - - - - - - - - - - - - - - - Elapsed time: 37696 ms Time in Young Generation GC: 7 ms (9 collections) Time in Old Generation GC: 0 ms (0 collections) Garbage Generated in Young Generation: 2886.1907 MiB Average CPU Load: 100.009384/800 ----------------------------------------
Jetty-8 achieves a healthy 26,525 requests per second on a single connection and core! Jetty-9 disappointed:
jetty-9 pipeline: ======================================== Used/Max Heap Size: 3.406746/1023.6875 MiB - - - - - - - - - - - - - - - - - - - - Pipeline Requests 1000000 of 1000000 - - - - - - - - - - - - - - - - - - - - Elapsed time: 47212 ms Time in Young Generation GC: 6 ms (10 collections) Time in Old Generation GC: 0 ms (0 collections) Garbage Generated in Young Generation: 3225.3438 MiB Average CPU Load: 133.77675/800 ----------------------------------------
Only 21,181 requests per second and 1.3 cores were needed to produce those results! That’s 25% slower with 30% more CPU!?!?!? How could this be so? All the jetty 9 components when tested individually were faster yet when run together, there were slower!
Benchmark analysis – Parallel Slowdown.
We profiled the benchmarks using various profiling tools, there were a few minor hot spots and garbage producers identified. These were easily found and fixed (eg replaced StringMap usage with a new Trie implementation), but only gave us about a 10% improvement leaving another 15% to be found just to break even!
But profiling revealed no really significant hot spots and no stand out methods that obviously needed to be improved and no tasks being done that were not done by jetty-8. The 15% was not going to be found in a few methods, it looked like we had to find 0.015% from 1000 methods ie it looked like every bit of the code was running a little bit slower than it should do.
The clue that helped us was that jetty-9 was using more than 1 core for a single connection. Thus we started suspecting that it was an issue with how we were using threads and perhaps with CPU caches. Jetty-9 makes a fair bit more usage of Atomics than Jetty-8, in an effort to support even more asynchronous behaviour. Investigating this led us to the excellent blog of Marc Brooker where he investigates the performance implications of CPU caching on integer incrementing.
While it turned out that there is nothing wrong with our usage of Atomics, the analysis tools that Marc describes (linux perf) revealed our smoking gun. The Linux perf tool gives access to the CPU and kernel performance counters so that you can glimpse what is going on within the hardware of a modern multicore machine. For my i7 CPU I worked out that the following command gave the extra information needed:
perf stat -e task-clock -e cycles -e instructions -e LLC-loads -e LLC-load-misses -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-icache-loads -e L1-icache-load-misses --pid $JETTY_PID
Running this against a warmed up Jetty-8 server for the entire pipeline test gave the following results:
Performance counter stats for process id 'jetty-8': 27751.967126 task-clock # 0.867 CPUs utilized 53,963,171,579 cycles # 1.944 GHz [28.67%] 49,404,471,415 instructions # 0.92 insns per cycle 204,217,265 LLC-loads # 7.359 M/sec [36.56%] 15,167,562 LLC-misses # 7.43% of all LL-cache hits [ 7.21%] 567,593,065 cache-references # 20.452 M/sec [14.50%] 17,518,855 cache-misses # 3.087 % of all cache refs [21.66%] 16,405,099,776 L1-dcache-loads #591.133 M/sec [28.46%] 782,601,144 L1-dcache-misses # 4.77% of all L1-dcache hits [28.41%] 22,585,255,808 L1-icache-loads #813.825 M/sec [28.57%] 4,010,843,274 L1-icache-misses # 17.76% of all L1-icache hits [28.57%]
The key number in all this gritty detail is the instructions per cycle figure. In Jetty-8, the CPU was able to execute 0.92 instructions every clock tick, with the remainder of the time being spent waiting for data from slow memory to fill either the instruction or the data caches. The same test for jetty-9 reveals the full horror of what was going on:
Performance counter stats for process id 'jetty-9-M3': 77452.678481 task-clock # 1.343 CPUs utilized 116,033,902,536 cycles # 1.498 GHz [28.35%] 62,939,323,536 instructions # 0.54 insns per cycle 891,494,480 LLC-loads # 11.510 M/sec [36.59%] 124,466,009 LLC-misses # 13.96% of all LL-cache hits [ 6.97%] 2,341,731,228 cache-references # 30.234 M/sec [14.03%] 29,223,747 cache-misses # 1.248 % of all cache refs [21.25%] 20,644,743,623 L1-dcache-loads #266.547 M/sec [28.39%] 2,290,512,202 L1-dcache-misses # 11.09% of all L1-dcache hits [28.15%] 34,515,836,027 L1-icache-loads #445.638 M/sec [28.12%] 6,685,624,757 L1-icache-misses # 19.37% of all L1-icache hits [28.34%]
Jetty-9 was only able to execute 0.54 instructions per tick, so almost half the CPU time was spent waiting for data from memory. Worse still, this caused so little load on the CPU that the power governor only felt the need to clock the CPU at 1.498GHz rather than the 1.944GHz achieve by jetty-8 (Note that some recommend to peg CPU frequencies during benchmarks, but I believe that unless you do that in your data centre and pay the extra power/cooling charges, then don’t do it in your benchmarks. Your code must be able to drive the CPU governors to dynamically increase the clock speed as needed).
The cause of this extra time waiting for memory is revealed by the cache figures. The L1 caches were being hit a little bit more often and missing a lot more often! This flowed through to the LLC cache that had to do 4 times more loads with 8 times more cache misses! This is a classic symptom of Parallel Slowdown, because Jetty-9 was attempting to use multiple cores to handle a job best done by a single core (ie a serial sequence of requests on single connection), it was wasting more time in sharing data between cores than it was gaining by increased computing power.
Where Jetty-9-M3 got it wrong!
One of the changes that we had made in Jetty-9 was an attempt to better utilize the selector thread so to reduce unnecessary dispatches to the thread pool. By default, we configure jetty with an NIO selector and selector thread for each available CPU core. In Jetty-8 when the selector detects a connection that is readable, it dispatched the endpoint to a thread from the pool, which would do the IO read, parse the HTTP request, call the servlet container and flush the response.
In Jetty-9, we realized that it is only when calling the application in the servlet container that there is a possibility that the thread might block and that it would thus be safe to let the selector thread do the IO read and HTTP parsing without a dispatch to a thread. Only once the HTTP parser had received an entire HTTP request, would a dispatch be done to an application handler to handle the request (probably via a servlet). This seemed like a great idea at the time that at worst would cost nothing, but may save some dispatches for slow clients.
Our retrospect-a-scope now tells us that is is a very bad idea to have a different thread do the HTTP parsing and the handling. The issue is that once one thread had finished parsing a HTTP request, then it’s caches are full of all the information just parsed. The method, URI and request object holding them, are all going to be in or near to the L1 cache. Dispatching the handling to another thread just creates the possibility that another core will execute the thread and will need to fill it’s cache from main memory with all the parsed parts of the request.
Luckily with the flexible architecture of jetty, we were able to quickly revert the dispatching model to dispatch on IO selection rather than HTTP request completion and we were instantly rewarded with another 10% performance gain. But we were still a little slower than jetty-8 and still using 1.1 cores rather than 1.0. Perf again revealed that we were still suffering from some parallel slowdown, which turned out to be the way Jetty-9 was handling pipelined requests. Previously Jetty’s IO handling thread had looped until all read data was consumed or until an upgrade or request suspension was done. Those “or”s made for a bit of complex code, so to simplify the code base, Jetty-9 always returned from the handling thread after handling a request and it was the completion callback that dispatched a new thread to handle any pipelined requests. This new thread might then execute on a different core, requiring its cache to be loaded with the IO buffer and the connection, request and other objects before the next request can be parsed.
Testing pipelines is more of an exercise for interest rather than handling something likely to be encountered in real productions, but it is worth while to handle them well if at least to deal with such simple unrealistic benchmarks. Thus we have reverted to the previous behaviour and found another huge gain in performance.
Jetty-9 getting it right
With the refactored components, the minor optimizations found from profiling, and the reversion to the jetty-8 threading model, jetty-9 is now meeting our expectations and out performing jetty-8. The perf numbers now look much better:
Performance counter stats for process id 'jetty-9-SNAPSHOT': 25495.319407 task-clock # 0.928 CPUs utilized 62,342,095,246 cycles # 2.445 GHz [33.50%] 45,949,661,990 instructions # 0.74 insns per cycle 349,576,707 LLC-loads # 13.711 M/sec [42.14%] 18,734,441 LLC-misses # 5.36% of all LL-cache hits [ 8.37%] 946,308,800 cache-references # 37.117 M/sec [16.79%] 18,683,743 cache-misses # 1.974 % of all cache refs [25.14%] 15,146,280,274 L1-dcache-loads #594.081 M/sec [33.43%] 1,313,578,215 L1-dcache-misses # 8.67% of all L1-dcache hits [33.31%] 21,215,554,821 L1-icache-loads #832.135 M/sec [33.27%] 4,130,760,394 L1-icache-misses # 19.47% of all L1-icache hits [33.27%]
The CPU is now executing 0.74 instructions per tick, not as good as jetty-8, but a good improvement. Most importantly, the macro benchmark numbers now indicate that parallel slowdown is not having an effect and the improved jetty-9 components are now able to do their stuff and provide some excellent results:
Jetty-9-SNAPSHOT Pipeline: ======================================== Processors: 8 Used/Max Heap Size: 4.152527,1023.6875 MiB - - - - - - - - - - - - - - - - - - - - Pipeline Requests 1000000 of 1000000 - - - - - - - - - - - - - - - - - - - - Statistics Ended at Mon Dec 17 13:03:54 EST 2012 Elapsed time: 29172 ms Time in Young Generation GC: 3 ms (4 collections) Time in Old Generation GC: 0 ms (0 collections) Garbage Generated in Young Generation: 1319.1224 MiB Average CPU Load: 99.955666/800 ----------------------------------------
This is 34,280 requests per second (29% better that Jetty-8), using only half the heap and generating 83% less garbage! If this was in anyway a realistic benchmark working with a load profile in any way resembling a real world load, then these numbers would be absolutely AWESOME!
But this is just a single connection pipeline test, nothing like the load profile that 99.999% of servers will encounter. So while these results are very encouraging, I’ll wait until we do some tuning against some realistic load benchmarks before I get too excited. Also I believe that the perf numbers are showing that there may also be room for even more improvement with jetty-9 and the same tools can also be used to get significant results by improving (or avoiding) branch prediction.
The code for the benchmarks used is available at firstname.lastname@example.org:jetty-project/jetty-bench.git.