Since we discovered how to make Jetty-9 avoid parallel slowdown, we’ve been continuing to work with micro benchmarks and consideration of Mechanical Sympathy to further optimise Jetty-9. As we now about to go to release candidate for Jetty-9, I thought I’d give a quick report on the excellent results we’ve had so far.
False Sharing in Queues
Queuing is a very important operation in servers like Jetty and our QueuedThreadPool is a key element to Jetty’s great performance. While it implements the JVMs Executor interface, even the Jetty-8 implementations has far superior performance to the executors provided by the JVM. This queue is based on our BlockingArrayQueue that separates the locks for the head and tail and only supports blocking for take operations.
However because of the layout in memory of the class, it turned out that the head and tail pointers and locks were all within a single CPU cache row. This is bad because when different threads running on different cores are trying to independently work on the head and tail, it turns out that they are both hitting the same area of memory and are thus repeatedly invalidating each others caches in a pattern called false sharing.
The solution is to be aware of the memory layout of the class when considering what threads will be accessing which fields and to space them out so that you can avoid this false sharing of cache rows. The results have given us a significant boost in our micro benchmarks (see below).
Time and Space Efficient Trie
Looking up string values is probably one of the most common activities in a HTTP server, as header lines are parsed and the semantic meaning interpreted from the text headers. A simple hash map look up of a string can be moderately efficient in both space and time, but it assumes that you have a String instance in the first place. When parsing HTTP, we just have bytes in a buffer and it is costly to have to create a String from these bytes just to lookup what string it is. Furthermore we need case insensitivity, which is not well supported by the standard JVM hash maps.
In Jetty-9 we introduced a Trie abstraction that allowed us to experiment with various implementations of string lookups which could operate directly from a slice of the IO buffers without any copies or object creation.
For our well known string (eg. HTTP header names and values) we initially implemented a simple TreeTrie that stored each character as a node object in a tree. This was moderately fast, but it suffered from poor locality of reference as each character had to look up a new object that could be located anywhere in the heap.
Thus we developed an ArrayTrie implementation that stores the tree as index references within a large char. This had the huge benefit that once the a portion of the char was loaded into cache for one character in the lookup, it is highly likely that subsequent character lookups are already in the cache. This again gave us a significant boost in our micro benchmarks! But we wanted more
Look ahead Trie
The Trie abstraction was initially just used for looking up known strings such as “Host”, “Content-Type”, “User-Agent”, “Connection”, “close” etc. which is very useful as you parse a HTTP header token by token. However, HTTP is a very repetitive protocol and for a given client you will frequently see well known combinations of tokens such as:
Connection: close Connection: keep-alive Accept-Encoding: gzip Accept: */*
The simple parsing strategy is to look for ‘:’ and CRLF to identify tokens and then lookup those strings in the Trie. But if you are able to look up the combinations of tokens in a Trie, then the Trie you save effort parsing as well as being able to lookup shared instances of common fields (eg Connection: keep-alive). Thus we modified our Trie interface to support a best match lookup that given the entire buffer will attempt to match an entire header line.
For many well known fields combinations like the ones listed above, our ArrayTrie was a good solution. While it is a bit memory hungry, the number of field combinations is not large, is statically known and is shared between all connections to the server. But unfortunately, not all fields are well known in advance and some of the longest repeated fields look like:
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 Cookie: __utma=1598342155.164253763.123423536.1359602604.1359611604.283; __utmz=12352155.135234604.383.483.utmcsr=google.com.au|utmccn=(referral)|utmcmd=referral|utmcct=/ig; __utmc=4234112; __utmb=4253.1.10.1423 Accept-Language: en-US,en;q=0.5,it;q=0.45
Such fields are not statically know but will frequently repeat, either from the same client or from a class of client for a give period of time while a particular version is current. Thus having a static field Trie is insufficient and we needed to be able to create dynamic per connection Tries to lookup such repeated fields. ArrayTrie worked, but is massively memory hungry and unsuitable to handle the hundreds of thousands of connections that Jetty can terminate.
The theory of Tries suggested that a Ternary Tree is a good memory structure with regards to memory consumption, but the problem is that it gave up our locality of reference and worse still created a lot of node garbage as trees are built and discarded. The solution is to combine the two approaches and we came up with our ArrayTernaryTrie, which is a ternary tree structure stored in a fixed size char (which also gives the benefit of protection from DOS attacks). This data structure has proved quick to build, quick to lookup, efficient on memory and cheap to GC. It’s another winner in the micro benchmarks.
Supporting many versions of a protocol and the many different semantics that it can carry results in code with lots of if statements. When a modern CPU encounters a conditional, it tries to guess which way the branch will go and fills the CPU pipeline with instructions from that branch. This means you either want your branches to be predictable or you want to avoid branches altogether so as to avoid breaking the CPU pipeline.
This can result in some very fast, but slightly unreadable code. The following branchless code:
byte b = (byte)((c & 0x1f) + ((c >> 6) * 0x19) - 0x10);
Converts a hex digit to an byte value without the need for branchful code like:
if (c>='A' && c<='F') b=10+c-'A'; ...
The results have been great, albeit with my normal disclaimer that these are just micro benchmarks and don’t represent any realistic load and please wait for a full server benchmarks before getting too excited.
For a single connection handling 1,000,000 pipelined requests, Jetty-8 achieved the following results:
======================================== Statistics Started at Thu Jan 31 15:27:11 EST 2013 Operative System: Linux 3.5.0-22-generic amd64 JVM : Oracle Corporation Java HotSpot(TM) 64-Bit Server VM runtime 23.3-b01 1.7.0_07-b10 Processors: 8 System Memory: 97.034004% used of 7.7324257 GiB Used Heap Size: 5.117325 MiB Max Heap Size: 1023.25 MiB Young Generation Heap Size: 340.5625 MiB - - - - - - - - - - - - - - - - - - - - /stop/ Pipeline Requests 1000000 of 1000000 - - - - - - - - - - - - - - - - - - - - Statistics Ended at Thu Jan 31 15:28:00 EST 2013 Elapsed time: 48636 ms Time in JIT compilation: 1 ms Time in Young Generation GC: 7 ms (9 collections) Time in Old Generation GC: 0 ms (0 collections) Garbage Generated in Young Generation: 2914.1484 MiB Garbage Generated in Survivor Generation: 0.4375 MiB Garbage Generated in Old Generation: 0.046875 MiB Average CPU Load: 99.71873/800 ----------------------------------------
This style of benchmark is a reasonable test of:
- The raw speed of the IO layer
- The efficiency of the HTTP parsing and generating
- The memory footprint of the server
- The garbage produced by the server
For the same benchmark, Jetty-9 achieved the following results:
======================================== Statistics Started at Thu Jan 31 15:30:14 EST 2013 Operative System: Linux 3.5.0-22-generic amd64 JVM : Oracle Corporation Java HotSpot(TM) 64-Bit Server VM runtime 23.3-b01 1.7.0_07-b10 Processors: 8 System Memory: 94.26746% used of 7.7324257 GiB Used Heap Size: 5.7408752 MiB Max Heap Size: 1023.25 MiB Young Generation Heap Size: 340.5625 MiB - - - - - - - - - - - - - - - - - - - - /stop/ Pipeline Requests 1000000 of 1000000 - - - - - - - - - - - - - - - - - - - - Statistics Ended at Thu Jan 31 15:30:47 EST 2013 Elapsed time: 33523 ms Time in JIT compilation: 2 ms Time in Young Generation GC: 4 ms (4 collections) Time in Old Generation GC: 0 ms (0 collections) Garbage Generated in Young Generation: 1409.474 MiB Garbage Generated in Survivor Generation: 0.1875 MiB Garbage Generated in Old Generation: 0.046875 MiB Average CPU Load: 99.959854/800 ----------------------------------------
Thus for a small increase in static heap usage (0.5MB in the static Tries), jetty-9 out performs jetty-8 by 30% faster (33.5s vs 48.6s) and 50% less YG garbage (1409MB vs 2914MB) which trigger less than half the YG collections.
Release Candidate 0 of Jetty-9 will be released in the next few days, so I hope you’ll join us and start giving it some more realistic loads and testing and report the results.