Since we discovered how to make Jetty-9 avoid parallel slowdown, we’ve been continuing to work with micro benchmarks and consideration of Mechanical Sympathy to further optimise Jetty-9.  As we now about to go to release candidate for Jetty-9, I thought I’d give a quick report on the excellent results we’ve had so far.

False Sharing in Queues

Queuing is a very important operation in servers like Jetty and our QueuedThreadPool is a key element to Jetty’s great performance.   While it implements the JVMs Executor interface, even the Jetty-8 implementations has far superior performance to the executors provided by the JVM.   This queue is based on our BlockingArrayQueue that separates the locks for the head and tail and only supports blocking for take operations.

However because of the layout in memory of the class, it turned out that the head and tail pointers and locks were all within a single CPU cache row.  This is bad because when different threads running on different cores are trying to independently work on the head and tail, it turns out that they are both hitting the same area of memory and are thus repeatedly invalidating each others caches in a pattern called false sharing.

The solution is to be aware of the memory layout of the class when considering what threads will be accessing which fields and to space them out so that you can avoid this false sharing of cache rows.  The results have given us a significant boost in our micro benchmarks (see below).

Time and Space Efficient Trie

Looking up string values is probably one of the most common activities in a HTTP server, as header lines are parsed and the semantic meaning interpreted from the text headers.  A simple hash map look up of a string can be moderately efficient in both space and time, but it assumes that you have a String instance in the first place.  When parsing HTTP, we just have bytes in a buffer and it is costly to have to create a String from these bytes just to lookup what string it is.  Furthermore we need case insensitivity, which is not well supported by the standard JVM hash maps.

In Jetty-9 we introduced a Trie abstraction that allowed us to experiment with various implementations of string lookups which could operate directly from a slice of the IO buffers without any copies or object creation.

For our well known string (eg. HTTP header names and values) we initially implemented a simple TreeTrie that stored each character as a node object in a tree.   This was moderately fast, but it suffered from poor locality of reference as each character had to look up a new object that could be located anywhere in the heap.

Thus we developed an ArrayTrie implementation that stores the tree as index references within a large char[].  This had the huge benefit that once the a portion of the char[] was loaded into cache for one character in the lookup, it is highly likely that subsequent character lookups are already in the cache.  This again gave us a significant boost in our micro benchmarks! But we wanted more

Look ahead Trie

The Trie abstraction was initially just used for looking up known strings such as “Host”, “Content-Type”, “User-Agent”, “Connection”, “close” etc. which is very useful as you parse a HTTP header token by token.  However, HTTP is a very repetitive protocol and for a given client you will frequently see well known combinations of tokens such as:

Connection: close
Connection: keep-alive
Accept-Encoding: gzip
Accept: */*

The simple parsing strategy is to look for ‘:’ and CRLF to identify tokens and then lookup those strings in the Trie.  But if you are able to look up the combinations of  tokens in a Trie, then the Trie you save effort parsing as well as being able to lookup shared instances of common fields (eg Connection: keep-alive).    Thus we modified our Trie interface to support a best match lookup that given the entire buffer will attempt to match an entire header line.

For many well known fields combinations like the ones listed above, our ArrayTrie was a good solution. While it is a bit memory hungry, the number of field combinations is not large, is statically known and is shared between all connections to the server.  But unfortunately, not all fields are well known in advance and some of the longest repeated fields look like:

User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0
Cookie: __utma=1598342155.164253763.123423536.1359602604.1359611604.283; __utmz=12352155.135234604.383.483.utmcsr=google.com.au|utmccn=(referral)|utmcmd=referral|utmcct=/ig; __utmc=4234112; __utmb=4253.1.10.1423
Accept-Language: en-US,en;q=0.5,it;q=0.45

Such fields are not statically know but will frequently repeat, either from the same client or from a class of client for a give period of time while a particular version is current.  Thus having a static field Trie is insufficient and we needed to be able to create dynamic per connection Tries to lookup such repeated fields.   ArrayTrie worked, but is massively memory hungry and unsuitable to handle the hundreds of thousands of connections that Jetty can terminate.

The theory of Tries suggested that a Ternary Tree is a good memory structure with regards to memory consumption, but the problem is that it gave up our locality of reference and worse still created a lot of node garbage as trees are built and discarded.   The solution is to combine the two approaches and we came up with our ArrayTernaryTrie, which is a ternary tree structure stored in a fixed size char[] (which also gives the benefit of protection from DOS attacks).  This data structure has proved quick to build, quick to lookup, efficient on memory and cheap to GC.  It’s another winner in the micro benchmarks.

Branchless Code

Supporting many versions of a protocol and the many different semantics that it can carry results in code with lots of if statements.  When a modern CPU encounters a conditional, it tries to guess which way the branch will go and fills the CPU pipeline with instructions from that branch.  This means you either want your branches to be predictable or you want to avoid branches altogether so as to avoid breaking the CPU pipeline.

This can result in some very fast, but slightly unreadable code. The following branchless code:

byte b = (byte)((c & 0x1f) + ((c >> 6) * 0x19) - 0x10);

Converts a hex digit to an byte value without the need for branchful code like:

if (c>='A' && c<='F')
  b=10+c-'A';
...

Results

The results have been great, albeit with my normal disclaimer that these are just micro benchmarks and don’t represent any realistic load and please wait for a full server benchmarks before getting too excited.

For a single connection handling 1,000,000 pipelined requests, Jetty-8 achieved the following results:

========================================
Statistics Started at Thu Jan 31 15:27:11 EST 2013
Operative System: Linux 3.5.0-22-generic amd64
JVM : Oracle Corporation Java HotSpot(TM) 64-Bit Server VM runtime 23.3-b01 1.7.0_07-b10
Processors: 8
System Memory: 97.034004% used of 7.7324257 GiB
Used Heap Size: 5.117325 MiB
Max Heap Size: 1023.25 MiB
Young Generation Heap Size: 340.5625 MiB
- - - - - - - - - - - - - - - - - - - -
/stop/ Pipeline Requests 1000000 of 1000000
- - - - - - - - - - - - - - - - - - - -
Statistics Ended at Thu Jan 31 15:28:00 EST 2013
Elapsed time: 48636 ms
    Time in JIT compilation: 1 ms
    Time in Young Generation GC: 7 ms (9 collections)
    Time in Old Generation GC: 0 ms (0 collections)
Garbage Generated in Young Generation: 2914.1484 MiB
Garbage Generated in Survivor Generation: 0.4375 MiB
Garbage Generated in Old Generation: 0.046875 MiB
Average CPU Load: 99.71873/800
----------------------------------------

This style of benchmark is a reasonable test of:

  • The raw speed of the IO layer
  • The efficiency of the HTTP parsing and generating
  • The memory footprint of the server
  • The garbage produced by the server

For the same benchmark, Jetty-9 achieved the following results:

========================================
Statistics Started at Thu Jan 31 15:30:14 EST 2013
Operative System: Linux 3.5.0-22-generic amd64
JVM : Oracle Corporation Java HotSpot(TM) 64-Bit Server VM runtime 23.3-b01 1.7.0_07-b10
Processors: 8
System Memory: 94.26746% used of 7.7324257 GiB
Used Heap Size: 5.7408752 MiB
Max Heap Size: 1023.25 MiB
Young Generation Heap Size: 340.5625 MiB
- - - - - - - - - - - - - - - - - - - -
/stop/ Pipeline Requests 1000000 of 1000000
- - - - - - - - - - - - - - - - - - - -
Statistics Ended at Thu Jan 31 15:30:47 EST 2013
Elapsed time: 33523 ms
    Time in JIT compilation: 2 ms
    Time in Young Generation GC: 4 ms (4 collections)
    Time in Old Generation GC: 0 ms (0 collections)
Garbage Generated in Young Generation: 1409.474 MiB
Garbage Generated in Survivor Generation: 0.1875 MiB
Garbage Generated in Old Generation: 0.046875 MiB
Average CPU Load: 99.959854/800
----------------------------------------

Thus for a small increase in static heap usage (0.5MB in the static Tries), jetty-9 out performs jetty-8 by 30% faster (33.5s vs 48.6s) and 50% less YG garbage (1409MB vs 2914MB) which trigger less than half the YG collections.

Release Candidate 0 of Jetty-9 will be released in the next few days, so I hope you’ll join us and start giving it some more realistic loads and testing and report the results.


8 Comments

Justin Mason · 05/02/2013 at 13:05

Great stuff! fantastic to see you guys applying these techniques with good results….

Ashwin Jayaprakash · 06/02/2013 at 07:14

Cool stuff. Keep up the good work!

Peter Tillotson · 14/02/2013 at 10:12

Very good article, reminded me of some of the FSM work in Lucerne 4.

Diego de Oliveira · 18/02/2013 at 17:01

What tool are you using to run the benchmarks and print the java memory usage, garbage collections time and so on?

Cowtowncoder · 01/03/2013 at 22:46

Interesting stuff. I am actually not convinced of Trie part — from my earlier work with Jackson, I actually found that a faster way is to use hash-based table, with quads (ints from 4 byte), and lazily decode UTF-8 on miss. I did consider Tries, but neither speed nor updatability could quite compete.
Just for fun you could see how Jackson’s UTF-8 symbol table would fare; Jackson itself uses (and reuses) these for property names, where potential value domain is large (esp. considering optional escaping). I would be interested in a comparison.

    gregw · 01/03/2013 at 23:04

    Sounds like we need a showdown at the string matching corral!) Got a pointer to the exactly code/class?

Most interesting links of February ’13 « The Holy Java · 28/02/2013 at 21:51

[…] Jetty-9 goes fast with Mechanical Sympathy – interesting how the run-time behavior might differ from what we would expect and how knowing the hardware can improve performance. Here:  false sharing of a blocking queue’s head/tail pointers and locks (close => same CPU cache row => updating one invalidates the other), using trie backed directly by IO buffers for faster String lookups etc. Result (all microbenchmark disclaimers): jetty-9 out performs jetty-8 by 30% faster and 50% less YG garbage. […]

Comments are closed.