One of my pet peeves is misleading benchmarks, as discussed in my Lies, Damned Lies and Benchmarks blog.  Recently there has been a bit of interest in Vert.x, some of it resulting from apparently good benchmark results against node.js. The author gave a disclaimer that the tests were non-rigorous and just for fun, but they have already lead some people to ask if Jetty can scale like Vert.x.

I know absolutely nothing about Vert.x, but I do know that their benchmark is next to useless to demonstrate any kind of scalability of a server.  So I’d like to analyse their benchmarks and compare them to how we benchmark jetty/cometd to try to give some understanding about how benchmarks should be designed and interpreted.

The benchmark

The vert.x benchmark uses 6 clients, each with 10 connections, each with up to 2000 pipelines HTTP requests for a trivial 200 OK or tiny static file. The tests were run for a minute and the average request rate was taken. So lets break this down:

6 Clients of 10 connections!

However you look at this (6 users each with a browser with 10 connections, or 60 individual users), 6 or 60 users does not represent any significant scalability.  We benchmark jetty/comet with 10,000 to 200,000 connections and have production sites that run with similar numbers.

Testing 60 connections does not tell you anything about scalability. So why do so many benchmarks get performed on low numbers of connections?  It’s because it is really really hard to generate realistic load for hundreds of thousands of connections.  To do so, we use the jetty asynchronous HTTP client, which has been designed specifically for this purpose, and we still need to use multiple load generating machines to achieve high numbers of connections.

2000 pipelined requests!

Really? HTTP pipelining is not turned on by default in most web browsers, and even if it was, I cannot think of any realistic application that would be generate 2000 requests in a pipeline. Why is this important?  Because with pipelined requests a server that does:

will read many requests into that buffer in a single read.  A trivial HTTP request is a few 10s of bytes (and I’m guessing they didn’t send any of the verbose complex headers that real browsers do), so the vert.x benchmark would be reading 30 or more requests on each read.  Thus this benchmark is not really testing any IO performance, but simply how fast they can iterate over a buffer and parse simple requests. At best it is telling you about the latency in their parsing and request handling.

Handling reads is not the hard part of scaling IO.  It is handling the idle pauses between the reads that is difficult.  It is these idle periods that almost all real load profiles have that requires the server to carefully allocate resources so that idle connections do not consume resources that could be better used by non idle connections.    2000 connections each with 6 pipelined requests would be more realistic, or better yet 20000 connections with 6 requests that are sent with 10ms delays between them.

Trivial 200 OK or Tiny static resource

Creating a scalable server for non trivial applications is all about trying to ensure that maximal resources are applied to performing real business logic in preparing dynamic responses.   If all the responses are trivial or static, then the server is free to be more wasteful.  Worse still for realistic benchmarks, a trivial response generation can probably be in-lined by the hotspot compiler is a way that no real application ever could be.

Run for a minute

A minute is insufficient time for a JVM to achieve steady state.  For the first few minutes of a run the Hotspot JIT compiler will be using CPU to analyse and compile code. A trivial application might be able to be hotspot compiled in a minute, but any reasonably complex server/application is going to take much longer.  Try watching your application with jvisualvm and watch the perm generation continue to grow for many minutes while more and more classes are compiled. Only after the JVM has warmed up your application and CPU is no longer being used to compile, can any meaningful results be obtained.

The other big killer of performance are full garbage collections that can stop the entire VM for many seconds.  Running fast for 60 seconds does not do you much good if a second later you pause for 10s while collecting the garbage from those fast 60 seconds.

Benchmark result need to be reported for steady state over longer periods of time and you need to consider GC performance.  The jetty/cometd benchmark tools specifically measures and reports both JIT and GC actions during the benchmark runs and we can perform many benchmark runs in the same JVM.  Below is example output showing that for a 30s run some JIT was still performed, so the VM is not fully warmed up yet:

Conclusion

I’m sure the vert.x guys had every good intent when doing their micro-benchmark, and it may well be that vert.x scales really well.  However I wish that when developers consider benchmarking servers, that instead of thinking: “let’s send a lot of requests at it”, that their first thought was “let’s open a lot of connections at it”.  Better yet, a benchmark (micro or otherwise) should be modelled on some real application and the load that it might generate.

The jetty/cometd benchmark is of a real chat application, that really works and has real features like member lists, private messages etc.  Thus the results that we achieve in benchmarks are able to be reproduced by real applications in production.

 

 

 

 

 

Truth in Benchmarking!

3 thoughts on “Truth in Benchmarking!

Comments are closed.