Taylor Crown has written a short paper regarding

Combining the Servlet API and NIO
,
which has been briefly

discussed on the serverside
.

NIO Servlets have often been discussed as the holy grail of java web application performance.
The promise of efficient buffers and reduced thread loads are very attractive
for providing scalable 100% java web servers. Taylor writes about a mockup NIO server that
he implemented which shows some of this promise.

Taylors results were not with a real Servlet container running realistic
loads. But his results look promising and his approach has inspired me
to try and apply it to the
Jetty Servlet container.

The fundamental problem with using NIO with servlets is how to combine the
non-blocking features of NIO with the blocking streams used by
servlets. I have tried several times before to introduce a
SocketChannelListener to Jetty, which only used non-blocking NIO semantics
to manage idle connections. Connections with active requests were converted
to blocking mode, assigned a thread and handled by the servlet container
normally.
Unfortunately, the cost of manipulating select sets and changing socket modes
was vastly greater than any savings. So while this listener did go into
production in a few sites, there was no significant gain in scalability and an
actual loss in max throughput.

Taylor has tried a different approach, where a producer/consumer model is used
to link NIO to servlets via piped streams. A single thread is responsible for
reading all incoming packets and placing them in the non-blocking pipes.
A pool of worker threads take jobs from a queue of connections with input and
does the actual request handling. I have applied this approach to Jetty as
follows:

  • The PipedInputStream used by Taylor requires all data read to be copied
    into byte arrays. My natural loathing of data copies lead me to write a
    ByteBufferInputStream, which allows the NIO direct buffers to be used as the
    InputStream buffers and then recycled for later use.
  • Taylors mock server uses direct NIO writes to copy data from a file to the
    response. While a great way to send static content, this is not realistic for
    a servlet container which must treat all content is as dynamic. Thus I wrote
    SocketChannelOutputStream to map a blocking OutputStream to a non-blocking
    SocketChannel. It works on the assumption that a write to a NIO stream will
    rarely return 0 bytes written. I have not well tested this assumption.
  • There is no job queue in the Jetty implementation, instead requests are
    directly delegated to the current Jetty thread pool. The effect of this
    change is to reduce the thread savings. A thread is required for all
    simultaneous requests, which is better than a thread per connection, but not
    as trim as Taylors minimal set of worker threads. A medium sized thread
    pool is being used as a fixed size job queue.
  • Taylors mock server only handled simple requests for static content, which
    may be handled with a simple 304 response. Thus no requests contained any
    content of size and neither do all responses. This is not a good test for
    the movements of real content that most web applications must do. The
    Jetty test setup is against a more realistic mix of static and dynamic
    content as well as a reasonable mix of POST requests with content.

This code has been written against Jetty 5.0 and is currently checked into
Jetty CVS HEAD in the

org.mortbay.http.nio

package. So far I have not had
time to really optimise or analyse the results, but early indications are
that this is no silver bullet.

The initial effects of using the NIO listener is that the latency of the
server under low load has doubled, and this latency gets worse with load.
The maximum throughput of the server has been reduced by about 10%, but
is maintained to much higher levels of load. In fact, with my current test setup I
was unable to produce enough load to significantly reduce the throughput.
So tecchnically at least, this has delivered on the scalability promise?

The producer/consumer model allows a trade off of some low and mid level
performance in return for grace under extreme load. But you have to ask
yourself, is this a reasonable trade? Do I want to offer crappy service
to 10000 users, or reasonable service to 5000? To answer this, you have to
consider the psychology of the users of the system.

Load generators do not have any psychology and are happy to wait out the
increasing latency to the limits of the timeouts, often 30 seconds or more.
But real users are not so well behaved and often have patience thresholds set well below
the timeouts. Unfortunately a common user response to a slowly displaying web
page is to hit the retry button, or
worse still the shift retry! Having your server handle 1000 requests per
second may not be such a great thing if 50% of those requests are retries
from upset users.

I suspect that the producer/consumer model may be costing real quality of
service in return for good technical numbers. Consider the logical extreme of
the job queue within Taylors mock implementation. If sustained load is
offered in excess of the level that the workers can handle, then that
queue will simply grow and grow. The workers will still be operating at
near their optimal throughput, but the latency of all requests served
with increase until timeouts start to expire. Throughput is maintained, but
well beyond the point of offering a reasonable quality of service.

Even with a limited job queue (as in the Jetty implementation),
the simple producer/consumer model suffers from the inability to target
resources to where they are best used. The single producer thread gives
equal effort towards handling new requests as it does to receiving
packets for requests that have already started processing. On a loaded
server, it is better to use your resources to clear existing requests so
that their resources may be freed for other requests. On a multi-CPU machine, it
will be a significant restriction to only allow a single CPU to perform any
IO reads, as other CPUs may be delayed from doing useful work or real requests, while
one CPU is reading more load onto the system.

Taylors producer/consumers approach is significantly better than my preceding attempts,
but has not produced an easy win when applied to a real Servlet container.
I am also concerned that the analysis has focused too much on throughput without
any due consideration for latency and QOS. This is not to say that this is a dead
end. Just that more thought and effort are required if producer/consumer NIO is to match
the wonderful job that modern JVMs do with threading.

I plan to leave the SocketChannelListener in the development branch of Jetty for
some time to allow further experimentation and analysis. However, I fear that
the true benefits of NIO will not be available to java web applications until we
look at an

API other than Servlets

for our content generation.