Object Pooling, Benchmarks, and Another Way

Context

The Jetty HTTP client internally uses a connection pool to recycle HTTP connections, as they are expensive to create and dispose of. This is a well-known pattern that has proved to work well.
While this pattern brings great benefits, it’s not without its shortcomings. If the Jetty client is used by concurrent threads (like it’s meant to be) then one can start seeing some contention building up onto the pool as it is a central point that all threads have to go to to get a connection.
Here’s a breakdown of what the client contains:

HttpClient
  ConcurrentMap<Origin, HttpDestination>
Origin
  scheme
  host
  port
HttpDestination
  ConnectionPool

The client contains a Map of Origin -> HttpDestination. The origin basically describes where the remote service is and the destination is a wrapper for the connection pool with bells and whistles.
So, when you ask the client to do a GET on https://www.google.com/abc the client will use the connection pool contained in the destination keyed by “https|www.google.com|443”. Asking the client to GET https://www.google.com/xyz would make it use the exact same connection pool as the origins of both URLs are the same. Asking the same client to also do a GET on https://www.facebook.com/ would make it use a different connection pool as the origin this time around is “https|www.facebook.com|443”.
It is quite common for a client to exchange with a handful, or sometimes even a single destination so it is expected that if multiple threads are using the client in parallel, they will all ask the same connection pool for a connection.
There are three important implementations of the ConnectionPool interface:
DuplexConnectionPool is the default. It guarantees that a connection will never be shared among threads, i.e.: when a connection is acquired by one thread, no other thread will be able to acquire it for as long as it hasn’t been released. It also tries to maximize re-use of the connections such as only a minimal amount of connections have to be kept open.
MultiplexConnectionPool This one allows up to N threads (N being configurable) to acquire the exact same connection. While this is of no use for HTTP/1 connections as HTTP/1.1 does not allow running concurrent requests, HTTP/2 does so when the Jetty client connects to an HTTP/2 server; a single connection can be used to send multiple requests in parallel. Just like DuplexConnectionPool, it also tries to maximize the re-use of connections.
RoundRobinConnectionPool Similar MultiplexConnectionPool as it allows multiple threads to share the connections. But unlike it, it does not try to maximize the re-use of connections. Instead, it tries to spread the load evenly across all of them. It can be configured to not multiplex the connections so that it can also be used with HTTP/1 connections.

The Problem on the Surface

Some users are heavily using the Jetty client with many concurrent threads, and they noticed that their threads were bottlenecked in the code that tries to get a connection from the connection pool. A quick investigation later revealed the problem comes from the fact that the connection pool implementations all use a java.util.Deque protected with a java.util.concurrent.locks.ReentrantLock. A single lock + multiple threads makes it rather obvious why this bottlenecks: all threads are contending on the lock which is easily provable with a simple microbenchmark ran under a profiler.
Here is the benchmarked code:

@Benchmark
public void testPool()
{
  Connection connection = pool.acquire(true);
  Blackhole.consumeCPU(ThreadLocalRandom.current().nextInt(10, 20));
  pool.release(connection);
}

And here is the report of the benchmark, running on a 12 cores CPU with 12 parallel threads and 12 connections in the pool:

Benchmark                                 (POOL_TYPE)   Mode  Cnt        Score         Error  Units   CPU
ConnectionPoolsBenchmark.testPool              duplex  thrpt    3  3168609.140 ± 1703378.453  ops/s   15%
ConnectionPoolsBenchmark.testPool           multiplex  thrpt    3  2284937.900 ±  191568.815  ops/s   15%
ConnectionPoolsBenchmark.testPool         round-robin  thrpt    3  1403693.845 ±  219405.841  ops/s   25%

It was quite apparent that while the benchmark was running, the reported CPU consumption never reached anything close to 100% no matter how many threads were configured to use the connection pool in parallel.

The Problem in Detail

There are two fundamental problems. The most obvious one is the lock that protects all access to the connection pool. The second is more subtle, it’s the fact that a queue is an ill-suited data structure for writing performant concurrent algorithms.
To be more precise: there are excellent concurrent queue algorithms out there that do a terrific job, and sometimes you have no choice but to use a queue to implement your logic correctly. But when you can write your concurrent code with a different data structure than a queue, you’re almost always going to win. And potentially win big.
Here is an over-simplified explanation of why:

Queue
head           tail
[c1]-[c2]-[c3]-[c4]
c1-4 are the connection objects.

All threads dequeue from the head and enqueue at the bottom, so that creates natural contention on these two spots. No matter how the queue is implemented, there is a natural set of data describing the head that must be read and modified concurrently by many threads so some form of mutual exclusion or Compare-And-Set retry loop has to be used to touch the head. Said differently: reading from a queue requires modifying the shape of the queue as you’re removing an entry from it. The same is true when writing to the queue.
Both DuplexConnectionPool and MultiplexConnectionPool exacerbate the problem because they use the queue as a stack: the re-queuing is performed at the top of the queue as these two implementations want to maximize usage of the connections. RoundRobinConnectionPool mitigates that a bit by dequeuing from the head and re-queuing at the tail to maximize load spreading over all connections.

The Solution

The idea was to come up with a data structure that does not need to modify its shape when a connection is acquired or released. So instead of directly storing the connections in the data structure, let’s wrap it in some metadata holder object (let’s call its class Entry) that can tell if the connection is in use or not. Let’s pick a base data structure that is quick and easy to iterate, like an array.
All threads iterate the array and try to reserve the visited connections up until one can be successfully reserved. The reservation is done by trying to flip the flag atomically which isn’t retried as a failure meaning you
need to try the next connection. Because of multiplexing, the free/in-use flag actually has to be a counter that can go from 0 to configured max multiplex.

Array
[e1|e2|e3|e4]
 a1 a2 a3 a4
 c1 c2 c3 c4
e1-4 are the Entry objects.
a1-4 are the "acquired" counters.
c1-4 are the connection objects.

Both DuplexConnectionPool and MultiplexConnectionPool still have some form of contention as they always start iterating at the array’s index 0. This is unavoidable if you want to maximize the usage of connections.
RoundRobinConnectionPool uses an atomic counter to figure out where to start iterating. This relieves the contention on the first elements in the array at the expense of contending on the atomic counter itself.
The results of this algorithm are speaking for themselves:

Benchmark                                 (POOL_TYPE)   Mode  Cnt         Score         Error  Units   CPU
ConnectionPoolsBenchmark.testPool              duplex  thrpt    3  15661516.143 ±  104590.027  ops/s  100%
ConnectionPoolsBenchmark.testPool           multiplex  thrpt    3   6172145.313 ±  459510.509  ops/s  100%
ConnectionPoolsBenchmark.testPool         round-robin  thrpt    3  15446647.061 ± 3544105.965  ops/s  100%

Clearly, this time we’re using all the CPU cycles we can. And we get 3 times more throughput for the multiplex pool, 5 times for the duplex pool, and 10 times for the round-robin pool.
This is quite an improvement in itself, but we can still do better.

Closing the Loop

The above explanation describes the core algorithm that the new connection pools were built upon. But this is a bit of an over-simplification as reality is a bit more complex than that.

Entries count can change dynamically. So instead of an array, a CopyOnWriteArrayList is used so that entries can be added or removed at any time while the pool is being used.
Since the Jetty client creates connections asynchronously and the pool enforces a maximum size, the mechanism to add new connections works in two steps: reserve and enable, plus a counter to track the pending connections, i.e.: those that are reserved but not yet enabled. Since it is not expected that adding or removing connections is going to happen very often, the operations that modify the CopyOnWriteArrayList happen under a lock to simplify the implementation.
The pool provides a feature with which you can set how many times a connection can be used by the Jetty client before it has to be closed and a new one opened. This usage counter needs to be updated atomically with the multiplexing counter which makes the acquire/release finite state machine more complex.

Extra #1: Caching

There is one extra step we can take that can improve the performance of the duplex and multiplex connection pools even further. In an ideal case, there are as many connections in the pool as there are threads using them. What if we could make the threads stick to the same connection every time? This can be done by storing a reference to the successfully acquired connection into a thread-local variable which could then be reused the next time(s) a connection is acquired. This way, we could bypass both the iteration of the array and contention on the “acquired” counters. Of course, life isn’t always ideal so we would still need to keep the iteration and the counters as a fallback in case we drift away from the ideal case. But the closer the usage of the pool would be from the ideal case, the more this thread-local cache would avoid iterations and contention on the counters.
Here are the results you get with this extra addition:

Benchmark                                 (POOL_TYPE)   Mode  Cnt         Score           Error  Units   CPU
ConnectionPoolsBenchmark.testPool       cached/duplex  thrpt    3  65338641.629 ±  42469231.542  ops/s  100%
ConnectionPoolsBenchmark.testPool    cached/multiplex  thrpt    3  64245269.575 ± 142808539.722  ops/s  100%

The duplex pool is over four times faster again, and the multiplex pool over 10 times faster. Again, the benchmark reflects the ideal case so this is really the best one can get but these improvements are certainly worth considering some tuning of the pool to try to get as much of those multipliers as possible.

Extra #2: Iteration strategy

All this logic has been moved to the org.eclipse.jetty.util.Pool class so that it can be re-used across all connection pool implementations, as well as potentially for pooling other objects. A good example is the server’s XmlConfiguration parser that uses expensive-to-create XML parsers. Another example is the Inflater and Deflater classes that are expensive to instantiate but are heavily used for compressing/decompressing requests and responses. Both are great candidates to pool objects and already are, using custom never-reused pooling code.
So we could replace those two pools with the new pool created for the Jetty client’s HTTP connection pool. Except that none of those need to maximize re-use of the pooled objects, and they don’t need to rotate through them either, so none of the two iterations discussed above do the proper job, and both come with some overhead.
Hence, StrategyType was introduced to tell the pool from where the iteration should start:

public enum StrategyType
{
 /**
  * A strategy that looks for an entry always starting from the first entry.
  * It will favour the early entries in the pool, but may contend on them more.
  */
  FIRST,
 /**
  * A strategy that looks for an entry by iterating from a random starting
  * index. No entries are favoured and contention is reduced.
  */
  RANDOM,
 /**
  * A strategy that uses the {@link Thread#getId()} of the current thread
  * to select a starting point for an entry search. Whilst not as performant as
  * using the {@link ThreadLocal} cache, it may be suitable when the pool is substantially smaller
  * than the number of available threads.
  * No entries are favoured and contention is reduced.
  */
  THREAD_ID,
 /**
  * A strategy that looks for an entry by iterating from a starting point
  * that is incremented on every search. This gives similar results to the
  * random strategy but with more predictable behaviour.
  * No entries are favoured and contention is reduced.
  */
  ROUND_ROBIN,
}

FIRST being the one used by duplex and multiplex connection pools and ROUND_ROBIN the one used, well, by the round-robin connection pool. The two new ones, RANDOM and THREAD_ID, are about spreading the load across all entries like ROUND_ROBIN, but unlike it, they use a cheap way to find that start index that does not require any form of contention.

Object Pooling, Benchmarks, and Another Way

Published by Ludovic Orban on 04/11/2020

Context

The Problem on the Surface

The Problem in Detail

The Solution

Closing the Loop

Extra #1: Caching

Extra #2: Iteration strategy

0 Comments

Leave a Reply Cancel reply

HTTP

Google App Engine Performance Improvements

General

Back to the Future with Cross-Context Dispatch

General

If Virtual Threads are the solution, what is the problem?

Object Pooling, Benchmarks, and Another Way

Published by Ludovic Orban on 04/11/2020

Context

The Problem on the Surface

The Problem in Detail

The Solution

Closing the Loop

Extra #1: Caching

Extra #2: Iteration strategy

0 Comments

Leave a Reply Cancel reply

Related Posts

HTTP

Google App Engine Performance Improvements

General

Back to the Future with Cross-Context Dispatch

General

If Virtual Threads are the solution, what is the problem?