Why detecting concurrent issues can be difficult

Jetty 9’s NIO code is a nearly complete rewrite with improved architecture, cleaner and clearer code base and best of all it’ll be even faster and more efficient than jetty 7/8’s NIO layer. Detecting concurrent code issues is usually not a trivial thing. In today’s blog I will describe how it took us 4 days to resolve a single concurrent issue in our brand new NIO code. The Fix is in jetty 9 Milestone 1.
I will try to keep this blog entry as general as possible and won’t go too much into detail of this single issue or the jetty code, but describe how I usually try to resolve concurrent code issues and what I’ve done to debug this issue.
However doing NIO right is not a trivial thing to do. As well as writing code that is absolutely thread safe during highly concurrent executions. We’ve been pleased how well the new NIO code has been working from scratch. That was due to good test coverage and the great skills of the people who wrote it (Simone Bordet and Greg Wilkins mainly). However last week we found a spdy load test failing occasionally.
Have a look at the test if you’re interested in the details. For this blog it’s sufficient to know, that there’s a client that opens a spdy connection to the server and then will open a huge amount of spdy streams to the server and send some data back and forth. The streams are opened by 50 concurrent threads as fast as possible.
Most of the time the test runs just fine. Occasionally it got completely stuck at a certain point and timed out.
When debugging such concurrent issues you should always try first to get the test fail more consistently. If you manage to get that done, then it’s way easier to determine if a fix you try is successful or not. If only every 10th run fails, you do a fix and then the test runs fine for twenty runs it might have been your fix or you’ve just made 20 lucky runs. So once you think you’ve fixed a concurrent code issue that happens intermittently, make sure you run the test in a loop until it either fails or the test has run often enough that you can be sure it succeeded.
This is the bash one-liner I usually use:

export x=0 ; while [ $? -eq "0" ] ; do ((x++)) ; echo $x ; mvn -Dtest=SynDataReplyDataLoadTest test ; done

It’ll run the test in a loop until an error occurs or you stop it. I leave it running until I’m totally sure that the problem is fixed.
For my specific issue I raised the test iterations from 500 to 1500 and that made the test fail about every 2nd run which is pretty good for debugging. Sometimes you’re not able to make the test fail more often and you’ve to rely on running the test often enough as described above.
Then whenever something gets stuck, you should get a few thread dumps of the JVM while it’s stuck and have a look if there’s something as obvious as a deadlock or a thread busy looping, etc. For this case, everything looked fine.
Next thing you usually should do is to carefully add some debug output to gain more information about the cause of the problem. I say carefully, because every change you do and especially expensive operations like writing a log message might affect the timing of your concurrent code execution and make the problem occur less often or in worst case it doesn’t occur at all. So simply turning on debug loglevel solved the problem once for all. Tried to convince Greg that we simply have to ship jetty with DEBUG enabled and blame customers who turn it off… 😉
Even a single log message printed for each iteration affected the timing enough to let the problem occur way less often. Too much logging and the problem doesn’t occur at all.
Instead of logging the information I needed, we’ve tried to keep the desired information in memory by adding some fields and make them accessible from the test to print them at a later stage.
I suspected that we might miss a call to flush() in our spdy StandardSession.java which will write DataFrames from a queue through Jetty’s NIO layer to the tcp layer. So for debugging I’ve stored some information about the last calls to append(), prepend(), flush(), write(), completed(). Most important for me was to know who the last callers to those methods was, the state of StandardSession.flushing(), the queue size, etc.
Simone told me the trick to have a scheduled task running in parallel to the thread which can then print all the additional information once the test goes stuck. Usually you know how long a normal test run takes. Then you add some time to be safe and have the scheduled task executed printing the desired information after enough time passed to be sure that the test is stuck. In my case it was about 50s when I could be sure that the test usually should have finished. I’ve raised the timeouts (2*50seconds for example) to make sure that the test is stuck long enough before the scheduled task is executed. But even collecting too much data this way made the test fail less often giving me a hard time to debug this. Having to do 10 test runs which all take about 2 min. before one failed already wastes 20 min. …
I’ve had a thesis: “Missing call to flush()” and thus everything stuck in the server’s spdy queue. And the information I collected as described above seem to have proofed my thesis. I found:
– pretty big queue size on the server
– server stuck sending spdy dataframes
Everything looked obvious. But at the end this is concurrent code. I double checked the code in StandardSession.java to make sure that the code is really threadsafe and that we do not miss a call to flush in every concurrent scenario. Code looked good for me, but concurrent code issues are rarely obvious. Triple checked it, nothing. So lets proof the thesis by doing a call to flush() from my scheduled task once the test is stuck and this should get the StandardSession back to send the queued data frames. However, it didn’t. So my thesis was wrong.
I’ve added some more debug information about the state StandardSession was in. And I could figure out that it is stuck sending a spdy frame to the client. StandardSession commits a single frame to the underlying NIO code and will wait until the NIO code calls a callback (StandardSession.completed()) before it flushes the next spdy frame. However completed() has not been called by the NIO layer indicating a single frame being stuck somewhere between the NIO layer of the server and the client. I was printing some debug information for the client as well and I could see that the last frame successfully sent by the server has not reached the spdy layer of the client. In fact the client usually was about 10.000 to 30.000 frames behind?!
So I used wireshark + spdyshark to investigate some network traces to see which frames are on the wire. We’ve compared several tcp packets and it’s hex encoded spdy frame bytes on the server and client with what we see in our debug output. And it looked like that the server didn’t even send the 10k-30k frames which are missing on the client. Again indicating an issue on the server side.
So I went through the server code and tried to identify why so many frames might not have been written and if we queue them somewhere I was not aware of. We don’t. As described above StandardSession commits a single spdy frame to the wire and waits until completed() is being called. completed() is only called if the dataframe has been committed to the tcp stack of the OS.
After a couple of hours of finding nothing, I went back to investigate the tcp dumps. In the dumps I’ve seen several tcp zerowindow and tcp windowfull flags being set by client and server indicating that the sender of the flag has a full RX (receive) buffer. See the wireshark wiki for details. As long as the client/server are updating the window size once they have read from the RX buffer and freed up some space everything’s good. As I’ve seen that this happens, I didn’t care too much on those as this is pretty normal behavior especially taking into account that the new NIO layer is pretty fast in sending/receiving data.
Now it was time to google a bit for jdk issues causing this behavior. And hey, I’ve found a problem which looked pretty similar to ours:
https://forums.oracle.com/forums/thread.jspa?messageID=10379569
Only problem is, I’ve had no idea how setting -Djava.net.preferIPv4Stack=true could affect an existing IPv4 connection and that the solution didn’t help. 🙂
As I’ve had no more better ideas on what to investigate, I’ve spend some more hours on investigating the wireshark traces I’ve collected. And with the help of some filters, etc. and looking at the traces from the last successfully transferred frame to the top, I figured that at a certain point the client stopped updating it’s RX window. That means that the client’s RX buffer was full and the client stopped reading from the buffer. Thus the server was not allowed to write to the tcp stack and thus the server got stuck writing, but not because of a problem on the server side. The problem was on the client!
Giving that information Simone finally found the root cause of the problem (dang, wasn’t me who finally found the cause! Still I’m glad Simone found it).
Now a short description of the problem for the more experienced developers of concurrent code. The problem was a non threadsafe update to a variable (_interestOps):

private void updateLocalInterests(int operation, boolean add)
{
  int oldInterestOps = _interestOps;
  int newInterestOps;
  if (add)
    newInterestOps = oldInterestOps | operation;
  else
    newInterestOps = oldInterestOps & ~operation;
  if (isInputShutdown())
    newInterestOps &= ~SelectionKey.OP_READ;
  if (isOutputShutdown())
    newInterestOps &= ~SelectionKey.OP_WRITE;
  if (newInterestOps != oldInterestOps)
  {
    _interestOps = newInterestOps;
    LOG.debug("Local interests updated {} -> {} for {}", oldInterestOps, newInterestOps, this);
    _selector.submit(_updateTask);
  }
  else
  {
    LOG.debug("Ignoring local interests update {} -> {} for {}", oldInterestOps, newInterestOps, this);
  }
}

There’s multiple threads calling updateLocalInterestOps() in parallel. The problem is caused by Thread A calling:

updateLocalInterestOps(1, true)

trying to set/add read interest to the underlying NIO connection. And Thread B returning from a write on the connection trying to reset write interest by calling:

updateLocalInterestOps(4, false)

at the same time.
If Thread A gets preempted by Thread B in the middle of it’s call to updateLocalInterestOps() at the right line of code, then Thread B might overwrite Thread A’s update to _interestOps in this line

newInterestOps &= ~SelectionKey.OP_WRITE;

which does a bitwise negate operation.
This is definitely not an obvious issue and one that happens to the best programmers when writing concurrent code. And this proofs that it is very important to have a very good test coverage of any concurrent code. Testing concurrent code is not trivial as well. And often enough you can’t write tests that reproduce a concurrent issue in 100% of the cases. Even running 50 parallel threads each doing 500 iterations revealed the issue in only about every 5th to 10th run. Running other stuff in the background of my macbook made the test fail less often as it affected the timing by making the whole execution a bit slower. Overall I’ve spent 4 days on a single issue and many hours have been spent together with Simone on skype calls investigating it together.
Simone finally fixed it by making the method threadsafe with a well known non-blocking algorithm (see Brian Goetz – Java Concurrency In Practise chapter 15.4 if you have no idea how the fix works):
http://git.eclipse.org/c/jetty/org.eclipse.jetty.project.git/commit/?h=jetty-9&id=39fb81c4861d4d88436539ce9675d8f3d8b7be74
I’ve seen in numerous projects that if such problems occur on production servers you’ll definitely gonna have a hard time finding the root cause. In production environments these kind of issues will happen rarely. Maybe you get something in the logs, maybe a customer complains. You investigate, everything looks good. You ignore it. Then another customer complains, etc.
In tests you limit the area of code you have to investigate. Still it can and most of the times will be hard to debug concurrent code issues. In production code it will be way more difficult to isolate the problem or get a test written afterwards.
If you write concurrent code, make sure you test it very well and that you take extra care about thread safety. Think about every variable, state, etc. twice and then a third time. Is this really threadsafe?
Conclusions: Detecting concurrent code issues is not trivial (well I knew that before), I need a faster macbook (filtering 500k packets in wireshark is cpu intensive), Jetty 9’s NIO layer written by Greg and Simone is great and Simone Bordet is a concurrent code rockstar (well I knew that before as well)!
Cheers,
Thomas

4 Comments

Min Idzelis · 16/10/2012 at 11:55

Nice write up! Great detail into investigating a gnarly issue. Have you heard of IBMs ConTest? I’ve used it successfully for quickly forcing concurrency issues to reproduce. It doesn’t automatically solve the problem, but it can make it more obvious.
https://www.research.ibm.com/haifa/projects/verification/contest/index.html

Thomas Becker · 16/10/2012 at 12:10

Hi Min,
yes. I’ve successfully used ConTest in the past and then seem to have forgotten how useful it can be. I remember I had some issues with ConTest as well, but forgot what it was. Might have helped with this specific issue as well. Maybe I should give it another try with the next conurrent code issue. Good hint and thx for the comment!
Cheers,
Thomas

Eugene · 25/02/2013 at 05:01

Hi Thomas, thanks for the nice article!
Unfortunately, I cannot find the updateLocalInterests() method in release-9 branch in jetty source repo anymore, but I re-created a similar threading issue in my code.
Maybe I missed something, but if you modify a non-final class field in both synchronized and unsynchronized context, it can be easily spotted by modern static analysis tools, like Findbugs or built-in IntelliJ Inspections. However, if updateLocalInterests() is the only place where we mutate _interestsOps, the race becomes not so obvious.
Anyways, using static analysis tools to prevent concurrency issues might be a good idea.

Java synchronization is? | MeInfoBlogger.net · 04/12/2012 at 13:47

[…] that the rise in the number of cores, and concurrent programming will become increasingly important.What is Java synchronization ? And when the computer does not have an operating system, but it's not…which is now very inefficient. Because the program at any time in all the resources of the computer, […]

Why detecting concurrent issues can be difficult

Published by Thomas Becker on 16/10/2012

4 Comments

Min Idzelis · 16/10/2012 at 11:55

Thomas Becker · 16/10/2012 at 12:10

Eugene · 25/02/2013 at 05:01

Java synchronization is? | MeInfoBlogger.net · 04/12/2012 at 13:47

Comments are closed.

HTTP

Google App Engine Performance Improvements

General

If Virtual Threads are the solution, what is the problem?

Jetty

Security Audit with Trail of Bits

Why detecting concurrent issues can be difficult

Published by Thomas Becker on 16/10/2012

4 Comments

Min Idzelis · 16/10/2012 at 11:55

Thomas Becker · 16/10/2012 at 12:10

Eugene · 25/02/2013 at 05:01

Java synchronization is? | MeInfoBlogger.net · 04/12/2012 at 13:47

Comments are closed.

Related Posts

HTTP

Google App Engine Performance Improvements

General

If Virtual Threads are the solution, what is the problem?

Jetty

Security Audit with Trail of Bits