Part 5 in a mini-series about developing a medium website with the Play framework. See previous episodes: Playing with fire 4 - serious baddies join the party.
So, remember that the basic instability was that Play would timeout some requests. These would either never return or just take minutes - neither making making any sense. Play threads are supposed to return and in a worse case scenario, display an exception code... right?
At this point, I put in place that ping page described in Playing with fire 2 - the instability comes into play and one of the things I kept track of was how many requests were being processed "right now", via the Globals logging filter given. This number should have been always 1 (or tending to one), but it was growing in time... yeah, it did not make any sense... except if some play threads were returning or were terminated without me catching that. If you look at my code, while I'm not a great Java developer, I don't think that should be possible - the "serving" counter should always decrease whether the processing completed normally or with an exception. puzzler no 1.
I put together a heavy super-multithreaded stress test, to reproduce those errors and stumbled onto puzzler number 2 here was while looking at the log file: you can see a netty thread getting the request, pass it to a play thread to some akka thread what not - at some point you see the thread logging LF.start and that should be the thread processing the body of the controller.
See the log here.
Except, this thread basically stops in the middle - never in the same place and, after a while, it is reused to process another request.
This is when you throw everything you got at it:
topshows your process does nothing
jstacksays all threads are bored and doing nothing and you just get more furious: "where the heck is it???"
Yeah, I know - I did throw my hands in the air and posted a help request on the play-framework forum but got no response except someone else wondering what I am smoking, since this is impossible.
Lesson #8 minimize asynchronous stuff if you don't have a good logging&debugging framework for these requests. You will be pulling your hair out when things go wrong, since there is no way to figure out what the heck happened to parts of your request. You need to log often, using the same logical request ID, to be able to correlate back all those mini-threads and figure out what the heck happened. Trust me - if your code base may ever get >5,000 LOCs, don't go there.
Lesson #9 do not mess around with continuations. These are a theoretical curiosity with no applicability in real life. Please call them by their real name which is setjmp/longjmp and leave it at that. In real life, people need to debug code that does stuff and when threads disappear in the middle of doing something, you expect them to be stuck.
The only way that could happen (threads disappear doing something and re-appear doing something else) is if someone abuses continuations and forgets stuff when things go bad... or just actors forgetting their lines or some such. This takes not only specialized skill to debug and understand but a lot of brain power, when you're in the middle of something else.
Lesson #10 don't drink the "code must be massively reactive/asynchronous just because it can be" kool-aid. Not unless you have a very good gluing and logging framework and the discipline to allow debugging the mini-threads contributing to a larger thread etc. I may blog the proper way to do those eventually, time-permitting.
There have been some baddies that only showed up infrequently, like this one for instance, pointing to some impedance mismatch between netty and play.
I haven't seen this one since that fateful day where all the above took place...