Part 4 in a mini-series about developing a medium website with the Play framework. See previous episodes: Playing with fire 3 - some baddies show up.
After cleaning the environment, the baddie hunt moves closer what we understand better: someone's scala or perhaps java code.
One feature of the racerkidz wiki pages is the ability to share signed scala scripts and execute them part of the page display or interactions with the page, accessing a limited context with data about the page, related pages and user - for that I reused the same interpreter instance, because it is very expensive to create one.
So there I was, chasing another memory leak, since the interpreter can't properly cleanup it's internal somethings between running scripts. I still have a supid guilty look on my face, whenever I remember that my solution was to re-allocate one interpreter instance every 20 times a script it ran... but I have worse hacks in my own code, so whatever...
Anyways, found this one while:
jmap -histoto figure out what grows
Lesson #6 figure out your memory/GC settings quickly, not after hit and miss, while already in production. Keep an eye on jConsole, send some load through it and size both JVM and VM accordingly.
Lesson #7 - this tuning exercise will also point out leaks quickly, not when users do stuff and stress the site - do this purification up-front, save yourself some white nights, eh?
My understanding was that play, by default, although multithreaded and asynchronous, only processes one request at a time. I was thus carelessly lazy in using a static cache... or two... but I knew that and saved it at the back of my head and //TODO fix later comments. So, when unexpected things started to happen randomly, that was a good candidate.
So, I went back and sprinkled some
synchronized keywords here and there, protecting access to some statics... while at it, I also added some akka actors to spawn some work and speed response times - I don't know if any of these were a big problem or not, since I did this and several other things at once (like memory leaks , re-configuring Ubuntu etc) and I didn't spend the time to analyze each, since I had other, functional problems to address.
Lesson #8 do please never rely on single-threadedness and always multi-thread proof your code, from the start. Functional programming helps (the no side-effects mantra) as well as actors, futures, synchronized keywords, whatever floats your boat.
Remember how I ran the production site: fronted with an apache running as reverse proxy? Now you do... well, my watchdog cron job just pinged the main site via the apache proxy, with
wget and whenever the response code was not
200, it would order a restart, assuming my Play app was stuck again.
After fixing a lot of other baddies, I started to wonder what the heck is still going on and started to log the actual http error code. Surprise, surprise, it wasn't a 40x but a 502 "proxy error" which can mean anything... although it still points to play not responding to apache.
Anyways, I modified my script to check again and only order a restart if getting two errors within 2 seconds and didn't have a single restart since.
Lesson #9 don't restart your servers at the first sign of trouble - give them a few seconds and if the troubles persist, sure - kill them mercilessly!
I still have this - I get 1-2 emails a day with a 502 fart, but no restarts. I do not yet understand what is hapenning, I am just happy the website has been running for 6 days now without restarts, during some heavy loads and serving tens of thousand objects without a hitch.
This particullar baddie is still at large. However, a much more interesting one will be revealed in the next installment... it will be reactive... it will be playful, so keep your eyes peeled.... here it is: Playing with fire 5 - a playfully reactive baddie.
Also, see previous episodes Playing with fire 3 - some baddies show up.