Part 2 in a mini-series on developing a medium sized web app with Scala and the Play Framework. See Part 1 here.
Like in all cheap and old Sci-Fi movies about chasing singularities whatnot, this episode is called "the instability comes into play"... quite literally.
Before we continue, it is important to note that, while this series is mostly critical and may seem negative, Linux+scala+sbt+play is still my development stack of choice - i.e. there's nothing as good, overall, out there. Having said that though...
While in heavy development, the website was up - restarted often because of updates whatnot. It has always been unstable, though, with the Play process often non-responsive. I ended up with a cron job that pings it every few minutes and kills with
kill -9 all java processes on the box and restarts it, in a loop until sbt cooperates.
The "production" server was a 512Mb VM on Rackspacecloud, later upgraded to 1G, which is quite small and magnifies any kind of memory problems, but I was already spending like 60$/month - lots of beers... had to make it work.
The biggest issue around restarting it was that sbt was uncontrollable and sometimes spawned processes that somehow locked the socket in some way that the thing itself could not restart. The solution was some pray-loop in there, killing any Java living on that box and restarting until it was working.
That was while using
play start to run the production server. That is the most efficient distribution model while in development, because you keep the same directory structure locally and remote and just upload file diffs and upon start, the thing will recompile what it needs and it's fast to upload and easy to know what's what at all times.
Because sbt started recompiling everything every time and I kept having those rogue processes popping up, I eventually changed to the
play dist model, which took sbt out of the picture, but now though makes updates to the site quite slow, since the entire distribution has 80M and the one relevant jar file with my code and resources, about 7M. From home, over DSL, it takes now 3-5 minutes to deploy an update. I'm guessing a hybrid model is best, where you upload a source diff, issue a
play dist on some area on the server, then a quick switch of jars during restart...
play dist model however, simplified the server scripts and everything massively though. I will give you all my scripts in part x, in a few days.
Lesson 3 Use
play dist to run the production system, especially on small budget VMs. Invest in a good distribution model, the hybrid above - it is well worth it down the line.
At some point, the instability got so bad - this is when I started having tens of real users sign up and fill out forms on the site - that I had to fix it asap. There are many things I found while chasing it and I did change a lot within a very short time, I don't know which was the real cause(s). Here's what I remember...
This was quite hard to debug - there were many things conspiring against my health, not just beer and lack of sleep. Every single component of the running stack was messing with me in some way or another. Yes, i'm still sane. Here's some of the tools and tactics I used while chasing it down.
One thing messing with me right now - do you know that github gists cannot be created using IE9 ? I just switched to Chrome and they now work fine. The entire world is insane... or is it just me?
htop are extremely useful, quite obviously. They show you what processes use how much memory and are the first to show that there's a bunch of Java processes consuming lots of memory. Or not, which was my first problem.
ps xaww | grep java tells you what these processes are: some are sbt, some are play, some are some stupid stuff I forgot running about.
kill -9 is your friend. Here's the magic command:
kill -9 `ps -u $USER | grep java | sed 's/^ //' | cut -f 1 -d " " | tr '\n' " " ` ... works like a charm, to quiet any rogue Java process.
I now have a
ping-info page which queries standard JMX beans to get system properties (like memory usage etc). Extremely useful: get one! You can use it in a script, save it, see it, email it...
jmap -histo are very useful when debugging memory leaks. In my cron job, they are now automatically run when the process is non-responsive. They helped debug one of the leaks. They come with the JVM, like the console.
The Java console is extremely useful in debugging memory configuration and memory leaks. It is easy to enable in your running process, easy to use remotely and paints a very good picture of the memory allocation and leaks, when under a long running stress test.
Enable the console in the server process this way:
-Dcom.sun.management.jmxremote.port=1234 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.access.file=$APP/jmxr.a -Dcom.sun.management.jmxremote.password.file=$APP/jmxr.p
Don't forget to have the two access and password file given the right permissions - this is imperative, look it up.
To debug heap allocation and memory leaks, you need to reproduce them fast. I have my own simple/sweet web-testing framework - very useful to programatically create test cases, loops, forms etc and check that some string you were looking for is returned. Run a loop with some users while observing the app via the JConsole, to fine tune your memory settings or find memory leaks.
Combining it with ScalaTest is a winning combination. I have regularly-run unit tests as well as stress/perf tests for:
See more details Simple website testing with Snakked.
At times, some requests were disapearing into thin air while others were never releasing their threads. Some threads were being reused without their results ever being sent anywhere... so, another useful piece was debugging the play itself. I didn't know if it was Apache, Mongo, Rackspacecloud, Play, Netty, me or 10 other things fudging it up, so you should put these there upfront.
In Global.scala, intercept the request processing and just wrap it in some simple logging and time measurement - I have a stupid version here, all written under stress, so there's a lot of copy/paste:
Lesson 4 Be prepared, be prepared, this lesson can be shared... no, really. Learn these tools and set up all this stuff before you need it, especially, before you expect lots of users on your site. Otherwise your white nights will become many
That's it for now - in the next episode, we'll chase the bad guys and use these tools, as well as our powers of deduction for that. See Playing with fire 3 - some baddies show up.