Scaling a Monlithic Server
A journey in scaling a live web application
When I first joined WiFiSPARK, one immediate challenge was to tackle the issue of the overloaded SPARK platform. Growth of the company had be remarkable, and the number of service subscribers was growing rapidly every day. The platform and development team had built a 32 processor, 128 core beast of a machine, with 128Gb of RAM and SCSI RAID disc array. However this machine was creaking under the load of a constantly growing user base. A huge monolithic application, that was reaching it’s vertical limit.
Status report
On investigation it transpired that the problems were even more deep seated. The initial code for SPARK when it had first been commissioned was an elegant MVC ( Model View Controller ) pattern. It had been put together using well designed Object Oriented PHP. However, a succession of less gifted developers had worked on the project. Poor specification, short time scales, and code without tests had really evolved, well more mutated, the code base and it’s database schema into an Ogre with bad breathe and a beastly attitude.
What should of been an elegant controller initiating a routing pattern, was now an 8000 line if then else, case, switch dispatcher, masquerading as mess of spaghetti code. To make matters worse, many of the code changes had been implemented with supporting database changes, implemented “inspite! of” the original database schema. The result of this was that each release to the code base had reduced application performance, and extended the range and complexity of the database queries. In some cases to build a login form view would take 18 individual database calls, many of which were implementing joins across data tables.
It’s not unusual
Don’t think this case is unique by any means, it’s more the rule than the exception. I have unpicked many systems in this state. Interestingly it is a much more common issue in proprietary software than in Open Source. Perhaps that’s because peer scrutiny is a motivator for writing good code.
Now, a word to the wise. In such circumstances you can’t go back to the company executives and say “Folks, it’s time to re-write your application!” such advisories, usually get very robust and decisive responses!
The outer limits
However, the story was not yet over, and neither were the problems. I suspected that we were already hitting the limits of the machine, and instincts told me that our ceiling was the disc input/output topping out and binding the database. To check this out we installed “atop” and “sar” onto the machine. “atop” is a program similar to the commonly used linux “top” command line tool, but it provides much better granularity of detail, enabling you to see network, memory, disc, processor load and throughput in real time. “sar” is more complex and requires a little more setup. It runs for a period of time and generates a log output which enables you to build a picture of the behaviour of the machine. A good approach is to use atop in real time, to get some ideas about where the system is being loaded. Then configure “sar” to monitor those resources you suspect as being an issue, and to deliver a report overtime. For us a 24 hour period was a good marker, as there is a distinct pattern of daily usage on WiFiSPARK’s service.
The results were exactly as I had suspected, the database was disc bound. It’s interesting however, because with a less detailed tool such as “top” it looks like the system is using all it’s memory and the CPU’s are overloaded. Small wounder then that the systems administrators had chosen to try to add more, and more memory and CPU’s with little effect.
The runaway train
In a disc bound data system, the database server is trying to open and close files on the disc, load data from those tables into memory ( a join ) then iterate over that data in the memory ( cpu intensive ) however the process threads of the database server start blocking the I/O to the disc. As this happens the database begins queueing requests ( more memory). The system starts to overload, and so the operating system kernel steps in with its scheduler trying to balance the performance of the machine and share limited resources across the running services. The time to executre each database query extends, and when a single view paint takes 18 queries to build the time to render the view to the browser extends. What happens next is really interesting. The http server starts to load, as it waits for the application to return the pages, for it to return to the browsing client. But more requests keep coming in, the http server continues to hand these off to it’s php handler. In our case these two services were nginx ( http service ), and php-fpm ( php handler ). The php handler, php-fpm starts to queue requests, and those requests start to take longer and longer to service. Now, php-fpm has a protection mechanism that limits the amount of time that a script can run for, and if that time gets breached it will kill that process thread.
So, the day starts and users begin to connect, but just like rush hour there are certain times of the day when everyone wants to connect to WiFi. From 7am the load would begin to rise and would peak each day by about 10am. On busy days the chain of events described above would begin, and if the load came on quickly, or intensely enough a self limiting cycle would start. The database query time would lengthen, then php-fpm would start to queue, then php-fpm would start killing requests, nginx would re-issue the request ( usually invoked by the calling client ) the queue would fill until php-fpm hit its thread limit, nginx would begin to queue until it hits it child process limit. The memory use would flood the memory, the kernel scheduler would start trying to swap out memory to the disc. This steals disc I/O from the database, fueling the fire. Then requests issued by nginx to php-fpm start to fail. nginx has no choice but to return a “504 Gateway timeout” to the client and users start dropping of the service.
However, back on the system we now have a run away race condition, because the kernel is continuing to try to swap out processes, so the database gets slower. But the Linux kernel is good, really really good. It reaches it’s resource limits and falls back to its next protection mechanism (Out of Memory Kill). In our case it would usually start cything through php-fpm processes, which start to unload the system, and would get things back under control. If we were unlucky, the kernel would take the “nuclear option” and kill the database.
Rick’ll Fix It
Anyway. the instructions I was given were clear. “If you could just fix that for us, that would be a good place to start!!”
In this situation there is no way up, you simply can not vertically scale any further, no amount of CPU, or RAM will fix your woes, switching out disc I/O is erroneous, expensive and will only buy you a little more time. Implementing a RAM disc is a hugely expensive option, although if your back is truly against the wall it would get you out of a corner, for a while.
The only way forward was to horizontally scale, partitioning of the service implementation and distributing them across multiple servers. The tricky bit was trying to do that whilst maintaining service, but we did it and thus began our journey to multiple instance distributed system.
With 4 SPARK monlith machines, sharing the service load,and our site implementations partitioned across them, we were back in control with additional head room capacity for further business growth.. It was still clear in my mind that we would need to find a way to re-write SPARK and address the softwares architectural issues.