Interesting MySQL or How to make better architectural decisions
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" - Donald Knuth
This statement has been my mantra for the past several years as a developer, and I say developer cause I could hardly call myself a coder or even the almighty 'engineer' (though in my new title at Kiva I am on the engineering team). The reason is simply because I wouldn't place myself to that high of a standard. I'm moving towards it, and I hope to be there one day, but for now I will still keep my ears fully open to the lessons of those who are smarter and better coders than I.
So, that said, I made a mistake when architecting a system, and that mistake can be ultimately be attributed to a failure to optimize. Don't get me wrong I still believe premature optimization could be the root of all evil, but I forgot about a key piece. One needs to revisit those decisions from time to time and decide if *now* is the time to optimize. To be fair, I just switched jobs and just spent a week in Chicago at DrupalCon, so I had a fair number of reasons not to pay attention to a system I was only consulting on now. I also wear a lot of hats, so I had a lot of competing priorities when I made the decision. So, the decision under discussion here was basically choosing to use the locale module when we did not need it. We chose to use it for a non-standard use case and we went forward with it even though it has trouble running on ndbcluster tables. However, I'll start with an architectural decision that was correct, but which goes back further, the decision to use MySQL Cluster. I'm happy we took that route and that we showed the world that Drupal can run on MySQL Cluster, including putting up instructions on how to do it.
Choosing MySQL Cluster was a sound decision, however I stopped counting the number of times people have asked why we did that. Two reasons come readily to mind. The first is that the sysadmin responsible for the servers had some expertise in having previously set up a cluster. The second reason was that we knew MySQL Cluster was designed with a goal of being highly available with only a small caveat of possibly losing some transactions in some minor edge cases in case of total cluster failure (a cluster fuck if you will). My co-developer and I took this into consideration along with our goal to try to make the system as stable and fault-tolerant as possible, with automatic failover in cause of failure. In addition, we knew we wanted to scale, and MySQL theoretically allows for easy linear scalability. Overall though, our main goal was stability, because the product we were replacing, IBM Websphere, had a horrible track record on stability, and had left our users very frustrated (you can read through some past blog posts if interested more in the architecture). In the end, this was a good decision, we had a very stable portal that only experienced significant downtime once in over a year, while having served out over 2 million page views. More importantly, we met our user needs, especially those that all users are authenticated and we have points in time where a large number of users hit the site at once and performance a number of actions (times in which there would be 500 concurrent logged in users for periods of 30 minutes - 1 hour).
So, this track record was building well, and eventually led to some complicity, and then we hit a failure. Since I no longer work directly on these systems, only consult part-time, I didn't even see the problem until I checked my e-mail about 2 hours after it first occurred. Unfortunately, the sysadmin who had originally helped us set up the system had moved on to greener pastures, so the expertise was lost, and the sysadmin who had taken over basically had thrown in the towel, as had my co-developer. So, naturally, I took it upon myself to step in and figure out what the problem was and get things running. I discuss the problem in more detail below, but basically we had a big problem with queries on locales_source table and the site crawled to a halt and then died. So I spent from 11pm - 3 am debugging the exact issue, figuring out the steps for a fix, making that fix reproducible, implementing it, then informing the users of why we had a problem. Had I made a better decision in the first place, I would have gotten to actually sleep more than 1 hour that night.
Before I get into the problem though, I want to discuss "the website is down" strategies. I'll note now that if you haven't seen the comedic video of such, then check out Web Guy vs. Sales Dude over at thewebsiteisdown.com. Now, how many of you have emergency outage strategies? If you don't have one, you should, because one day, your site will go down. I'm not going to name sites, but just look in the news of the past months and you will see a lot of big names. Unplanned outages stand in good company, they probably even have parties where they all get wasted and you are the designated driver that has to return them back to a normal state. The point is (I do have one, I promise), if you don't have a strategy for handling unplanned outages, then you are going to either be spending a lot of time obsessing over your Nagios alerts (what, you are at least using Nagios or some other fault-notification on your servers right?) or you are going to be left with a dead site for much longer than you would like at times. However, I was surprised at the responses made in this case, as they failed to first address the outage issue before attempting to find solutions. So, with that said, I first implore you to at least have the following in place:
- A Help Desk person who will answer calls and reassure customers when the website is down. Ideally, at all hours that your site is expected to be in operation. *Hint: You don't want to be the person getting a call at 1am.
- A checklist for said help desk person to follow. If there is no help desk person, then this is the plan that you will follow. It's important to write this down so you actually follow all the steps. Ideally, this checklist would have the steps necessary to show that the site is actually down, not just a problem for a particular person, particular region, or particular page.
- An easy to use and external medium to communicate problems to your customers when they occur. Many people choose to use twitter for this purpose, but any medium that is not reliant on the infrastructure of your primary site is a good alternative.
- A static HTML page with an outage message that just needs to be copied into place.
- Automated notifications of outages or severe problems Using Nagios is a great idea and not that hard to set up.
- Automatic site testing (e.g., SimpleTests or PHPUnit tests that run via a Continuous Integration tool). This is more of a nice to have, however it will help prevent a lot of problems
Now, assuming you have those in place, it's going to be a lot easier to follow the steps for handling your outage. In many cases, outages are simply caused by large influxes in traffic, and that was the case here (though not the root cause, but the large increase in traffic was the trigger). I'll note two things before continuing, I am not a DBA, though I do tend to play one on the internet, and I'm not a MySQL expert, though I know enough to usually figure out why a problem is occurring.
First, you will need to gather a little data, which means letting your system continue to crawl for a small amount of time.
- SSH into your box. Wait patiently
- Check the performance and network load of the system. On Linux, run "top" and then "ps aux" to try to determine where your bottleneck is. Often times, it's going to be either Apache (or your web server of choice) or MySQL (or your db of choice). The main point is, try to determine if it is that you have run out of memory, or out of CPU power, and narrow it down to what is taking the most resources.
- In our case MySQL seemed to be having issues, even though the load wasn't showing to be maxing out. So, we take the next step of examining further
- Open up a MySQL prompt (again wait patiently, if it's taking too long still, then skip to killing Apache).
- Run "show processlist;"
- If you see long-running queries, especially ones that are listing out times in many seconds, then these are probably causing an issue.
- Note the bad queries and check for any other issues on the system.
- Shutdown Apache. Completely.
- Move your temporary static HTML page with outage message into place as index.html. It's best to have your setup in place with a rewrite rule you can just uncomment and Apache will start serving the index.html page before index.php but you can also just temporarily move index.php. Alternatively, you can turn on Maintenance mode in Drupal, but if you are having database problems at this point, this isn't going to solve your problems like a static HTML page will.
- Start Apache
- Determine the exact problem. In our case, I could see that the locales_source table was being loaded with a SELECT query along with a JOIN and these queries were taking a large amount of time. I also knew that we had a recent influx of large amounts of traffic due to the release of offer letters
- If the fix is quick, then make a patch, commit it, and then re-enable Drupal.
- If the fix is going to take some time, see if you can make a temporary workaround. In our case, clearing out the locales_source table, and then rebuilding the translations temporarily resolved the issue
- Remove your temporary HTML maintenance page and if necessary comment our your rewrite rule (or take Drupal out of maintenance mode if you were able to go that route
- Contact your stakeholders with a brief but open explanation of why you suffered an outage
Now, this isn't to say using locale is bad, however, the reason we decided to use it was bad. Also, the things we did to get it to work on ndbcluster were bad. Basically, we were looking for a way that we could allow site admins to change text in code (text that we had wrapped in t() calls of course) in half a dozen custom modules. We were told they would want to change this text often and that we would want them to do it themselves. So, we decided to use the locale module to create our own custom subset of English language, then set it as default, then in the our t() functions we would enter just a symbolic string, which got translated to the actual text in our po files. Thus, admins could just go to the translation interface in Drupal to make changes. We noticed an impact at the time to performance but it seemed a small price to pay, and at the time no one cared if the site ran at 2 seconds or 8 seconds because we had users who were bought in. Students needed to perform business functions, so they were forced to wait on our occasional slowness. This worked quite well for quite some time. As stated we had a full year of stability, but then we got hit by a larger sudden influx of traffic than we had received ever before in the past year. This combined with our poor usage of locale led to a perfect storm of calamities. To fix this we could have gone with a different solution, but no one outside my co-developer and I actually ever used the translate interface. So, it was a complete waste of time, effort, and queries. We implemented the fix of removing all our fake filler strings and filling them with the real string text, and then removing the locale module. However, we did put ones we thought might actually change in variable calls, so that we could provide an administrative interface to easily change the text. Our load times became a much more respectable 1.5-2.5 seconds (something we plan to bring down even further, but is sufficient for this moment in time until we are actually given time to work on performance. Furthermore, our stability issues were solved once again, hopefully for another full year of stable goodness).
Just to show that we did indeed get a big jump in traffic that was beyond our expectations, here is part of the Analytics graph:
What did I learn? I learned that your co-developers may be smarter than you, and brilliant all around, but they might still suggest you towards a path of a bad decision, and that regardless, if you are one the one making the decision then you should be ready to take responsibility if things go awry. I take responsibility for this problem, though I'm sure if we had more time we would have optimized it away, but in reality, time is scarce and feature requests are always coming down the line. I also learned that it's rather important to come back to these kind of decisions every few months and review various pieces of data to figure out if we need to start optimizing, regardless of the current workload.
As always, I'm happy to receive comments, even critical ones telling me that I should be burned at the stake for making certain decisions, as I don't claim to necessarily be an expert in any of these technologies, just a determined developer with a passion for evangelizing Drupal, taking it to the far frontiers. I hope that if nothing else, you will takeaway from this that it is important to have an emergency outage plan in place, because at some point your site *will* go down, it is only a matter of time. Be prepared.