What happened on Tuesday?
February 12 2009, 9:59am
This post was guest-written by Mike Subelsky, OtherInbox Co-Founder
On Tuesday we had a problem with the main database server we use to power OtherInbox, which turned out to be a great demonstration of the benefits of the cloud computing infrastructure that OtherInbox uses. Our database had suddenly become unresponsive, which caused the website to stop working. Since our mail delivery system is designed to be heavily redundant and not depend on the database, message delivery was not affected (the messages just queued up until the database became available again).
We spent a few minutes trying to debug what had happened to the database, but as soon as we realized there would be no immediate short-term fix, we immediately launched a new database server, something that is easy to do as our servers are hosted in Amazon Elastic Computing Cloud (EC2).
While the server was launching, I made a new copy of the disk volume where our database is stored. Since we use Amazon Elastic Block Store and take frequent snapshots of that volume, I had a very recent copy to work with. These two actions allowed me to rebuild the database in about ten minutes with no data lost. It took another five or ten minutes to get the rest of our cloud talking to the new database server, and then we were back online.
When everything was up and running, I was able to kill off the old database server. We’ll never know exactly why it failed, but part of the philosophy of cloud computing is you never depend on any one component lasting forever. Our site is running on 25+ EC2 commodity servers which have a small failure rate. They are guaranteed to run forever. In this case, our database server lasted 133 days before crashing. We’ve had other servers function over 300 days without a problem. For the most critical parts of our site, like the inbound mail servers, we run multiple independent, redundant servers in different parts of EC2, so that one outage would not affect the rest of the site.
In the future, as we grow our service, we plan to add that kind of redundancy to the database server so that occasional cloud hiccups like this will be imperceptible to our users. We’re sorry for the inconvenience but hope this gives you confidence that we are being very careful to preserve your valuable data.
Via: http://blog.otherinbox.com/2009/02/what-happened-on-tuesday.html

