Archive for March, 2009

Service unavailable yesterday

Thursday, March 12th, 2009

aNobii was down intermittently yesterday.

What happened

All our web servers went down at the same time. We restarted the servers and re-opened the site. After a while the web servers’ load jumped up until they went down again. This process repeated itself during the 12-hour down period.

What caused it

We have built redundancy into all our hardwares. In fact, we just added a backup load balancer (the machine that directs traffic to different web servers) at the beginning of the week. The chance that the hardware of all the web servers failed together was extremely low. That, along with other signs we were seeing, ruled out hardware problem. It wasn’t the traffic either. The incoming traffic pattern was not unlike the norm.

That left us with the programs. But it was difficult to further narrow the search. We looked at the heavy-duty programs like search indexing and the similarity engine. We spent hours looking but found nothing. Finally, tipped off by two unrelated books that strangely share the same number of readers, we found a book that, as a result of a number of merging, is claimed to have over 100,000 editions(!) Certain programs that touched this book duly went nuts. We fixed it and re-opened the site an hour after we found the culprit.

What now

We are reviewing the book merge process at the moment. The merging function is now temporarily off-line until we’ve added some ways to prevent this from happening.

We are very sorry for the interruption!