Now that we've restored user's photos and files, and the application has been reliably back online for three days, we want to provide more details about our outage on Friday and the steps we took to recover from it. We will post more information in the coming days about our plans to avoid future outages and compensate customers for Friday's downtime.
Before we get into the details, we want to reiterate just how sorry we are for the inconvenience this has caused. We know our customers rely on us to provide superior service and performance, and that on Friday we let you down. The fact that Friday's outage came on the heels of our performance issues in October is obviously frustrating, both for you and for us.
What happened
As you may know from our previous posts to Everything TypePad, we have been migrating our operations to a new state-of-the-art data center. Over the past two months we've nearly completed that move, and as part of it made upgrades to our networking equipment, our bandwidth capacity, our application hardware and our storage systems. The maintenance window we took on Thursday night was designed to accomplish two things: configuring and testing high-performance and redundant network traffic load balancers, and configuring a recently added redundant "head" in our new, high capacity disk storage system.
The first task on Thursday night went off without a hitch. The second task, however, was where the problem started. As part of the configuration of the new storage system, we needed to reboot the storage device. When we attempted to bring the device back online at 10:50 pm PST, a hardware failure occurred and damaged the index of the file system. Essentially, the system couldn't mount the drive, even though the data was still there.
This disk failure led to both published blogs and the TypePad application being down. After diagnosing the problem in order to understand the severity of the issue, we decided to serve published blogs from a snapshot that was between 2-6 days old. This was why some TypePad blogs were out of date.
Recovering TypePad and restoring your data
Through the night and into Friday, our operations engineers worked to diagnose the problem with our storage vendor, and to bring the application back online.
We knew that when we brought TypePad back up user's posts, comments and TrackBacks, TypeLists and photo albums would be current in the application's database, but out of sync with their published blog. When we restored the application at approximately 3:00 pm on Friday, we encouraged users who were logging in to republish their weblogs to bring them up to date, and at the same time began a process to proactively republish user's content on their site. We advised users that republishing would resolve some issues, but not those related to photos or files that they had uploaded to TypePad in the past several days.
On Saturday morning we completed our process of republishing user's weblogs. Also on Saturday morning, we worked with our storage vendor to bring the data with the damaged file index back online in a new unit that they rush delivered to our data center. Once we were able to bring that machine up, we began the process of restoring user's missing photos and files to their weblogs and photo albums. Additionally, over the course of Saturday night and Sunday, we did more republishing of weblogs, photo albums and TypeLists to restore data and to address particular customer issues.
As of Sunday night at approximately 10:30 pm, we completed the data restoration process. We encourage any of our customers who are having any unresolved or related issues to file a help ticket with our support team, and they'll work hard to get your problems resolved. You can file a help ticket by visiting the Help tab under the Control Panel inside TypePad.
We hope that this post provides a bit more detail about what happened on Friday and over the weekend. Again, we are very sorry for the inconvenience, and appreciate your continued patience. We will post more information in the coming days about our plans to avoid future outages and compensate customers for Friday's downtime.