- 4 min read
Rocket.net Redefines Disaster Recovery for WordPress
Let’s talk about the thing no one ever wants to talk about in tech: disasters. No matter which cloud, container, or bare metal infrastructure you’re on, 100% uptime is just not realistic. Why not?
As much as we all strive for 100% uptime, things can happen – especially around holidays for some reason! In my twenty-two year career, I’ve seen data center providers catch on fire, server chassis completely die, storage filers go down for days, you name it.
Then, this fancy new technology came to market: “cloud”.
You see, what everyone originally thought about cloud computing was that a bunch of nodes would work together to create high availability and ultimately achieve 100% uptime. The problem? Reality set in. One hundred percent uptime isn’t always possible, especially when it comes to things like networking and power delivery.
I can’t tell you how many times in my career I’ve had infrastructure folks tell me “Our platform should be just like Google’s: If one rack goes down, we leave and continue to operate on the other rack.” But making this happen outside of a single application (like Google at the time) has about the same odds as winning the lottery.
In fact, every time I’ve seen a company try to do something “high-availability” in an uncontrolled application environment (web hosting for example) it’s had the exact opposite effect. If you don’t believe me, just ask any sysadmin who’s had to manage a Ceph cluster at scale before. It’s not trivial.
As we go through this post, we’re going to focus on some game changing functionality on the Rocket.net Platform:
- Customers have two sets of encrypted backups
- Every Virtual Machine runs a snapshot every 1-5 hours
- Enterprise nodes run snapshots every hour
- Restores can take place from either backup source
- Full server recovery in minutes
So let’s dive in!
The Internet. It fails. A lot.
First, let’s talk about what can and has gone wrong on the Internet.
There are many points of failure when it comes to the Internet. I’ve seen Amazon go down for 12 hours in us-east and affect the entire world. I’ve seen CDN platforms deploy bad code and take massive amounts of the Internet offline. No one or no thing is perfect. But one thing I admire about companies like Cloudflare is their ability to be fully transparent and learn from their failures to prevent them from happening again. Because sometimes you just don’t know what you don’t know until you have to live through it.
So what kind of issues have we seen at Rocket.net? The single biggest contributor to any service disruptions we’ve had can always be tracked back to network and/or power. Networking and power are completely out of our control, yet they are the two most crucial items when it comes to running servers.
Just this past weekend, one of our data centers had a leak on an edge route and we were seeing 1.5% packet loss and random failed requests from Cloudflare. These can often be the trickiest problems to narrow down, but thankfully we’ve done this once or twice and knew exactly what to provide our vendors with to track the issue and it was resolved promptly.
A few years ago, we had a rack lose power in Singapore. Normally this would not be a big deal, except the power flickered back on, and during boot, the power went out again. This caused bad writes to all the disks and ultimately corrupted the boot loader, leaving the server unbootable. Yeah, not good! While every server we run operates with two disks constantly mirroring (RAID-1), a bad write is a bad write and replicated to both. You could have six disks, it’s still going to corrupt the actual data.
So, in a case like the above, you could do what we did: Have the data center hot swap new drives into the chassis and restore your data to the new drives. But that’s not always possible! And if you say “well Ben, that’s why we use shared network storage”, my reply to that would be “Good Luck!”
Yes, network storage can be reliable (Google does a great job at it) but it can also fail, and in many cases, recovery can take not just hours, but days. In fact, I was at a company where I watched one filer go down for 48 hours, taking down 70,000 websites with it. We completely relaunched hosting in 2013 at this company and I still had to argue against shared storage after watching that massive outage.
The reality is, no matter how you deploy WordPress, there is always going to be a single point of failure at the origin. Whether it’s a single VPS or 1,000 containers, the network and power to your origin can fail at any time, not to mention hardware can fail at any time.
Today, I can truly say we are prepared for these failures at Rocket.net and am excited to walk you through exactly how prepared we really are.
How Rocket.net Handles WordPress and Platform Backups
First, let’s talk about multiple approaches we take at Rocket.net and some of what we’ve learned over the years.
Like most providers, we initially operated on the premise of “what if?” and analyzed how we would be able to react to various scenarios. Over time, “what if it happens” has turned into “when it happens” and this new way of thinking has resulted in what I consider one of the most advanced deployments of my career.
The thing that keeps me up most at night is uptime and data integrity for our customers. Our customers trust us with millions and millions of dollars worth of business, not to mention their livelihoods, so this is something we take very seriously.
WordPress Website Backups
This is arguably one of the most common checkboxes across all hosting/service providers, but also the most overlooked – WordPress Website Backups.
Since Rocket.net’s inception, we have been iterating through various backup solutions and ultimately chose to standardize on a solution powered by JetBackup.
JetBackup combined with Wasabi object storage turned out to be an incredible solution for our customers when they needed to do one off restores. Thanks to JetBackup, we run daily backups of all files and databases allowing customers to do a point in time restore in just minutes.
Since we run incremental backups, JetBackup will do an incremental restore to get the site back to how it was at that point in time, including a rollback of the database. This works amazingly well as it typically doesn’t involve a massive amount of data. Not that massive amounts of data are problematic to restore, but WordPress has a lot of very small files and it can be time consuming.
With that said, our team and customers have been very happy with the results JetBackup produces in time of need, but it does have some disadvantages when it comes to a disaster recovery scenario.
With JetBackup storing data in S3 compatible storage, it can be very limited on how rapidly a full restore can take in a disaster scenario.
So in the end, this backup effectively enables our customers to restore anything on their WordPress site from files to the database to any point in time, but doesn’t solve the issue of restoring an entire host node.
Single Points of Failure
Another problem we wanted to solve, as amazing as JetBackup is, we still had a single point of failure. When using only one backup provider, you are relying on a single set of backups. 99.999% of the time this is totally okay, but things happen. Having only one set of backups can create a massive issue at the most critical of time, so at Rocket.net, we wanted to ensure we had redundant backups.
Earlier, I talked about how we all strive for 100% uptime and how things can happen that are out of our control. Well, the same goes for backup storage, and we’ve seen it first hand.
On January 21, 2024, our backup storage provider, Wasabi, had an issue in their us-east-1 data center:
Now, this issue occurred after the daily backup ran which is great until you need to restore something. Unfortunately, a customer of ours urgently needed to restore a staging website that was accidentally deleted, and they needed to restore it ASAP so that they could launch a project on time.
Most companies would simply tell the customer to wait, but we’re not most companies. How did Rocket.net solve the problem for our customer?
Introducing: Rocket.net Platform Backups
In the last few months, we’ve quietly been working with another backup vendor in parallel to our JetBackup system. That vendor is Acronis.
Acronis offers an enterprise backup solution that integrates directly with our private cloud orchestration software, Virtuozzo. How does it work?
Acronis runs a snapshot of every Virtual Machine at Rocket.net every few hours. These snapshots are an exact point in time copy of the Virtual Machine at the block level. This backup process runs outside of the virtual machine and is able to securely store the entire copy of the Virtual Machine in the encrypted Acronis Cloud.
How did we use this to solve the problem for our customer? Earlier, we talked about how Wasabi had an outage in us-east-1. While this did not impact our service delivery, it did prevent customers from temporarily restoring backups. This is a major problem, but is now solved by our Platform Backups.
Using this second tier of backups with Acronis, we were able to pull the staging files securely from the Acronis Cloud and make them available to our customer within minutes.
That’s right! We had a backup provider fail, so we were able to turn to our other backup source and save the day. What did our customer think?
We’re confident that this won’t be the last time a customer benefits from these redundant backups, now fully integrated into the Rocket.net platform.
The other big benefit our Platform Backups provide is the ability to restore an entire host node (roughly 1.6TB) in 150 minutes or less. You read that right, we’re able to completely restore any of our Virtual Machines in 150 minutes or less by doing a full restore to a brand new node. The best part? The maximum window for data loss is a mere five hours!
On our Enterprise plans, our customers receive hourly Platform Backups, resulting in their maximum window of exposure to 60 minutes. It gets even better, in our weekly integrity testing, it takes an average of 8 minutes to completely restore an Enterprise node from its backup.
Our Enterprise customers can now be back online in minutes in the event of a disaster. Whether a data center catches fire, the internet goes dark in a region, we have you back up and running in minutes.
Putting Disaster Recovery Into Action Saves A Customer’s Day
Case in point: A few weeks ago, our Phoenix data center had a network issue that lasted just under 60 minutes. Typically 60 minutes or so isn’t the end of the world on our platform all thanks to our Enterprise CDN and Full Page Caching. Our customers see a 98%+ cache hit ratio on our platform, so the only time they really see an issue is if they’re doing dynamic requests such as writing new posts.
But in this case, we had an enterprise customer that manages one of the largest food blogs in the world. Since the Platform Backup was located securely in the Acronis Cloud, we were able to restore that website to another data center in minutes so this customer could continue to write articles to continue to generate revenue.
This kind of flexibility is unheard of in the web hosting world, and yet we made it happen for our customer because it was the right thing to do.
Always Putting Customers First at Rocket.net
We’re beyond excited to provide this multi-tiered backup and disaster recovery solution to all of our customers. Not only does this help us to ensure the safety of our customer’s businesses with extra layers of backups, but it allows us to provide our customers with the fastest recovery time in the industry right when it counts the most: In the middle of an outage.
What’s more, all of this is available on our entire platform at no additional cost. Customers don’t have to turn anything on, we handle everything automatically. That means our customers can sleep better at night, having peace of mind, knowing that they aren’t just at the fastest WordPress host on the planet, but that Rocket.net disaster recovery plans are second to none.
More blog resources
- 2 min read
- 3 min read