Season 1: Episode #7

High Availability vs Disaster Recovery: Fight!

High availability is an expectation in business today — and Rahul shudders to think about what parents (and kids) would do if Disney Plus or Youtube went down for 15 minutes. Joking aside, it’s the cloud: so why all the focus on old, complex disaster recovery solutions?

In this episode, it’s High Availability vs. Disaster Recovery, and it’s Rahul vs. Eamonn O’Neill, a disaster recovery expert and Co-Founder and CTO of Lemongrass. This episode won’t be a disaster, but it is highly available on all podcast apps.

Guest

Eamonn O'Neill

Co-Founder and CTO at Lemongrass

Read Bio

Eamonn O'Neill

Co-Founder and CTO at Lemongrass

Listen to the AWS Insiders podcast

Transcript

Rahul Subramaniam: We shouldn’t be spending the time on a lot of these really complex disaster recovery solutions.

Eamonn O’Neill: The way we address it is through automation.

Hilary Doyle: You can better safe than sorry your way into bankruptcy, and I feel like that’s the threshold.

Eamonn O’Neill: You haven’t convinced me. Maybe next time.

Hilary Doyle: This is AWS Insiders, an original podcast by Cloud Fix about the services, patterns and future of cloud computing at AWS. Cloud FIX is a tool that finds and implements 100% safe AWS-recommended cost savings. It’s fixes not just analytics. I’m Hilary Doyle, joined by Rahul Subramaniam.
Hey, Rahul.

Rahul Subramaniam: Hey, Hilary. How are you doing?

Speaker 4: I’m pretty good, thanks.

Rahul Subramaniam: That’s Siri.

Hilary Doyle: Oh, my God. Siri just answered your question. That’s fantastic. Please keep that in.
In this episode, we’re talking about disaster recovery and high availability. Are you for one or the other, or perhaps both? It’s a serious topic, so I’ll make fewer jokes, but I’ll slide a few in there maybe as a case study since usually it’s my jokes that are the disaster.

Rahul Subramaniam: Hilary, this is a really serious topic and serious business.

Hilary Doyle: Sorry.

Rahul Subramaniam: I really shudder to think about what parents would do if kids just couldn’t watch their Disney+ or YouTube.
But anyway, on a more serious note, high availability is an expectation in business today. We have very little patience for a system with any downtime or latency. Google posted a very interesting statistic recently that says something like, “You’re likely to lose 50 per cent of your customer traffic on your webpage if it takes more than two seconds to load.” Not judging, but when I was a kid, we used to write letters and it had to take weeks to land, but…

Hilary Doyle: Okay, grandpa.

Rahul Subramaniam: You’re making me feel old now, Hilary.

Hilary Doyle: I’m just kidding. I think we’re the same age. Go on.

Rahul Subramaniam: Setting context for what modern day expectation looks like. With that context, it is natural that high availability just gets so much attention.

Hilary Doyle: So people are saying that high availability is the top priority. That’s what they’re demanding. So fit disaster recovery into this picture.

Rahul Subramaniam: Disaster recovery. Okay, so back when you had to manage everything on your own in your own data center, you were limited by how much availability you could actually achieve. It was nothing like the numbers that we talk about today, but anything that disrupted availability was in effect, a disaster, and therefore the focus was always on creating a good disaster recovery plan.
Now, a DR plan creates a set of run books for how you would go about restoring a service back to its operational state. And that happens in a scenario where you encounter a complete and total failure.

Hilary Doyle: NASA has got us covered, but it sounds like insurance. So is that a reasonable way to look at this then?

Rahul Subramaniam: I absolutely think… I think where you did not have good availability, you create a disaster recovery plan as your insurance policy. Now, like any other insurance policy, you just don’t know whether the insurance company’s going to pay up or not.

Hilary Doyle: Okay. Well, great. We are highly available with Rahul’s hot takes, his hot tips and a use case the size of a linebacker. Plus we have a chat with Eamonn O’Neill, co-founder and CTO at Lemon Grass with such a beautiful Irish lilt.
But first, your AWS headlines.
Rahul, earlier last summer, Amazon EC2 High Memory Instances became available in several big regions in America and around the world. What are you making of this?

Rahul Subramaniam: High Memory Instances? Try absolutely huge instances. These are 12 terabytes huge. They’re massive.
These instances were primarily created for SAP HANAs in memory use cases, and for institutions that are looking to transition away from expensive on-prem hardware. This is an absolutely amazing option.

Hilary Doyle: More on AWS rollouts, Amazon CloudFront now supports HTTP/3. This is the latest evolution of one of the core protocols, keeping the internet running.
Rahul, what does this mean for all of us users of the web?

Rahul Subramaniam: That’s absolutely right, Hilary. It caters to all the impatience that we all have. So a large part of this revolution has actually been led by Google, and they have been pushing the boundaries of the HTTP protocol and brought about HTTP/2, which is what most of us use today, and that made loading of pages way faster.
HTTP/3 gets three to four times faster by taking a very fundamentally different approach. It swaps out… There’s a layer in the middle called the TCP layer and uses UDP instead, and that increases the performance of the server communications very significantly.
So the fact that CloudFront now supports HTTP/3, will make your websites way faster and it’s awesome. That’s fully backwards compatible, so if you have a browser that does not support HTTP/3, great. You can start using that.

Hilary Doyle: That was a lot of letters. I’m just going to move us along here.
Back in September, Logitech announced its cloud powered handheld gaming device. Cloud gaming is getting very popular. Yes, it needs a hefty internet connection, but it’s a money saver. You don’t have to build or maintain costly gaming rigs anymore.
Note please, that Logitech is powered by NVIDIAs or Xboxes, which is Microsoft’s cloud gaming service. So where does this leave AWS in the cloud gaming world?

Rahul Subramaniam: Cloud gaming is really one of the next big battlegrounds of the gaming business. AWS already has a very large footprint in game tech and eSports. It all started with Amazon acquiring Twitch which has become the default gameplay streaming service. And since then, they’ve built a number of different services that allow game developers to build the next generation of games in-the-cloud-for-the-cloud. And as internet infrastructure keeps getting better worldwide, it is just inevitable that cloud gaming is going to be mainstream. Definitely a space to watch.

Hilary Doyle: I’m such a noob. I’m going to move us along.

Rahul Subramaniam: Hilary, where do you do your online shopping?

Hilary Doyle: I do my best to shop local, but when I’m shopping online, I often use Amazon.

Rahul Subramaniam: Great. And do you really plan for the eventuality or do you think I plan for the eventuality that Amazon will not deliver my order and plan a visit to the nearby Target or Walmart, or whatever the local store is?

Hilary Doyle: No, I just complained to customer service if it doesn’t arrive on time, but it always does.

Rahul Subramaniam: So that’s the perfect illustration of what HA versus DR looks like. When you have very high availability, you don’t really need a disaster recovery plan. You also don’t know if you will be able to execute it because the outage events are just so rare that it is impractical both logistically and cost-wise to build out a plan that actually works.

Hilary Doyle: Okay, but hang on, because the Uptime Institute estimates about 44 per cent of businesses have suffered recent major outages that tangibly impacted their business. So what would you say to those business owners?

Rahul Subramaniam: So I would say that they’re most likely not on a service like AWS or they haven’t built their application with the high availability best practices that you’re supposed to put in as you deploy these applications. If Amazon were unreliable, now, God forbid that happened, I bet that you would have at least two to three other online ordering solutions at your fingertips and you would start hedging your delivery bets across these vendors.
Now, that is your quintessential disaster recovery plan when availability is in question.

Hilary Doyle: Rahul, are you a football fan?

Rahul Subramaniam: Of course, I am. I have a soft spot for FC Barcelona, but a large part of that is because of Messi. My loyalties are still finding their feet since he moved to PSG, but absolutely a fan.

Hilary Doyle: Okay, I’m talking about the NFL, American football. More specifically, I’m actually talking about Fantasy Football. Where does one find the time to assemble imaginary teams that play against other imaginary teams for money? I do not know, but it is a lucrative spectator sport, and it also makes a good use case for high availability. Granted, it’s a few years old, but so is Tom Brady and he’s looking fine, so on we go.

Rahul Subramaniam: So in my world, we have Fantasy Cricket, so I completely get it.

Hilary Doyle: Nobody else does. Back in 2016, a small company called Fantasy Draft discovered it needed unsurprisingly, major website availability, especially on game days.
Rahul, please hit the field.

Rahul Subramaniam: So Fantasy drafted most of its business on game days, and most of it just before kickoff, as fans made last minute adjustments to their teams. This created a massive spike in the number of customers connecting to their website just before the game started. And it was a real challenge while they were still in their data center. Their issue wasn’t disaster recovery, it really was high availability, as you said.

Hilary Doyle: So basically they heard you shouting at them from the stands and Fantasy Draft was like, “Oh, my God. We got to do something to quiet this man,” and needed a fast fix.
We’re going to stop there for now. Think of this as halftime. We will be back.
But first, we’ve brought in a disaster recovery expert to go up against your support Rahul for high availability. Eamonn O’Neill, he’s a co-founder, director and CTO at Lemon Grass. Lemon Grass is a software enabled services provider synonymous with SAP on cloud. So I’m really excited to listen to the two of you go head to head. Let’s do it.
Eamonn, welcome to the show.

Eamonn O’Neill: Hi, Hilary. Thanks for having me.

Hilary Doyle: I want to start here. You and Rahul agree that cloud-based is the way to go when it comes to high availability and disaster recovery requirements. But your shared point of view more or less ends there. Rahul believes that DR is irrelevant in modern cloud-based architectures. Eamonn, why is Rahul wrong?

Eamonn O’Neill: What’s really interesting, I think, is the terminology is evolving. So the term disaster recovery, of course, from on premise days related to a data center going offline and people had the images of media is hitting a data center and burning down, etc.
Of course, things are different in cloud, and some of those types of scenarios are mitigated by default. However, disaster isn’t just about the physical disappearance of some infrastructure. Security breaches can be a disaster. Corruption of data can be a disaster, and there’s plenty of other items that maybe even on the client side that could create a disaster for the service.

Hilary Doyle: Rahul, this brings up the notion that human beings are great at planning for what has already happened, and we are notoriously bad at forecasting black swan events. So where does a black swan event and the very real specter of outages fit into your high availability and no DR plan?

Rahul Subramaniam: I think where I get a little confused is that separation of HA and DR. In the cloud, it merges all in and it becomes just high availability at its core. Data loss prevention, which probably is one of the biggest items under the traditional DR setup, is also taken care of and finds itself in the core of any high availability architecture when you look at the cloud setup.

Eamonn O’Neill: But what if those preventative measures, as Hilary said these things that you couldn’t predict. What if they happened and what do you do then and your HA clearly no longer is available? So we see customers create what we call a ‘in case of emergency’ account or someplace where they’ve got a bunker with their data and potentially, some capacity reserve for compute that they can restore from. So the likelihood of disaster is definitely shrunk, but the impact can actually be bigger now than it was on premise.

Rahul Subramaniam: In the experiences that we’ve had as well as a lot of customers that I speak to, one of the realizations is that nine out of 10 times when they actually try to bring back something that is completely destroyed for whatever reason – that DR strategy just never works because they forgot something or something went different from the plans that they had. And so, Eamonn I’d love to hear more from you on that.

Eamonn O’Neill: It’s A really common problem, absolutely, that people are so afraid of even trying to do a DR test that they avoid it.
The way we address it is through automation. So automation not only gives you higher quality of the execution of those steps, but also much faster stand up of the new environment. And of course, a good example was we had one customer who moved from on premise to AWS. And on premise, they had a DR test they ran every year, and there was one guy with a clipboard who used to go around and checking that every step was getting done. And it took two days to complete this DR test. When we got to cloud, we showed him how we could do the automation of the recovery. And the funny story was that he couldn’t tick the boxes fast enough to get through the DR recovery.
What about you? Have you come up with good solutions?

Rahul Subramaniam: To be honest, not really, because there are two reasons why we’ve seen DR fail miserably. Number one is the cloud services and technologies themselves are changing so rapidly that a lot of the automation that we build that gets tested maybe once a year or less, invariably becomes obsolete. And then if I look at the sub-services and the sub API and to think about every possible scenario, permutation and combination of something that goes wrong, that’s incredibly hard to do.

Eamonn O’Neill: One of the things that we have started to do with some customers is effectively rebuild their entire landscape every 30 days. And the reason for that is people generally don’t get into these build phases except in a disaster. And of course, HA will allow you to flip between two different nodes. But what some customers have said is they don’t trust no IaaS that’s older than 30 or 60 days.
Now that’s a very high bar. Not all applications can do this, so I don’t want to pretend it’s simple. But certainly for SAP, where we spend a lot of our time with SAP customers, that’s what we’re shooting with them for their key systems.

Hilary Doyle: We know that with five nines of availability, the downtime per year is about five and a half minutes, and we can quantify those minutes in terms of cost. So the number that floats around is about $8,850 per minute of downtime.
For five minutes a year. That’s costing a company less than $50,000, which is probably significantly less than any disaster recovery as a service package. Eamonn, is this the wrong way to be thinking about things?

Eamonn O’Neill: Yeah, it’s what I would say at least.

Hilary Doyle: That’s what I’m here for, guys.

Eamonn O’Neill: So first off, that doesn’t take into account the maintenance windows. There’s a range of levers that could be pulled either deliberately or accidentally very quickly, which renders the entire cluster across the zones unusable. And what do you do then? So I think your calculations of course add up, but the truth is that you’re not mitigating a HA failure with DR. You’re mitigating a regional failure for DR.

Rahul Subramaniam: In the traditional HA architectures, don’t we already take care of the fact that in the worst case scenario, an entire region can go down? I mean, if you look at services like Aurora or S3, the underlying storage layers automatically can replicate your data under the hood across not just the three availability zones. But if you create them as regional data stores, they can actually automatically replicate them across multiple regions as well. Same with DynamoDB and a bunch of other services.
When you have systems like that, that at least at the infrastructure layer take care of a lot of that high availability requirement – yes, it costs quite a bit more to achieve it, but that’s part of your well architected high availability framework that you would use to decide how you play out all your infrastructure.

Eamonn O’Neill: I suppose the services you mentioned, I would classify those as PaaS services. So they’re not IaaS services where you’re having to architect at the lower level. So you’re right, the PaaS services have got that ability to be regionally redundant and should mitigate some of those risks that I mentioned. But for an IaaS layer, what you tend to rely on is your own HA configuration, a cluster generally that is deciding where the workload should actually run.
And if you do spread a cluster across region, which you could do, then the disaster is that the entire cluster gets destroyed. That’s the same risk as if it’s a cross zone, as it’s a cross region. If you’ve got a cluster that can see each other and somehow it gets attacked, then you still have that risk of what do you do when the cluster is unusable.
With the PaaS services, you don’t have to think about it. I agree. That almost becomes built into the service, and I’m coming from more of a packaged application where you’ve got to architect it yourself.

Rahul Subramaniam: Got it. When do you see customers wanting that sense of control and wanting to do this all by themselves, given that switching over to past services seems like a no brainer in these scenarios?

Eamonn O’Neill: We absolutely encourage customers to use PaaS or SaaS where it’s available. There’s some applications that will not allow you to run on those PaaS services. You, for example, with SAP, they certify the services you’re allowed to run SAP on. And as a limited set, it excludes things like RDS and PaaS services, and it relies more on the OS providers to come up with their own cluster management.
So sometimes customers are forced into it. They have to because they can’t get it built in.
Now, there are some versions of SAP, which are SaaS, some customers who go straight to that and forget about the entire technical architecture. But the vast majority of customers who move to Hyperscaler with SAP at least, have to consider this themselves. And yes, they use best practice patterns and all of that, but when you look at that cluster, it’s cross AZ, and you can set up an A sync even into a cross region. But the risk is that’s one defined entity effectively. That’s an attack surface, so what do you do if that goes away?
Did you say you see customers who prefer to do IaaS if there’s a PaaS alternative, or…?

Rahul Subramaniam: Yeah, I see a lot of customers in IT especially wanting that level of control, which doesn’t seem logical. It just seems something that they’re used to, they’re familiar with and they feel like they have more control when they take their setup and deploy it the way they feel is safe. And so I wanted to see if you see the same thing when you talk to customers.

Eamonn O’Neill: Sometimes I think there’s an inevitable concern some people have when they’re moving to cloud that, “What about my job? What about my role? Is it going away now? And I’d rather keep doing it the way I’ve always done it.”
And we do see that from time to time. That’s not an uncommon thing to see. And we do see customers who just try to replicate the entire data center architecture in Hyperscaler. Same management tooling, same layouts, even a big firewall around the entire VPC. But sometimes people need a bit of help and a bit of education, a bit of experience before they can really get their head around some of those concepts.

Rahul Subramaniam: And you know this, Hilary. I don’t miss an opportunity to call out the likes of Oracle and SAP.

Hilary Doyle: I know.

Rahul Subramaniam: So this just feels like an absolute monopoly of sorts where these vendors take every opportunity to hold onto their monopoly and lock in with their customers where something that could actually be beneficial, where they can leverage some of these higher auto PaaS services that actually make redundancy, reliability and stuff so much simpler. A lot of these vendors just will not allow it.

Hilary Doyle: But haven’t you said that multi-vendor strategies are the ones that are most likely to have an outage.

Rahul Subramaniam: They absolutely are. And actually at the bottom of it all, it is the people that make all the difference between a high availability set up or not. Eamonn solves a lot of his problems with automation, but if the people don’t think about all the different scenarios and if they feel like they know everything and control all aspects of how to produce HA, you’re not going to have HA. So to try and do a multi-vendor set up and stuff, it just adds so much complication. You are more likely to miss something.

Hilary Doyle: I just want to push back on that though, because you’re right, humans are not good at thinking ahead. So if they’re not good at doing that for high availability, why aren’t you assuming the same for disaster recovery and putting barriers in place or insurance in place for these crazy, once in a lifetime events?

Rahul Subramaniam: So I completely agree. We shouldn’t be spending the time on a lot of these really complex, sophisticated disaster recovery solutions. And I think a large number of them are propagated by the fact that they’re just used to them in their on-premise setup. And once they move to the cloud, just continue that because there’s familiarity.
Whether that’s relevant is a whole other question, and I actually think that they’re just wasting a whole lot of money. If instead they invested in training in the cloud, going cloud native, doing PaaS and building a high availability architecture, they would stand to gain a lot more in the long run, and actually build a high availability service with manageable risks. In no scenario can you say there’s zero risk, but manageable risks where the costs are proportionate to the risk that you’re willing to undertake.

Eamonn O’Neill: We’ve got companies, as I’m sure Rahul has as well, that their entire business is running on cloud now. And that risk – even though it’s unlikely and generally controllable, we know security is the most active part of the hacker community – is trying to get into these systems and hold to ransom some very large companies. We know this is prevalent everywhere. Ransomware attacks are going up all the time and we cannot predict what’s going to come next.
If we said to customers, “Don’t worry about a DR because it’s not going to happen, it’s never happened before,” that’s not good enough for us. Our customers would always say, “In case of something we haven’t thought of happens, we want to make sure we’ve got a lockbox or we can go back to, and rebuild everything.”

Hilary Doyle: Somebody has said and I’m stealing their line. You can better safe than sorry your way into bankruptcy, and I feel like that’s the threshold. Don’t be so safe that you bankrupt yourself with these costs.
Eamonn, thank you so much for this conversation. What a pleasure to meet you.

Eamonn O’Neill: You too guys. Thanks.

Rahul Subramaniam: Absolutely. This is a lot of fun, Eamonn. Thank you so much for coming in.

Eamonn O’Neill: No, no, I really enjoyed that actually. It’s good to get challenged on the way we think. But you haven’t convinced me, Rahul. Maybe next time,

Hilary Doyle: Rahul, we can leave disaster recovery by the side of the road just for your comfort. We’re going to stick with high availability for now. Let’s say a company is focused on improving theirs. Where should they start? Three hot tips.

Rahul Subramaniam: So here it goes. If you’re in the cloud, do not waste your time on DR or disaster recovery. It’s like pouring money down the drain.
Build a great high availability solution and let the higher auto cloud services do all of the heavy lifting for you.
Second, I know that this might sound counterintuitive, but my advice is to invent nothing. When it comes to building reliable solutions. That’s the way you want to go. Use any and all well patterns proven to be reliable and just use them as is.
Remember, bespoke deployments will have bespoke failures that you won’t ever be able to predict.
And lastly, Hilary, everything fails. That’s a law of nature. You have to plan for it and then become the monkey with the wrench that’s trying to break everything. You don’t want disaster recovery here. You really want disaster contingencies and high availability. If you don’t exercise those contingencies, you’re basically screwed.

Hilary Doyle: Rahul, this is what happens at the beginning of Fantasy Football. You assemble a literal team and then you get together with those teammates, maybe eat 700 pounds of chicken wings, drink gallons of beer until you’re certain you’re going to die. And then somewhere between your third and let’s say your 20th keg stand, you choose your players and the season begins. And sites like Fantasy Draft are there with a digital playing field for all of the teams to gather on.
When we left Fantasy Draft, they had just moved to AWS to increase elasticity and scalability. Ball is in your court, bring us home, stick the landing, give us 110 per cent on this case study, Rahul.

Rahul Subramaniam: Well, website availability went way up to give customers reliable 24 hour service. Website performance climbed something like 20 per cent, and Fantasy Draft can now scale its web servers to meet Game Day demand without really having to worry about disaster recovery that they had to do in their old data centers.

Hilary Doyle: They have benched disaster recovery.

Rahul Subramaniam: Absolutely. They used a bunch of AWS tools to achieve these outcomes. Let’s see what they put on the team. They put Amazon RDS for PostgreSQL. There was EC2 for its fluctuating availability needs, and they can use anywhere from 30 to 100 instances at any given time. And for them, that really was the winning combination.

But one last thing, Hilary, given that they are still using bare EC2 instances and auto scaling groups, I think they could get even more cost and performance-efficient if they just rebuilt some of those services to run on Lambdas.

Hilary Doyle: Hey, Fantasy Draft, pick up this free agent. That’s it for us for now. We’ll be back. You’ve been listening to AWS Insiders from Cloud Fix. My name is Hilary Doyle.

Rahul Subramaniam: And I’m Rahul Subramaniam.

Hilary Doyle: Cloud Fix is an AWS cost optimization tool. Learn more about them at cloudfix.com. Check out the show notes.

Rahul Subramaniam: And please leave a review and follow us.

Hilary Doyle: We’ll catch you later.

Rahul Subramaniam: Thank you.

Meet your hosts

Rahul Subramaniam

Host

Rahul is the Founder and CEO of CloudFix. Over the course of his career, Rahul has acquired and transformed 140+ software products in the last 13 years. More recently, he has launched revolutionary products such as CloudFix and DevFlows, which transform how users build, manage, and optimize in the public cloud.

Hilary Doyle

Host

Hilary Doyle is the co-founder of Wealthie Works Daily, an investment platform and financial literacy-based media company for kids and families launching in 2022/23. She is a former print journalist, business broadcaster, and television writer and series developer working with CBC, BNN, CTV, CTV NewsChannel, CBC Radio, W Network, Sportsnet, TVA, and ESPN. Hilary is also a former Second City actor, and founder of CANADA’S CAMPFIRE, a national storytelling initiative.

Rahul Subramaniam

Host

Hilary Doyle

Host