How to Prepare for the Next AWS or Azure Outage

What caused the recent AWS and Azure outages discussed in the article?

The AWS outage was triggered by a latent defect in the service’s automated DNS management system, causing endpoint resolution failures for DynamoDB and cascading failures across multiple services. The Azure outage was due to an inadvertent configuration change, also resulting in DNS issues and widespread service disruptions. Both incidents highlight the increasing complexity and fragility of large-scale cloud environments. Source: AWS Incident Summary, The Verge

How did Sedai detect the AWS outage before it was publicly announced?

Sedai detected early signals of the AWS outage at 12:30 AM PDT by identifying an unusual increase in error rates across several customer applications. This was nearly an hour before AWS made a public announcement at 1:26 AM. Sedai’s machine learning models autonomously learned normal application behavior, enabling the platform to catch anomalies without relying on static thresholds or rules.

Can Sedai prevent cloud outages like those at AWS or Azure?

Sedai cannot prevent outages at the cloud provider level (such as AWS or Azure), but it can autonomously detect early warning signs in customer environments and help prevent similar outages within those environments before they impact users. Sedai acts as an intelligence layer, providing early detection, recommendations, and automated remediation for issues under your control.

How does Sedai's approach to outage detection differ from traditional observability platforms?

Traditional observability platforms rely on fixed thresholds and pre-set rules to detect availability issues, which may not catch all anomalies. Sedai uses patented machine learning models to learn the normal behavior of each cloud-based application, enabling it to detect early warning signs of outages and performance issues that static rules might miss.

What strategies are recommended for building a resilient cloud environment?

Key strategies include building for redundancy and fault-containment (e.g., using multiple availability zones, multi-region or multi-cloud setups), implementing operational resilience (automation, monitoring, disaster recovery drills), and adopting a multi-cloud strategy to avoid vendor lock-in and enhance reliability.

How can redundancy help mitigate the impact of cloud outages?

Redundancy ensures there is no single point of failure by using multiple availability zones, multi-region or multi-cloud setups, and globally distributed data stores. This approach helps contain failures and enables faster recovery during outages.

What role does automation play in operational resilience?

Automation is critical for operational resilience, enabling repeatable deployments (via Infrastructure as Code), automated failover testing, chaos engineering, and regular disaster recovery drills. These practices help organizations maintain uptime and recover quickly from incidents.

Why is a multi-cloud strategy important for resilience?

A multi-cloud strategy helps avoid vendor lock-in, enhances reliability, and bolsters compliance. By replicating critical workloads across providers and using portable architectures, organizations can recover more quickly from provider-specific outages.

How does Sedai help engineering teams respond to large-scale cloud events?

Sedai provides autonomous detection of anomalies, automated rollbacks to safe versions, and actionable recommendations for remediation. This empowers engineering teams to respond faster and with full context during large-scale events, reducing the impact on users.

What percentage of cloud misconfigurations are caused by human error?

More than 80% of misconfigured cloud resources are the result of human error, highlighting the need for autonomous systems like Sedai to reduce risk and improve reliability.

How does Sedai's machine learning model work for outage detection?

Sedai’s patented machine learning models learn the normal behavior of each cloud-based application, enabling the platform to detect anomalies, misconfigurations, and inefficiencies without relying on static thresholds or rules. This allows for earlier and more accurate detection of potential outages.

What is the blast radius in the context of cloud outages?

The blast radius refers to the scope of impact caused by a failure or outage. Strategies like redundancy, fault-containment, and multi-cloud setups are designed to minimize the blast radius, containing failures and preventing them from affecting the entire environment.

How long did the AWS and Azure outages last?

The AWS outage lasted approximately 15 hours, while the Azure outage lasted over 8 hours, affecting thousands of companies and millions of users worldwide.

What is the financial impact of major cloud outages?

The AWS outage alone resulted in an estimated half a billion dollars in potential damages, demonstrating the significant business risk associated with cloud downtime. Source: CRN

How does Sedai support disaster recovery and failover strategies?

Sedai provides recommendations and automation for rollbacks to safe versions, supports integration with Infrastructure as Code for repeatable deployments, and enables observability and anomaly detection to support disaster recovery and failover strategies.

What is the future of autonomous cloud management according to Sedai?

Sedai believes that autonomous systems are now a necessary part of cloud management, as the complexity and scale of modern cloud environments exceed human capacity for manual configuration and monitoring. Autonomous platforms like Sedai are essential for early detection, remediation, and resilience in the face of increasing large-scale events.

Why should engineering teams consider autonomous systems for cloud management?

Engineering teams should consider autonomous systems like Sedai because they reduce manual toil, catch early warning signs of outages, and provide automated remediation, enabling teams to focus on innovation and high-value work instead of firefighting incidents.

What is Sedai and what does it do?

Sedai is an autonomous cloud management platform that optimizes cloud operations for cost, performance, and availability using machine learning. It eliminates manual intervention, reduces cloud costs by up to 50%, improves performance, and proactively resolves issues before they impact users. Learn more

What are the key features of Sedai's autonomous cloud management platform?

Sedai offers autonomous optimization, proactive issue resolution, full-stack cloud coverage (across AWS, Azure, GCP, Kubernetes), release intelligence, plug-and-play implementation, and enterprise-grade governance. It supports multiple modes: Datapilot (observability), Copilot (one-click optimizations), and Autopilot (fully autonomous execution).

How does Sedai help reduce cloud costs?

Sedai reduces cloud costs by up to 50% through autonomous optimization, rightsizing workloads, and eliminating waste. Customers like Palo Alto Networks saved $3.5 million, and KnowBe4 achieved 50% cost savings in production. Read the KnowBe4 case study

What types of cloud environments does Sedai support?

Sedai supports optimization across AWS, Azure, Google Cloud Platform (GCP), and Kubernetes environments, providing full-stack coverage for compute, storage, and data resources.

How quickly can Sedai be implemented?

Sedai’s setup process is designed to be quick and efficient, taking just 5 minutes for general use cases and up to 15 minutes for specific scenarios like AWS Lambda. For complex environments, timelines may vary. Book a demo for details.

What integrations does Sedai offer?

Sedai integrates with monitoring and APM tools (Cloudwatch, Prometheus, Datadog, Azure Monitor), Kubernetes autoscalers (HPA/VPA, Karpenter), IaC and CI/CD tools (GitLab, GitHub, Bitbucket, Terraform), ITSM tools (ServiceNow, Jira), notification tools (Slack, Microsoft Teams), and various runbook automation platforms.

What security and compliance certifications does Sedai have?

Sedai is SOC 2 certified, demonstrating adherence to stringent security requirements and industry standards for data protection and compliance. Learn more

Who are Sedai's typical customers?

Sedai serves organizations with significant cloud operations across industries such as cybersecurity (Palo Alto Networks), IT (HP), financial services (Experian, CapitalOne Bank), healthcare (GSK), travel (Expedia), car rental (Avis), retail/e-commerce (Belcorp), SaaS (Freshworks), and digital commerce (Campspot).

What roles and teams benefit most from Sedai?

Sedai is designed for platform engineering, IT/cloud operations, technology leadership (CTO, CIO, VP Engineering), site reliability engineering (SRE), and FinOps teams. These roles benefit from cost optimization, operational efficiency, and improved reliability.

How does Sedai compare to other cloud optimization tools?

Sedai differentiates itself with 100% autonomous optimization, proactive issue resolution, application-aware intelligence, full-stack cloud coverage, release intelligence, and rapid plug-and-play implementation. Unlike competitors that rely on static rules or manual adjustments, Sedai operates autonomously and focuses on outcomes like cost efficiency and performance.

What pain points does Sedai address for cloud teams?

Sedai addresses pain points such as cost inefficiencies, operational toil, performance and latency issues, lack of proactive issue resolution, complexity in multi-cloud/hybrid environments, and misaligned priorities between engineering and FinOps teams.

What business impact can customers expect from using Sedai?

Customers can expect up to 50% cloud cost savings, 75% latency reduction, 6X productivity gains, 50% reduction in failed customer interactions, and improved release quality. These outcomes are supported by case studies from companies like Palo Alto Networks, KnowBe4, and Belcorp. See more case studies

What customer feedback has Sedai received regarding ease of use?

Customers praise Sedai for its quick plug-and-play setup (5–15 minutes), agentless integration, personalized onboarding, detailed documentation, and risk-free 30-day trial. These features contribute to positive feedback on ease of use and adoption. Get started

Where can I find technical documentation for Sedai?

Technical documentation for Sedai is available at docs.sedai.io/get-started. Additional resources, including case studies and datasheets, can be found on the resources page.

In the past two weeks, the world was hit by two massive cloud outages:

AWS, which impacted an estimated 17 million users and more than 70,000 companies, including United Airlines, T-Mobile, and Starbucks.
Azure, which took down major platforms from Microsoft 365 to Minecraft.

The result was about half a billion dollars in potential damages from the AWS outage alone. That’s left a lot of tough questions for engineering leaders about how to prepare for the next big one.

Sedai had a front row seat to observe both outages, since our platform manages & optimizes the cloud at many large enterprises. In fact, before the world learned about these events, our technology detected the early signs in our customers’ environments.

In this blog, we’ll quickly cover:

What caused the AWS and Azure outages
How AI detected the early signs in production
How to build a resilient cloud

SLO adherence requires more than good monitoring. It requires action. Book a demo to see how Sedai closes the loop between observability and optimization autonomously.

What caused the AWS and Azure outages

As the cloud becomes more and more complex, experts who weighed in on the outages all agreed on one thing: There are more outages coming.

For AWS, the primary culprit was a DNS issue. After increased error rates and latencies were reported for AWS services in the US-EAST-1 Region, AWS’s incident summary confirmed the massive outage was “triggered by a latent defect within the service’s automated DNS management system.” This defect “caused endpoint resolution failures for DynamoDB.”

From there, a cascading effect of failures hit 70,000 companies, including over 2,000 large enterprises.

What began with the DynamoDB endpoint failure soon spread. The outage’s blast radius hit EC2’s internal launch workflow, NLB health checks, & other services downstream, letting the outage languish for 15 hours.

One week later, Azure faced its own outage due to an "inadvertent configuration change” to the Azure infrastructure. Again, causing a DNS issue.

Microsoft reported that services using Azure Front Door “may have experienced latencies, timeouts, and errors.” It sequentially rolled back to the “last known good configuration,” continued to recover nodes, rerouted traffic through healthy nodes, and blocked customers from making configuration changes.

More than a dozen Azure services went down, including Databricks, Azure Maps, and Azure Virtual. The cascading failures continued to reach Microsoft 365, Xbox, Minecraft, and beyond the Microsoft ecosystem, such as Alaska Airlines’ key systems and London’s Heathrow Airport’s website.

The global outage lasted for over 8 hours, peaking at 18,000 Azure users reporting issues and 20,000 Microsoft 365 Users.

Ready to build cloud resilience?

Book a Sedai demo to improve reliability, automate recovery actions, and minimize outage impact.

How AI caught the outage early

Human-caused misconfigurations are inevitable. In fact, more than 80% of misconfigured resources are the result of human error.

And given the complexity and scale of the modern cloud, if AWS and Azure are at risk of massive outages, you are too.

When AWS services started to experience problems, Sedai first picked up signals at 12:30 AM PDT, identifying an unusual increase in error rates across several applications for one of our customers. Meanwhile, AWS didn’t announce the outage until 1:26 a.m.

When the increased error rates came in, it triggered an alert signal, prompting the system to attempt corrective actions.

Sedai can’t bring a cloud service like AWS or Azure back online, but our technology plays a critical role in early detection, empowering customers to respond faster and with full context.

‍

‍

Traditional observability platforms rely on fixed thresholds and pre-set rules to detect availability issues. However, this approach can’t always determine when an individual application is not performing as expected.

Sedai, meanwhile, uses patented ML models to learn the normal behavior of each cloud-based application, without any thresholds or rules. That uniquely allows our platform to catch the early warning signs of outages — along with other performance issues that engineering teams can’t predict.

Sedai serves as a pivotal intelligence layer during outages. Our platform:

Learns your app’s behavior to automatically detect anomalies, misconfigurations, & inefficiencies
Rolls back to the safest version if disaster strikes
Gives you recommendations to remediate the issue

While Sedai can’t prevent AWS or Azure outages, it can autonomously prevent similar outages in our customer’s environments before they impact users. Ultimately, this kind of autonomous system has become necessary to handle the complexity of the cloud.

No doubt, the cloud will continue to experience disruptions. But as we move toward a future of more large-scale events (LSEs), like the AWS and Azure outages, companies and engineers must rely on autonomous systems to respond, in real time.

How to build a resilient cloud

Along with implementing autonomous systems like Sedai, there are other key strategies we recommend implementing within your org to build a more resilient cloud. So that when an outage does happen, you’re as ready as possible.

Here are three strategies you can use to mitigate the blast radius of an outage or failure.

1. Build for redundancy and fault-containment

The backbone of resilience, redundancy saves you when you can’t isolate or recover from failures. It ensures there is no single point of failure.

How to build redundancy:

Use multiple availability zones (AZs) within a region, so one failure doesn’t bring the entire service down
Create multi-region or multi-cloud setups for bounded recovery time
Follow the Deployment Stamps design pattern to shift to workload-level redundancy, containing compute, storage, networking, & dependency redundancies
Use a globally distributed data store for data replication & consistency

2. Implement operational resilience

Automation, monitoring, and testing are the keys to remaining operationally resilient over time.

How to build operational resilience:

Use Infrastructure as Code (IaC) for repeatable deployments
Implement observability to collect metrics, logs, & traces across clouds
Automate failover testing & chaos engineering
Regularly run disaster recovery drills & measure RTO/RPO compliance

3. Adopt a multi-cloud strategy

As these LSEs continue to grow in frequency and size, it’s more important than ever to adopt a multi-cloud strategy to avoid vendor lock-in, enhance reliability, and bolster compliance.

How to adopt a multi-cloud strategy:

Identify critical workloads to replicate or mirror on a second cloud provider
Use portable architectures like containers/Kubernetes, serverless frameworks, or cross-cloud APIs
Apply consistent security, governance, and networking policies across providers

Runbooks that aren't regularly tested under realistic conditions become unreliable when outages actually occur. Book a demo to see how Sedai's continuous SLO awareness reduces the scenarios where runbooks are needed in the first place.

The future of the autonomous cloud

Preparing for outages has become a core responsibility for every engineering leader. And in my experience, autonomous systems are now a necessary part of that job.

While current APMs detect increases in error rates across several applications when outages happen, our platform at Sedai is unique in not relying on rules & thresholds that don’t always catch the early symptoms. In short, the complexity of the cloud requires a deeper level of intelligence.

The scale of modern computing exceeds our limits as human beings, and yet we still expect engineers and SREs to configure & manage cloud resources themselves.

So as we push forward into the era of the all-encompassing and intricate cloud, it’s more important than ever that engineering teams shift to autonomous systems like Sedai. It’s time we don’t leave our engineering teams to fend for themselves in the face of increasing outages beyond their scope, and instead give them the tools that can identify and remediate issues. Autonomously.

Frequently Asked Questions

Cloud Outages & Resilience