Amazon AWS Outage Has Crashed The Web

In a stark reminder of the internet’s fragility, Amazon Web Services experienced a major outage on October 20, 2025, taking down hundreds of major platforms including Snapchat, Fortnite, Reddit, Coinbase, and even AWS’s own support systems. Starting at 3:11 AM ET, the cascading failure stemmed from a DNS resolution issue affecting DynamoDB in AWS’s critical US-EAST-1 region in Northern Virginia. With AWS controlling 30% of the global cloud infrastructure market and serving over 4 million customers, the outage exposed how the entire modern internet economy rests on a surprisingly small number of infrastructure chokepoints.

What’s Actually Happening

At approximately 3:11 AM ET on Monday, October 20, 2025, AWS began experiencing what would become one of the most significant cloud infrastructure failures in recent years. The technical issue—a DNS resolution failure for the DynamoDB API endpoint in the US-EAST-1 region—triggered a domino effect that rendered dozens of major internet services unusable for hours.

The scope of impact was staggering. Downdetector recorded approximately 50,000 simultaneous outage reports across multiple services. Gaming platforms Fortnite, Roblox, and Pokémon GO went dark. Communication tools Slack, Signal, and Snapchat became unreachable. Financial services including Coinbase, Venmo, and Robinhood were inaccessible, prompting Coinbase to reassure users that “all funds are safe.” UK banking services from Lloyds, Halifax, and Bank of Scotland experienced disruptions, along with government services including the HMRC tax authority website.

AWS engineers identified the root cause at approximately 5:01 AM ET—a DNS problem affecting how systems could locate and communicate with DynamoDB, AWS’s managed NoSQL database service. By 7:48 AM ET, AWS reported “significant signs of recovery,” though service restoration continued throughout the morning as systems worked through backlogs of queued requests.

  • 3:11 AM ET Start Time: AWS first detected increased error rates and latencies across multiple services in US-EAST-1, the company’s most critical regional hub
  • 50,000+ Simultaneous Reports: Downdetector recorded massive spikes across hundreds of platforms, indicating the broadest cloud infrastructure failure since the 2024 CrowdStrike incident
  • 4-Hour Duration: From initial detection to “fully mitigated” status, the outage lasted approximately four hours, with residual issues persisting beyond that window

The Strategic Play

This outage wasn’t just a technical failure—it was a demonstration of systemic architectural risk that exposes fundamental questions about how the modern internet is structured. AWS’s 30% market share of global cloud infrastructure means that when AWS fails, roughly one-third of the internet’s digital services fail with it.

The concentration risk is even more acute than market share suggests. US-EAST-1, the affected region, isn’t just another AWS data center—it hosts many control-plane endpoints and high-throughput managed services that global customers depend on regardless of where their primary workloads run. Identity and Access Management (IAM) updates, DynamoDB Global Tables, and other foundational services all route through US-EAST-1, creating single points of failure that cascade across geographies.

The DNS failure affecting DynamoDB is particularly instructive. DynamoDB serves as the session store, authentication backend, and metadata repository for countless applications. When systems couldn’t resolve the DynamoDB endpoint, they couldn’t validate user sessions, retrieve configuration data, or maintain state—effectively rendering entire applications inoperable even if their primary compute resources were functioning normally.

Under the Hood: What Makes This Different

DNS resolution failures are among the most catastrophic types of infrastructure problems because DNS operates as the internet’s address book. When DNS fails, systems lose the ability to translate human-readable domain names into IP addresses that machines use to communicate. In this case, the DNS problem specifically affected dynamodb.us-east-1.amazonaws.com, preventing applications from locating AWS’s database service.

The cascading nature of the failure illustrates tight coupling in modern cloud architectures. Applications don’t just use DynamoDB for data storage—they use it for critical real-time operations including authentication, authorization, and session management. When DynamoDB became unreachable, applications couldn’t validate user credentials, maintain logged-in sessions, or retrieve permission settings. This meant that even users already authenticated to a service would be logged out and unable to reconnect.

AWS’s status updates revealed they were “working on multiple parallel paths to accelerate recovery,” suggesting the problem wasn’t simple to diagnose or fix. The fact that AWS’s own support ticketing system went down—meaning customers couldn’t even report the outage through official channels—demonstrates how internal dependencies can create recursion failures where the tools needed to fix a problem are themselves affected by that problem.

The Disruptions Nobody’s Talking About

1. Financial Infrastructure Vulnerability Beyond Consumer Apps

While headlines focused on social media and gaming outages, the financial impact ran far deeper. Cryptocurrency exchanges handling billions in daily trading volume became inaccessible during a volatile market period. Payment processors including Venmo and PayPal experienced disruptions, potentially affecting business operations that rely on real-time payment confirmation. UK banking infrastructure failures meant individuals couldn’t access accounts, make transfers, or complete purchases—exposing how banking digitization has created new systemic dependencies on third-party cloud providers.

2. Government Service Dependency Creating Sovereignty Concerns

The HMRC website outage in the UK reveals that government services—traditionally considered critical national infrastructure—now depend on commercial cloud providers. If tax collection, permit applications, or emergency services run on AWS, then AWS outages become matters of national security and governance. The concentration of government workloads on a small number of cloud providers creates geopolitical risk: what happens when a cloud provider faces regulatory action, cybersecurity threats, or politically motivated disruption?

3. Multi-Cloud Strategy Illusion and Hidden Single Points of Failure

Many organizations believe they’ve mitigated cloud dependency through multi-cloud architectures—using AWS, Google Cloud, and Azure simultaneously. But this outage exposed hidden dependencies: if your identity management, DNS, or session storage uses AWS even while compute runs elsewhere, you still fail when AWS fails. True redundancy requires not just distributing workloads but completely decoupling authentication, authorization, and state management across providers—a level of architectural complexity that most organizations haven’t achieved.

Strategic Implications by Role

For Strategic Operators (C-Suite)

This outage should trigger board-level questions about operational resilience and dependency risk. The concentration of cloud infrastructure creates systemic vulnerabilities that traditional business continuity planning doesn’t address.

  • Demand detailed dependency mapping: What happens if AWS, Azure, or Google Cloud fails for 4+ hours? Can your business continue operating? Most organizations cannot answer this question with specificity
  • Evaluate regulatory and contractual exposure: If you have SLAs promising 99.9% uptime but depend on infrastructure with demonstrated failure modes, you face liability when upstream failures cascade to your customers
  • Consider insurance and financial hedging: Cloud provider outages create quantifiable financial losses; insurance products covering infrastructure dependency risk are emerging as organizations recognize concentrated exposure

For Builder-Executives (Technical Leaders)

Architecture decisions made for cost optimization and convenience have created systemic dependencies that most organizations don’t fully understand. This outage provides a forcing function to revisit foundational assumptions.

  • Audit authentication and session management dependencies: If your auth system uses AWS Cognito, DynamoDB, or other managed services, you have a single point of failure regardless of where compute runs
  • Implement true multi-region redundancy with active-active failover: Passive disaster recovery isn’t sufficient; systems must detect regional failures and automatically reroute traffic without human intervention
  • Develop out-of-band administrative access: When AWS support systems fail during outages, you need alternative communication channels and administrative capabilities that don’t depend on the failing infrastructure

For Enterprise Transformers (Change Leaders)

Cloud migration and digital transformation initiatives have prioritized speed and cost reduction over resilience. This outage exposes the organizational and process gaps that infrastructure consolidation creates.

  • Establish clear accountability for dependency risk: Who in your organization tracks third-party infrastructure dependencies? Most companies lack clear ownership of this increasingly critical risk domain
  • Create outage response playbooks that account for communications failures: Standard incident response assumes you can create support tickets, access documentation, and communicate via normal channels—assumptions this outage violated
  • Build organizational capability for rapid architectural pivots: The ability to migrate critical workloads between providers within hours, not months, becomes essential when single-provider dependencies create existential risk

Market Ripple Effects

The immediate market impact remained muted—AWS is too systemically important for customers to meaningfully reduce dependency in response to isolated incidents. But the outage does raise longer-term questions about cloud provider concentration and antitrust implications. If a single provider controls infrastructure that, when it fails, disrupts banking, government services, and global commerce simultaneously, does that constitute a monopoly requiring regulatory intervention?

The 2017 AWS S3 outage, lasting just four hours, cost S&P 500 companies an estimated $150 million according to cyber risk firm Cyence. Today’s outage, affecting more services across a broader set of industries, likely generated significantly larger losses. Companies face both direct costs (lost revenue during downtime) and indirect costs (customer churn, reputational damage, contractual penalties).

For cloud competitors, each AWS outage theoretically creates opportunity to win market share by emphasizing reliability and diversification. However, AWS’s scale advantages—massive capital for infrastructure investment, extensive service offerings, deep enterprise relationships—make it difficult for competitors to capitalize on isolated incidents. Organizations express frustration after outages but remain locked in by switching costs, technical debt, and the reality that alternative providers face similar systemic risks.

The Bottom Line

The AWS outage wasn’t just a technical failure—it was a stress test revealing that the modern internet operates on a remarkably fragile foundation. With AWS controlling 30% of cloud infrastructure and critical services routing through concentrated regional hubs, we’ve built a digital economy with systemic single points of failure. The DNS issue affecting DynamoDB took down banking, government services, communications platforms, and global commerce simultaneously because architectural convenience and cost optimization trumped resilience planning. For organizations, the wake-up call is clear: your business continuity planning must account for cloud provider failures lasting hours or days, not minutes. True resilience requires painful architectural complexity—distributed authentication, active-active multi-region deployments, and out-of-band administrative capabilities—that most organizations have deferred as “too expensive.” This outage proves that dependency risk isn’t theoretical; it’s operational reality that boards and technical leadership can no longer ignore.


Navigate these infrastructure risks with The Business Engineer’s strategic frameworks. Our AI Business Models guide addresses cloud dependency risk. For systematic resilience approaches, explore our Business Engineering workshop.

Scroll to Top

Discover more from FourWeekMBA

Subscribe now to keep reading and get access to the full archive.

Continue reading

FourWeekMBA