
A note from our Director of Engineering on rethinking resilience after the AWS US-East-1 incident.
There’s a comforting simplicity in the way engineers think about cloud outages: the cloud is either up or down. But on October 20 2025, AWS’s US-East-1 region quietly broke that binary. The region stayed online. Requests still flowed. Dashboards still blinked green. And yet — for many teams, including ours at Tonkean — we couldn’t move. We weren’t down. But we couldn’t deploy. We were alive, but paralyzed.
It was a predicament unique to our changing world. Here’s how we reacted—plus how we worked to ensure greater resilience the next time this happens.
Between late October 19 and the morning of October 20, AWS experienced a cascading failure triggered by a DNS issue affecting the DynamoDB service in US-East-1. In a blog detailing the cause of the outage, AWS described how the issue stemmed from “a latent defect within the service’s automated DNS management system.”
On paper, it sounds like a database hiccup. In practice, what happened is that defect rippled through AWS’s internal ecosystem — especially the subsystems responsible for launching new EC2 instances.
What’s scary about this, from a user perspective, is that it’s the kind of dependency you don’t usually see until it breaks. When EC2’s internal orchestration can’t launch new nodes, every system that relies on scaling, auto-healing, or rolling updates suddenly loses one of its most fundamental abilities: the ability to change.
At Tonkean, our Kubernetes clusters in US-East-1 were directly affected. Not because DynamoDB was down — but because AWS itself, depending on DynamoDB internally, couldn’t provision new EC2 machines. That meant we couldn’t add or replace nodes in the cluster. We couldn’t safely deploy new versions of our services.Even worse, a single reclaimed spot instance could have triggered a cascade of missing capacity that we couldn’t replace.
The region was alive, but it had lost the ability to regenerate. And that’s a scarier failure mode than total darkness.
In these moments, automation becomes a double-edged sword.Our first instinct in the cloud era is to automate everything: deployment pipelines, scaling, rollouts, self-healing. But when your foundation (EC2 instance creation) is frozen, automation turns risky. The system may “self-heal” itself straight into collapse. So we did the opposite. We paused. We disabled ArgoCD’s auto-sync, freezing all automatic deployments. It felt wrong — deliberately stopping change in a platform built to move fast — but it was necessary.
Something else we did: we leaned on our Karpenter setup, which we had configured with careful node affinities. That affinity configuration turned out to be a quiet hero: it kept the essential workloads pinned to healthy nodes and prevented aggressive rescheduling that could have destabilized running services. Karpenter couldn’t launch new EC2 instances — no one could — but it kept the existing ones efficiently utilized. Our clusters effectively “held the fort” with what they had.
None of this is anything you plan for often. You plan for outages, not for a world where you’re operationally alive but evolutionarily frozen. But that’s precisely what makes this instance important to talk about. This is a new facet of our reality.
This incident reframes what “resilience” really means in the cloud. Resilience isn’t just the ability to serve traffic; it’s the ability to change safely under stress.
For years, engineers have built systems to survive data center failures, AZ outages, even regional collapses.But this was different: the infrastructure was up, the APIs responded, and yet every attempt to evolve — to deploy, to scale, to adapt — was met with friction.
It was a reminder that resilience isn’t about uptime alone; it’s about flexibility and motion.
On my team, in response to this, we’ve started thinking in three states, not two:
The second state is where most teams are unprepared — and that’s exactly where we found ourselves.
If we want to design for the “frozen region” scenario, we need to evolve the way we build and operate cloud systems:
In modern cloud infrastructure, “up” isn’t good enough anymore.
A truly resilient system must be both alive and changeable.
When AWS stumbled, the world didn’t go offline — but many of us discovered how fragile “deployability” really is.
It’s not the loud outages that teach us the most anymore; it’s the silent ones where the lights stay on, but our hands are tied.
Because in this new era of distributed systems, resilience isn’t about anticipating and surviving occasional cloud outages— it’s about staying flexible when the cloud freezes, too.

