The AWS DynamoDB Incident: When the Cloud Doesn’t Fall Down — But You Still Can’t Move

Afik Udi

•

November 6, 2025

October 30, 2025

•

min read

A note from our Director of Engineering on rethinking resilience after the AWS US-East-1 incident.

There’s a comforting simplicity in the way engineers think about cloud outages: the cloud is either up or down. But on October 20 2025, AWS’s US-East-1 region quietly broke that binary. The region stayed online. Requests still flowed. Dashboards still blinked green. And yet — for many teams, including ours at Tonkean — we couldn’t move. We weren’t down. But we couldn’t deploy. We were alive, but paralyzed.

It was a predicament unique to our changing world. Here’s how we reacted—plus how we worked to ensure greater resilience the next time this happens.

What Happened? When Availability Isn’t Enough

Between late October 19 and the morning of October 20, AWS experienced a cascading failure triggered by a DNS issue affecting the DynamoDB service in US-East-1. In a blog detailing the cause of the outage, AWS described how the issue stemmed from “a latent defect within the service’s automated DNS management system.”

On paper, it sounds like a database hiccup. In practice, what happened is that defect rippled through AWS’s internal ecosystem — especially the subsystems responsible for launching new EC2 instances.

What’s scary about this, from a user perspective, is that it’s the kind of dependency you don’t usually see until it breaks. When EC2’s internal orchestration can’t launch new nodes, every system that relies on scaling, auto-healing, or rolling updates suddenly loses one of its most fundamental abilities: the ability to change.

At Tonkean, our Kubernetes clusters in US-East-1 were directly affected. Not because DynamoDB was down — but because AWS itself, depending on DynamoDB internally, couldn’t provision new EC2 machines. That meant we couldn’t add or replace nodes in the cluster. We couldn’t safely deploy new versions of our services.Even worse, a single reclaimed spot instance could have triggered a cascade of missing capacity that we couldn’t replace.

The region was alive, but it had lost the ability to regenerate. And that’s a scarier failure mode than total darkness.

What Did We Do? Holding the Fort

In these moments, automation becomes a double-edged sword.Our first instinct in the cloud era is to automate everything: deployment pipelines, scaling, rollouts, self-healing. But when your foundation (EC2 instance creation) is frozen, automation turns risky. The system may “self-heal” itself straight into collapse. So we did the opposite. We paused. We disabled ArgoCD’s auto-sync, freezing all automatic deployments. It felt wrong — deliberately stopping change in a platform built to move fast — but it was necessary.

Something else we did: we leaned on our Karpenter setup, which we had configured with careful node affinities. That affinity configuration turned out to be a quiet hero: it kept the essential workloads pinned to healthy nodes and prevented aggressive rescheduling that could have destabilized running services. Karpenter couldn’t launch new EC2 instances — no one could — but it kept the existing ones efficiently utilized. Our clusters effectively “held the fort” with what they had.

None of this is anything you plan for often. You plan for outages, not for a world where you’re operationally alive but evolutionarily frozen. But that’s precisely what makes this instance important to talk about. This is a new facet of our reality.

The Real Lesson: Change Is a Form of Uptime

This incident reframes what “resilience” really means in the cloud. Resilience isn’t just the ability to serve traffic; it’s the ability to change safely under stress.

For years, engineers have built systems to survive data center failures, AZ outages, even regional collapses.But this was different: the infrastructure was up, the APIs responded, and yet every attempt to evolve — to deploy, to scale, to adapt — was met with friction.

It was a reminder that resilience isn’t about uptime alone; it’s about flexibility and motion.

On my team, in response to this, we’ve started thinking in three states, not two:

Alive and healthy – traffic flows, change possible.
Alive but frozen – traffic flows, change impossible.
Down – traffic stops.

The second state is where most teams are unprepared — and that’s exactly where we found ourselves.

What This Changes About How We Build

If we want to design for the “frozen region” scenario, we need to evolve the way we build and operate cloud systems:

Monitor the change-plane, not just the serve-plane. It’s not enough to know requests succeed; track metrics like instance-launch latency, autoscaling failure rates, and node replacement health.
Design for graceful stasis. Build your clusters and workloads to sustain themselves without immediate scale-up for several hours. Avoid aggressive eviction or rescheduling policies that depend on new capacity.
Build safety brakes into automation. Your deployment tools should know when to stop — like we did with ArgoCD — not just when to start.
Architect for dependency awareness. This incident shows how a seemingly unrelated service like DynamoDB can cripple EC2 launches. Map these indirect dependencies. Don’t let “invisible glue” be your single point of failure.
Test for “frozen” conditions. Run chaos drills that simulate inability to launch new nodes while everything else is green. See what breaks, what keeps running, and what you can still do.

A Shift in Perspective

In modern cloud infrastructure, “up” isn’t good enough anymore.

A truly resilient system must be both alive and changeable.

When AWS stumbled, the world didn’t go offline — but many of us discovered how fragile “deployability” really is.

It’s not the loud outages that teach us the most anymore; it’s the silent ones where the lights stay on, but our hands are tied.

Because in this new era of distributed systems, resilience isn’t about anticipating and surviving occasional cloud outages— it’s about staying flexible when the cloud freezes, too.

‍

Afik Udi

•

November 6, 2025

October 30, 2025

•

min read

Share this post

View all posts

Tonkean Included in the 2025/26 ProcureTech 100

Procurement

min read

Tonkean Included in the 2025/26 ProcureTech 100

For the third year in a row, Tonkean’s ProcurementWorks has been featured in the ProcureTech100. We were featured this year in the Strategy, Process & Performance category.

Read post

Tonkean

min read

Tonkean Acquires AI Spend Intelligence Startup Cinch, Doubling Down on Procurement, Finance, and EMEA

Tonkean today announced that it has acquired the AI-powered finance and logistics intelligence startup Cinch. Recently named a 2025 Gartner® Cool Vendor in Logistics and Fulfillment Technology, Cinch is an advanced, AI-powered platform that helps enterprise finance and procurement teams automatically capture, organize, and analyze freight and invoice data in real time–turning unstructured documents into actionable spend insights.

Read post

Stay up to date

Get experts articles & updates to your inbox!

1384

Create a process experience that works.

Maximize adoption, compliance, and efficiency.
Transform your internal processes with powerful AI and personalized experience. 100% no-code.

Chat with Sales

Try for Free