Cloudflare outage took down 20% of the Internet, CEO explains what went wrong

Cloudflare suffered a major global outage on Tuesday, taking down nearly 20% of the Internet and disrupting major platforms including X (Twitter), ChatGPT, Canva, and several popular services. The outage, which lasted close to five hours, left users worldwide facing HTTP 500 errors – a common indicator of internal server failure.

Today, Cloudflare Co-Founder and CEO Matthew Prince published a detailed post-mortem explaining what caused the widespread disruption. Importantly, he confirmed that the issue was not a cyberattack but an internal configuration failure.

“An outage like today is unacceptable. We’ve architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we’ve had outages in the past, it’s always led to us building new, more resilient systems,” said Prince.

Not a DDoS Attack – It Was an Internal System Error

Prince described the event as “Cloudflare’s worst outage since 2019”, apologizing to customers and Internet users for the disruption.

According to him, the outage was triggered by a permissions change in one of Cloudflare’s database systems. This led to the generation of a faulty feature file used by its Bot Management system, and this file unexpectedly doubled in size, exceeding software limits.

Once the oversized file was created, it propagated across Cloudflare’s global network and caused proxy software to fail, resulting in widespread HTTP 5xx errors.

Bad File Originated from ClickHouse Cluster

The problematic feature file came from a query running on a ClickHouse database cluster.

Every five minutes, the file was regenerated. But because some ClickHouse nodes had newer updates while others did not, a “bad” file was occasionally produced and distributed. This created a cycle of failures and partial recoveries, making diagnosis much more difficult.

Initially, Cloudflare engineers suspected a massive DDoS attack due to the sudden wave-like pattern of failures. But further investigation revealed the misconfigured file propagation to be the real culprit.

Once identified, Cloudflare:

Stopped the distribution of the corrupted file
Rolled back to a stable version
Restarted core proxy services
Traffic stabilized by 14:30 UTC (8:00 PM IST), with full recovery achieved by 17:06 UTC (10:36 PM IST).

Services Affected

Several major Cloudflare systems were impacted, including:

CDN & Security Services: Higher-than-normal HTTP 5xx errors
Turnstile Bot Challenge: Completely failed to load
Workers KV: Elevated error rates due to gateway failures
Dashboard: Partially functional, but many users couldn’t log in
Email Security: Reduced spam detection accuracy temporarily

Despite the severity, Cloudflare confirmed that internal services are now stable, and preventive measures are being implemented to avoid such recurrence.

Tags: Cloudflare

Cloudflare outage took down 20% of the Internet, CEO explains what went wrong

No, it wasn't a cyberattack. Cloudflare CEO explains why 20% of the Internet went down yesterday.

TECNO enables full Jio True 5G SA support with new signal core enhancements

Wobble One launched in India at ₹22,000, featuring Dimensity 7400, upto 12GB RAM, 50MP LYT600 camera, 50MP Selfie, and more

Wobble One launched in India at ₹22,000, featuring Dimensity 7400, upto 12GB RAM, 50MP LYT600 camera, 50MP Selfie, and more

404 Not Found