cubic blog: The Cloudflare outage

Blog

The Cloudflare outage

Why code quality matters in 2026

Alex Mercer

Jan 29, 2026

November 18, 2025. Cloudflare crashed. X stopped loading. ChatGPT went dark. Spotify quit playing.

One-third of the world's 10,000 most popular websites became unreachable.

The cause? A database permissions change caused a bot management configuration file to exceed its size limit. The file doubled in size from its expected size to a massive one. Every proxy process that tried to load it crashed.

TLDR

Cloudflare's November 2025 outage affected one-third of top websites when a bot management file exceeded size limits, crashing proxy processes globally. The root cause was a latent bug in the code, combined with a database change. The incident shows why code quality, thorough testing, and automated review matter. Teams using repository-wide code review catch issues before they reach production.

What went wrong at Cloudflare

Cloudflare shared a post-mortem explaining what caused the outage.

A bot management feature file gets generated from database queries. The code had a hardcoded size limit based on the expected file size. A change to database permissions altered query behavior, causing the feature file to contain duplicate rows.

The file doubled in size and exceeded the hardcoded limit. When proxy processes tried to load it, they crashed, flooding Cloudflare's network with HTTP 500 errors and taking down dependent services.

The bug existed in the code for a long time. It only triggered when specific conditions aligned.

Why this matters for every engineering team

Cloudflare runs internet infrastructure at a massive scale with world-class engineers. They still shipped code with a latent bug that caused a global impact.

1. Hidden assumptions break in production

The code assumed bot management files would stay within certain size bounds. That assumption held until it didn't. When the database change triggered larger files, the hardcoded limit became a critical failure point.

Code review that understands repository-wide context catches these assumptions. For microservice systems, tools analyzing entire codebases identify hardcoded limits that could fail under different conditions.

2. Small changes have large impacts

A database permissions update seems routine. It triggered a cascade affecting millions of users. The relationship between the database change and the bot management code wasn't obvious during review.

Infrastructure and dev teams maintaining shared services need code review that traces dependencies across repositories. Changes to one component affect systems that depend on it.

3. Testing doesn't catch everything

Cloudflare's testing didn't reproduce the conditions that caused the file to double in size. The bug passed through development, testing, and deployment without detection.

Comprehensive code quality checks examine edge cases and boundary conditions that tests miss. Automated review flags hardcoded limits, unvalidated assumptions, and potential failure modes before code reaches production.

The cost of code quality

Even small bugs in critical systems can have huge consequences. The points below show how code issues affect users, businesses, and entire infrastructure, and why proactive review matters.

1. Immediate impact on businesses

Cloudflare's outage lasted roughly 5.5 hours. Some estimates suggest that roughly one in five webpages were affected at the height of the incident. Affected businesses saw e-commerce orders fail, SaaS platforms go offline, and API-dependent applications break.

2. Prevention vs reaction

Cloudflare's team reacted quickly, diagnosing and fixing the issue within hours. But reaction, no matter how fast, still means downtime. Prevention catches problems before they affect users.

Teams shipping code that powers critical infrastructure can't afford to find bugs in production. The window between deployment and incident is too short. Speed and quality don't conflict when review processes catch issues automatically.

3. Multiple systems impacted by a single failure

The initial bug affected proxy processes. Those failures cascaded to Workers KV, Cloudflare Access, the dashboard, and dependent services. One bug created dozens of failure modes across the system.

Code quality tools that understand system architecture identify potential cascade points. When changes affect shared infrastructure, the code review should flag dependencies that could fail.

What code review should catch

The Cloudflare outage highlights specific patterns that automated code review needs to identify.

1. Check hardcoded limits

The bot management code had a size limit, but didn't validate incoming data against that limit before processing. When data exceeded the limit, the system crashed instead of handling the error gracefully.

Code review should flag any hardcoded limit that lacks corresponding validation. If the code assumes data stays below X size, the review should verify error handling for data that exceeds X.

2. Watch hidden dependencies

The database permissions change affected bot management file generation. That dependency wasn't obvious from either component independently. Changes to the database didn't show which services depended on specific query behaviors.

Repository-wide analysis traces these dependencies. When one component changes, the review identifies other components that could be affected.

3. Assumptions about runtime conditions

The code assumed files would stay within certain sizes. This assumption held under normal conditions but failed when conditions changed. The assumption was implicit, not documented or validated.

AI review that learns team patterns identifies undocumented assumptions. When code makes implicit assumptions about data size, timing, or behavior, automated review flags those assumptions for validation.

4. Handle edge cases

When the file exceeded size limits, the system crashed. The code lacked graceful degradation, error logging, or fallback behavior for the unexpected case.

Quality checks verify error handling for edge cases. What happens when the data is larger than expected? When do services time out? When dependencies fail? Code should handle these scenarios explicitly.

How can cubic’s AI Code Review Tool catch issues like this?

cubic uses AI-powered code review to detect patterns that caused incidents like Cloudflare’s outage.

1. Repository-wide context

cubic analyzes entire codebases, not just changed files. When database code changes, cubic traces that services depend on that database's behavior. Dependencies that aren't obvious from file-level review become visible.

2. Custom policy enforcement

Teams can define quality standards in natural language. Policies like "All file processing must validate size limits before loading" or "Hardcoded limits require corresponding error handling" get enforced automatically.

3. Learning from incidents

cubic learns from team feedback. When an incident reveals a pattern that should be caught in review, that pattern gets stored and flagged in future code.

4. Automated checks before merge

Issues get flagged during code review, before merge, before deployment. The feedback loop shortens from production incident to PR comment.

5. Codebase scans for latent bugs

PR review catches issues in new code. But Cloudflare's bug existed in the code for a long time before it triggered. Most codebases contain years of code written before AI review coverage existed.

cubic's codebase scans analyze your entire repository to find bugs hiding in existing code. Thousands of AI agents explore your codebase in parallel, tracing data flows across files and verifying whether potential issues are actually exploitable. They find hardcoded limits without validation, hidden cross-component dependencies, race conditions, and business logic flaws that pattern-matching tools miss.

Unlike traditional scanners that flood you with false positives, codebase scans investigate each finding before reporting it. The result is high-confidence issues you can fix immediately.

What engineering teams should do differently?

The lessons from Cloudflare's outage apply to every team shipping code to production. Here’s what engineering teams can do to avoid similar problems.

1. Question hardcoded assumptions

When code contains hardcoded limits, timeouts, or size constraints, verify what happens when those limits are exceeded. Add validation, error handling, and monitoring so the system fails safely instead of crashing.

2. Trace cross-component dependencies.

Most systems work fine under normal conditions. Problems show up when data gets much bigger than expected, services slow down, or assumptions no longer hold. Testing should include these edge cases, not just ideal scenarios.

3. Test edge cases

Avoid flipping the switch for everyone at once. Release changes in stages, watch how systems behave, and catch issues when they affect a small slice of traffic, not the entire platform.

4. Implement gradual rollouts

Deploy changes incrementally. Monitor behavior. Catch issues when they affect a small percentage of traffic, not the entire network.

5. Maintain code quality as a priority

Quality isn't a feature you ship later; it's a feature you ship earlier. It's a foundation that prevents incidents. Teams that prioritize quality during development avoid expensive fixes in production.

Code quality prevents incidents

Cloudflare's outage demonstrates what happens when latent bugs reach production at scale. The team fixed it quickly, shared detailed post-mortems, and implemented preventative measures. This transparency helps the entire industry learn.

The lesson is clear. Code quality matters because production incidents are expensive, disruptive, and entirely preventable. Better review processes catch issues before deployment. Repository-wide analysis identifies dependencies that file-level review misses. Automated quality checks flag assumptions, validate limits, and verify error handling.

Every team ships code with the potential to cause incidents. The question is whether you catch issues in review or in production. cubic integrates AI-powered code review and automated quality checks directly into your workflow, helping teams catch bugs, validate assumptions, and enforce standards before deployment.

Ready to improve code quality before deployment?

Book a demo with cubic to see how repository-wide analysis and automated quality checks catch issues that file-focused reviews miss.

Table of contents

Title

Technical debt analysis

Paul Sanglé-Ferrière

•

Jan 27, 2026

Technical debt analysis

Paul Sanglé-Ferrière

•

Jan 27, 2026

Navigation

Homepage

Pricing

Customer stories

Blog

Careers

Product

Learning

Enterprise

Codebase scan

Documentation

AI wiki directory

Changelog

Social Media

GitHub

Terms

Navigation

Homepage

Pricing

Customer stories

Blog

Careers

Product

Learning

Enterprise

Codebase scan

Documentation

AI wiki directory

Changelog

Social Media

GitHub

Terms

Navigation

Homepage

Pricing

Customer stories

Blog

Careers

Product

Learning

Enterprise

Codebase scan

Documentation

AI wiki directory

Changelog

Social Media

GitHub

Terms