All blogs

Nov 6, 2025

How successful teams ship AI-generated code to production

What I learned talking to 200+ teams about AI code generation

Paul Sangle-Ferriere

Since building cubic, I've talked to 200+ engineering teams about AI code generation. Some are shipping 10-15 AI-generated PRs a day to production. Others tried once, had an incident, and haven't trusted AI since.

The difference isn't access to better models or smarter prompts. It's that the successful teams redesigned how they validate code.

The anatomy of an AI code failure

Here's what the unsuccessful teams look like:

Last week, I was on a call with a team who just shipped an 800-line AI-generated PR. All tests passed. CI was green. Linters found nothing.

Sixteen minutes after deploy, their authentication service failed. The load balancer was routing requests to dead nodes.

The issue wasn't a syntax error or a test failure. It was a semantic violation the system had no way to detect. The AI had refactored a private method to public, breaking an invariant that existed nowhere in code, only in the team's collective understanding.

Why AI code fails differently

The teams shipping AI at scale all made the same realization: AI code fails differently than human code. And your CI/CD pipeline, designed for human error patterns, misses the errors that matter.

Every time we've changed how code gets written, we've had to change how we validate it.

When compilers emerged, they checked syntax. Static analysis tools checked for null pointer exceptions. CI/CD systems checked whether your change broke the larger system.

AI code generation is the same kind of shift. Machines now write code that humans review, rather than the other way around. But most teams are still using the same validation tools.

This is what the successful teams do differently.

What successful teams noticed about AI-generated code

In conversations with teams shipping AI at scale, I kept hearing the same story: "our production incidents are passing all CI checks."

Here's what they figured out: human-written code fails in predictable ways. Typos, off-by-one errors, forgetting to handle edge cases, copy-paste mistakes that break variable names… These are pattern-matchable errors that linters and static analysis tools were built to catch.

More importantly, humans learn a system's unwritten rules through experience. A developer who's worked on a codebase for six months knows not to make certain database columns public, even if nothing in the CI pipeline would catch that mistake. The knowledge lives in their head, not in the validation layer.

AI-generated code fails differently. It produces syntactically correct, idiomatically appropriate code that violates system-specific constraints. It produces semantic violations that come from the AI's lack of context about your specific system's unwritten rules.

The validation gap

The incentive structure makes this worse. Tool builders optimize for adoption, which means optimizing the visible part: generation quality. "Better completions" and "fewer hallucinations" are clear metrics that demo well and drive user acquisition.

Validation improvements are invisible until something breaks in production. You can't demo "we caught a semantic violation you didn't know was possible." This creates systematic underinvestment in validation infrastructure.

Most AI coding content on X and LinkedIn comes from people who don't build production systems. They see something cool in a demo, tweet about it, get 10k likes. Meanwhile, the teams actually shipping AI to production are building, not tweeting.

This creates an information problem: teams trying to learn about AI in production are learning from people who've never shipped AI to production. The teams that figured this out early realized validation infrastructure is now the bottleneck.

Three things successful teams do differently

Once the successful teams understood that AI fails differently, they made three structural changes.

First: They build validators for their system's unwritten rules

Firecrawl's founding engineer Gergő told me they were drowning in 10-15 PRs a day. Production kept going down. Then they added validators that understood their dependency graph.

Result: caught circular imports three times in a row. 70% reduction in review time.

Traditional CI/CD checks explicit rules: does it compile, do tests pass, does the linter approve.

But the errors that hurt are implicit rule violations. Rules that exist in your team's head, not in your linters.

"Don't make this method public." "Always add an index when querying this table." "Services A and B should never have circular dependencies."

The modern approach is to use AI to encode these rules. You write them in natural language, and AI validates every PR against them.

This is how n8n, Cal.com, and Granola validate their PRs with cubic's custom rules. The AI reads your constraint, understands it, and flags violations.

For example: "All routes in /billing/** must pass requireAuth and include orgId claim." The validator scans every diff touching billing routes. If requireAuth is missing or orgId isn't validated, it blocks the PR and suggests the fix. Edge cases like /billing/webhook/* can be allowlisted to prevent false positives on public endpoints.

The pattern is obvious in hindsight but invisible to traditional linters because the imports are syntactically valid.

Second: They route high-risk code to humans, auto-merge the rest

At Browser Use, founding engineer Nick Sweeting takes days-long PR cycles down to 3 hours. 85% faster merges, 50% less technical debt.

"I wouldn't even look at a PR until it had the AI review. And then my first comment would be, please fix all the AI comments."

This works because human review time doesn't scale linearly with code volume. If AI generates 10x more code, you can't hire 10x more reviewers. Nick uses AI to pre-filter what needs human attention.

The leverage comes from accurate risk classification. Low-risk changes (documentation, test additions, refactors that don't touch critical paths) can auto-merge. High-risk changes (authentication logic, database migrations, API contracts) route to human review.

Using disagreement as signal

An interesting pattern emerges when teams use separate AI models for generation and validation. When the generator and validator disagree, that's signal, not noise. It means the generator produced something that looks like normal code but doesn't match the historical patterns in your specific codebase. This disagreement is precisely where human judgment should be applied.

Teams that instrument this create a natural escalation mechanism: low disagreement routes to auto-merge, high disagreement routes to human review. This isn't a static rule. It's a feedback loop that improves as you collect data on which disagreements predicted actual problems.

Third: They test in production-like environments before deploying

Successful teams deploy every AI-generated PR to a preview environment first. Not staging that gets updated once a week. A unique preview URL for every single PR, automatically.

When AI generates 10x more code, you can't manually test everything in production. You need to catch issues before they reach prod. Here's how it works: Every PR gets its own preview environment. Humans test critical workflows. AI runs automated tests against the preview. The "private method made public" bug? Caught in preview when the auth flow breaks, not 16 minutes after prod deploy.

The fast feedback loop

The fast feedback loop: test in preview → find issue → fix → redeploy to preview → test again. This cycle takes minutes, not days.

The modern pattern: humans test critical user paths, AI tests everything else. AI can run hundreds of test scenarios against a preview environment in parallel. It catches edge cases humans would never manually test.

And when something still breaks in production? Fast rollback matters. But it's the backup plan, not the primary strategy. Teams shipping AI at scale caught most issues in preview.

Why this creates a durable advantage

When I talk to teams evaluating AI code generation, they ask: "is AI ready for production?"

The teams already shipping at scale ask a different question: "have we built the validation architecture that makes AI code generation safe?"

This reveals the fundamental insight: AI code generation isn't a feature improvement (better autocomplete, faster boilerplate). It's an architectural shift that requires rethinking the validation layer.

Teams treating it as a feature hit scaling limits. They restrict AI to low-risk changes and wonder why they're not seeing 10x productivity gains. Teams treating it as architecture are rebuilding their validation infrastructure and seeing compounding advantages.

Why does this compound? AI code generation is becoming a commodity. But AI code validation is infrastructure that gets more valuable as generation volume increases. Early movers are climbing a learning curve that late movers will have to climb from scratch.

The difference between teams debating and teams shipping isn't access to better models – we all use the same ones.

It's that successful teams redesigned their validation architecture first.

© 2025 cubic. All rights reserved. Terms