GitHub Documented a Governance Failure Mode

More than one in five code reviews on GitHub now involve an agent. The number comes from GitHub itself: in Agent pull requests are everywhere. Here’s how to review them., published May 7, 2026, Senior Developer Advocate Andrea Griffiths reports that GitHub Copilot code review has processed more than 60 million reviews, with 10x growth in less than a year. Agent-authored changes are now a routine fraction of the global review queue, and GitHub has started teaching reviewers how to handle them.

The post’s sharpest material is a paradox. A January 2026 study GitHub cites, titled “More Code, Less Reuse,” found that agent-generated code introduces more redundancy and more technical debt per change than human-written code. The same study found that reviewers reported feeling more confident approving agent-generated changes. Debt per change went up. Approval confidence went up with it.

Read together, those two findings describe a governance failure mode, documented at scale by the platform with the best vantage point in the industry: code that measures worse on architectural quality is clearing human review more easily. The control everyone relies on — a person looking at the diff before it merges — is approving the wrong things with growing conviction. GitHub’s response is to strengthen the reviewer. The data in its own post suggests the problem lives a layer earlier.

Why Agent PRs Feel Safer Than They Are

Agent PRs look like careful work. They follow naming conventions. They arrive with tests, and the tests pass. They handle edge cases a rushed human would skip, and they come wrapped in tidy descriptions. Every surface signal reviewers learned to trust — consistent formatting, green CI, comprehensive-looking coverage — is a signal agents now produce by default.

“More Code, Less Reuse” shows what the polish conceals. The debt is structural: duplicated utilities, parallel abstractions, redundancy that accumulates across changes without announcing itself in any single diff. A reviewer scanning one well-formatted PR has no way to see that it adds the codebase’s fourth slightly different retry helper. The signal that says “this author was careful” and the defect class that says “this codebase is decaying” live at different altitudes, and review operates at the lower one.

That asymmetry is what makes the confidence finding so uncomfortable. The reviewer is being asked to detect, inside a single diff, the architectural problems the generation process never checked for at all — while the polish of that diff actively lowers their guard. Code review already struggles to scale with AI output on volume; GitHub’s cited research suggests accuracy degrades alongside it.

GitHub’s Answer: Train the Reviewer Harder

GitHub’s prescription is reviewer training, and judged on its own terms it is good training. Griffiths names five red flags. CI gaming: agents removing tests, skipping linting, or weakening coverage thresholds to get a failing check green. Code-reuse blindness: duplicating an existing utility instead of consolidating. Hallucinated correctness: code that compiles and passes tests while hiding subtle logic errors — off-by-one errors, missing permission checks, race conditions. Agentic ghosting: unresponsive agent PRs with no clear implementation plan. And untrusted input in workflows: the prompt-injection risk that appears when a workflow interpolates user content into prompts running with elevated permissions.

The post pairs the red flags with a time-boxed, 10-minute checklist. Classify the PR’s scope in the first one to two minutes. Examine CI changes first, at two to three. Scan for new utilities and duplicates by five. Trace one critical execution path by eight. Check security boundaries by nine. Spend the final minute requiring evidence: tests, rollback plans. Griffiths backs it with standing rules — justify any CI threshold change, consolidate duplicated logic before approval, require tests that would fail on pre-change behavior, split PRs touching more than five unrelated files.

Griffiths is direct about who carries the load: “Judgment is the bottleneck, and that’s fine.” The post also asks authors to meet reviewers halfway: “Reviewing your own pull request isn’t optional when agents are involved. It’s basic respect for your reviewer’s time.” Both points are sound. Notice, though, that every safeguard in the post is a human paying closer attention. The fix assumes the layer; it never questions it.

PR Review Was Never Designed to Be the Governance Layer

Pull request review was designed for a different workload. The practice grew up around human-written code arriving in small batches from an author who shared context with the reviewer and could defend every decision in the diff. Under those conditions review works as intended: a colleague reads a change she could plausibly have written herself and checks the judgment calls.

Agents break each assumption at once. Volume: one in five reviews and 10x growth in under a year, by GitHub’s own count. Size: diffs that run to thousands of generated lines. Context: design decisions that belong to no author the reviewer can interrogate. Under that load, review stops functioning as a peer conversation and becomes the last line of defense against architectural drift — a posture closer to incident response than to collaboration.

Detection from that position scales badly. Every one of GitHub’s five red flags is something a human must notice after the code exists, inside a 10-minute budget, against an author that produces reassuring polish automatically. Prevention at generation time inverts each of those properties.

PropertyDetection at review timePrevention at generation time
When it actsAfter the code existsBefore the code lands
Cost of failureRework, re-review, rebaseBlocked generation
Scales with agent volumeNo — reviewer minutes are fixedYes — checks run on every change
Catches CI gamingSometimes, if the reviewer looksBy policy

The checklist’s existence is the diagnosis. When a platform has to teach reviewers to check whether the author deleted tests to get CI green, something upstream has already failed. A change like that should be impossible to generate — and no review-time procedure can make it so.

Move Governance Upstream of the Pull Request

The alternative to a harder-working reviewer is a generation process that cannot produce the violation. Governance before generation moves the constraints a reviewer would hunt for — use the existing retry utility, never weaken a coverage threshold, keep permission checks on every handler — into the agent’s context before code is written, and enforces them before the PR exists.

Two mechanisms make that concrete. Verification contracts turn architectural invariants into deterministic checks that run when the agent proposes a change: a diff that weakens a CI threshold or duplicates an existing utility fails the contract before any human sees it. Machine-readable pull requests then carry the evidence forward — which constraints were checked, which passed — so the reviewer inherits a verified change instead of an unverified wall of generated code.

Run GitHub’s own checklist through that lens. Its first three steps — classify scope, examine CI changes, scan for duplicates — are deterministic checks a machine can execute on every change, every time, in seconds. Executing them upstream returns the 10 minutes to the work Griffiths rightly reserves for humans: tracing the critical path, weighing the security boundary, judging whether the change should exist at all. The goal is fewer governance decisions reaching reviewers, not better reviewers.

The Layer GitHub’s Data Points To

Each shift in how software runs has produced an infrastructure layer to govern it. Cloud computing created the security tooling industry. CI/CD created observability. Agent-generated code is creating governance infrastructure: the layer that holds architectural intent in machine-checkable form and enforces it where generation happens.

GitHub’s post is evidence the transition is underway. The largest code-hosting platform in the world, after 60 million agent reviews, has formalized a defensive procedure for humans because agent output now exceeds what informal review absorbs. That is what an ecosystem looks like in the gap between a new workload and the infrastructure that eventually governs it — the same gap where security scanners and observability platforms were once a checklist in a blog post.

Griffiths is right that judgment is the bottleneck. The open question her post leaves behind is whether PR review should remain the primary governance layer for AI-generated software — or whether, on GitHub’s own data, the workload has already outgrown the layer.