I Can't Tell If the Model Matters

What I actually found when I set out to test heterogeneous AI code review.

For the last couple of months, I've been running a two-agent code review workflow in my terminal. Left window: Claude Code doing implementation. Right window: a second Claude Code instance prompted to be adversarial, specifically tasked with finding problems in whatever the left window produced. It worked surprisingly well.

A few weeks ago, someone sent me a Lenny's Podcast episode featuring Dan Shipper. One of the things he talked about was using competing frontier models for coding and reviewing, the idea being that different model lineages have different blind spots, and a reviewer trained differently than the author catches things the author misses. That felt like an important gap in the workflow. I set out to fill it.

I went in expecting to write a post about model diversity as a reliability strategy. That's not what happened.

The plan that didn't survive contact with the code

A search for a Gemini code review GitHub Action surfaced one with a reasonable README. The first test ran it against a deliberately bad changeset. Gemini 2.5 Pro flagged some things, missed an obvious multi-tenancy violation entirely, and tagged everything as medium severity regardless of actual severity. Not because Gemini is bad at code review, as I'd eventually learn, but because the action was feeding it diff hunks through a four-line generic prompt with no repo context. The technical notes from this session describe it well: "a frontier model reviewing through a straw."

The terminal Claude reviewer, running with the same adversarial framing but full repo access, found everything planted in the bad changeset plus a couple of things that weren't planned for.

That gap prompted a hypothesis: was the difference in findings about model lineage, or about context and prompt quality? The Gemini action had neither. The terminal reviewer had both. The lineage variable was confounding from the start.

To isolate it, the Claude GitHub Action went in next. Default configuration, Sonnet 4.6: similar shortcomings to the Gemini action. Bumped to Opus 4.8 with full repo checkout and an adversarial prompt encoding the codebase's cardinal rules: same results as the terminal reviewer. The hypothesis held. Context and prompt quality were doing the work. Model lineage was a secondary variable at best.

Why didn't we just use the official Gemini action?

Good question. One exists. A search earlier in the session never found it. The answer, when Claude was asked about it directly: a poorly-worded query on its part. Major miss. What had been running was a 2-star fork of an abandoned project, last updated fourteen months prior, on a deprecated Node runtime, pinned to a mutable version tag with write access to pull requests. The Marketplace rewards "exists and has a README," not "is good." Read the source of anything you hand a write token and an API key.

So the first-party action went in next. This is where the afternoon got complicated.

The first-party Gemini action with 2.5 Pro returned an empty response with a hidden API error. Flash returned a 400 on every call. Bypassing the MCP server entirely and feeding the PR diff directly to the API via a text prompt failed at the command line with exit 1 and zero diagnostic output. Several more configurations. None produced a review. The failure was diagnosed as a model-layer issue, specifically that 2.5 Pro returns empty responses in this context, and the effort was abandoned.

A new hypothesis entered: Does Gemini have a better understanding of how to implement Gemini tooling and flows than Claude does?

I installed the Gemini CLI locally and handed it the history of what had been tried, the PR, and an explanation of what a working implementation needed to do. Twelve minutes later, no questions asked, it surfaced a notification that it had fixed everything. It also consumed 60% of the free tier quota in the process.

The fix: GEMINI_CLI_TRUST_WORKSPACE: true. A flag that had been set in an earlier version of the integration and dropped during a rewrite. The empty response failures confidently attributed to a model-layer issue were actually a workspace trust regression introduced during that rewrite, and then misdiagnosed. Gemini caught the bug. The session had been abandoned one step short of the solution, on a problem an agent caused, with a diagnosis another agent got wrong.

What the repo-aware Gemini actually found

Once it had full repo context and an adversarial prompt encoding the codebase's cardinal rules, the Gemini CLI running gemini-2.5-pro reviewed the same bad fixture the diff-only action had seen earlier. Four critical findings, including the cross-tenant data leak the diff-only version had completely missed, plus a no-tests policy violation, SQL injection, and a hardcoded secret logged in plaintext. Real differentiated severities. Same model family, completely different harness, completely different results.

That's the thesis, confirmed twice. The diff-only action missed the multi-tenancy violation because nothing told it that tenant-scoping is the repo's cardinal rule and it couldn't go look. The repo-aware CLI caught it immediately. The only variable was context.

What I'm running now, and what I still don't know

The current setup: Claude Code action as the primary automatic reviewer, full repo checkout, CLAUDE.md loaded, adversarial prompt. Gemini CLI as a secondary on-demand reviewer, same adversarial framing, same repo access. I'll run both for a few more weeks before drawing any conclusions about whether lineage diversity adds signal or just noise.

A few things this session settled:

First-party tooling over third-party, every time. The abandoned fork was the only Gemini integration that reliably posted reviews during this entire session. It was also the worst option by every other measure. That shouldn't be the tradeoff, and it won't be.

Context and prompt quality dominate model choice. A reviewer without repo context is reviewing through a straw regardless of which model sits behind it. The harness is the variable that matters most.

The models have different working styles, and that turned out to matter more than expected. Claude asked before acting. The Gemini CLI fixed things autonomously and surfaced a notification when done. Neither approach is wrong. But knowing which mode you're working with changes how you supervise it.

The thing that caught bugs most reliably across this entire session wasn't any reviewer. It was running the code. Every misdiagnosed failure, every hallucinated fix, every confident wrong answer got caught when something actually executed and returned an error. Agents reviewed. Agents triaged. Agents misdiagnosed. Execution caught it. The human shepherd directing all of this wasn't in the error chain. The next agent run was.

The lineage-diversity hypothesis is still open. What this session established is that harness quality dominates model capability as a variable, and that a same-lineage reviewer with adversarial framing and full context is already heterogeneous in the way that actually matters. Whether adding a genuinely different lineage on top of that adds anything is a question for more data.

I'll report back.


Raleigh Schickel is a VP/Director of Engineering with 15+ years of experience building and repairing engineering organizations. He writes about Engineering team health, the tools Engineering leaders use, and the humans who build software. He is building an engineering health diagnostic platform as a coaching tool for Engineering leaders.