Case Studies › Code Review

Code Review Pipeline

How a software team cut PR wait times from 4 hours to 20 minutes without sending their codebase to OpenAI.

Atlas Freight Systems | 35 developers | Mac Studio M3 Ultra 512GB

The company

Atlas Freight Systems builds logistics and fleet management software. 35 developers, two engineering teams (platform and product), and a codebase that's been growing for eight years.

They ship fast — two or three deployments per day. Or they did, until code review became the bottleneck.

The problem

Every pull request needs review before it merges. The team's code review guidelines require:

At least one senior engineer to review logic, architecture, and edge cases
Security checks (input validation, auth boundaries, data exposure)
Test coverage verification
Style and consistency checks

What actually happens:

A developer finishes a feature at 2 PM. They open a PR. It sits in the queue.

The senior engineer is in back-to-back meetings until 4:30. She picks up the PR at 5, reviews it, leaves three comments, and goes home. The developer sees the comments at 9 AM the next day, makes changes, re-requests review. The senior engineer reviews again at 2 PM.

A one-day feature takes two days to ship because of review latency.

And then there's the quality problem. When the senior engineer finally gets to the PR at 5 PM, she's tired, she's context-switching from three meetings, and she's rushing because she knows the developer is waiting. She misses things:

A SQL injection vector in a new query (caught in production, three weeks later)
A missing test case for the empty-input edge case (customer found it)
An inconsistent error handling pattern that diverged from the codebase standard (technical debt accumulating silently)

The cost:

Average PR wait time: 3-5 hours
Average merge-to-deploy time: 1-2 days (review is the bottleneck)
Senior engineer spends 3-4 hours/day on review — that's half her working day
Production bugs that a thorough review would have caught: 2-3 per month
Developer morale: engineers are frustrated by the wait and the rushed reviews

What they tried:

GitHub Copilot. Good for code completion, but it doesn't review PRs. It suggests code as you type — different problem.
ChatGPT for code review. It worked — but the company's CTO, David, realised they were sending their entire codebase to OpenAI's API. The logistics algorithms, the fleet routing logic, the customer integration code — all of it going through a third party. Their biggest client, a national retailer, has a clause in their contract: "Supplier codebases containing [client] integration logic must not be processed by third-party AI services."
Hiring a dedicated reviewer. They hired one. He quit after four months — reviewing other people's code all day is not a fulfilling job.

What Foundry does

Foundry runs on a Mac Studio in the engineering team's office. It's connected to their GitHub via a webhook — when a PR is opened or updated, Foundry gets notified.

It does a first-pass code review. Not a rubber stamp. A real review.

When a PR opens:

Foundry reads the changes. It understands the diff — not just the lines changed, but the context around them, the files they're in, and how they relate to the rest of the codebase.
It checks against the team's review guidelines:
- Logic errors or edge cases the developer may have missed
- Security concerns (input validation, auth boundaries, injection risks)
- Test coverage — are the new code paths tested? Are edge cases covered?
- Consistency with existing codebase patterns
- Potential performance issues (N+1 queries, unnecessary allocations, blocking calls)
It posts a structured review as a comment on the PR:
- Must fix — issues that need to be addressed before merge
- Should consider — suggestions that improve quality but aren't blocking
- Looks good — areas it reviewed and found no issues
- It includes specific line references and suggested fixes
It flags the PR for human review with a priority level. A PR with no must-fix issues? Quick scan. A PR with three must-fix issues? Needs careful human review.

The senior engineer still reviews every PR. But she's reviewing a PR that's already been through a thorough first pass. She's confirming, not discovering. And she's doing it in 5 minutes instead of 30.

What it looks like day to day

2:15 PM — Developer opens a PR

Sarah pushes a feature: a new endpoint that calculates delivery route optimisation based on traffic data. The PR is 340 lines across 4 files.

2:15 PM — Foundry starts review

2:17 PM — Foundry posts review:

Review summary: 2 issues found, 1 suggestion MUST FIX — SQL injection risk (routes.py, line 47) The `traffic_source` parameter is concatenated directly into the query string. Use parameterised queries instead. # Suggested fix: cursor.execute("SELECT * FROM traffic_data WHERE source = %s", (traffic_source,)) SHOULD CONSIDER — Missing test case (test_routes.py) No test for the empty traffic data scenario. If the API returns an empty response, the optimiser will throw a KeyError on `traffic_data['routes']`. LOOKS GOOD — auth, input validation, error handling, performance Auth boundary is correct. Input validation present. Error handling follows existing pattern. No N+1 queries detected.

2:20 PM — Developer fixes the SQL injection and adds the test

Sarah sees the review immediately, fixes the issue, adds the empty-data test, and pushes the update.

2:22 PM — Foundry re-reviews the updated PR

SQL injection fixed. Parameterised query now in use. Empty data test added and passing. Ready for human review.

2:35 PM — Senior engineer reviews

David opens the PR. Foundry's review is at the top. He reads it, scans the changes, confirms the fix is correct, and approves.

Total time from PR to merge: 20 minutes.

Without Foundry, this PR would have been reviewed at 5 PM the next day — if the senior engineer had time. The SQL injection would have been caught in QA or production, not at 2:17 PM.

The numbers

Metric	Before	After	Change
Average PR wait time	3-5 hours	15-25 minutes	90% reduction
Senior engineer daily review time	3-4 hours	45-60 mins	75% reduction
PR merge-to-deploy time	1-2 days	same day	50% faster
Production bugs caught in review	60%	92%	+32 points
Security issues reaching production	1-2/month	0-1/quarter	80%+ reduction
Codebase sent to third-party AI	Yes (ChatGPT)	No	Fully local
Monthly API cost	£800-1,200 (OpenAI)	£0	£9,600-14,400/year saved

Annual impact: 600-700 hours of senior engineer time recovered + £10,000+ in API costs + fewer production incidents (each P1 incident costs £5,000-15,000 in response, fix, and client impact).

Foundry cost: £999 setup + £99/month = £2,187 first year. Existing Mac Studio.

What stayed cloud

GitHub, CI/CD pipeline, deployment infrastructure — all untouched
Cloud development environments (if used) — Foundry reviews the PR, not the dev environment
External API calls in the code being reviewed — Foundry reads code, it doesn't execute it
Developer tools (IDEs, Copilot for code completion) — Foundry does review, not completion

What moved local: the AI that reads your code and identifies issues. That's the part that was sending your proprietary codebase through OpenAI's API.

What it doesn't do

Does not auto-approve or auto-merge PRs. Every PR still needs a human to say "approved."
Does not write code. It reviews and suggests fixes, but the developer writes the fix.
Does not replace the senior engineer. It does the first pass — the systematic, tedious checking that takes time but doesn't require deep architectural judgement. The senior engineer still reviews for architecture, business logic, and things AI can't see.
Does not send code externally. The review runs on the Mac Studio in your office. Your codebase stays yours.
Does not catch everything. It's very good at pattern-based issues (security, tests, consistency) and less good at "is this the right architectural approach for our business." That's still the senior engineer's job.

What the team says

"The first week, Foundry caught a SQL injection in a PR that I would have missed at 5 PM on a Friday. I've been reviewing code for twelve years. That stung — but it proved the point." David, CTO

"I used to wait half a day for someone to look at my code. Now it's reviewed before I've finished my coffee. The feedback is specific — line numbers, suggested fixes, not just 'looks fine.'" Sarah, developer

"The national retailer contract clause about third-party AI was the blocker for us using ChatGPT for review. Foundry runs on our hardware. Our code never leaves the building. Procurement is happy, legal is happy, and we're shipping faster." David, CTO

Is this right for your team?

This setup works for:

Software teams of 10-100 developers doing regular PRs
Companies with proprietary codebases that can't go through third-party AI APIs
Teams where senior engineer review time is the deployment bottleneck
Organisations with security requirements (financial services, healthcare, defence-adjacent, regulated industries)

Not a fit if you:

Have very low PR volume (<5/week) — human review is fine
Already have a CI-based code quality pipeline you're happy with
Are comfortable sending your codebase through cloud AI tools
Don't have a senior engineer to do the final human review (Foundry augments, it doesn't replace)

Want to see it review a sample PR? Book a Foundry Fit Review →

Technical details

Hardware: Mac Studio M3 Ultra, 512GB unified memory
Model: Qwen3-Coder-30B (Q5_K_M) via llama.cpp — specifically tuned for code understanding
Pipeline: GitHub webhook → fetch diff → analyse changes → post structured review → developer addresses → re-review on update → flag for human approval
Review categories: Security (injection, auth, data exposure), logic (edge cases, error paths, null handling), tests (coverage, edge cases, assertions), consistency (codebase patterns, style, naming), performance (N+1 queries, allocations, blocking calls)
Integration: GitHub (via webhook + API), GitLab, Bitbucket, or direct git hooks
Throughput: 10-30 seconds per PR depending on diff size
No-cloud posture: Code is fetched locally, reviewed locally, and the review is posted via the GitHub API. No code content is sent to any third-party AI service.
Observability: llm_stats dashboard showing review volume, issue detection rates, false positive tracking, and model health
False positive rate: ~5-8% on suggestions (developer dismisses); near-zero on must-fix items (these are verified security/logic issues)

← Back to all case studies