There's a question that hangs over everything I've written on this blog so far. I've talked about the framework. I've talked about the journey from copilot to orchestrator. I've talked about the philosophy. But there's one thing I hadn't done yet — the thing that any serious engineering organization would demand before anything else.

I hadn't had someone else look at the code.

Not a quick glance. Not a "looks good to me." A real code review, by a real senior engineer, with real opinions about how software should be built. The kind of review where someone pulls up your project, reads through it, and tells you what they actually think.

So I asked my guy Tim to do exactly that.

The Setup

Tim and I have been working together for something like fifteen years — honestly, neither of us can pin down exactly when it started. He's always been a contractor, but every time I've moved to a new organization, Tim has come with me. When you find an engineer whose judgment you trust that deeply, you don't let go. So when I started Theoretically Impossible Solutions and needed someone to be my designated code reviewer — the person who would put AI-generated code under a real microscope — Tim was the obvious and only choice.

When I handed him a client project — a multi-service email scheduling system built entirely through my agentic development framework — I genuinely didn't know what he'd find.

I had a fear, and I think it's one a lot of people in this space share: what if the AI-generated code is fundamentally flawed? What if it looks okay on the surface but falls apart under scrutiny? What if someone with decades of experience takes one look at it and says, "this is kindergarten work — are you building software with crayons?"

Tim did his review over a weekend. He went through the codebase, made notes, and then had Claude run a couple of automated analyses on the API endpoint patterns and database access architecture. We scheduled a call to walk through everything together, and I hit record on Otter.ai — because that's how we do things now.

The Verdict

Tim's first words set the tone for everything that followed: the line-by-line code quality was on par with what he'd been seeing in his own professional work. Variables were well-named. Methods were well-structured. The code was readable enough that you could understand what was happening without heavy commenting — though the comments were there too. His general sentiment was that when you're down in the details, the quality of the generated code was, in his words, "surprisingly good."

No crayons. No kindergarten. The code itself was solid.

That was the good news, and it was significant. But it wasn't the whole story.

Where It Struggled

The issues Tim found weren't in the code — they were in the architecture. And that distinction matters enormously, because it tells you something important about where AI excels and where it still needs human oversight.

At the method level, inside individual files, the AI writes clean, professional code. But when it comes to project-level decisions — which services should talk to which other services, where files should live, how concerns should be separated across a multi-project solution — that's where things got interesting.

Here's what Tim found:

Database access patterns were inconsistent. The system has about thirteen deployable services. One of them — the Sender — was doing things exactly right: talking to the database exclusively through the API, using HTTP client implementations, authenticating via Entra ID tokens. It was, as Tim put it, the gold standard. But four other services were bypassing the API entirely and hitting the database directly through Dapper repositories. This wasn't a code quality issue — it was an architectural drift that happened early in the project before I'd codified the rule in my engineering standards.

Some API endpoints returned anonymous objects instead of typed models. Five public-facing endpoints were returning inline anonymous objects rather than proper POCO response classes. The code worked fine, but it made the endpoints harder to unit test (you end up fighting with dynamics), broke OpenAPI schema generation, and meant the API contract was implicit rather than explicit.

The migration project was nearly empty. All the actual SQL migration files were stuffed into the Infrastructure project. Tim flagged this immediately, and when we talked through it, I realized what had probably happened: early in the project, before I understood how to guide the AI on project structure, it made a decision about where to put things and I didn't catch it. The AI didn't know it was wrong — I hadn't told it what "right" looked like yet.

No serialization attributes on request/response models. None of the models were decorated with [JsonPropertyName] attributes. In practice, ASP.NET Core defaults to camelCase serialization, so everything worked. But without explicit attributes, renaming a C# property would silently change the API contract. Tim's point was about resilience: make the contract explicit so refactoring can't accidentally break things.

Hard-coded configuration fallbacks. One service had a hard-coded default for the API base URL — a fallback in case the setting wasn't configured. Tim's take was immediate: if a required setting is missing, the application should throw on startup, not silently fall back to a default that might mask a real problem. You don't want something that "accidentally works" and then breaks in production with no obvious reason why.

The Pattern

When you step back from the individual findings, a pattern emerges. Every issue Tim identified fell into one of two categories: either the AI made an architectural decision I hadn't constrained, or it was inconsistent in applying a pattern across a large, multi-project solution.

Neither of these is surprising when you think about how agentic development works. The AI operates within a context window. It's exceptionally good at solving the problem in front of it — writing a clean method, implementing an endpoint, building a service. But maintaining global architectural consistency across thirteen services over the course of a long development process? That's where human oversight earns its keep.

This is exactly what I described in the framework post as context drift. The AI doesn't forget the rules because it's broken. It loses track of them because the context window is finite and the project is large. The solution isn't to stop using the AI. It's to build better guardrails.

The Feedback Loop

Here's what made this code review different from any I've experienced in a traditional development environment. In a normal team, a code review produces two things: approved code and maybe some institutional knowledge that lives in someone's head. In our process, every single finding became two outputs.

First, a ticket for the project. Fix the anonymous endpoints. Consolidate the internal API endpoints. Convert the remaining services to use the API instead of direct database access. Standard stuff.

Second — and this is the part that matters — an update to the TI Engineering Standards. Every finding that represented a pattern, not just a one-off mistake, went back into the standards repository that governs all future development. No hard-coded configuration fallbacks; throw on startup if a required setting is missing. API endpoints must return typed POCO models, never anonymous objects. All request and response models must be decorated with serialization attributes. APIs must use RFC 7807 Problem Details for error responses. Pagination responses must include a wrapper model with total counts. POST endpoints that create resources must return 201 Created with a Location header. Request/response naming follows the {Verb}{Entity}Request / {Entity}Response convention. Configuration settings bind to concrete POCO classes, not raw IConfiguration access.

Every one of those standards now gets loaded into every Claude Code session, on every project, before any code gets written. The lessons from this review don't just fix one project — they prevent the same issues from appearing in anything I build going forward.

This is the compounding effect. Traditional code reviews make the current project better. This process makes every future project better.

The Black Box Model

Tim and I talked about what ongoing code review looks like in an agentic development world, and I landed on an analogy that felt right: the aviation black box model.

In traditional development, code review is a gate. Nothing gets merged until a senior engineer approves it. That makes sense when you have a team of developers who might introduce bugs, misunderstand requirements, or have varying skill levels. The PR review loop is the quality control mechanism.

But in agentic development, the AI is already running through an automated review pipeline — engineering review, security review, integration testing — before anything gets merged. Adding a human PR review on every commit would create a bottleneck that defeats the purpose of the whole system.

Instead, what works is periodic architectural audits. Like pulling the black box after a flight. You let the system run, you let the automated checks do their job, and then at defined intervals you bring in a senior engineer to look at the big picture. Not line by line — the AI handles that fine. But the architectural decisions, the cross-project consistency, the patterns that only become visible when you zoom out.

That's what Tim did. And the findings fed directly back into the standards, which means the automated pipeline gets smarter after every audit. The system learns from its own black box.

The New Role of the Senior Engineer

Something became very clear during this review: Tim wasn't doing what a code reviewer traditionally does. He wasn't checking for null pointer exceptions or debating variable names or catching off-by-one errors. The AI handles all of that competently.

What Tim was doing was something more valuable. He was looking at the system as a whole and asking questions the AI can't ask itself: Should these services be talking to each other this way? Is this architectural pattern going to cause problems at scale? Are we being consistent in how we handle this concern across the entire solution?

That's the new role. Not writing code. Not even reviewing code in the traditional sense. It's architectural oversight — the kind of judgment that comes from building systems for decades and knowing where the bodies are buried. It's the thing that turns a collection of well-written files into a coherent, maintainable system.

And it's exactly the kind of thing that an engineering standards repository can capture and enforce going forward, which means the senior engineer's impact multiplies across every project, not just the one they're reviewing.

What This Means

I walked into this code review braced for bad news. I walked out with a short list of architectural improvements and a significantly stronger set of engineering standards. The code quality was confirmed. The process was validated. The gaps we found were exactly the kind of gaps you'd expect in any v1 project — the difference is that in our process, those gaps get closed permanently, not just for this project but for everything that follows.

If you're building with agentic development and you haven't had someone else review the output yet — do it. Not because the code will be bad. It probably won't be. Do it because the conversation you'll have about the findings will make your entire system better. Every standard you codify, every architectural pattern you constrain, every lesson you feed back into your framework compounds into higher quality output on everything you build next.

The first code review wasn't a test I passed. It was a mechanism I discovered. And it's now a permanent part of how we work.

Kevin Phifer is the founder of Theoretically Impossible Solutions LLC, specializing in agentic AI development and consulting. You can reach him at kevin.phifer@theoreticallyimpossible.org.

← Back to Blog