Two Roads to Deployment

Three experiments into the CED research program, I had a deep understanding of how to measure AI-generated code. Forty ISO/IEC 25010 dimensions. Five epistemic scoring layers. Automated tools replacing subjective LLM panels. The measurement infrastructure was rigorous.

What I didn't have was a reliable way to generate it.

Phase 4 was supposed to answer that question with a pipeline. What it actually answered was more interesting: there are two viable roads to deployment, and they trade off in ways I didn't expect.

The Pipeline

Foundry was built to operationalize everything the first three experiments learned. An 8-phase gated pipeline with 5 packages, 19,168 lines of TypeScript, 40 scoring dimensions, and a specification-first architecture that enforced quality through structure. The thesis was compelling: if you gate every phase and score every dimension, quality is guaranteed.

Thirteen runs later, the pipeline had produced zero complete applications.

The failures were specific and instructive. The acceptance runner had 5 bugs — it expected Bearer tokens when the API used cookies, it didn't reset the database between tests, it hardcoded tokens instead of extracting them from responses. The spec harmonization layer required 25 iterations to align 7 spec types, and the specs still contained phantom commands and inconsistent field names. Per-feature build gates were too brittle for the local model — each feature was built in isolation, so errors accumulated across features instead of being caught in context.

None of these are fundamental. They are engineering bugs in an ambitious system. But 13 consecutive failures with zero output forced a question: is there a simpler path to the same destination?

The Observation

Two weeks before Phase 4 began, I had built MedConnect — a multi-tenant telehealth booking platform. 198 files, 22,957 lines of TypeScript, 12 test files. It took about 7 hours in a single Claude Code session.

The method was simple. I wrote a BUILD_PLAN.md — 148 lines describing the architecture, data model, and phased build order. Then I derived a CLAUDE.md from it — 147 lines of navigation context for the AI. Then I said “go” and guided the session through each phase, reviewing output and redirecting when needed.

No pipeline. No scoring engine. No acceptance runner. No harmonization layer. Just a plan and an agent. And the output was a working application — the same kind of application that 13 pipeline runs had failed to produce.

This raised an uncomfortable hypothesis: what if the pipeline is solving a problem the agent doesn't have?

The Experiment

I had an existing tool called Forge — 15,109 lines of Python providing an agent loop with file tools, shell access, circuit breakers, model escalation, session persistence, and web search. Everything a developer needs to build software, wrapped around a local AI model.

I added 833 lines of new code: a command handler, prompt templates, and CLI wiring. The forge new command takes an idea and generates a project. No custom build engine, no scoring, no multi-spec derivation. Just the system prompt telling the agent what to do, and the agent doing it.

The key architectural decision was the model cascade. Cloud models handle the bookends: Gemini Flash for planning (architectural decisions, phase structure, infrastructure requirements) and Claude for verification (audit against BUILD_PLAN, write missing tests, fix wiring issues). The local model — qwen3-coder-next running on local hardware — does all the actual construction.

Six trials. The first five failed for specific, fixable reasons: headless mode exited after the first turn, the agent inherited broken state from a prior run, the server-start command hung forever, the verification loop never triggered because it was checking turn count instead of model output, and the local model ran rm -rf and deleted its own specs.

The sixth trial succeeded. Three and a half hours. Fifty-four TypeScript files, 3,466 lines of code, 33 passing tests, 12 git commits. The local model made 422 tool calls, processing 37.8 million tokens. The Gemini planning calls took 30 seconds and cost $0.004. The Claude verification passes took 10 minutes and fixed DI wiring in 3 modules, wrote 5 test files, and added CORS and error handling.

The Cost Story

The numbers that stopped me: 99.7% of all compute ran on local hardware at zero cost. The remaining 0.3% — planning and verification — cost $0.004 out-of-pocket (Gemini API) plus whatever fraction of a Claude Max subscription covers 10 minutes of use. At API rates, the Claude usage would be $25.85.

This is an interesting economic structure. The expensive work (37.8M tokens of code generation, file writes, build verification) runs for free on hardware I already own. The cheap work (two planning calls, two verification passes) uses cloud models surgically for the tasks where capability matters most: architectural planning and quality assessment.

It isn't accurate to call this “free.” The hardware has a cost. The subscription has a cost. But the marginal cost of generating an application — the cost of one more run — is $0.004. That changes the economics of iteration. When failure is effectively free, you can afford many more attempts.

The Gap

I ran a thorough post-run audit and found 43 issues. Eight were critical: no DTO validation (the ValidationPipe was configured with nothing to validate), race conditions in register and label creation (find-then-create without transactions), a broken frontend auth flow (cookies and localStorage mixed incoherently), and field name mismatches between the dashboard frontend and API.

Compared to production benchmarks, the output is a sketch. SavSpot — an open-source booking platform I maintain — has 404,003 lines, 197 test files, and 217 commits. The forge output is 117 times smaller. Even the human-guided telehealth-booking session produced 6.6 times more code.

But the 43 issues traced to 5 root causes, and all 5 are addressable at the prompt level: the BUILD_PLAN didn't specify infrastructure patterns, verification didn't test runtime behavior, the shared package was never wired up, the frontend-backend contract wasn't verified, and the local model doesn't know NestJS conventions unless told explicitly.

The fixes are committed. They haven't been tested in a clean run. That's next.

The Road Ahead

Neither approach is abandoned. Both continue.

The agent loop established a baseline: 3.5 hours, $0.004, working application. For the pipeline to justify its 19,168 lines, it must produce measurably higher quality output — not just code that compiles, but code that passes the scoring framework it was built to enforce. The pipeline's value proposition (scoring, gating, quality assurance at scale) is real. It just hasn't been realized yet.

The deeper question is about scale. A todo app may not need 8 phases and 40 scoring dimensions. But a 100,000-line enterprise application might. The agent loop succeeds by trusting the AI to self-organize. The pipeline succeeds by constraining it. At some point — some level of complexity, some number of modules, some depth of integration — self-organization should break down and structure should start paying for itself.

Finding that crossover point is what Phase 4 is about. We are not there yet.

Full trial data, the 43-issue gap analysis, and comparison tables are on the experiment page. This builds on the questions raised in The Measurement Problem. The first three experiments asked how to measure quality. Phase 4 asks what architecture produces it. The research continues.