Experiment
In ProgressTwo Roads to Deployment
Can a guided agent loop with 99.7% local compute match a 19K-line orchestration pipeline — and what does each approach trade away?
Hypothesis
The first three experiments focused on measuring AI code quality. Phase 4 shifts to a different question: what is the right architecture for generating it?
Two approaches emerged. A gated pipeline— 8 phases, 5 packages, 40 scoring dimensions, 19,168 lines of TypeScript — that enforces quality through structure. And a guided agent loop— a system prompt, a model cascade, and 833 lines of new Python — that relies on the AI agent's native ability to plan, build, and self-correct.
The hypothesis: a guided agent loop with minimal orchestration can produce functional code faster than a multi-phase pipeline — but the pipeline's scoring and gating may produce measurably higher quality output once operational. This experiment tests the first half of that hypothesis. The second half remains open.
The Two Approaches
Both approaches target the same goal: take an idea and produce deployment-ready code. They differ fundamentally in how much structure is imposed on the generation process.
| Dimension | Pipeline (Foundry) | Agent Loop (Forge) |
|---|---|---|
| Codebase size | 19,168 lines (TypeScript) | 833 new lines (Python) |
| Architecture | 8-phase gated pipeline, 5 packages | Agent loop + system prompt |
| Planning | 7 spec types, 25 harmonization iterations | BUILD_PLAN.md + CLAUDE.md (sequential) |
| Build method | Per-feature with custom engine | Batch phases via agent loop |
| Error recovery | Custom oscillation detector + escalation | Built-in circuit breaker + model escalation |
| Scoring | 40 ISO/IEC 25010 dimensions, 5 layers | Post-build verification pass |
| Trial success rate | 0/13 (0%) | 1/6 (17%) |
| Best output | No complete output | 54 files, 3,466 lines, 33 tests |
Foundry (Pipeline)
19,168
lines of TypeScript
Forge (Agent Loop)
833
new lines of Python
Trial Timeline
19
Total Trials
43
Failure Modes
2
Approaches Tested
1/2
Produced Output
Pipeline Runs (Foundry)
Agent Loop Runs (Forge)
Each failed run produced specific learnings. The pipeline failures exposed 5 bugs in the acceptance runner and fundamental brittleness in per-feature gating with local models. The agent loop failures drove improvements to headless mode, verification triggers, clean environment guarantees, and prompt engineering.
The Model Cascade
The successful agent loop run used three model tiers, each for a specific phase. Cloud models handle the bookends (architectural planning and verification); the local model does all construction.
Write BUILD_PLAN.md and CLAUDE.md — architectural decisions, phase structure, infrastructure requirements
Execute the plan — 422 tool calls, batch phases, write all source files, run builds and tests
Audit output against BUILD_PLAN.md, write missing tests, fix DI wiring, verify runtime boot
Cost Analysis
99.7%
Local Compute
$0.004
Out-of-Pocket
$25.85
API-Equivalent
3h 30m
Duration
Token Volume by Model
Cost Comparison
The local model (qwen3-coder-next running on local hardware) processed 37.8 million tokens — 99.7% of all compute — at zero cost. Cloud models were used surgically: 2 Gemini Flash calls for planning ($0.004) and 2 Claude Opus passes for verification.
Cost transparency:the $0.004 out-of-pocket figure reflects Gemini API billing only. Claude was accessed via a Max subscription, making the marginal cost $0. At API rates, the same Claude usage would cost $25.85 — primarily from 99K output tokens at $75/M and 287K cache-write tokens at $18.75/M. Both numbers are reported for honest accounting.
Output Assessment
The successful run (forge-06) produced a NestJS + Next.js todo application. A post-run audit identified 43 issues across 4 severity levels.
54
TypeScript Files
3,466
Lines of Code
33
Passing Tests
12
Git Commits
43 Issues by Severity
No DTO validation, race conditions (find-then-create), broken frontend auth flow, field name mismatches, JWT secret in committed .env
No pagination, no health check, no request logging, no rate limiting, zero controller tests, zero e2e tests
Inefficient dashboard queries, inconsistent guard usage, API URL hardcoded 15+ times, no Dockerfile
Console.log instead of Logger, missing @HttpCode decorators
The output is functional but not production-ready. The app builds, tests pass, and the server starts — but the auth flow is broken end-to-end, validation is missing, and race conditions exist in all mutating operations. This is a sketch, not a product.
Root Causes
The 43 issues trace to 5 root causes — all addressable at the prompt level, not the architecture level. This is an important finding: the agent loop architecture is not the bottleneck.
BUILD_PLAN doesn’t specify infrastructure
The planning prompt produced feature lists but not infrastructure patterns (guards, filters, pipes, middleware). The local model doesn’t add these unless told to.
Fix applied: Planning prompt now requires infrastructure sections: every guard, filter, pipe, and middleware the app needs.
Verification doesn’t test runtime behavior
Verification checked “does it build?” and “do tests exist?” but not “does the auth flow work end-to-end?”
Fix applied: Verification now includes functional smoke tests: register → login → create → list → dashboard.
Shared package never wired up
Zod schemas were written in packages/shared but never imported by the API. Dead code.
Fix applied: BUILD_PLAN must specify cross-package imports. Verification checks that shared exports are consumed.
Frontend-backend contract not verified
Dashboard expected {total, completed}, API returned {totalTasks, completedTasks}. Dashboard showed zeros.
Fix applied: Verification now checks that every frontend API call receives the response shape it expects.
Local model lacks NestJS conventions
qwen3-coder-next consistently missed: PrismaModule imports, guard decorators, global pipes/filters, proper HTTP status codes.
Fix applied: Stack preset now includes explicit NestJS convention rules injected into the system prompt.
Quality Benchmark
The forge output compared against two reference projects: a human-guided Claude session (telehealth-booking) and a production SaaS (SavSpot).
| Metric | Forge Output | Telehealth-Booking | SavSpot |
|---|---|---|---|
| TS/TSX Files | 54 | 198 | 1,105 |
| Lines of Code | 3,466 | 22,957 | 404,003 |
| Test Files | 5 | 12 | 197 |
| Duration | 3.5 hours | ~7 hours | 18 days |
| Method | Automated agent loop | Human + Claude | Human + AI-assisted |
| Commits | 12 | 5 | 217 |
The forge output is 6.6x smaller than the human-guided session and 117x smaller than production SaaS. The gap is not primarily about compute or cost — it is about prompt depth and verification scope. The same agent loop with better prompts should produce substantially more complete output.
What's Next
Both approaches continue development. Neither has been abandoned.
Agent Loop (Forge)
Run v5 with all 5 root cause fixes applied. Test on a more complex application (multi-entity, auth, dashboard). Compare output quality to telehealth-booking as the human-guided baseline.
Pipeline (Foundry)
Fix the 5 acceptance runner bugs. Improve spec harmonization to eliminate the 25-iteration loop. The pipeline's value proposition — scoring, gating, and quality assurance at scale — remains valid. Once operational, it should produce measurably higher quality than the agent loop.
The deeper question: at what scale does orchestration start paying for itself? A todo app may not need 8 phases and 40 scoring dimensions. A 100K-line enterprise application might. This experiment has not yet tested at that scale.
Transparency
Honest accounting of what this experiment can and cannot claim.
Claims NOT Supported
“Production-ready output”
43 issues (8 critical) in the post-run audit. The output builds and tests pass, but the auth flow is broken, validation is missing, and race conditions exist.
“Pipeline approach is wrong”
The pipeline (Foundry) attempts a harder problem: quality assurance at scale with scoring and gating. 13 failures are iteration data, not a verdict. Development continues.
“$0 cost” without qualification
Out-of-pocket was $0.004 (Gemini only). Claude was accessed via Max subscription ($0 marginal). At API rates, the same usage costs $25.85. Both figures are required for honest reporting.
Known Limitations
Single application type tested
Only a NestJS + Next.js todo app. Results may differ for other stacks, larger applications, or different domain complexity.
Single local model tested
qwen3-coder-next only. Other local models (CodeLlama, DeepSeek-Coder) may perform differently. No cross-model comparison has been done.
Prompt fixes not yet validated
All 5 root cause fixes have been committed to the forge codebase but not tested in a clean run. Their effectiveness is unproven.
Pipeline comparison is asymmetric
The pipeline had 13 runs against a harder problem with known bugs. The agent loop had 6 runs against a simpler problem. Direct comparison of success rates (0% vs 17%) is misleading without this context.
Relationship to Prior Phases
Shifts from measurement to architecture
Phases 1–3 asked “how do we measure AI code quality?” Phase 4 asks “what is the right architecture for generating it?” These are complementary questions. The pipeline’s scoring framework (from Phase 3) could be applied to the agent loop’s output in future work.
Foundry incorporates prior findings
The pipeline’s scoring engine uses 40 ISO/IEC 25010 dimensions from Phase 3 and deterministic tool verification from Phase 2. The pipeline is not separate from the research — it is an attempt to operationalize it.
Data Access
Trial output, gap analysis, and experiment reports are public. The Foundry and Forge codebases are private.