Skip to main content

Experiment

In Progress

Two Roads to Deployment

Can a guided agent loop with 99.7% local compute match a 19K-line orchestration pipeline — and what does each approach trade away?

Hypothesis

The first three experiments focused on measuring AI code quality. Phase 4 shifts to a different question: what is the right architecture for generating it?

Two approaches emerged. A gated pipeline— 8 phases, 5 packages, 40 scoring dimensions, 19,168 lines of TypeScript — that enforces quality through structure. And a guided agent loop— a system prompt, a model cascade, and 833 lines of new Python — that relies on the AI agent's native ability to plan, build, and self-correct.

The hypothesis: a guided agent loop with minimal orchestration can produce functional code faster than a multi-phase pipeline — but the pipeline's scoring and gating may produce measurably higher quality output once operational. This experiment tests the first half of that hypothesis. The second half remains open.

The Two Approaches

Both approaches target the same goal: take an idea and produce deployment-ready code. They differ fundamentally in how much structure is imposed on the generation process.

DimensionPipeline (Foundry)Agent Loop (Forge)
Codebase size19,168 lines (TypeScript)833 new lines (Python)
Architecture8-phase gated pipeline, 5 packagesAgent loop + system prompt
Planning7 spec types, 25 harmonization iterationsBUILD_PLAN.md + CLAUDE.md (sequential)
Build methodPer-feature with custom engineBatch phases via agent loop
Error recoveryCustom oscillation detector + escalationBuilt-in circuit breaker + model escalation
Scoring40 ISO/IEC 25010 dimensions, 5 layersPost-build verification pass
Trial success rate0/13 (0%)1/6 (17%)
Best outputNo complete output54 files, 3,466 lines, 33 tests

Foundry (Pipeline)

19,168

lines of TypeScript

Forge (Agent Loop)

833

new lines of Python

Trial Timeline

19

Total Trials

43

Failure Modes

2

Approaches Tested

1/2

Produced Output

Pipeline Runs (Foundry)

foundry-01Acceptance runner: no cookie auth support
Failed
foundry-02Spec harmonization loop (25 iterations)
Failed
foundry-03Acceptance runner: hardcoded tokens
Failed
foundry-04Per-feature gate too brittle for local model
Failed
foundry-05Acceptance runner: no DB reset between tests
Failed
foundry-06Build errors accumulated across features
Failed
foundry-07Missing escalation counter in acceptance
Failed
foundry-08Spec inconsistencies: phantom commands
Failed
foundry-09Rate-limit setup steps missing
Failed
foundry-10Local model oscillation in fix loops
Failed
foundry-11Auth feature acceptance: Bearer vs cookie mismatch
Failed
foundry-12Build errors: 17 → 83 across fix attempts
Failed
foundry-13Accumulated state from prior failures
Failed

Agent Loop Runs (Forge)

forge-01Scaffolding only, abandoned early
Partial
forge-02Headless mode exit after first turn
Partial
forge-03Inherited 88 build errors from broken state
Partial
forge-04Server-start hang, verification never triggered
Partial
forge-05Local model ran rm -rf, destroyed specs
Partial
forge-0654 files, 3,466 lines, 33 tests
3h 30mSuccess

Each failed run produced specific learnings. The pipeline failures exposed 5 bugs in the acceptance runner and fundamental brittleness in per-feature gating with local models. The agent loop failures drove improvements to headless mode, verification triggers, clean environment guarantees, and prompt engineering.

The Model Cascade

The successful agent loop run used three model tiers, each for a specific phase. Cloud models handle the bookends (architectural planning and verification); the local model does all construction.

PlanningGemini 2.5 Flash

Write BUILD_PLAN.md and CLAUDE.md — architectural decisions, phase structure, infrastructure requirements

~13K tokens~30 seconds
Buildingqwen3-coder-next (local)

Execute the plan — 422 tool calls, batch phases, write all source files, run builds and tests

37.8M tokens~88 minutes
VerificationClaude Opus 4.6

Audit output against BUILD_PLAN.md, write missing tests, fix DI wiring, verify runtime boot

~8.6M tokens (incl. cache)~10 minutes

Cost Analysis

99.7%

Local Compute

$0.004

Out-of-Pocket

$25.85

API-Equivalent

3h 30m

Duration

Token Volume by Model

Loading chart...

Cost Comparison

Loading chart...

The local model (qwen3-coder-next running on local hardware) processed 37.8 million tokens — 99.7% of all compute — at zero cost. Cloud models were used surgically: 2 Gemini Flash calls for planning ($0.004) and 2 Claude Opus passes for verification.

Cost transparency:the $0.004 out-of-pocket figure reflects Gemini API billing only. Claude was accessed via a Max subscription, making the marginal cost $0. At API rates, the same Claude usage would cost $25.85 — primarily from 99K output tokens at $75/M and 287K cache-write tokens at $18.75/M. Both numbers are reported for honest accounting.

Output Assessment

The successful run (forge-06) produced a NestJS + Next.js todo application. A post-run audit identified 43 issues across 4 severity levels.

54

TypeScript Files

3,466

Lines of Code

33

Passing Tests

12

Git Commits

43 Issues by Severity

Critical8

No DTO validation, race conditions (find-then-create), broken frontend auth flow, field name mismatches, JWT secret in committed .env

High17

No pagination, no health check, no request logging, no rate limiting, zero controller tests, zero e2e tests

Medium14

Inefficient dashboard queries, inconsistent guard usage, API URL hardcoded 15+ times, no Dockerfile

Low4

Console.log instead of Logger, missing @HttpCode decorators

The output is functional but not production-ready. The app builds, tests pass, and the server starts — but the auth flow is broken end-to-end, validation is missing, and race conditions exist in all mutating operations. This is a sketch, not a product.

Root Causes

The 43 issues trace to 5 root causes — all addressable at the prompt level, not the architecture level. This is an important finding: the agent loop architecture is not the bottleneck.

BUILD_PLAN doesn’t specify infrastructure

The planning prompt produced feature lists but not infrastructure patterns (guards, filters, pipes, middleware). The local model doesn’t add these unless told to.

Fix applied: Planning prompt now requires infrastructure sections: every guard, filter, pipe, and middleware the app needs.

Verification doesn’t test runtime behavior

Verification checked “does it build?” and “do tests exist?” but not “does the auth flow work end-to-end?”

Fix applied: Verification now includes functional smoke tests: register → login → create → list → dashboard.

Shared package never wired up

Zod schemas were written in packages/shared but never imported by the API. Dead code.

Fix applied: BUILD_PLAN must specify cross-package imports. Verification checks that shared exports are consumed.

Frontend-backend contract not verified

Dashboard expected {total, completed}, API returned {totalTasks, completedTasks}. Dashboard showed zeros.

Fix applied: Verification now checks that every frontend API call receives the response shape it expects.

Local model lacks NestJS conventions

qwen3-coder-next consistently missed: PrismaModule imports, guard decorators, global pipes/filters, proper HTTP status codes.

Fix applied: Stack preset now includes explicit NestJS convention rules injected into the system prompt.

Quality Benchmark

The forge output compared against two reference projects: a human-guided Claude session (telehealth-booking) and a production SaaS (SavSpot).

MetricForge OutputTelehealth-BookingSavSpot
TS/TSX Files541981,105
Lines of Code3,46622,957404,003
Test Files512197
Duration3.5 hours~7 hours18 days
MethodAutomated agent loopHuman + ClaudeHuman + AI-assisted
Commits125217

The forge output is 6.6x smaller than the human-guided session and 117x smaller than production SaaS. The gap is not primarily about compute or cost — it is about prompt depth and verification scope. The same agent loop with better prompts should produce substantially more complete output.

What's Next

Both approaches continue development. Neither has been abandoned.

Agent Loop (Forge)

Run v5 with all 5 root cause fixes applied. Test on a more complex application (multi-entity, auth, dashboard). Compare output quality to telehealth-booking as the human-guided baseline.

Pipeline (Foundry)

Fix the 5 acceptance runner bugs. Improve spec harmonization to eliminate the 25-iteration loop. The pipeline's value proposition — scoring, gating, and quality assurance at scale — remains valid. Once operational, it should produce measurably higher quality than the agent loop.

The deeper question: at what scale does orchestration start paying for itself? A todo app may not need 8 phases and 40 scoring dimensions. A 100K-line enterprise application might. This experiment has not yet tested at that scale.

Transparency

Honest accounting of what this experiment can and cannot claim.

Claims NOT Supported

“Production-ready output”

43 issues (8 critical) in the post-run audit. The output builds and tests pass, but the auth flow is broken, validation is missing, and race conditions exist.

“Pipeline approach is wrong”

The pipeline (Foundry) attempts a harder problem: quality assurance at scale with scoring and gating. 13 failures are iteration data, not a verdict. Development continues.

“$0 cost” without qualification

Out-of-pocket was $0.004 (Gemini only). Claude was accessed via Max subscription ($0 marginal). At API rates, the same usage costs $25.85. Both figures are required for honest reporting.

Known Limitations

Single application type tested

Only a NestJS + Next.js todo app. Results may differ for other stacks, larger applications, or different domain complexity.

Single local model tested

qwen3-coder-next only. Other local models (CodeLlama, DeepSeek-Coder) may perform differently. No cross-model comparison has been done.

Prompt fixes not yet validated

All 5 root cause fixes have been committed to the forge codebase but not tested in a clean run. Their effectiveness is unproven.

Pipeline comparison is asymmetric

The pipeline had 13 runs against a harder problem with known bugs. The agent loop had 6 runs against a simpler problem. Direct comparison of success rates (0% vs 17%) is misleading without this context.

Relationship to Prior Phases

Shifts from measurement to architecture

Phases 1–3 asked “how do we measure AI code quality?” Phase 4 asks “what is the right architecture for generating it?” These are complementary questions. The pipeline’s scoring framework (from Phase 3) could be applied to the agent loop’s output in future work.

Foundry incorporates prior findings

The pipeline’s scoring engine uses 40 ISO/IEC 25010 dimensions from Phase 3 and deterministic tool verification from Phase 2. The pipeline is not separate from the research — it is an attempt to operationalize it.

Data Access

Trial output, gap analysis, and experiment reports are public. The Foundry and Forge codebases are private.