Experiment

In Progress

Two Roads to Deployment

Can a guided agent loop with 99.7% local compute match a 19K-line orchestration pipeline — and what does each approach trade away?

Hypothesis

The first three experiments focused on measuring AI code quality. Phase 4 shifts to a different question: what is the right architecture for generating it?

Two approaches emerged. A gated pipeline— 8 phases, 5 packages, 40 scoring dimensions, 19,168 lines of TypeScript — that enforces quality through structure. And a guided agent loop— a system prompt, a model cascade, and 833 lines of new Python — that relies on the AI agent's native ability to plan, build, and self-correct.

The hypothesis: a guided agent loop with minimal orchestration can produce functional code faster than a multi-phase pipeline — but the pipeline's scoring and gating may produce measurably higher quality output once operational. This experiment tests the first half of that hypothesis. The second half remains open.

The Two Approaches

Both approaches target the same goal: take an idea and produce deployment-ready code. They differ fundamentally in how much structure is imposed on the generation process.

Dimension	Pipeline (Foundry)	Agent Loop (Forge)
Codebase size	19,168 lines (TypeScript)	833 new lines (Python)
Architecture	8-phase gated pipeline, 5 packages	Agent loop + system prompt
Planning	7 spec types, 25 harmonization iterations	BUILD_PLAN.md + CLAUDE.md (sequential)
Build method	Per-feature with custom engine	Batch phases via agent loop
Error recovery	Custom oscillation detector + escalation	Built-in circuit breaker + model escalation
Scoring	40 ISO/IEC 25010 dimensions, 5 layers	Post-build verification pass
Trial success rate	0/13 (0%)	1/6 (17%)
Best output	No complete output	54 files, 3,466 lines, 33 tests

Foundry (Pipeline)

19,168

lines of TypeScript

Forge (Agent Loop)

833

new lines of Python

Trial Timeline

Total Trials

Failure Modes

Approaches Tested

1/2

Produced Output

Pipeline Runs (Foundry)

foundry-01Acceptance runner: no cookie auth support

Failed

foundry-02Spec harmonization loop (25 iterations)

Failed

foundry-03Acceptance runner: hardcoded tokens

Failed

foundry-04Per-feature gate too brittle for local model

Failed

foundry-05Acceptance runner: no DB reset between tests

Failed

foundry-06Build errors accumulated across features

Failed

foundry-07Missing escalation counter in acceptance

Failed

foundry-08Spec inconsistencies: phantom commands

Failed

foundry-09Rate-limit setup steps missing

Failed

foundry-10Local model oscillation in fix loops

Failed

foundry-11Auth feature acceptance: Bearer vs cookie mismatch

Failed

foundry-12Build errors: 17 → 83 across fix attempts

Failed

foundry-13Accumulated state from prior failures

Failed

Agent Loop Runs (Forge)

forge-01Scaffolding only, abandoned early

Partial

forge-02Headless mode exit after first turn

Partial

forge-03Inherited 88 build errors from broken state

Partial

forge-04Server-start hang, verification never triggered

Partial

forge-05Local model ran rm -rf, destroyed specs

Partial

forge-0654 files, 3,466 lines, 33 tests

3h 30mSuccess

Each failed run produced specific learnings. The pipeline failures exposed 5 bugs in the acceptance runner and fundamental brittleness in per-feature gating with local models. The agent loop failures drove improvements to headless mode, verification triggers, clean environment guarantees, and prompt engineering.

The Model Cascade

The successful agent loop run used three model tiers, each for a specific phase. Cloud models handle the bookends (architectural planning and verification); the local model does all construction.

PlanningGemini 2.5 Flash

Write BUILD_PLAN.md and CLAUDE.md — architectural decisions, phase structure, infrastructure requirements

~13K tokens~30 seconds

Buildingqwen3-coder-next (local)

Execute the plan — 422 tool calls, batch phases, write all source files, run builds and tests

37.8M tokens~88 minutes

VerificationClaude Opus 4.6

Audit output against BUILD_PLAN.md, write missing tests, fix DI wiring, verify runtime boot

~8.6M tokens (incl. cache)~10 minutes

Cost Analysis

99.7%

Local Compute

$0.004

Out-of-Pocket

$25.85

API-Equivalent

3h 30m

Duration

Token Volume by Model

Cost Comparison

The local model (qwen3-coder-next running on local hardware) processed 37.8 million tokens — 99.7% of all compute — at zero cost. Cloud models were used surgically: 2 Gemini Flash calls for planning ($0.004) and 2 Claude Opus passes for verification.

Cost transparency:the $0.004 out-of-pocket figure reflects Gemini API billing only. Claude was accessed via a Max subscription, making the marginal cost $0. At API rates, the same Claude usage would cost $25.85 — primarily from 99K output tokens at $75/M and 287K cache-write tokens at $18.75/M. Both numbers are reported for honest accounting.

Output Assessment

The successful run (forge-06) produced a NestJS + Next.js todo application. A post-run audit identified 43 issues across 4 severity levels.

TypeScript Files

3,466

Lines of Code

Passing Tests

Git Commits

43 Issues by Severity

Critical8

No DTO validation, race conditions (find-then-create), broken frontend auth flow, field name mismatches, JWT secret in committed .env

High17

No pagination, no health check, no request logging, no rate limiting, zero controller tests, zero e2e tests

Medium14

Inefficient dashboard queries, inconsistent guard usage, API URL hardcoded 15+ times, no Dockerfile

Low4

Console.log instead of Logger, missing @HttpCode decorators

The output is functional but not production-ready. The app builds, tests pass, and the server starts — but the auth flow is broken end-to-end, validation is missing, and race conditions exist in all mutating operations. This is a sketch, not a product.

Root Causes

The 43 issues trace to 5 root causes — all addressable at the prompt level, not the architecture level. This is an important finding: the agent loop architecture is not the bottleneck.

BUILD_PLAN doesn’t specify infrastructure

The planning prompt produced feature lists but not infrastructure patterns (guards, filters, pipes, middleware). The local model doesn’t add these unless told to.

Fix applied: Planning prompt now requires infrastructure sections: every guard, filter, pipe, and middleware the app needs.

Verification doesn’t test runtime behavior

Verification checked “does it build?” and “do tests exist?” but not “does the auth flow work end-to-end?”

Fix applied: Verification now includes functional smoke tests: register → login → create → list → dashboard.

Shared package never wired up

Zod schemas were written in packages/shared but never imported by the API. Dead code.

Fix applied: BUILD_PLAN must specify cross-package imports. Verification checks that shared exports are consumed.

Frontend-backend contract not verified

Dashboard expected {total, completed}, API returned {totalTasks, completedTasks}. Dashboard showed zeros.

Fix applied: Verification now checks that every frontend API call receives the response shape it expects.

Local model lacks NestJS conventions

qwen3-coder-next consistently missed: PrismaModule imports, guard decorators, global pipes/filters, proper HTTP status codes.

Fix applied: Stack preset now includes explicit NestJS convention rules injected into the system prompt.

Quality Benchmark

The forge output compared against two reference projects: a human-guided Claude session (telehealth-booking) and a production SaaS (SavSpot).

Metric	Forge Output	Telehealth-Booking	SavSpot
TS/TSX Files	54	198	1,105
Lines of Code	3,466	22,957	404,003
Test Files	5	12	197
Duration	3.5 hours	~7 hours	18 days
Method	Automated agent loop	Human + Claude	Human + AI-assisted
Commits	12	5	217

The forge output is 6.6x smaller than the human-guided session and 117x smaller than production SaaS. The gap is not primarily about compute or cost — it is about prompt depth and verification scope. The same agent loop with better prompts should produce substantially more complete output.

What's Next

Both approaches continue development. Neither has been abandoned.

Agent Loop (Forge)

Run v5 with all 5 root cause fixes applied. Test on a more complex application (multi-entity, auth, dashboard). Compare output quality to telehealth-booking as the human-guided baseline.

Pipeline (Foundry)

Fix the 5 acceptance runner bugs. Improve spec harmonization to eliminate the 25-iteration loop. The pipeline's value proposition — scoring, gating, and quality assurance at scale — remains valid. Once operational, it should produce measurably higher quality than the agent loop.

The deeper question: at what scale does orchestration start paying for itself? A todo app may not need 8 phases and 40 scoring dimensions. A 100K-line enterprise application might. This experiment has not yet tested at that scale.

Transparency

Honest accounting of what this experiment can and cannot claim.

Claims NOT Supported

“Production-ready output”

43 issues (8 critical) in the post-run audit. The output builds and tests pass, but the auth flow is broken, validation is missing, and race conditions exist.

“Pipeline approach is wrong”

The pipeline (Foundry) attempts a harder problem: quality assurance at scale with scoring and gating. 13 failures are iteration data, not a verdict. Development continues.

“$0 cost” without qualification

Out-of-pocket was $0.004 (Gemini only). Claude was accessed via Max subscription ($0 marginal). At API rates, the same usage costs $25.85. Both figures are required for honest reporting.

Known Limitations

Single application type tested

Only a NestJS + Next.js todo app. Results may differ for other stacks, larger applications, or different domain complexity.

Single local model tested

qwen3-coder-next only. Other local models (CodeLlama, DeepSeek-Coder) may perform differently. No cross-model comparison has been done.

Prompt fixes not yet validated

All 5 root cause fixes have been committed to the forge codebase but not tested in a clean run. Their effectiveness is unproven.

Pipeline comparison is asymmetric

The pipeline had 13 runs against a harder problem with known bugs. The agent loop had 6 runs against a simpler problem. Direct comparison of success rates (0% vs 17%) is misleading without this context.

Relationship to Prior Phases

Shifts from measurement to architecture

Phases 1–3 asked “how do we measure AI code quality?” Phase 4 asks “what is the right architecture for generating it?” These are complementary questions. The pipeline’s scoring framework (from Phase 3) could be applied to the agent loop’s output in future work.

Foundry incorporates prior findings

The pipeline’s scoring engine uses 40 ISO/IEC 25010 dimensions from Phase 3 and deterministic tool verification from Phase 2. The pipeline is not separate from the research — it is an attempt to operationalize it.

Data Access

Trial output, gap analysis, and experiment reports are public. The Foundry and Forge codebases are private.

View Phase 4 Data View Phase 3: Normative Convergence