Experiment
In ProgressDiscrete Convergence
Does code that an LLM thinks is high-quality actually pass real tools? Phase 2 replaces subjective LLM scoring with automated tool measurement.
Hypothesis
Phase 1 (Layered Convergence) demonstrated that a specification-first methodology could converge across 10 quality layers using LLM-based scoring. But LLM scoring is subjective — three Claude sessions evaluating the same code may agree it “looks right” while the code fails to compile, tests don't pass, or Docker builds break.
The hypothesis: LLM-based scoring systematically overstates quality. Automated tools — compilers, linters, test runners, container builds, security scanners — will reveal gaps that subjective evaluation missed. The delta between “LLM says yes” and “tools say yes” is the measure of methodology blind spots.
Method
Five progressive phases, each requiring the infrastructure of prior phases. For each trial, three independent full-stack applications are built from the methodology across different business domains (analytics, event management, field dispatch). The trial score for each dimension is the minimum across all three projects — a rule that works in one domain but fails in another does not converge.
17 scoring dimensions are measured by concrete tools: ESLint, TypeScript compiler, Prisma validator, Docker build, and more. Each dimension produces a discrete score from tool output — no subjective interpretation.
Phases are sequential. Each must converge before the next begins, progressively building from static analysis through build verification, container orchestration, runtime validation, and finally edge-case completeness.
Results
3
Trials
9
Applications
10
Failure Modes
0/5
Phases Converged
Calibration Finding
Trial 0 established the calibration baseline. The methodology's builder (LLM) self-assessed near-perfect quality. Discrete tool scoring measured actual quality at less than half that level — confirming the core hypothesis that LLM scoring systematically overstates code quality.
LLM Self-Assessment
99.2%
Discrete Tool Score
46.7%
6 new failure modes were discovered in Trial 0. The experiment is in Phase 0 (Calibration), iterating the methodology to close the gap between subjective and discrete assessment before progressing to phases that require build infrastructure and runtime environments.
Phase Progression
Each phase adds infrastructure requirements and activates new scoring dimensions. Phases are sequential — each must converge before the next begins.
| Phase | Name | Focus | Dimensions | Status |
|---|---|---|---|---|
| 0 | Calibration | Static analysis + type checking | 6 | Active |
| 1 | Test Execution | Tests must pass, coverage thresholds | 2 | Pending |
| 2 | Container Verification | Docker builds, healthchecks, security scans | 2 | Pending |
| 3 | Runtime Validation | Performance, accessibility, active security | 4 | Pending |
| 4 | Edge Case & Completeness | Business logic, feature completeness, UX | 3 | Pending |
Transparency
Honest accounting of the experiment's design choices, limitations, and relationship to Phase 1.
Design Choices
Minimum-across-projects scoring
Each dimension score is the minimum across all three project domains. This is deliberately conservative — a methodology rule that works for analytics but fails for event management does not pass. This means scores are lower than averages would suggest.
Scorer bug fixes applied retroactively
Trial 0 was re-scored after 4 scorer implementation bugs were fixed. The re-scored results are canonical. This is expected during calibration — the scorer itself is being validated alongside the methodology.
Known Limitations
Same tech stack as Phase 1
Only the NestJS + Next.js + Prisma + PostgreSQL + Turborepo stack is tested. Results may not generalize to other frameworks or languages.
Single AI system
All building is performed by Claude. The scorer is automated tooling, but the builder remains a single AI system without cross-model validation.
Progressive infrastructure dependency
Later phases require Docker, running containers, and network access. Environment differences between scoring runs could affect reproducibility.
Relationship to Phase 1
Builds on layered-convergence methodology
The master methodology from Phase 1 (v1.0) is the starting point. Phase 2 does not start from scratch — it starts from the methodology that converged across 10 layers with LLM scoring, and measures how well that methodology holds up under tool-based verification.
Data Access
Trial source code and aggregate results are open source. Scoring tooling, methodology documents, and per-dimension breakdowns are not published.