Skip to main content

Experiment

In Progress

Discrete Convergence

Does code that an LLM thinks is high-quality actually pass real tools? Phase 2 replaces subjective LLM scoring with automated tool measurement.

Hypothesis

Phase 1 (Layered Convergence) demonstrated that a specification-first methodology could converge across 10 quality layers using LLM-based scoring. But LLM scoring is subjective — three Claude sessions evaluating the same code may agree it “looks right” while the code fails to compile, tests don't pass, or Docker builds break.

The hypothesis: LLM-based scoring systematically overstates quality. Automated tools — compilers, linters, test runners, container builds, security scanners — will reveal gaps that subjective evaluation missed. The delta between “LLM says yes” and “tools say yes” is the measure of methodology blind spots.

Method

Five progressive phases, each requiring the infrastructure of prior phases. For each trial, three independent full-stack applications are built from the methodology across different business domains (analytics, event management, field dispatch). The trial score for each dimension is the minimum across all three projects — a rule that works in one domain but fails in another does not converge.

17 scoring dimensions are measured by concrete tools: ESLint, TypeScript compiler, Prisma validator, Docker build, and more. Each dimension produces a discrete score from tool output — no subjective interpretation.

Phases are sequential. Each must converge before the next begins, progressively building from static analysis through build verification, container orchestration, runtime validation, and finally edge-case completeness.

Results

3

Trials

9

Applications

10

Failure Modes

0/5

Phases Converged

Calibration Finding

Trial 0 established the calibration baseline. The methodology's builder (LLM) self-assessed near-perfect quality. Discrete tool scoring measured actual quality at less than half that level — confirming the core hypothesis that LLM scoring systematically overstates code quality.

LLM Self-Assessment

99.2%

Discrete Tool Score

46.7%

6 new failure modes were discovered in Trial 0. The experiment is in Phase 0 (Calibration), iterating the methodology to close the gap between subjective and discrete assessment before progressing to phases that require build infrastructure and runtime environments.

Phase Progression

Each phase adds infrastructure requirements and activates new scoring dimensions. Phases are sequential — each must converge before the next begins.

PhaseNameFocusDimensionsStatus
0CalibrationStatic analysis + type checking6Active
1Test ExecutionTests must pass, coverage thresholds2Pending
2Container VerificationDocker builds, healthchecks, security scans2Pending
3Runtime ValidationPerformance, accessibility, active security4Pending
4Edge Case & CompletenessBusiness logic, feature completeness, UX3Pending

Transparency

Honest accounting of the experiment's design choices, limitations, and relationship to Phase 1.

Design Choices

Minimum-across-projects scoring

Each dimension score is the minimum across all three project domains. This is deliberately conservative — a methodology rule that works for analytics but fails for event management does not pass. This means scores are lower than averages would suggest.

Scorer bug fixes applied retroactively

Trial 0 was re-scored after 4 scorer implementation bugs were fixed. The re-scored results are canonical. This is expected during calibration — the scorer itself is being validated alongside the methodology.

Known Limitations

Same tech stack as Phase 1

Only the NestJS + Next.js + Prisma + PostgreSQL + Turborepo stack is tested. Results may not generalize to other frameworks or languages.

Single AI system

All building is performed by Claude. The scorer is automated tooling, but the builder remains a single AI system without cross-model validation.

Progressive infrastructure dependency

Later phases require Docker, running containers, and network access. Environment differences between scoring runs could affect reproducibility.

Relationship to Phase 1

Builds on layered-convergence methodology

The master methodology from Phase 1 (v1.0) is the starting point. Phase 2 does not start from scratch — it starts from the methodology that converged across 10 layers with LLM scoring, and measures how well that methodology holds up under tool-based verification.

Data Access

Trial source code and aggregate results are open source. Scoring tooling, methodology documents, and per-dimension breakdowns are not published.

Trial Timeline

Trial 0Calibration
4.67+6 failures
Trial 1Calibration
8.83+3 failures
Trial 2Calibration
8.50+1 failures