Skip to main content

Experiment

Complete

Normative Convergence

Can a 5-layer epistemic scorer, mapped to ISO/IEC 25010:2023, measure real code quality — or does the model just learn to pass the scorer?

Hypothesis

Phase 2 (Discrete Convergence) proved that automated tools catch what LLMs miss. But its scoring was binary — pass or fail, per tool. It couldn't answer how well the code handled adversarial input, survived infrastructure faults, or maintained stability under sustained load.

The hypothesis: a multi-layer scoring architecture — where each layer adds epistemic depth and later layers can only lower scores, never raise them — will expose quality gaps that tool-based pass/fail scoring misses. Mapping all dimensions to ISO/IEC 25010:2023 provides a normative standard against which convergence can be objectively measured.

Method

Five epistemic layers, applied sequentially. Each layer adds runtime depth. The final score for any dimension is the minimum across all applicable layers — a monotonic constraint that prevents early layers from masking deficiencies found later.

40 scoring dimensions derived from ISO/IEC 25010:2023 cover 9 quality characteristics: functional suitability, performance efficiency, compatibility, interaction capability, reliability, security, maintainability, flexibility, and safety — plus cross-cutting quality assurance.

Phases gate which layers are active. Phase 0 runs only Layer 1 (structural). Phase 4 runs all five layers including a 30-minute endurance test. Each phase must converge before the next begins.

Results

2

Trials

40

Dimensions

9

Scorer Bugs Found

5/5

Layers Passed

Trial01 — Archived

Trial01 revealed 9 systematic bugs in the scorer itself: fake proof templates, grep fallback violations, dimension mapping mismatches, phantom penalties, and non-deterministic fault injection. The same codebase scored anywhere from 6.4 to 7.5 across iterations with no code changes. Trial01 data is archived as unreliable.

Trial01 Score

6.88

Trial02 Score

9.79

Trial02 Score Trajectory

Loading chart...

Layer Score Breakdown

Loading chart...

Trial02 achieved 9.79/10.0 across all 40 dimensions and 5 layers. All dimensions pass. The lowest scores came from the adversarial layer (minor logging gaps) and the endurance layer (latency variance under sustained load).

The 5-Layer Architecture

Each layer adds epistemic depth. Later layers can only lower scores (monotonic constraint). The final score for each dimension is the minimum across all applicable layers.

LayerNameFocusStatus
1StructuralStatic analysis: extract claims from source tree. No runtime.Passed
2BehavioralRuntime proof of each claim. Unproven claims capped at 6/10.Passed
3AdversarialBlind attacks, no manifest access. Discovers unclaimed vulnerabilities.Passed
4ResilienceFault injection: kill DB, exhaust pools, crash processes.Passed
5EnduranceSustained load over time. Measures drift, leaks, and stability.Passed
Structural
9.940 dims
Behavioral
10.040 dims
Adversarial
9.922 dims
Resilience
10.014 dims
Endurance
9.65 dims

The Goodhart's Law Problem

The central finding of this experiment is not the 9.79 score. It's the question the score raises: did the code get better, or did it just learn to pass the scorer?

Between Trial01 and Trial02, three things changed simultaneously: the scorer was fixed (9 bugs), the scoring formula was rewritten, and the application code was regenerated. The 2.91-point improvement cannot be cleanly decomposed into these three factors. They are confounded.

Evidence of Scorer-Application Co-Evolution

100% behavioral proof rate. 641/641 claims proven in Trial02. A perfect proof rate across 40 dimensions is more suspicious than reassuring — it suggests the proofs are testing what the code was built to pass.

Structural claim inflation. Claims grew from 316 (Trial01) to 641 (Trial02). The cross-cutting extractor maps common NestJS patterns to many dimensions simultaneously, inflating coverage metrics.

Proof templates accept weak evidence. Some behavioral proofs accept HTTP 401/403 as “proven” (the endpoint exists and is auth-protected), which conflates deployment with behavioral verification.

Single-subject design. One project (telehealth-booking), one tech stack (NestJS/Prisma/PostgreSQL), one AI system (Claude). No cross-project or cross-model validation.

This is Goodhart's Law applied to code quality measurement: “When a measure becomes a target, it ceases to be a good measure.” The scorer and the application co-evolved — each iteration of the scorer shaped what the next generation of code optimized for. The question this experiment leaves open is how to design a scorer that resists gaming by the system it measures.

Transparency

Honest accounting of what this experiment can and cannot claim.

Protocol Deviations

Endurance test was 30 minutes, not multi-hour

The methodology specifies multi-hour sustained load for Layer 5. Trial02 ran 30 minutes. Memory leaks and connection exhaustion often manifest only after hours of operation.

Single trial, not multiple consecutive clean trials

Convergence criteria require multiple consecutive clean trials. Only one trial (Trial02) has been scored with the corrected scorer.

Single project, not three across different domains

The methodology calls for three independent projects across different business domains. Only one project (telehealth-booking) was scored.

Known Limitations

Confounded variables between trials

Scorer fixes, formula changes, and code regeneration all happened between Trial01 and Trial02. The 2.91-point improvement cannot be attributed to any single factor.

Adversarial layer has limited attack surface

The adversarial layer tested 53 endpoints with fuzzing, concurrency, and input attacks. A real penetration test would be broader and more creative.

Same tech stack as all prior experiments

NestJS + Prisma + PostgreSQL remains the only stack tested. ISO/IEC 25010 dimensions may manifest differently in other ecosystems.

Relationship to Prior Phases

Builds on Phase 1 and Phase 2 methodology

The CED methodology from Phase 1 (layered convergence) and Phase 2 (discrete convergence) is the foundation. Phase 3 does not start from scratch — it starts from a methodology that converged across 10 layers and 5 phases, and asks whether deeper runtime measurement changes the picture.

Scorer is a new artifact, not inherited

The 5-layer scorer was built for this experiment. Unlike Phase 2's tool-based scoring, this scorer uses a claims-proof-attack architecture that introduces its own complexity and potential for bugs — as Trial01 demonstrated.

Data Access

Trial application source code is public. The scoring methodology, per-dimension breakdowns, and failure mode definitions remain private.

Trial02 Iteration Timeline

T2-iter1L1-L4
3.3416/40 passing
T2-iter2L1-L4
7.1435/40 passing
T2-iter3L1-L4
7.5240/40 passing
T2-iter6L1-L4
9.8440/40 passing
T2-P4L1-L5
6.8836/40 passing
T2-P4 finalL1-L5
9.7940/40 passing