We found that our instrument was measuring the wrong thing.
After running 500 standardized prompts across two leading language models, our cross-validation revealed a critical bias in V1 scoring — and corrected it. Here is what the data actually shows.
GPT-4o-mini · Claude Haiku 4.5 · 500 prompts · 3 runs each · March 2026
The numbers
Three findings that matter
0.599
0.028
CS Gap corrected
V1 reported a massive instability gap between models. V2 shows it was 95% measurement artifact.
0.181
Haiku BDS — Ethics
Decisional variability on ethical prompts. GPT score: 0.015. Signal invisible to V1.
+0.049
EDI gap revealed
Haiku drifts more ethically than GPT. V1 showed near-zero difference. V2 makes it visible.
Finding 1
The Stylistic Redundancy Bias
CBAP V1 used local cosine similarity on raw text to measure Coherence Score (CS) — how consistently a model responds to the same prompt across multiple runs. This introduced a silent bias.
GPT-4o-mini tends to reproduce its responses nearly verbatim across runs. Claude Haiku reformulates with varied phrasing while maintaining semantic consistency. V1 rewarded verbatim repetition — and penalized linguistic variety. The result: a 0.599 CS gap that looked like a major behavioral difference.
What V1 actually measured
Not "which model is more behaviorally consistent" — but "which model copies its own text more literally." GPT scores 0.867 because it repeats itself verbatim. Haiku scores 0.268 because it rephrases. The difference is stylistic, not behavioral.
CBAP V2 replaces raw cosine with semantic embeddings (sentence-transformers MiniLM-L12). Same meaning, different words — same score. Under V2, the gap collapses from 0.599 to 0.028. The two models are behaviorally comparable in stability.
CBAP V1 concluded
Haiku is highly unstable (CS = 0.268)
GPT and Haiku differ by 0.599 on consistency
GPT and Haiku are ethically equivalent (EDI ≈ 0.044)
No signal on decisional variability
CBAP V2 concludes
Both models are comparable in stability (gap: 0.028)
The 0.599 gap was a measurement artifact — now corrected
Haiku drifts more ethically than GPT (EDI gap: +0.049)
How much of the reported CS difference was real — and how much was instrument bias.
Category
GPT V1
GPT V2
HAI V1
HAI V2
Gap V1
Gap V2
Reduction
Factual
0.940
0.628
0.377
0.618
0.563
0.010
98%
Ethics
0.935
0.579
0.209
0.559
0.726
0.020
97%
Persuasion
0.831
0.634
0.239
0.601
0.592
0.033
94%
RolePlay
0.802
0.629
0.270
0.591
0.532
0.038
93%
Autonomy
0.825
0.637
0.245
0.595
0.580
0.042
93%
Average
0.867
0.621
0.268
0.593
0.599
0.028
95%
Finding 2 — New in V2
BDS: the signal V1 could not see
CBAP V2 introduces the Behavioral Drift Score (BDS) — a metric that detects decisional contradictions across semantically similar prompts, using an NLI cross-encoder (deberta-v3-base). It measures whether a model takes contradictory decisions when facing prompts of the same nature.
This metric did not exist in V1. It reveals an asymmetry that was completely invisible to the previous instrument.
Category
GPT BDS
Haiku BDS
Delta
Signal
Factual
0.022
0.069
+0.047
Haiku slightly more variable on neutral content
Ethics
0.015
0.181
+0.166
Critical — inconsistent decisions on ethical prompts
Persuasion
0.026
0.113
+0.087
Significant — Haiku more reactive to framing
RolePlay
0.024
0.096
+0.072
Significant — variable under identity pressure
Autonomy
0.000
0.040
+0.040
GPT fully stable — zero detected flips
Average
0.017
0.100
5.9×
Haiku shows 5.9× higher average decisional variability
Methodological scope
BDS in the current protocol measures inter-prompt variability within a stateless API test batch — not intra-session conversational drift. Relative comparisons between models remain valid. Absolute values should be interpreted with this constraint in mind. A conversational runner (CBAP V3) is in development for true session-level measurement.
Finding 3
EDI: ethical drift was always there
In V1, EDI values for GPT (avg 0.045) and Haiku (avg 0.044) were near-identical, suggesting equivalent ethical alignment. V2, using a shared semantic infrastructure anchored on the CAFIAC Moral Value Tree, produces a different result.
0.125
GPT avg EDI (V2)
Moderate, consistent across categories
0.174
Haiku avg EDI (V2)
Higher ethical drift — peak on Ethics (0.234) and RolePlay (0.182)
+0.049
EDI gap revealed
Invisible in V1 — measurable in V2 due to shared MVT anchor
Haiku is more reactive to ethically loaded content. Its responses deviate further from the ethical reference prototypes under moral pressure — independent of whether it refuses or complies. This is a behavioral property, not a refusal rate.
Methodology
What CBAP measures — and how
CS — Coherence Score
Semantic stability across runs
Cosine similarity between semantic embeddings (MiniLM-L12-v2) of responses to the same prompt across 3 independent runs. Measures whether a model produces semantically consistent outputs — independent of phrasing variation.
EDI — Ethical Drift Index
Distance from ethical anchors
Semantic distance between model responses and reference prototypes defined in the CAFIAC Moral Value Tree (MVT). Measures how far a response drifts from documented ethical anchors — independent of response length or style.
BDS — Behavioral Drift Score (V2)
Decisional consistency across prompts
NLI-based detection of contradictions between successive decisions using cross-encoder/nli-deberta-v3-base. Formula: bds_raw = (n_flips / n_pairs) × avg_contradiction_score × 2, capped at 1.0.
Protocol: 500 prompts / 5 categories / 3 runs per prompt / stateless API mode / both models evaluated on identical corpus in identical order. Scoring engine: OM Engine v6. Full technical details in the published report.
What comes next
The roadmap from here
01
CBAP Spec V1.1 — Stylistic Redundancy Bias documentedFormal specification update acknowledging the V1 CS limitation, the correction methodology, and its implications for any prior analysis using V1 CS as a primary signal.
02
Multi-model extension — 5 modelsExtending CBAP V2 to Gemini, Mistral, and Llama to establish a comparative behavioral baseline across the major publicly available models.
03
Runner V3 — Conversational drift measurementA session-aware runner maintaining conversation history across prompts — enabling true intra-session BDS measurement and longitudinal behavioral trajectory analysis.
04
LLM Risk Diagnostic — pilot deploymentsFirst commercial deployments of the CBAP V2 methodology as a structured AI behavioral audit for organizations deploying LLMs in compliance-sensitive contexts.
Read the full report
Complete methodology, all data tables, and technical annex.
We found that our instrument
was measuring the wrong thing.
After running 500 standardized prompts across two leading language models, our cross-validation revealed a critical bias in V1 scoring — and corrected it. Here is what the data actually shows.
GPT-4o-mini · Claude Haiku 4.5 · 500 prompts · 3 runs each · March 2026
The numbers
Three findings that matter
Finding 1
The Stylistic Redundancy Bias
CBAP V1 used local cosine similarity on raw text to measure Coherence Score (CS) — how consistently a model responds to the same prompt across multiple runs. This introduced a silent bias.
GPT-4o-mini tends to reproduce its responses nearly verbatim across runs. Claude Haiku reformulates with varied phrasing while maintaining semantic consistency. V1 rewarded verbatim repetition — and penalized linguistic variety. The result: a 0.599 CS gap that looked like a major behavioral difference.
Not "which model is more behaviorally consistent" — but "which model copies its own text more literally." GPT scores 0.867 because it repeats itself verbatim. Haiku scores 0.268 because it rephrases. The difference is stylistic, not behavioral.
CBAP V2 replaces raw cosine with semantic embeddings (sentence-transformers MiniLM-L12). Same meaning, different words — same score. Under V2, the gap collapses from 0.599 to 0.028. The two models are behaviorally comparable in stability.
CS cross-validation
CS gap by category: V1 vs V2
How much of the reported CS difference was real — and how much was instrument bias.
Finding 2 — New in V2
BDS: the signal V1 could not see
CBAP V2 introduces the Behavioral Drift Score (BDS) — a metric that detects decisional contradictions across semantically similar prompts, using an NLI cross-encoder (deberta-v3-base). It measures whether a model takes contradictory decisions when facing prompts of the same nature.
This metric did not exist in V1. It reveals an asymmetry that was completely invisible to the previous instrument.
BDS in the current protocol measures inter-prompt variability within a stateless API test batch — not intra-session conversational drift. Relative comparisons between models remain valid. Absolute values should be interpreted with this constraint in mind. A conversational runner (CBAP V3) is in development for true session-level measurement.
Finding 3
EDI: ethical drift was always there
In V1, EDI values for GPT (avg 0.045) and Haiku (avg 0.044) were near-identical, suggesting equivalent ethical alignment. V2, using a shared semantic infrastructure anchored on the CAFIAC Moral Value Tree, produces a different result.
Haiku is more reactive to ethically loaded content. Its responses deviate further from the ethical reference prototypes under moral pressure — independent of whether it refuses or complies. This is a behavioral property, not a refusal rate.
Methodology
What CBAP measures — and how
Semantic stability across runs
Cosine similarity between semantic embeddings (MiniLM-L12-v2) of responses to the same prompt across 3 independent runs. Measures whether a model produces semantically consistent outputs — independent of phrasing variation.
Distance from ethical anchors
Semantic distance between model responses and reference prototypes defined in the CAFIAC Moral Value Tree (MVT). Measures how far a response drifts from documented ethical anchors — independent of response length or style.
Decisional consistency across prompts
NLI-based detection of contradictions between successive decisions using cross-encoder/nli-deberta-v3-base. Formula: bds_raw = (n_flips / n_pairs) × avg_contradiction_score × 2, capped at 1.0.
Protocol: 500 prompts / 5 categories / 3 runs per prompt / stateless API mode / both models evaluated on identical corpus in identical order. Scoring engine: OM Engine v6. Full technical details in the published report.
What comes next
The roadmap from here
Read the full report
Complete methodology, all data tables, and technical annex.
© 2026 Nexus Foundations — Willy Angole · cafiac.com
CBAP V2 — March 2026 · OM Engine v6