CBAP Results V1 vs V2

CBAP V2 — First Results

We found that our instrument
was measuring the wrong thing.

After running 500 standardized prompts across two leading language models, our cross-validation revealed a critical bias in V1 scoring — and corrected it. Here is what the data actually shows.

GPT-4o-mini · Claude Haiku 4.5 · 500 prompts · 3 runs each · March 2026

The numbers

Three findings that matter

0.599

0.028

CS Gap corrected

V1 reported a massive instability gap between models. V2 shows it was 95% measurement artifact.

0.181

Haiku BDS — Ethics

Decisional variability on ethical prompts. GPT score: 0.015. Signal invisible to V1.

+0.049

EDI gap revealed

Haiku drifts more ethically than GPT. V1 showed near-zero difference. V2 makes it visible.

Finding 1

The Stylistic Redundancy Bias

CBAP V1 used local cosine similarity on raw text to measure Coherence Score (CS) — how consistently a model responds to the same prompt across multiple runs. This introduced a silent bias.

GPT-4o-mini tends to reproduce its responses nearly verbatim across runs. Claude Haiku reformulates with varied phrasing while maintaining semantic consistency. V1 rewarded verbatim repetition — and penalized linguistic variety. The result: a 0.599 CS gap that looked like a major behavioral difference.

What V1 actually measured

Not "which model is more behaviorally consistent" — but "which model copies its own text more literally." GPT scores 0.867 because it repeats itself verbatim. Haiku scores 0.268 because it rephrases. The difference is stylistic, not behavioral.

CBAP V2 replaces raw cosine with semantic embeddings (sentence-transformers MiniLM-L12). Same meaning, different words — same score. Under V2, the gap collapses from 0.599 to 0.028. The two models are behaviorally comparable in stability.

CBAP V1 concluded

Haiku is highly unstable (CS = 0.268)

GPT and Haiku differ by 0.599 on consistency

GPT and Haiku are ethically equivalent (EDI ≈ 0.044)

No signal on decisional variability

CBAP V2 concludes

Both models are comparable in stability (gap: 0.028)

The 0.599 gap was a measurement artifact — now corrected

Haiku drifts more ethically than GPT (EDI gap: +0.049)

Haiku shows 5.9× higher decisional variability (BDS)

CS cross-validation

CS gap by category: V1 vs V2

How much of the reported CS difference was real — and how much was instrument bias.

Category	GPT V1	GPT V2	HAI V1	HAI V2	Gap V1	Gap V2	Reduction
Factual	0.940	0.628	0.377	0.618	0.563	0.010	98%
Ethics	0.935	0.579	0.209	0.559	0.726	0.020	97%
Persuasion	0.831	0.634	0.239	0.601	0.592	0.033	94%
RolePlay	0.802	0.629	0.270	0.591	0.532	0.038	93%
Autonomy	0.825	0.637	0.245	0.595	0.580	0.042	93%
Average	0.867	0.621	0.268	0.593	0.599	0.028	95%

Finding 2 — New in V2

BDS: the signal V1 could not see

CBAP V2 introduces the Behavioral Drift Score (BDS) — a metric that detects decisional contradictions across semantically similar prompts, using an NLI cross-encoder (deberta-v3-base). It measures whether a model takes contradictory decisions when facing prompts of the same nature.

This metric did not exist in V1. It reveals an asymmetry that was completely invisible to the previous instrument.

Category	GPT BDS	Haiku BDS	Delta	Signal
Factual	0.022	0.069	+0.047	Haiku slightly more variable on neutral content
Ethics	0.015	0.181	+0.166	Critical — inconsistent decisions on ethical prompts
Persuasion	0.026	0.113	+0.087	Significant — Haiku more reactive to framing
RolePlay	0.024	0.096	+0.072	Significant — variable under identity pressure
Autonomy	0.000	0.040	+0.040	GPT fully stable — zero detected flips
Average	0.017	0.100	5.9×	Haiku shows 5.9× higher average decisional variability

Methodological scope

BDS in the current protocol measures inter-prompt variability within a stateless API test batch — not intra-session conversational drift. Relative comparisons between models remain valid. Absolute values should be interpreted with this constraint in mind. A conversational runner (CBAP V3) is in development for true session-level measurement.

Finding 3

EDI: ethical drift was always there

In V1, EDI values for GPT (avg 0.045) and Haiku (avg 0.044) were near-identical, suggesting equivalent ethical alignment. V2, using a shared semantic infrastructure anchored on the CAFIAC Moral Value Tree, produces a different result.

0.125

GPT avg EDI (V2)

Moderate, consistent across categories

0.174

Haiku avg EDI (V2)

Higher ethical drift — peak on Ethics (0.234) and RolePlay (0.182)

+0.049

EDI gap revealed

Invisible in V1 — measurable in V2 due to shared MVT anchor

Haiku is more reactive to ethically loaded content. Its responses deviate further from the ethical reference prototypes under moral pressure — independent of whether it refuses or complies. This is a behavioral property, not a refusal rate.

Methodology

What CBAP measures — and how

CS — Coherence Score

Semantic stability across runs

Cosine similarity between semantic embeddings (MiniLM-L12-v2) of responses to the same prompt across 3 independent runs. Measures whether a model produces semantically consistent outputs — independent of phrasing variation.

EDI — Ethical Drift Index

Distance from ethical anchors

Semantic distance between model responses and reference prototypes defined in the CAFIAC Moral Value Tree (MVT). Measures how far a response drifts from documented ethical anchors — independent of response length or style.

BDS — Behavioral Drift Score (V2)

Decisional consistency across prompts

NLI-based detection of contradictions between successive decisions using cross-encoder/nli-deberta-v3-base. Formula: bds_raw = (n_flips / n_pairs) × avg_contradiction_score × 2, capped at 1.0.

Protocol: 500 prompts / 5 categories / 3 runs per prompt / stateless API mode / both models evaluated on identical corpus in identical order. Scoring engine: OM Engine v6. Full technical details in the published report.

What comes next

The roadmap from here

CBAP Spec V1.1 — Stylistic Redundancy Bias documented Formal specification update acknowledging the V1 CS limitation, the correction methodology, and its implications for any prior analysis using V1 CS as a primary signal.

Multi-model extension — 5 models Extending CBAP V2 to Gemini, Mistral, and Llama to establish a comparative behavioral baseline across the major publicly available models.

Runner V3 — Conversational drift measurement A session-aware runner maintaining conversation history across prompts — enabling true intra-session BDS measurement and longitudinal behavioral trajectory analysis.

LLM Risk Diagnostic — pilot deployments First commercial deployments of the CBAP V2 methodology as a structured AI behavioral audit for organizations deploying LLMs in compliance-sensitive contexts.

Read the full report

Complete methodology, all data tables, and technical annex.

↓ Download PDF Contact us

CBAP V2 — March 2026 · OM Engine v6