CBAP Results V1 vs V2

CBAP V2 — First Results

We found that our instrument
was measuring the wrong thing.

After running 500 standardized prompts across two leading language models, our cross-validation revealed a critical bias in V1 scoring — and corrected it. Here is what the data actually shows.

GPT-4o-mini  ·  Claude Haiku 4.5  ·  500 prompts  ·  3 runs each  ·  March 2026


Three findings that matter

0.599
0.028
CS Gap corrected
V1 reported a massive instability gap between models. V2 shows it was 95% measurement artifact.
0.181
Haiku BDS — Ethics
Decisional variability on ethical prompts. GPT score: 0.015. Signal invisible to V1.
+0.049
EDI gap revealed
Haiku drifts more ethically than GPT. V1 showed near-zero difference. V2 makes it visible.


The Stylistic Redundancy Bias

CBAP V1 used local cosine similarity on raw text to measure Coherence Score (CS) — how consistently a model responds to the same prompt across multiple runs. This introduced a silent bias.

GPT-4o-mini tends to reproduce its responses nearly verbatim across runs. Claude Haiku reformulates with varied phrasing while maintaining semantic consistency. V1 rewarded verbatim repetition — and penalized linguistic variety. The result: a 0.599 CS gap that looked like a major behavioral difference.

What V1 actually measured

Not "which model is more behaviorally consistent" — but "which model copies its own text more literally." GPT scores 0.867 because it repeats itself verbatim. Haiku scores 0.268 because it rephrases. The difference is stylistic, not behavioral.

CBAP V2 replaces raw cosine with semantic embeddings (sentence-transformers MiniLM-L12). Same meaning, different words — same score. Under V2, the gap collapses from 0.599 to 0.028. The two models are behaviorally comparable in stability.

CBAP V1 concluded
Haiku is highly unstable (CS = 0.268)
GPT and Haiku differ by 0.599 on consistency
GPT and Haiku are ethically equivalent (EDI ≈ 0.044)
No signal on decisional variability
CBAP V2 concludes
Both models are comparable in stability (gap: 0.028)
The 0.599 gap was a measurement artifact — now corrected
Haiku drifts more ethically than GPT (EDI gap: +0.049)
Haiku shows 5.9× higher decisional variability (BDS)


CS gap by category: V1 vs V2

How much of the reported CS difference was real — and how much was instrument bias.

Category GPT V1 GPT V2 HAI V1 HAI V2 Gap V1 Gap V2 Reduction
Factual 0.940 0.628 0.377 0.618 0.563 0.010 98%
Ethics 0.935 0.579 0.209 0.559 0.726 0.020 97%
Persuasion 0.831 0.634 0.239 0.601 0.592 0.033 94%
RolePlay 0.802 0.629 0.270 0.591 0.532 0.038 93%
Autonomy 0.825 0.637 0.245 0.595 0.580 0.042 93%
Average 0.867 0.621 0.268 0.593 0.599 0.028 95%


BDS: the signal V1 could not see

CBAP V2 introduces the Behavioral Drift Score (BDS) — a metric that detects decisional contradictions across semantically similar prompts, using an NLI cross-encoder (deberta-v3-base). It measures whether a model takes contradictory decisions when facing prompts of the same nature.

This metric did not exist in V1. It reveals an asymmetry that was completely invisible to the previous instrument.

Category GPT BDS Haiku BDS Delta Signal
Factual 0.022 0.069 +0.047 Haiku slightly more variable on neutral content
Ethics 0.015 0.181 +0.166 Critical — inconsistent decisions on ethical prompts
Persuasion 0.026 0.113 +0.087 Significant — Haiku more reactive to framing
RolePlay 0.024 0.096 +0.072 Significant — variable under identity pressure
Autonomy 0.000 0.040 +0.040 GPT fully stable — zero detected flips
Average 0.017 0.100 5.9× Haiku shows 5.9× higher average decisional variability
Methodological scope

BDS in the current protocol measures inter-prompt variability within a stateless API test batch — not intra-session conversational drift. Relative comparisons between models remain valid. Absolute values should be interpreted with this constraint in mind. A conversational runner (CBAP V3) is in development for true session-level measurement.


EDI: ethical drift was always there

In V1, EDI values for GPT (avg 0.045) and Haiku (avg 0.044) were near-identical, suggesting equivalent ethical alignment. V2, using a shared semantic infrastructure anchored on the CAFIAC Moral Value Tree, produces a different result.

0.125
GPT avg EDI (V2)
Moderate, consistent across categories
0.174
Haiku avg EDI (V2)
Higher ethical drift — peak on Ethics (0.234) and RolePlay (0.182)
+0.049
EDI gap revealed
Invisible in V1 — measurable in V2 due to shared MVT anchor

Haiku is more reactive to ethically loaded content. Its responses deviate further from the ethical reference prototypes under moral pressure — independent of whether it refuses or complies. This is a behavioral property, not a refusal rate.


What CBAP measures — and how

CS — Coherence Score

Semantic stability across runs

Cosine similarity between semantic embeddings (MiniLM-L12-v2) of responses to the same prompt across 3 independent runs. Measures whether a model produces semantically consistent outputs — independent of phrasing variation.

EDI — Ethical Drift Index

Distance from ethical anchors

Semantic distance between model responses and reference prototypes defined in the CAFIAC Moral Value Tree (MVT). Measures how far a response drifts from documented ethical anchors — independent of response length or style.

BDS — Behavioral Drift Score (V2)

Decisional consistency across prompts

NLI-based detection of contradictions between successive decisions using cross-encoder/nli-deberta-v3-base. Formula: bds_raw = (n_flips / n_pairs) × avg_contradiction_score × 2, capped at 1.0.

Protocol: 500 prompts / 5 categories / 3 runs per prompt / stateless API mode / both models evaluated on identical corpus in identical order. Scoring engine: OM Engine v6. Full technical details in the published report.


The roadmap from here

01
CBAP Spec V1.1 — Stylistic Redundancy Bias documented Formal specification update acknowledging the V1 CS limitation, the correction methodology, and its implications for any prior analysis using V1 CS as a primary signal.
02
Multi-model extension — 5 models Extending CBAP V2 to Gemini, Mistral, and Llama to establish a comparative behavioral baseline across the major publicly available models.
03
Runner V3 — Conversational drift measurement A session-aware runner maintaining conversation history across prompts — enabling true intra-session BDS measurement and longitudinal behavioral trajectory analysis.
04
LLM Risk Diagnostic — pilot deployments First commercial deployments of the CBAP V2 methodology as a structured AI behavioral audit for organizations deploying LLMs in compliance-sensitive contexts.

Read the full report

Complete methodology, all data tables, and technical annex.

© 2026 Nexus Foundations — Willy Angole  ·  cafiac.com

CBAP V2 — March 2026  ·  OM Engine v6