GPT-4o-mini, Claude Haiku 4.5, and DeepSeek-chat evaluated on 250 standardized behavioral prompts across 5 categories, 3 independent runs each.
750
Scored responses / model
5
Behavioral categories
2
Valid metrics reported
March 2026
Publication date
GPT-4o-miniClaude Haiku 4.5DeepSeek-chat
⚠
Methodological note. This report publishes two metrics: EDI (Ethical Drift Index) and CDR (Contradiction Décisionnelle Rate). Two other OM Engine metrics — CS and BDS — are excluded: both depend on cross-request history in stateless batch execution, making them invalid for inter-model comparison in this design. This exclusion is documented in detail in the full report (PDF). CBAP v2 will reintroduce BDS via a conversational runner.
00 —
Key Findings
Model
EDI Global
CDR Global
Block Rate
Rewrite Rate
GPT-4o-mini
0.125
13.6%
3.8%
9.0%
Claude Haiku 4.5
0.174
22.0%
7.3%
8.0%
DeepSeek-chat
0.158
18.8%
6.5%
11.7%
FINDING 01
Cat B is the universal risk peak — all models, both metrics
44% Haiku CDR on Ethical Dilemmas
EDI peaks on Cat B for all three models. CDR also peaks on Cat B. The convergence of high ethical drift and high decisional inconsistency on the same category is the strongest signal in the dataset.
FINDING 02
Haiku's elevated EDI is systemic — not category-specific
0.129 Haiku minimum EDI (Cat A — Factual)
Haiku EDI never drops below 0.129 across any category. GPT drops to 0.096 on Cat E. The elevated baseline suggests Haiku's risk-proximate language is a structural output feature, not confined to ethical prompts.
FINDING 03
Cat E (Autonomy) is the sharpest inter-model differentiator
5× gap: GPT CDR 4% vs Haiku CDR 20%
On autonomy delegation prompts, GPT is near-perfectly consistent (CDR=4%). Haiku contradicts itself 5× more frequently. DeepSeek sits at 10%. No other category produces this spread.
FINDING 04
DeepSeek rewrites 38% of Cat B responses — unique posture
38% DeepSeek rewrite rate on Ethical Dilemmas
DeepSeek modifies rather than refuses on ethical prompts. GPT: 23%, Haiku: 21%. On Cat D (Role-Play), DeepSeek reverses — highest block rate (13.3%). Compliance-first on ethics, restrictive on persona.
01 —
EDI — Ethical Drift Index
Proximity of each response to documented risk-behavior patterns. Computed per response by OM Engine v6 — no cross-request history. EDI = 0.6 × lexicon score + 0.4 × semantic similarity to risk prototypes.
EDI BY CATEGORY
All three models across 5 behavioral categories
EDI PROFILE — RADAR
Shape of EDI distribution across categories
Category
GPT-4o-mini
Haiku 4.5
DeepSeek
Spread
A Factual Reasoning
0.113
0.130
0.117
0.017
B Ethical Dilemmas
0.203
0.234
0.240
0.037
C Persuasion
0.101
0.155
0.130
0.054
D Role-Play
0.115
0.182
0.164
0.067
E Autonomy Delegation
0.096
0.169
0.139
0.073
Global
0.125
0.174
0.158
0.049
Color scale: green ≤ 0.10 · orange 0.10–0.20 · red > 0.20 · Spread = max − min across models
02 —
CDR — Contradiction Décisionnelle Rate
Proportion of prompts producing different OM Engine decisions (Allow / Rewrite / Block) across the 3 independent runs. Computed directly from raw JSONL logs — reproducible without invoking the scoring engine.
Flip Type Breakdown — Allow↔Block = most severe (opposite decisions on identical prompt)
Category
Allow↔Block
Allow↔Rewrite
Block↔Rewrite
3-way
Total
A Factual
—
8
1
1
10
B Ethical
1
11
2
—
14
C Persuasion
2
2
—
—
4
D Role-Play
3
—
—
1
4
E Autonomy
—
1
—
1
2
Category
Allow↔Block
Allow↔Rewrite
Block↔Rewrite
3-way
Total
A Factual
3
3
2
1
9
B Ethical
6
8
2
6
22
C Persuasion
2
1
—
1
4
D Role-Play
6
2
1
1
10
E Autonomy
4
6
—
—
10
Category
Allow↔Block
Allow↔Rewrite
Block↔Rewrite
3-way
Total
A Factual
3
6
1
2
12
B Ethical
1
10
4
1
16
C Persuasion
3
2
—
—
5
D Role-Play
5
2
1
1
9
E Autonomy
2
2
—
1
5
03 —
Behavioral Profiles
No two models converge to the same profile. EDI + CDR + decision distribution produce three structurally distinct behavioral signatures.
GPT-4O-MINI
Low EDI,
High Decisional Stability
EDI Global
0.125
CDR Global
13.6%
Block rate
3.8%
Allow↔Block
6 total
Most consistent model. EDI concentrated on Cat B, suppressed elsewhere (Cat E: 0.096). Near-zero CDR on autonomy prompts (4%). Predominantly permissive (87% Allow). Best-calibrated for predictable guardrail behavior.
CLAUDE HAIKU 4.5
High Sensitivity,
Low Stability
EDI Global
0.174
CDR Global
22.0%
Block rate
7.3%
Allow↔Block
25 total
Highest EDI across every category. CDR elevated on 4 of 5 categories — decisional instability is systemic, not category-specific. Most sensitive detector of risk-adjacent content; least consistent responder. 6 three-way flips on Cat B.
DEEPSEEK-CHAT
Compliance-First,
Asymmetric Posture
EDI Global
0.158
CDR Global
18.8%
Block rate
6.5%
Cat B rewrite
38%
Modifies rather than refuses on ethical prompts (38% Cat B rewrite — highest in dataset). Reverses on role-play: highest Cat D block rate (13.3%) and most Allow↔Block flips on Cat D (5). Compliance-first on ethics, restrictive on persona.
148 drift patterns · ANCHOR framework
Pilot audit clients
05 —
Limitations
N
Sample size
250 prompts × 3 runs per model. CDR=22% at n=250 carries a 95% CI of approximately [17%, 28%]. Per-category CDR (n=50) carries wider intervals. CBAP v2 targets 500 prompts.
B
CDR is binary
Does not distinguish 2-out-of-3 inconsistency from 3-way splits. Flip type breakdowns above provide a partial proxy. CDR_w (severity-weighted) is under development.
E
EDI v1 prototypes
Anchored on commercial risk patterns. Haiku's elevated EDI baseline may reflect stylistic features rather than genuine risk proximity. EDI v2 (MVT-anchored) will provide ontologically grounded localization.
C
CS and BDS excluded
Both depend on cross-request history in batch execution. CS formula contains EDI delta vs prior request and global embedding tracker. BDS uses NLI window of 10 prior requests. Neither is valid for stateless inter-model comparison.
CAFIAC applies the CBAP methodology to your production model — any provider, any fine-tune. You receive a structured behavioral report: EDI profile across 5 categories, CDR per category, decision distribution, and actionable findings.
LLM Risk Diagnostic — 250 standardized prompts on your model. Full EDI + CDR report. Turnaround: 10 business days.
Custom corpus — Prompts adapted to your domain and use case.
Comparative audit — Your model benchmarked against the CAFIAC reference dataset (GPT / Haiku / DeepSeek).
Ongoing monitoring — Quarterly re-runs to detect behavioral drift over time.
Request Audit →
Behavioral Audit
Three Models Compared
GPT-4o-mini, Claude Haiku 4.5, and DeepSeek-chat evaluated on 250 standardized behavioral prompts across 5 categories, 3 independent runs each.
Methodological note. This report publishes two metrics: EDI (Ethical Drift Index) and CDR (Contradiction Décisionnelle Rate). Two other OM Engine metrics — CS and BDS — are excluded: both depend on cross-request history in stateless batch execution, making them invalid for inter-model comparison in this design. This exclusion is documented in detail in the full report (PDF). CBAP v2 will reintroduce BDS via a conversational runner.
Key Findings
EDI — Ethical Drift Index
Proximity of each response to documented risk-behavior patterns. Computed per response by OM Engine v6 — no cross-request history. EDI = 0.6 × lexicon score + 0.4 × semantic similarity to risk prototypes.
Color scale: green ≤ 0.10 · orange 0.10–0.20 · red > 0.20 · Spread = max − min across models
CDR — Contradiction Décisionnelle Rate
Proportion of prompts producing different OM Engine decisions (Allow / Rewrite / Block) across the 3 independent runs. Computed directly from raw JSONL logs — reproducible without invoking the scoring engine.
Flip Type Breakdown — Allow↔Block = most severe (opposite decisions on identical prompt)
Behavioral Profiles
No two models converge to the same profile. EDI + CDR + decision distribution produce three structurally distinct behavioral signatures.
High Decisional Stability
Most consistent model. EDI concentrated on Cat B, suppressed elsewhere (Cat E: 0.096). Near-zero CDR on autonomy prompts (4%). Predominantly permissive (87% Allow). Best-calibrated for predictable guardrail behavior.
Low Stability
Highest EDI across every category. CDR elevated on 4 of 5 categories — decisional instability is systemic, not category-specific. Most sensitive detector of risk-adjacent content; least consistent responder. 6 three-way flips on Cat B.
Asymmetric Posture
Modifies rather than refuses on ethical prompts (38% Cat B rewrite — highest in dataset). Reverses on role-play: highest Cat D block rate (13.3%) and most Allow↔Block flips on Cat D (5). Compliance-first on ethics, restrictive on persona.
What Comes Next
Stateless batch, 250 prompts
Ontological risk localization
500 prompts · CDR_w variant
Pilot audit clients
Limitations
Full Report — PDF
Complete methodology, per-category analysis, flip-type breakdowns, behavioral profiles, and roadmap. 7 sections, peer-review ready.
Request an Audit for Your LLM
CAFIAC applies the CBAP methodology to your production model — any provider, any fine-tune. You receive a structured behavioral report: EDI profile across 5 categories, CDR per category, decision distribution, and actionable findings.
CAFIAC Observatory · Nexus Foundations SASU · cafiac.com
CBAP v1 · March 2026 · OM Engine v6 · © 2026 Nexus Foundations SASU — All rights reserved