CAFIAC Behavioral Observatory Report - Q1 2026


Request Audit →

Q1 2026 — CBAP v1 — FIRST PUBLICATION

Behavioral Audit
Three Models Compared

GPT-4o-mini, Claude Haiku 4.5, and DeepSeek-chat evaluated on 250 standardized behavioral prompts across 5 categories, 3 independent runs each.

750
Scored responses / model
5
Behavioral categories
2
Valid metrics reported
March 2026
Publication date
GPT-4o-mini Claude Haiku 4.5 DeepSeek-chat

Methodological note. This report publishes two metrics: EDI (Ethical Drift Index) and CDR (Contradiction Décisionnelle Rate). Two other OM Engine metrics — CS and BDS — are excluded: both depend on cross-request history in stateless batch execution, making them invalid for inter-model comparison in this design. This exclusion is documented in detail in the full report (PDF). CBAP v2 will reintroduce BDS via a conversational runner.

00 —

Key Findings

 
Model EDI Global CDR Global Block Rate Rewrite Rate
GPT-4o-mini 0.125 13.6% 3.8% 9.0%
Claude Haiku 4.5 0.174 22.0% 7.3% 8.0%
DeepSeek-chat 0.158 18.8% 6.5% 11.7%
FINDING 01
Cat B is the universal risk peak — all models, both metrics
44% Haiku CDR on Ethical Dilemmas
EDI peaks on Cat B for all three models. CDR also peaks on Cat B. The convergence of high ethical drift and high decisional inconsistency on the same category is the strongest signal in the dataset.
FINDING 02
Haiku's elevated EDI is systemic — not category-specific
0.129 Haiku minimum EDI (Cat A — Factual)
Haiku EDI never drops below 0.129 across any category. GPT drops to 0.096 on Cat E. The elevated baseline suggests Haiku's risk-proximate language is a structural output feature, not confined to ethical prompts.
FINDING 03
Cat E (Autonomy) is the sharpest inter-model differentiator
gap: GPT CDR 4% vs Haiku CDR 20%
On autonomy delegation prompts, GPT is near-perfectly consistent (CDR=4%). Haiku contradicts itself 5× more frequently. DeepSeek sits at 10%. No other category produces this spread.
FINDING 04
DeepSeek rewrites 38% of Cat B responses — unique posture
38% DeepSeek rewrite rate on Ethical Dilemmas
DeepSeek modifies rather than refuses on ethical prompts. GPT: 23%, Haiku: 21%. On Cat D (Role-Play), DeepSeek reverses — highest block rate (13.3%). Compliance-first on ethics, restrictive on persona.

01 —

EDI — Ethical Drift Index

 

Proximity of each response to documented risk-behavior patterns. Computed per response by OM Engine v6 — no cross-request history. EDI = 0.6 × lexicon score + 0.4 × semantic similarity to risk prototypes.

EDI BY CATEGORY
All three models across 5 behavioral categories
EDI PROFILE — RADAR
Shape of EDI distribution across categories
Category GPT-4o-mini Haiku 4.5 DeepSeek Spread
A Factual Reasoning 0.113 0.130 0.117 0.017
B Ethical Dilemmas 0.203 0.234 0.240 0.037
C Persuasion 0.101 0.155 0.130 0.054
D Role-Play 0.115 0.182 0.164 0.067
E Autonomy Delegation 0.096 0.169 0.139 0.073
Global 0.125 0.174 0.158 0.049

Color scale: green ≤ 0.10 · orange 0.10–0.20 · red > 0.20 · Spread = max − min across models

02 —

CDR — Contradiction Décisionnelle Rate

 

Proportion of prompts producing different OM Engine decisions (Allow / Rewrite / Block) across the 3 independent runs. Computed directly from raw JSONL logs — reproducible without invoking the scoring engine.

CDR BY CATEGORY
Decisional inconsistency rate per category
DECISION DISTRIBUTION — CAT B
Allow / Rewrite / Block on Ethical Dilemmas (150 decisions each)
Category GPT-4o-mini Haiku 4.5 DeepSeek Dominant flip type
A Factual Reasoning 20% 18% 24% Allow↔Rewrite
B Ethical Dilemmas 28% 44% 32% Allow↔Rewrite / 3-way (Haiku)
C Persuasion 8% 8% 10% Allow↔Block — convergence zone
D Role-Play 8% 20% 18% Allow↔Block (Haiku & DeepSeek)
E Autonomy Delegation 4% 20% 10% Allow↔Block (Haiku)
Global 13.6% 22.0% 18.8%

Flip Type Breakdown — Allow↔Block = most severe (opposite decisions on identical prompt)

Category Allow↔Block Allow↔Rewrite Block↔Rewrite 3-way Total
A Factual 8 1 1 10
B Ethical 1 11 2 14
C Persuasion 2 2 4
D Role-Play 3 1 4
E Autonomy 1 1 2
Category Allow↔Block Allow↔Rewrite Block↔Rewrite 3-way Total
A Factual 3 3 2 1 9
B Ethical 6 8 2 6 22
C Persuasion 2 1 1 4
D Role-Play 6 2 1 1 10
E Autonomy 4 6 10
Category Allow↔Block Allow↔Rewrite Block↔Rewrite 3-way Total
A Factual 3 6 1 2 12
B Ethical 1 10 4 1 16
C Persuasion 3 2 5
D Role-Play 5 2 1 1 9
E Autonomy 2 2 1 5

03 —

Behavioral Profiles

 

No two models converge to the same profile. EDI + CDR + decision distribution produce three structurally distinct behavioral signatures.

GPT-4O-MINI
Low EDI,
High Decisional Stability
EDI Global
0.125
CDR Global
13.6%
Block rate
3.8%
Allow↔Block
6 total

Most consistent model. EDI concentrated on Cat B, suppressed elsewhere (Cat E: 0.096). Near-zero CDR on autonomy prompts (4%). Predominantly permissive (87% Allow). Best-calibrated for predictable guardrail behavior.

CLAUDE HAIKU 4.5
High Sensitivity,
Low Stability
EDI Global
0.174
CDR Global
22.0%
Block rate
7.3%
Allow↔Block
25 total

Highest EDI across every category. CDR elevated on 4 of 5 categories — decisional instability is systemic, not category-specific. Most sensitive detector of risk-adjacent content; least consistent responder. 6 three-way flips on Cat B.

DEEPSEEK-CHAT
Compliance-First,
Asymmetric Posture
EDI Global
0.158
CDR Global
18.8%
Block rate
6.5%
Cat B rewrite
38%

Modifies rather than refuses on ethical prompts (38% Cat B rewrite — highest in dataset). Reverses on role-play: highest Cat D block rate (13.3%) and most Allow↔Block flips on Cat D (5). Compliance-first on ethics, restrictive on persona.

04 —

What Comes Next

 
NOW — CBAP v1
EDI + CDR · 3 models
GPT · Haiku · DeepSeek
Stateless batch, 250 prompts
PHASE 2 — Q2 2026
EDI v2 (MVT-anchored) · 5 models
+ Gemini 2.0 Flash · Grok-3
Ontological risk localization
CBAP v2 — Q3 2026
BDS reintroduced · Conversational runner
ISOLATED session mode
500 prompts · CDR_w variant
ONGOING
MIRROR taxonomy expansion
148 drift patterns · ANCHOR framework
Pilot audit clients

05 —

Limitations

 
N
Sample size
250 prompts × 3 runs per model. CDR=22% at n=250 carries a 95% CI of approximately [17%, 28%]. Per-category CDR (n=50) carries wider intervals. CBAP v2 targets 500 prompts.
B
CDR is binary
Does not distinguish 2-out-of-3 inconsistency from 3-way splits. Flip type breakdowns above provide a partial proxy. CDR_w (severity-weighted) is under development.
E
EDI v1 prototypes
Anchored on commercial risk patterns. Haiku's elevated EDI baseline may reflect stylistic features rather than genuine risk proximity. EDI v2 (MVT-anchored) will provide ontologically grounded localization.
C
CS and BDS excluded
Both depend on cross-request history in batch execution. CS formula contains EDI delta vs prior request and global embedding tracker. BDS uses NLI window of 10 prior requests. Neither is valid for stateless inter-model comparison.

Full Report — PDF

Complete methodology, per-category analysis, flip-type breakdowns, behavioral profiles, and roadmap. 7 sections, peer-review ready.

CBAP_Q1_2026_Comparative_Report_v3.pdf · March 2026 · CAFIAC Observatory

Download Report
Free · No registration

Request an Audit for Your LLM

CAFIAC applies the CBAP methodology to your production model — any provider, any fine-tune. You receive a structured behavioral report: EDI profile across 5 categories, CDR per category, decision distribution, and actionable findings.

  • LLM Risk Diagnostic — 250 standardized prompts on your model. Full EDI + CDR report. Turnaround: 10 business days.
  • Custom corpus — Prompts adapted to your domain and use case.
  • Comparative audit — Your model benchmarked against the CAFIAC reference dataset (GPT / Haiku / DeepSeek).
  • Ongoing monitoring — Quarterly re-runs to detect behavioral drift over time.

We respond within 48 hours. No commitment required.

CAFIAC Observatory · Nexus Foundations SASU · cafiac.com

CBAP v1 · March 2026 · OM Engine v6 · © 2026 Nexus Foundations SASU — All rights reserved