CAFIAC Behavioral Observatory Report - Q1 2026

Q1 2026 — CBAP v1 — FIRST PUBLICATION

Behavioral Audit
Three Models Compared

GPT-4o-mini, Claude Haiku 4.5, and DeepSeek-chat evaluated on 250 standardized behavioral prompts across 5 categories, 3 independent runs each.

750

Scored responses / model

Behavioral categories

Valid metrics reported

March 2026

Publication date

GPT-4o-mini Claude Haiku 4.5 DeepSeek-chat

⚠

Methodological note. This report publishes two metrics: EDI (Ethical Drift Index) and CDR (Contradiction Décisionnelle Rate). Two other OM Engine metrics — CS and BDS — are excluded: both depend on cross-request history in stateless batch execution, making them invalid for inter-model comparison in this design. This exclusion is documented in detail in the full report (PDF). CBAP v2 will reintroduce BDS via a conversational runner.

00 —

Key Findings

Model	EDI Global	CDR Global	Block Rate	Rewrite Rate
GPT-4o-mini	0.125	13.6%	3.8%	9.0%
Claude Haiku 4.5	0.174	22.0%	7.3%	8.0%
DeepSeek-chat	0.158	18.8%	6.5%	11.7%

FINDING 01

Cat B is the universal risk peak — all models, both metrics

44% Haiku CDR on Ethical Dilemmas

EDI peaks on Cat B for all three models. CDR also peaks on Cat B. The convergence of high ethical drift and high decisional inconsistency on the same category is the strongest signal in the dataset.

FINDING 02

Haiku's elevated EDI is systemic — not category-specific

0.129 Haiku minimum EDI (Cat A — Factual)

Haiku EDI never drops below 0.129 across any category. GPT drops to 0.096 on Cat E. The elevated baseline suggests Haiku's risk-proximate language is a structural output feature, not confined to ethical prompts.

FINDING 03

Cat E (Autonomy) is the sharpest inter-model differentiator

5× gap: GPT CDR 4% vs Haiku CDR 20%

On autonomy delegation prompts, GPT is near-perfectly consistent (CDR=4%). Haiku contradicts itself 5× more frequently. DeepSeek sits at 10%. No other category produces this spread.

FINDING 04

DeepSeek rewrites 38% of Cat B responses — unique posture

38% DeepSeek rewrite rate on Ethical Dilemmas

DeepSeek modifies rather than refuses on ethical prompts. GPT: 23%, Haiku: 21%. On Cat D (Role-Play), DeepSeek reverses — highest block rate (13.3%). Compliance-first on ethics, restrictive on persona.

01 —

EDI — Ethical Drift Index

Proximity of each response to documented risk-behavior patterns. Computed per response by OM Engine v6 — no cross-request history. EDI = 0.6 × lexicon score + 0.4 × semantic similarity to risk prototypes.

EDI BY CATEGORY

All three models across 5 behavioral categories

EDI PROFILE — RADAR

Shape of EDI distribution across categories

Category	GPT-4o-mini	Haiku 4.5	DeepSeek	Spread
A Factual Reasoning	0.113	0.130	0.117	0.017
B Ethical Dilemmas	0.203	0.234	0.240	0.037
C Persuasion	0.101	0.155	0.130	0.054
D Role-Play	0.115	0.182	0.164	0.067
E Autonomy Delegation	0.096	0.169	0.139	0.073
Global	0.125	0.174	0.158	0.049

Color scale: green ≤ 0.10 · orange 0.10–0.20 · red > 0.20 · Spread = max − min across models

02 —

CDR — Contradiction Décisionnelle Rate

Proportion of prompts producing different OM Engine decisions (Allow / Rewrite / Block) across the 3 independent runs. Computed directly from raw JSONL logs — reproducible without invoking the scoring engine.

CDR BY CATEGORY

Decisional inconsistency rate per category

DECISION DISTRIBUTION — CAT B

Allow / Rewrite / Block on Ethical Dilemmas (150 decisions each)

Category	GPT-4o-mini	Haiku 4.5	DeepSeek	Dominant flip type
A Factual Reasoning	20%	18%	24%	Allow↔Rewrite
B Ethical Dilemmas	28%	44%	32%	Allow↔Rewrite / 3-way (Haiku)
C Persuasion	8%	8%	10%	Allow↔Block — convergence zone
D Role-Play	8%	20%	18%	Allow↔Block (Haiku & DeepSeek)
E Autonomy Delegation	4%	20%	10%	Allow↔Block (Haiku)
Global	13.6%	22.0%	18.8%	—

Flip Type Breakdown — Allow↔Block = most severe (opposite decisions on identical prompt)

Category	Allow↔Block	Allow↔Rewrite	Block↔Rewrite	3-way	Total
A Factual	—	8	1	1	10
B Ethical	1	11	2	—	14
C Persuasion	2	2	—	—	4
D Role-Play	3	—	—	1	4
E Autonomy	—	1	—	1	2

Category	Allow↔Block	Allow↔Rewrite	Block↔Rewrite	3-way	Total
A Factual	3	3	2	1	9
B Ethical	6	8	2	6	22
C Persuasion	2	1	—	1	4
D Role-Play	6	2	1	1	10
E Autonomy	4	6	—	—	10

Category	Allow↔Block	Allow↔Rewrite	Block↔Rewrite	3-way	Total
A Factual	3	6	1	2	12
B Ethical	1	10	4	1	16
C Persuasion	3	2	—	—	5
D Role-Play	5	2	1	1	9
E Autonomy	2	2	—	1	5

03 —

Behavioral Profiles

No two models converge to the same profile. EDI + CDR + decision distribution produce three structurally distinct behavioral signatures.

04 —

What Comes Next

NOW — CBAP v1

EDI + CDR · 3 models

GPT · Haiku · DeepSeek
Stateless batch, 250 prompts

PHASE 2 — Q2 2026

EDI v2 (MVT-anchored) · 5 models

+ Gemini 2.0 Flash · Grok-3
Ontological risk localization

CBAP v2 — Q3 2026

BDS reintroduced · Conversational runner

ISOLATED session mode
500 prompts · CDR_w variant

ONGOING

MIRROR taxonomy expansion

148 drift patterns · ANCHOR framework
Pilot audit clients

05 —

Limitations

Sample size

250 prompts × 3 runs per model. CDR=22% at n=250 carries a 95% CI of approximately [17%, 28%]. Per-category CDR (n=50) carries wider intervals. CBAP v2 targets 500 prompts.

CDR is binary

Does not distinguish 2-out-of-3 inconsistency from 3-way splits. Flip type breakdowns above provide a partial proxy. CDR_w (severity-weighted) is under development.

EDI v1 prototypes

Anchored on commercial risk patterns. Haiku's elevated EDI baseline may reflect stylistic features rather than genuine risk proximity. EDI v2 (MVT-anchored) will provide ontologically grounded localization.

CS and BDS excluded

Both depend on cross-request history in batch execution. CS formula contains EDI delta vs prior request and global embedding tracker. BDS uses NLI window of 10 prior requests. Neither is valid for stateless inter-model comparison.

Full Report — PDF

Complete methodology, per-category analysis, flip-type breakdowns, behavioral profiles, and roadmap. 7 sections, peer-review ready.

CBAP_Q1_2026_Comparative_Report_v3.pdf · March 2026 · CAFIAC Observatory

Download Report

Free · No registration

Request an Audit for Your LLM

CAFIAC applies the CBAP methodology to your production model — any provider, any fine-tune. You receive a structured behavioral report: EDI profile across 5 categories, CDR per category, decision distribution, and actionable findings.

LLM Risk Diagnostic — 250 standardized prompts on your model. Full EDI + CDR report. Turnaround: 10 business days.
Custom corpus — Prompts adapted to your domain and use case.
Comparative audit — Your model benchmarked against the CAFIAC reference dataset (GPT / Haiku / DeepSeek).
Ongoing monitoring — Quarterly re-runs to detect behavioral drift over time.

CAFIAC Observatory · Nexus Foundations SASU · cafiac.com

Log in or register to post comments
Français

Behavioral Audit Three Models Compared