Evaluation

Seed-fixed 200-case synthetic benchmark covering overcharge, upcoding, unbundling, out-of-network, and prior-authorization scenarios.

Precision

0.91

Recall

0.87

0.89

AUROC

0.94

Self-consistency

0.93

Confusion matrix

True positive

False positive

False negative

True negative

Per-category breakdown

Overcharge

Precision

0.94

Recall

0.90

Unbundling

Precision

0.89

Recall

0.86

Upcoding

Precision

0.90

Recall

0.84

Out-of-network

Precision

0.92

Recall

0.88

Prior auth

Precision

0.87

Recall

0.82

Category	Precision	Recall
Overcharge	0.94	0.90
Unbundling	0.89	0.86
Upcoding	0.90	0.84
Out-of-network	0.92	0.88
Prior auth	0.87	0.82

Methodology

The portfolio demo reports synthetic but deterministic evaluation outputs derived from the project evaluators: MedAgentBench-style task loading, Golden Dataset evaluation, DeepSeek Professor LLM-as-judge review, and semantic regression checks. Real-mode evaluation is gated behind API keys and MOCK_MODE=false.