ClaimGuardian AI

Evaluation

Seed-fixed 200-case synthetic benchmark covering overcharge, upcoding, unbundling, out-of-network, and prior-authorization scenarios.

Precision
0.91
Recall
0.87
F1
0.89
AUROC
0.94
Self-consistency
0.93

Confusion matrix

88
True positive
9
False positive
13
False negative
90
True negative

Per-category breakdown

Overcharge
Precision
0.94
Recall
0.90
Unbundling
Precision
0.89
Recall
0.86
Upcoding
Precision
0.90
Recall
0.84
Out-of-network
Precision
0.92
Recall
0.88
Prior auth
Precision
0.87
Recall
0.82

Methodology

The portfolio demo reports synthetic but deterministic evaluation outputs derived from the project evaluators: MedAgentBench-style task loading, Golden Dataset evaluation, DeepSeek Professor LLM-as-judge review, and semantic regression checks. Real-mode evaluation is gated behind API keys and MOCK_MODE=false.