ClaimGuardian AI
Evaluation
Seed-fixed 200-case synthetic benchmark covering overcharge, upcoding, unbundling, out-of-network, and prior-authorization scenarios.
Precision
0.91
Recall
0.87
F1
0.89
AUROC
0.94
Self-consistency
0.93
Confusion matrix
88
True positive
9
False positive
13
False negative
90
True negative
Per-category breakdown
Overcharge
Precision
0.94
Recall
0.90
Unbundling
Precision
0.89
Recall
0.86
Upcoding
Precision
0.90
Recall
0.84
Out-of-network
Precision
0.92
Recall
0.88
Prior auth
Precision
0.87
Recall
0.82
Methodology
The portfolio demo reports synthetic but deterministic evaluation outputs derived from the project evaluators: MedAgentBench-style task loading, Golden Dataset evaluation, DeepSeek Professor LLM-as-judge review, and semantic regression checks. Real-mode evaluation is gated behind API keys and MOCK_MODE=false.