

Large-scale analysis of LLMs in healthcare
Large-scale analysis of LLMs in healthcare
A continuously evolving benchmark of medical AI performance across clinical reasoning, documentation, and coding tasks.
Introduction
Introduction
Healthcare is one of the most complex and high-stakes environments for AI systems.Large Language Models (LLMs) now demonstrate impressive reasoning and generation capabilities, but their reliability depends heavily on clinical grounding, structured context, and rigorous evaluation.
Healthcare is one of the most complex and high-stakes environments for AI systems.Large Language Models (LLMs) now demonstrate impressive reasoning and generation capabilities, but their reliability depends heavily on clinical grounding, structured context, and rigorous evaluation.
Sully conducts continuous, large-scale benchmarking of LLMs across three core domains:
Medical Q&A & Diagnostic Reasoning
Clinical Note Generation Quality
Medical Coding & Structured Benchmarking
Sully conducts continuous, large-scale benchmarking of LLMs across three core domains:
Medical Q&A & Diagnostic Reasoning
Clinical Note Generation Quality
Medical Coding & Structured Benchmarking
Our evaluations span multiple datasets, modalities, and clinical tasks to ensure that model performance is not only high, but also clinically valid, safe, and explainable.
This page summarizes findings from these ongoing efforts and highlights how retrieval-augmented systems—especially for medical coding—significantly outperform unguided generation approaches.
Our evaluations span multiple datasets, modalities, and clinical tasks to ensure that model performance is not only high, but also clinically valid, safe, and explainable.
This page summarizes findings from these ongoing efforts and highlights how retrieval-augmented systems—especially for medical coding—significantly outperform unguided generation approaches.
Medical benchmarks:
QnA and diagnostic reasoning
Medical benchmarks:
QnA and diagnostic reasoning
Medical benchmarks:
QnA and diagnostic reasoning
This section evaluates LLMs on clinical knowledge, differential diagnosis, and reasoning tasks.
This section evaluates LLMs on clinical knowledge, differential diagnosis, and reasoning tasks.
What We Measure
Clinical knowledge accuracy
Diagnostic reasoning strength
Logical validity and hallucination rates
Safety and guideline adherence
What We Measure
Clinical knowledge accuracy
Diagnostic reasoning strength
Logical validity and hallucination rates
Safety and guideline adherence



Medical note generation quality
Medical note generation quality
Medical note generation quality
Evaluating model ability to generate clinically accurate and comprehensive documentation.
Evaluating model ability to generate clinically accurate and comprehensive documentation.
Evaluating model ability to generate clinically accurate and comprehensive documentation.
Metrics Assessed
Completeness of medical content
Logical consistency and absence of hallucinations
Clinical validity
Adherence to note structure
Physician usability
Metrics Assessed
Completeness of medical content
Logical consistency and absence of hallucinations
Clinical validity
Adherence to note structure
Physician usability
Model performance across all evaluation metrics
Metrics
Clinical Safety
3.36
4.24
4.22
3.68
4.60
4.46
4.05
2.82
3.56
3.64
3.16
4.10
4.44
3.94
3.88
3.55
Template Compliance
1.06
1.46
1.04
1.12
1.26
1.22
1.00
1.02
1.06
1.00
1.00
1.32
1.64
1.24
1.10
1.10
Clinical Accuracy
2.70
3.80
3.68
3.34
4.15
4.20
3.52
2.22
3.32
3.50
2.92
3.62
3.88
3.82
3.18
2.71
Clinical Completeness
3.50
4.70
4.64
3.82
4.81
4.72
4.32
2.40
3.66
3.68
3.04
4.50
4.72
4.44
4.42
3.65
Information Architecture
1.06
1.64
1.06
1.16
1.55
1.36
1.00
1.20
1.08
1.06
1.00
1.44
2.08
1.48
1.12
1.22
Evidence & Reasoning
2.42
2.88
2.80
2.86
3.21
3.18
2.78
1.74
2.52
2.60
2.36
2.62
2.88
2.82
2.70
2.27
Custom Instruction Adherence
2.38
3.36
3.04
2.76
3.28
3.56
3.28
2.56
2.96
2.76
2.72
2.94
3.58
3.34
2.78
2.71
Factual Accuracy
2.52
3.58
3.34
3.02
3.89
3.84
3.12
1.96
3.06
3.16
2.62
3.20
3.40
3.50
3.26
2.47
Documentation Standards
4.14
4.42
4.28
4.02
3.98
4.36
3.92
3.66
3.96
4.02
3.90
4.38
4.28
4.18
4.38
4.27
Clinical Boundary Compliance
2.44
2.98
2.98
2.78
3.57
3.28
3.00
1.86
2.48
2.66
2.56
3.04
2.94
3.24
2.66
2.35
Instruction Compliance
1.46
2.56
1.72
1.94
3.11
2.28
1.55
1.86
1.44
1.70
1.36
1.96
2.86
2.48
1.84
1.92
Deepseek-r1-distill
-llama-70b
gpt-4.1
gpt-4.1-mini
gpt-4o
gpt-5
gpt-5-mini
gpt-5-nano
llama-3.1-8b-
instant
llama-3.3-70b-
versatile
meta-llama/llama-4-
maverick-17b-128e-instruct
meta-llama/llama-4-
scout-17b-16e-instruct
moonshotai/kimi-
k2-instruct
o3
o4-mini
openai/gpt-
oss-120b
openai/gpt-
oss-20b
Model
Model performance across all evaluation metrics
Metrics
Clinical Safety
3.36
4.24
4.22
3.68
4.60
4.46
4.05
2.82
3.56
3.64
3.16
4.10
4.44
3.94
3.88
3.55
Template Compliance
1.06
1.46
1.04
1.12
1.26
1.22
1.00
1.02
1.06
1.00
1.00
1.32
1.64
1.24
1.10
1.10
Clinical Accuracy
2.70
3.80
3.68
3.34
4.15
4.20
3.52
2.22
3.32
3.50
2.92
3.62
3.88
3.82
3.18
2.71
Clinical Completeness
3.50
4.70
4.64
3.82
4.81
4.72
4.32
2.40
3.66
3.68
3.04
4.50
4.72
4.44
4.42
3.65
Information Architecture
1.06
1.64
1.06
1.16
1.55
1.36
1.00
1.20
1.08
1.06
1.00
1.44
2.08
1.48
1.12
1.22
Evidence & Reasoning
2.42
2.88
2.80
2.86
3.21
3.18
2.78
1.74
2.52
2.60
2.36
2.62
2.88
2.82
2.70
2.27
Custom Instruction Adherence
2.38
3.36
3.04
2.76
3.28
3.56
3.28
2.56
2.96
2.76
2.72
2.94
3.58
3.34
2.78
2.71
Factual Accuracy
2.52
3.58
3.34
3.02
3.89
3.84
3.12
1.96
3.06
3.16
2.62
3.20
3.40
3.50
3.26
2.47
Documentation Standards
4.14
4.42
4.28
4.02
3.98
4.36
3.92
3.66
3.96
4.02
3.90
4.38
4.28
4.18
4.38
4.27
Clinical Boundary Compliance
2.44
2.98
2.98
2.78
3.57
3.28
3.00
1.86
2.48
2.66
2.56
3.04
2.94
3.24
2.66
2.35
Instruction Compliance
1.46
2.56
1.72
1.94
3.11
2.28
1.55
1.86
1.44
1.70
1.36
1.96
2.86
2.48
1.84
1.92
Deepseek-r1-distill
-llama-70b
gpt-4.1
gpt-4.1-mini
gpt-4o
gpt-5
gpt-5-mini
gpt-5-nano
llama-3.1-8b-
instant
llama-3.3-70b-
versatile
meta-llama/llama-4-
maverick-17b-128e-instruct
meta-llama/llama-4-
scout-17b-16e-instruct
moonshotai/kimi-
k2-instruct
o3
o4-mini
openai/gpt-
oss-120b
openai/gpt-
oss-20b
Model
Medical coding
& structured benchmarks
Medical coding
& structured benchmarks
Medical coding requires precision small errors can cause billing, legal, and clinical safety issues. Our evaluations demonstrate the dramatic impact of code retrieval context on accuracy.
Medical coding requires precision small errors can cause billing, legal, and clinical safety issues. Our evaluations demonstrate the dramatic impact of code retrieval context on accuracy.
Model performance across all evaluation metrics
Metrics
Clinical Safety
3.36
4.24
4.22
3.68
4.60
4.46
4.05
2.82
3.56
3.64
3.16
4.10
4.44
3.94
3.88
3.55
Template Compliance
1.06
1.46
1.04
1.12
1.26
1.22
1.00
1.02
1.06
1.00
1.00
1.32
1.64
1.24
1.10
1.10
Clinical Accuracy
2.70
3.80
3.68
3.34
4.15
4.20
3.52
2.22
3.32
3.50
2.92
3.62
3.88
3.82
3.18
2.71
Clinical Completeness
3.50
4.70
4.64
3.82
4.81
4.72
4.32
2.40
3.66
3.68
3.04
4.50
4.72
4.44
4.42
3.65
Information Architecture
1.06
1.64
1.06
1.16
1.55
1.36
1.00
1.20
1.08
1.06
1.00
1.44
2.08
1.48
1.12
1.22
Evidence & Reasoning
2.42
2.88
2.80
2.86
3.21
3.18
2.78
1.74
2.52
2.60
2.36
2.62
2.88
2.82
2.70
2.27
Custom Instruction Adherence
2.38
3.36
3.04
2.76
3.28
3.56
3.28
2.56
2.96
2.76
2.72
2.94
3.58
3.34
2.78
2.71
Factual Accuracy
2.52
3.58
3.34
3.02
3.89
3.84
3.12
1.96
3.06
3.16
2.62
3.20
3.40
3.50
3.26
2.47
Documentation Standards
4.14
4.42
4.28
4.02
3.98
4.36
3.92
3.66
3.96
4.02
3.90
4.38
4.28
4.18
4.38
4.27
Clinical Boundary Compliance
2.44
2.98
2.98
2.78
3.57
3.28
3.00
1.86
2.48
2.66
2.56
3.04
2.94
3.24
2.66
2.35
Instruction Compliance
1.46
2.56
1.72
1.94
3.11
2.28
1.55
1.86
1.44
1.70
1.36
1.96
2.86
2.48
1.84
1.92
Deepseek-r1-distill
-llama-70b
gpt-4.1
gpt-4.1-mini
gpt-4o
gpt-5
gpt-5-mini
gpt-5-nano
llama-3.1-8b-
instant
llama-3.3-70b-
versatile
meta-llama/llama-4-
maverick-17b-128e-instruct
meta-llama/llama-4-
scout-17b-16e-instruct
moonshotai/kimi-
k2-instruct
o3
o4-mini
openai/gpt-
oss-120b
openai/gpt-
oss-20b
Model
Medical coding
& structured benchmarks
Medical coding requires precision small errors can cause billing, legal, and clinical safety issues. Our evaluations demonstrate the dramatic impact of code retrieval context on accuracy.
Medical Coding: Including ICD-10 code context in prompts dramatically improves performance. Without code lists, adjusted accuracy drops from 99.1% to just 49.0%, and truly incorrect predictions increase from 6.8% to 57.3%. This demonstrates that providing relevant code options is essential for reliable clinical coding.
Medical Coding: Including ICD-10 code context in prompts dramatically improves performance. Without code lists, adjusted accuracy drops from 99.1% to just 49.0%, and truly incorrect predictions increase from 6.8% to 57.3%. This demonstrates that providing relevant code options is essential for reliable clinical coding.
Medical Coding Benchmarks: Using Sully's proprietary coding retrieval algorithm and engine, coding accuracy reaches 99.1% with adjusted metrics. In contrast, one-shot generation by open-source and foundational models without code context achieves only 49.0% adjusted accuracy, with truly incorrect predictions rising to 57.3% compared to Sully's 6.8% (of non-exact codes). This demonstrates that Sully's intelligent code retrieval system is essential for reliable clinical coding.
Medical Coding Benchmarks: Using Sully's proprietary coding retrieval algorithm and engine, coding accuracy reaches 99.1% with adjusted metrics. In contrast, one-shot generation by open-source and foundational models without code context achieves only 49.0% adjusted accuracy, with truly incorrect predictions rising to 57.3% compared to Sully's 6.8% (of non-exact codes). This demonstrates that Sully's intelligent code retrieval system is essential for reliable clinical coding.

Adjusted Accuracy
OS Model + Sully coding
retrieval engine
OS model 1-shot generation
Adjusted Accuracy
99.1%
49.0%
Clinically Similar
159
18
Still Representative
60
20
Truly incorrect
16
51
Clinical Validity
93.2%
42.7%
Change
-50.1%
-47.5pp
-3.0pp
+50.5pp
-50.5%
Clinical Similarity Evaluation
Adjusted Accuracy
OS Model + Sully coding
retrieval engine
OS model 1-shot generation
Adjusted Accuracy
99.1%
49.0%
Clinically Similar
159
18
Still Representative
60
20
Truly incorrect
16
51
Clinical Validity
93.2%
42.7%
Change
-50.1%
-47.5pp
-3.0pp
+50.5pp
-50.5%
Clinical Similarity Evaluation

Adjusted Accuracy
OS Model + Sully coding
retrieval engine
OS model 1-shot generation
Adjusted Accuracy
99.1%
49.0%
Clinically Similar
159
18
Still Representative
60
20
Truly incorrect
16
51
Clinical Validity
93.2%
42.7%
Change
-50.1%
-47.5pp
-3.0pp
+50.5pp
-50.5%
Clinical Similarity Evaluation
Adjusted Accuracy
OS Model + Sully coding
retrieval engine
OS model 1-shot generation
Adjusted Accuracy
99.1%
49.0%
Clinically Similar
159
18
Still Representative
60
20
Truly incorrect
16
51
Clinical Validity
93.2%
42.7%
Change
-50.1%
-47.5pp
-3.0pp
+50.5pp
-50.5%
Clinical Similarity Evaluation

Adjusted Accuracy
OS Model + Sully coding
retrieval engine
OS model 1-shot generation
Adjusted Accuracy
99.1%
49.0%
Clinically Similar
159
18
Still Representative
60
20
Truly incorrect
16
51
Clinical Validity
93.2%
42.7%
Change
-50.1%
-47.5pp
-3.0pp
+50.5pp
-50.5%
Clinical Similarity Evaluation
Adjusted Accuracy
OS Model + Sully coding
retrieval engine
OS model 1-shot generation
Adjusted Accuracy
99.1%
49.0%
Clinically Similar
159
18
Still Representative
60
20
Truly incorrect
16
51
Clinical Validity
93.2%
42.7%
Change
-50.1%
-47.5pp
-3.0pp
+50.5pp
-50.5%
Clinical Similarity Evaluation
Conclusion
Conclusion
Conclusion
As LLMs become increasingly integrated into healthcare workflows, rigorous evaluation is critical.
Our ongoing analyses show:
Retrieval dramatically improves reliability across tasks—especially coding.
Unguided generation is unsafe for high-precision clinical domains.
Continuous benchmarking allows models and systems to improve over time.
Clinical validity must remain the primary measure—not just accuracy or similarity.
As LLMs become increasingly integrated into healthcare workflows, rigorous evaluation is critical.
Our ongoing analyses show:
Retrieval dramatically improves reliability across tasks—especially coding.
Unguided generation is unsafe for high-precision clinical domains.
Continuous benchmarking allows models and systems to improve over time.
Clinical validity must remain the primary measure—not just accuracy or similarity.
Sully will continue to expand these datasets, refine our evaluation pipelines, and release updated results.
Our goal is to ensure that AI systems deployed in healthcare are not only powerful, but trustworthy, clinically safe, and aligned with real-world needs.
Sully will continue to expand these datasets, refine our evaluation pipelines, and release updated results.
Our goal is to ensure that AI systems deployed in healthcare are not only powerful, but trustworthy, clinically safe, and aligned with real-world needs.
Resources
© Sully AI 2026. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.
Resources
© Sully AI 2026. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.
Resources
© Sully AI 2026. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.