Large-scale analysis of LLMs in healthcare

A continuously evolving benchmark of medical AI performance across clinical reasoning, documentation, and coding tasks.

Introduction

Healthcare is one of the most complex and high-stakes environments for AI systems.Large Language Models (LLMs) now demonstrate impressive reasoning and generation capabilities, but their reliability depends heavily on clinical grounding, structured context, and rigorous evaluation.

Sully conducts continuous, large-scale benchmarking of LLMs across three core domains:

Medical Q&A & Diagnostic Reasoning
Clinical Note Generation Quality
Medical Coding & Structured Benchmarking

Sully conducts continuous, large-scale benchmarking of LLMs across three core domains:

Medical Q&A & Diagnostic Reasoning
Clinical Note Generation Quality
Medical Coding & Structured Benchmarking

Our evaluations span multiple datasets, modalities, and clinical tasks to ensure that model performance is not only high, but also clinically valid, safe, and explainable.
This page summarizes findings from these ongoing efforts and highlights how retrieval-augmented systems—especially for medical coding—significantly outperform unguided generation approaches.

Medical benchmarks:
QnA and diagnostic reasoning

This section evaluates LLMs on clinical knowledge, differential diagnosis, and reasoning tasks.

What We Measure

Clinical knowledge accuracy
Diagnostic reasoning strength
Logical validity and hallucination rates
Safety and guideline adherence

What We Measure

Clinical knowledge accuracy
Diagnostic reasoning strength
Logical validity and hallucination rates
Safety and guideline adherence

Medical note generation quality

Evaluating model ability to generate clinically accurate and comprehensive documentation.

Metrics Assessed

Completeness of medical content
Logical consistency and absence of hallucinations
Clinical validity
Adherence to note structure
Physician usability

Metrics Assessed

Completeness of medical content
Logical consistency and absence of hallucinations
Clinical validity
Adherence to note structure
Physician usability

Model performance across all evaluation metrics

Metrics

Clinical Safety

3.36

4.24

4.22

3.68

4.60

4.46

4.05

2.82

3.56

3.64

3.16

4.10

4.44

3.94

3.88

3.55

Template Compliance

1.06

1.46

1.04

1.12

1.26

1.22

1.00

1.02

1.06

1.00

1.32

1.64

1.24

1.10

Clinical Accuracy

2.70

3.80

3.68

3.34

4.15

4.20

3.52

2.22

3.32

3.50

2.92

3.62

3.88

3.82

3.18

2.71

Clinical Completeness

3.50

4.70

4.64

3.82

4.81

4.72

4.32

2.40

3.66

3.68

3.04

4.50

4.72

4.44

4.42

3.65

Information Architecture

1.06

1.64

1.06

1.16

1.55

1.36

1.00

1.20

1.08

1.06

1.00

1.44

2.08

1.48

1.12

1.22

Evidence & Reasoning

2.42

2.88

2.80

2.86

3.21

3.18

2.78

1.74

2.52

2.60

2.36

2.62

2.88

2.82

2.70

2.27

Custom Instruction Adherence

2.38

3.36

3.04

2.76

3.28

3.56

3.28

2.56

2.96

2.76

2.72

2.94

3.58

3.34

2.78

2.71

Factual Accuracy

2.52

3.58

3.34

3.02

3.89

3.84

3.12

1.96

3.06

3.16

2.62

3.20

3.40

3.50

3.26

2.47

Documentation Standards

4.14

4.42

4.28

4.02

3.98

4.36

3.92

3.66

3.96

4.02

3.90

4.38

4.28

4.18

4.38

4.27

Clinical Boundary Compliance

2.44

2.98

2.78

3.57

3.28

3.00

1.86

2.48

2.66

2.56

3.04

2.94

3.24

2.66

2.35

Instruction Compliance

1.46

2.56

1.72

1.94

3.11

2.28

1.55

1.86

1.44

1.70

1.36

1.96

2.86

2.48

1.84

1.92

Deepseek-r1-distill
-llama-70b

gpt-4.1

gpt-4.1-mini

gpt-4o

gpt-5

gpt-5-mini

gpt-5-nano

llama-3.1-8b-
instant

llama-3.3-70b-
versatile

meta-llama/llama-4-

maverick-17b-128e-instruct

meta-llama/llama-4-

scout-17b-16e-instruct

moonshotai/kimi-
k2-instruct

o4-mini

openai/gpt-
oss-120b

openai/gpt-
oss-20b

Model

Model performance across all evaluation metrics

Metrics

Clinical Safety

3.36

4.24

4.22

3.68

4.60

4.46

4.05

2.82

3.56

3.64

3.16

4.10

4.44

3.94

3.88

3.55

Template Compliance

1.06

1.46

1.04

1.12

1.26

1.22

1.00

1.02

1.06

1.00

1.32

1.64

1.24

1.10

Clinical Accuracy

2.70

3.80

3.68

3.34

4.15

4.20

3.52

2.22

3.32

3.50

2.92

3.62

3.88

3.82

3.18

2.71

Clinical Completeness

3.50

4.70

4.64

3.82

4.81

4.72

4.32

2.40

3.66

3.68

3.04

4.50

4.72

4.44

4.42

3.65

Information Architecture

1.06

1.64

1.06

1.16

1.55

1.36

1.00

1.20

1.08

1.06

1.00

1.44

2.08

1.48

1.12

1.22

Evidence & Reasoning

2.42

2.88

2.80

2.86

3.21

3.18

2.78

1.74

2.52

2.60

2.36

2.62

2.88

2.82

2.70

2.27

Custom Instruction Adherence

2.38

3.36

3.04

2.76

3.28

3.56

3.28

2.56

2.96

2.76

2.72

2.94

3.58

3.34

2.78

2.71

Factual Accuracy

2.52

3.58

3.34

3.02

3.89

3.84

3.12

1.96

3.06

3.16

2.62

3.20

3.40

3.50

3.26

2.47

Documentation Standards

4.14

4.42

4.28

4.02

3.98

4.36

3.92

3.66

3.96

4.02

3.90

4.38

4.28

4.18

4.38

4.27

Clinical Boundary Compliance

2.44

2.98

2.78

3.57

3.28

3.00

1.86

2.48

2.66

2.56

3.04

2.94

3.24

2.66

2.35

Instruction Compliance

1.46

2.56

1.72

1.94

3.11

2.28

1.55

1.86

1.44

1.70

1.36

1.96

2.86

2.48

1.84

1.92

Deepseek-r1-distill
-llama-70b

gpt-4.1

gpt-4.1-mini

gpt-4o

gpt-5

gpt-5-mini

gpt-5-nano

llama-3.1-8b-
instant

llama-3.3-70b-
versatile

meta-llama/llama-4-

maverick-17b-128e-instruct

meta-llama/llama-4-

scout-17b-16e-instruct

moonshotai/kimi-
k2-instruct

o4-mini

openai/gpt-
oss-120b

openai/gpt-
oss-20b

Model

Medical coding

& structured benchmarks

Medical coding

& structured benchmarks

Medical coding requires precision small errors can cause billing, legal, and clinical safety issues. Our evaluations demonstrate the dramatic impact of code retrieval context on accuracy.

Model performance across all evaluation metrics

Metrics

Clinical Safety

3.36

4.24

4.22

3.68

4.60

4.46

4.05

2.82

3.56

3.64

3.16

4.10

4.44

3.94

3.88

3.55

Template Compliance

1.06

1.46

1.04

1.12

1.26

1.22

1.00

1.02

1.06

1.00

1.32

1.64

1.24

1.10

Clinical Accuracy

2.70

3.80

3.68

3.34

4.15

4.20

3.52

2.22

3.32

3.50

2.92

3.62

3.88

3.82

3.18

2.71

Clinical Completeness

3.50

4.70

4.64

3.82

4.81

4.72

4.32

2.40

3.66

3.68

3.04

4.50

4.72

4.44

4.42

3.65

Information Architecture

1.06

1.64

1.06

1.16

1.55

1.36

1.00

1.20

1.08

1.06

1.00

1.44

2.08

1.48

1.12

1.22

Evidence & Reasoning

2.42

2.88

2.80

2.86

3.21

3.18

2.78

1.74

2.52

2.60

2.36

2.62

2.88

2.82

2.70

2.27

Custom Instruction Adherence

2.38

3.36

3.04

2.76

3.28

3.56

3.28

2.56

2.96

2.76

2.72

2.94

3.58

3.34

2.78

2.71

Factual Accuracy

2.52

3.58

3.34

3.02

3.89

3.84

3.12

1.96

3.06

3.16

2.62

3.20

3.40

3.50

3.26

2.47

Documentation Standards

4.14

4.42

4.28

4.02

3.98

4.36

3.92

3.66

3.96

4.02

3.90

4.38

4.28

4.18

4.38

4.27

Clinical Boundary Compliance

2.44

2.98

2.78

3.57

3.28

3.00

1.86

2.48

2.66

2.56

3.04

2.94

3.24

2.66

2.35

Instruction Compliance

1.46

2.56

1.72

1.94

3.11

2.28

1.55

1.86

1.44

1.70

1.36

1.96

2.86

2.48

1.84

1.92

Deepseek-r1-distill
-llama-70b

gpt-4.1

gpt-4.1-mini

gpt-4o

gpt-5

gpt-5-mini

gpt-5-nano

llama-3.1-8b-
instant

llama-3.3-70b-
versatile

meta-llama/llama-4-

maverick-17b-128e-instruct

meta-llama/llama-4-

scout-17b-16e-instruct

moonshotai/kimi-
k2-instruct

o4-mini

openai/gpt-
oss-120b

openai/gpt-
oss-20b

Model

Medical coding

& structured benchmarks

Medical coding requires precision small errors can cause billing, legal, and clinical safety issues. Our evaluations demonstrate the dramatic impact of code retrieval context on accuracy.

Medical Coding: Including ICD-10 code context in prompts dramatically improves performance. Without code lists, adjusted accuracy drops from 99.1% to just 49.0%, and truly incorrect predictions increase from 6.8% to 57.3%. This demonstrates that providing relevant code options is essential for reliable clinical coding.

Medical Coding Benchmarks: Using Sully's proprietary coding retrieval algorithm and engine, coding accuracy reaches 99.1% with adjusted metrics. In contrast, one-shot generation by open-source and foundational models without code context achieves only 49.0% adjusted accuracy, with truly incorrect predictions rising to 57.3% compared to Sully's 6.8% (of non-exact codes). This demonstrates that Sully's intelligent code retrieval system is essential for reliable clinical coding.

Adjusted Accuracy

OS Model + Sully coding

retrieval engine

OS model 1-shot generation

Adjusted Accuracy

99.1%

49.0%

Clinically Similar

159

Still Representative

Truly incorrect

Clinical Validity

93.2%

42.7%

Change

-50.1%

-47.5pp

-3.0pp

+50.5pp

-50.5%

Clinical Similarity Evaluation

Adjusted Accuracy

OS Model + Sully coding

retrieval engine

OS model 1-shot generation

Adjusted Accuracy

99.1%

49.0%

Clinically Similar

159

Still Representative

Truly incorrect

Clinical Validity

93.2%

42.7%

Change

-50.1%

-47.5pp

-3.0pp

+50.5pp

-50.5%

Clinical Similarity Evaluation

Adjusted Accuracy

OS Model + Sully coding

retrieval engine

OS model 1-shot generation

Adjusted Accuracy

99.1%

49.0%

Clinically Similar

159

Still Representative

Truly incorrect

Clinical Validity

93.2%

42.7%

Change

-50.1%

-47.5pp

-3.0pp

+50.5pp

-50.5%

Clinical Similarity Evaluation

Adjusted Accuracy

OS Model + Sully coding

retrieval engine

OS model 1-shot generation

Adjusted Accuracy

99.1%

49.0%

Clinically Similar

159

Still Representative

Truly incorrect

Clinical Validity

93.2%

42.7%

Change

-50.1%

-47.5pp

-3.0pp

+50.5pp

-50.5%

Clinical Similarity Evaluation

Adjusted Accuracy

OS Model + Sully coding

retrieval engine

OS model 1-shot generation

Adjusted Accuracy

99.1%

49.0%

Clinically Similar

159

Still Representative

Truly incorrect

Clinical Validity

93.2%

42.7%

Change

-50.1%

-47.5pp

-3.0pp

+50.5pp

-50.5%

Clinical Similarity Evaluation

Adjusted Accuracy

OS Model + Sully coding

retrieval engine

OS model 1-shot generation

Adjusted Accuracy

99.1%

49.0%

Clinically Similar

159

Still Representative

Truly incorrect

Clinical Validity

93.2%

42.7%

Change

-50.1%

-47.5pp

-3.0pp

+50.5pp

-50.5%

Clinical Similarity Evaluation

Conclusion

As LLMs become increasingly integrated into healthcare workflows, rigorous evaluation is critical.
Our ongoing analyses show:

Retrieval dramatically improves reliability across tasks—especially coding.
Unguided generation is unsafe for high-precision clinical domains.
Continuous benchmarking allows models and systems to improve over time.
Clinical validity must remain the primary measure—not just accuracy or similarity.

As LLMs become increasingly integrated into healthcare workflows, rigorous evaluation is critical.
Our ongoing analyses show:

Retrieval dramatically improves reliability across tasks—especially coding.
Unguided generation is unsafe for high-precision clinical domains.
Continuous benchmarking allows models and systems to improve over time.
Clinical validity must remain the primary measure—not just accuracy or similarity.

Sully will continue to expand these datasets, refine our evaluation pipelines, and release updated results.
Our goal is to ensure that AI systems deployed in healthcare are not only powerful, but trustworthy, clinically safe, and aligned with real-world needs.

Ready for the
future of healthcare?

Join us

Large-scale analysis of LLMs in healthcare

Large-scale analysis of LLMs in healthcare

What We Measure

What We Measure

Metrics Assessed

Completeness of medical content

Logical consistency and absence of hallucinations

Clinical validity

Adherence to note structure

Physician usability

Metrics Assessed

Completeness of medical content

Logical consistency and absence of hallucinations

Clinical validity

Adherence to note structure

Physician usability