


Model strengths vary by
medical domains.
Model strengths vary by medical domains.
A cross-specialty benchmark analysis showing how leading large language models differ in accuracy, reasoning, and specialization across medical tasks and body systems.
Second opinion matters: Towards adaptive clinical AI via the consensus of expert model ensemble
Introduction
Introduction
Medical AI models do not perform uniformly across clinical domains. Each model exhibits distinct strengths depending on the specialty area, task type, and underlying medical complexity.
This page compiles a comprehensive view of model performance across medical tasks, body systems, and question types, based on Sully’s benchmarking of leading LLMs.
Medical AI models do not perform uniformly across clinical domains. Each model exhibits distinct strengths depending on the specialty area, task type, and underlying medical complexity.
This page compiles a comprehensive view of model performance across medical tasks, body systems, and question types, based on Sully’s benchmarking of leading LLMs.
Why domain level variation exists
Why domain level variation exists
Why domain level variation exists
Even state-of-the-art models show substantial variability across specialties due to differences in:
Even state-of-the-art models show substantial variability across specialties due to differences in:
Depth of domain-specific medical knowledge
Depth of domain-specific medical knowledge
Requirements for multi-step reasoning vs. factual recall
Requirements for multi-step reasoning vs. factual recall
Model design (generalist vs. domain-trained)
Model design (generalist vs. domain-trained)
Variances in training corpora and clinical content familiarity
Variances in training corpora and clinical content familiarity
Performance
Performance
01
01
By medical task (Diagnosis, treatment, basic science)
By medical task (Diagnosis, treatment, basic science)
Benchmarks reveal significant differences across task types:
Diagnosis: Best handled by O1 and O3-MINI
Treatment: GPT-4.5-PREVIEW shows specialty strength
Basic Science: Few models exceed 30% accuracy; tasks remain the hardest
Benchmarks reveal significant differences across task types:
Diagnosis: Best handled by O1 and O3-MINI
Treatment: GPT-4.5-PREVIEW shows specialty strength
Basic Science: Few models exceed 30% accuracy; tasks remain the hardest
All Models performance by medical task
All Models performance by medical task
Medical_task
Medical_task
Treatment
Treatment
Diagnosis
Diagnosis
Basic Science
Basic Science
43.2
43.2
18.5
18.5
28.4
28.4
30.9
30.9
21.0
21.0
24.7
24.7
29.6
29.6
17.3
17.3
17.3
17.3
29.6
29.6
30.9
30.9
21.0
21.0
35.8
35.8
48.1
48.1
32.0
32.0
16.0
16.0
18.0
18.0
22.0
22.0
14.0
14.0
23.0
23.0
25.0
25.0
16.0
16.0
11.0
11.0
22.0
22.0
26.0
26.0
15.0
15.0
39.0
39.0
47.0
47.0
20.6
20.6
12.7
12.7
22.2
22.2
17.5
17.5
12.7
12.7
15.9
15.9
19.0
19.0
7.9
7.9
17.5
17.5
17.5
17.5
15.9
15.9
9.5
9.5
25.4
25.4
39.7
39.7
gpt-4.5-
preview
gpt-4.5-
preview
med lm
med lm
llama3-70b-
8192
llama3-70b-
8192
gpt-4o
gpt-4o
gpt-4o-mini
gpt-4o-mini
claude-3-7-sonnet-
latest
claude-3-7-sonnet-
latest
deepseek-r1-distill-
llama-70b
deepseek-r1-distill-
llama-70b
gemini-1.5-pro
gemini-1.5-pro
deepseek-r1-distill-
qwen-32b
deepseek-r1-distill-
qwen-32b
llama-3.3-70b-
versatile
llama-3.3-70b-
versatile
gemini-2.0-flash
gemini-2.0-flash
claude-3-5-haiku-
latest
claude-3-5-haiku-
latest
o3-mini
o3-mini
22.0
22.0
Model
Model
All Models performance by medical task
All Models performance by medical task
Medical_task
Medical_task
Treatment
Treatment
Diagnosis
Diagnosis
Basic Science
Basic Science
43.2
18.5
28.4
30.9
21.0
24.7
29.6
17.3
17.3
29.6
30.9
21.0
35.8
48.1
32.0
16.0
18.0
22.0
14.0
23.0
25.0
16.0
11.0
22.0
26.0
15.0
39.0
47.0
20.6
12.7
22.2
17.5
12.7
15.9
19.0
7.9
17.5
17.5
15.9
9.5
25.4
39.7
gpt-4.5-
preview
gpt-4.5-
preview
med lm
med lm
llama3-70b-
8192
llama3-70b-
8192
gpt-4o
gpt-4o
gpt-4o-mini
gpt-4o-mini
claude-3-7-sonnet-
latest
claude-3-7-sonnet-
latest
deepseek-r1-distill-
llama-70b
deepseek-r1-distill-
llama-70b
gemini-1.5-pro
gemini-1.5-pro
deepseek-r1-distill-
qwen-32b
deepseek-r1-distill-
qwen-32b
llama-3.3-70b-
versatile
llama-3.3-70b-
versatile
gemini-2.0-flash
gemini-2.0-flash
claude-3-5-haiku-
latest
claude-3-5-haiku-
latest
o3-mini
o3-mini
22.0
22.0
Model
Model
02
02
By body system
By body system
Model strengths diverge sharply depending on the clinical system involved.
Examples:
Lymphatic: O3-MINI reaches 62.5% (top model result)
Digestive: O1 peaks at 59.09% (best system for O1)
Endocrine: GPT-4.5-PREVIEW achieves 53.85%
Respiratory: O3-MINI at 42.9%
Model strengths diverge sharply depending on the clinical system involved.
Examples:
Lymphatic: O3-MINI reaches 62.5% (top model result)
Digestive: O1 peaks at 59.09% (best system for O1)
Endocrine: GPT-4.5-PREVIEW achieves 53.85%
Respiratory: O3-MINI at 42.9%
All Models performance by body system
Body_system
Lmphatic
25.0
25.0
62.5
37.5
37.5
25.0
25.0
25.0
25.0
37.5
12.5
0.0
62.5
50.0
Endocrine
53.8
15.4
30.8
30.8
0.0
38.5
30.8
15.4
23.1
38.5
23.1
7.7
46.2
38.5
Nervous
47.1
17.6
20.6
20.6
8.8
35.3
32.4
17.6
23.5
17.6
32.4
14.7
35.3
55.9
Urinary
27.3
27.3
27.3
27.3
18.2
18.2
27.3
18.2
27.3
36.4
18.2
36.4
36.4
27.3
Reproductive
28.0
16.0
28.0
20.0
32.0
12.0
24.0
12.0
12.0
28.0
32.0
20.0
44.0
56.0
Other/NA
40.0
20.0
26.7
40.0
13.3
26.7
20.0
20.0
13.3
20.0
26.7
20.0
20.0
40.0
Skeletal
31.6
21.1
28.9
21.1
18.4
21.1
23.7
23.7
7.9
26.3
31.6
21.1
26.3
36.8
Digestive
27.3
9.1
22.7
27.3
13.6
18.2
22.7
18.2
9.1
31.8
27.3
13.6
36.4
59.1
Respiratory
38.1
23.8
14.3
33.3
23.8
19.0
9.5
14.3
14.3
14.3
19.0
9.5
42.9
42.9
Cardiovascular
27.8
8.3
11.1
22.2
13.9
22.2
33.3
2.8
16.7
19.4
19.4
16.7
33.3
41.7
Integumentary
33.3
0.0
0.0
0.0
0.0
0.0
33.3
0.0
0.0
0.0
33.3
0.0
33.3
33.3
Muscular
11.1
5.6
11.1
5.6
5.6
5.6
16.7
0.0
5.6
11.1
11.1
5.6
16.7
44.4
gpt-4.5-
preview
med lm
llama3-70b-
8192
gpt-4o
gpt-4o-mini
claude-3-7-sonnet-
latest
deepseek-r1-distill-
llama-70b
gemini-1.5-pro
deepseek-r1-distill-
qwen-32b
llama-3.3-70b-
versatile
gemini-2.0-flash
claude-3-5-haiku-
latest
o3-mini
22.0
Model
All Models performance by body system
Body_system
Lmphatic
25.0
25.0
62.5
37.5
37.5
25.0
25.0
25.0
25.0
37.5
12.5
0.0
62.5
50.0
Endocrine
53.8
15.4
30.8
30.8
0.0
38.5
30.8
15.4
23.1
38.5
23.1
7.7
46.2
38.5
Nervous
47.1
17.6
20.6
20.6
8.8
35.3
32.4
17.6
23.5
17.6
32.4
14.7
35.3
55.9
Urinary
27.3
27.3
27.3
27.3
18.2
18.2
27.3
18.2
27.3
36.4
18.2
36.4
36.4
27.3
Reproductive
28.0
16.0
28.0
20.0
32.0
12.0
24.0
12.0
12.0
28.0
32.0
20.0
44.0
56.0
Other/NA
40.0
20.0
26.7
40.0
13.3
26.7
20.0
20.0
13.3
20.0
26.7
20.0
20.0
40.0
Skeletal
31.6
21.1
28.9
21.1
18.4
21.1
23.7
23.7
7.9
26.3
31.6
21.1
26.3
36.8
Digestive
27.3
9.1
22.7
27.3
13.6
18.2
22.7
18.2
9.1
31.8
27.3
13.6
36.4
59.1
Respiratory
38.1
23.8
14.3
33.3
23.8
19.0
9.5
14.3
14.3
14.3
19.0
9.5
42.9
42.9
Cardiovascular
27.8
8.3
11.1
22.2
13.9
22.2
33.3
2.8
2.8
16.7
19.4
19.4
16.7
33.3
41.7
Integumentary
33.3
0.0
0.0
0.0
0.0
0.0
33.3
0.0
0.0
0.0
33.3
0.0
33.3
33.3
Muscular
11.1
5.6
11.1
5.6
5.6
5.6
16.7
0.0
5.6
11.1
11.1
5.6
16.7
44.4
gpt-4.5-
preview
med lm
llama3-70b-
8192
gpt-4o
gpt-4o-mini
claude-3-7-sonnet-
latest
deepseek-r1-distill-
llama-70b
gemini-1.5-pro
deepseek-r1-distill-
qwen-32b
llama-3.3-70b-
versatile
gemini-2.0-flash
claude-3-5-haiku-
latest
o3-mini
22.0
Model
All Models performance by body system
Body_system
Lmphatic
25.0
25.0
62.5
37.5
37.5
25.0
25.0
25.0
25.0
37.5
12.5
0.0
62.5
50.0
Endocrine
53.8
15.4
30.8
30.8
0.0
38.5
30.8
15.4
23.1
38.5
23.1
7.7
46.2
38.5
Nervous
47.1
17.6
20.6
20.6
8.8
35.3
32.4
17.6
23.5
17.6
32.4
14.7
35.3
55.9
Urinary
27.3
27.3
27.3
27.3
18.2
18.2
27.3
18.2
27.3
36.4
18.2
36.4
36.4
27.3
Reproductive
28.0
16.0
28.0
20.0
32.0
12.0
24.0
12.0
12.0
28.0
32.0
20.0
44.0
56.0
Other/NA
40.0
20.0
26.7
40.0
13.3
26.7
20.0
20.0
13.3
20.0
26.7
20.0
20.0
40.0
Skeletal
31.6
21.1
28.9
21.1
18.4
21.1
23.7
23.7
7.9
26.3
31.6
21.1
26.3
36.8
Digestive
27.3
9.1
22.7
27.3
13.6
18.2
22.7
18.2
9.1
31.8
27.3
13.6
36.4
59.1
Respiratory
38.1
23.8
14.3
33.3
23.8
19.0
9.5
9.5
14.3
14.3
14.3
19.0
9.5
42.9
42.9
Cardiovascular
27.8
8.3
11.1
22.2
13.9
22.2
33.3
2.8
2.8
16.7
19.4
19.4
16.7
33.3
41.7
Integumentary
33.3
0.0
0.0
0.0
0.0
0.0
33.3
0.0
0.0
0.0
33.3
0.0
33.3
33.3
Muscular
11.1
5.6
11.1
5.6
5.6
5.6
16.7
0.0
5.6
11.1
11.1
5.6
16.7
44.4
gpt-4.5-
preview
med lm
llama3-70b-
8192
gpt-4o
gpt-4o-mini
claude-3-7-sonnet-
latest
deepseek-r1-distill-
llama-70b
gemini-1.5-pro
deepseek-r1-distill-
qwen-32b
llama-3.3-70b-
versatile
gemini-2.0-flash
claude-3-5-haiku-
latest
o3-mini
22.0
Model
Standout specialty-specific wins
Standout specialty-specific wins
Standout specialty-specific wins
Some combinations reach exceptional accuracy (>90th percentile)
Some combinations reach exceptional accuracy (>90th percentile)

Task + Specialty
Best Model
Accuracy
Diagnosis – Lymphatic
O3-MINI
100%
Basic Science – Reproductive
GPT-4.5-PREVIEW
100%
Treatment – Digestive
O1
83.33%
Treatment – Reproductive
O1
70%
Diagnosis – Endocrine
GPT-4.5-PREVIEW
70%

Task + Specialty
Best Model
Accuracy
Diagnosis – Lymphatic
O3-MINI
100%
Basic Science – Reproductive
GPT-4.5-PREVIEW
100%
Treatment – Digestive
O1
83.33%
Treatment – Reproductive
O1
70%
Diagnosis – Endocrine
GPT-4.5-PREVIEW
70%

Task + Specialty
Best Model
Accuracy
Diagnosis – Lymphatic
O3-MINI
100%
Basic Science – Reproductive
GPT-4.5-PREVIEW
100%
Treatment – Digestive
O1
83.33%
Treatment – Reproductive
O1
70%
Diagnosis – Endocrine
GPT-4.5-PREVIEW
70%
This illustrates how specialist-focused models or ensembles can outperform generalized models in meaningful clinical niches.
This illustrates how specialist-focused models or ensembles can outperform generalized models in meaningful clinical niches.
Cross-vendor comparison
The fragility of clinical
AI systems
Cross-vendor comparison
Groq models show competitive recall
OpenAI models demonstrate strong accuracy in higher-tier models
Claude models display consistent but mid-range performance
Together (Deepseek-distill) peaks well in recall in limited contexts
Groq models show competitive recall
OpenAI models demonstrate strong accuracy in higher-tier models
Claude models display consistent but mid-range performance
Together (Deepseek-distill) peaks well in recall in limited contexts

Key takeaways
Key takeaways
No model is universally strong specialty strengths vary widely.
No model is universally strong specialty strengths vary widely.
Consensus ensembles outperform single models in specialized.
Consensus ensembles outperform single models in specialized.
Models reason better than
they understand medical content.
Models reason better than
they understand medical content.
Basic science knowledge is a shared weakness across models.
Basic science knowledge is a shared weakness across models.
Confidence is not equal to accuracy calibration is still a major challenge.
Confidence is not equal to accuracy calibration is still a major challenge.
Vendor performance varies leaving major clinical
reliability.
Vendor performance varies leaving major clinical
reliability.
Vendor performance varies leaving major clinical reliability.
Resources
© Sully AI 2026. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.
Resources
© Sully AI 2026. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.
Resources
© Sully AI 2026. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.