Model strengths vary by
medical domains.

Model strengths vary by medical domains.

A cross-specialty benchmark analysis showing how leading large language models differ in accuracy, reasoning, and specialization across medical tasks and body systems.

Second opinion matters: Towards adaptive clinical AI via the consensus of expert model ensemble

Read the full paper

Introduction

Medical AI models do not perform uniformly across clinical domains. Each model exhibits distinct strengths depending on the specialty area, task type, and underlying medical complexity.

This page compiles a comprehensive view of model performance across medical tasks, body systems, and question types, based on Sully’s benchmarking of leading LLMs.

Medical AI models do not perform uniformly across clinical domains. Each model exhibits distinct strengths depending on the specialty area, task type, and underlying medical complexity.

This page compiles a comprehensive view of model performance across medical tasks, body systems, and question types, based on Sully’s benchmarking of leading LLMs.

Why domain level variation exists

Even state-of-the-art models show substantial variability across specialties due to differences in:

Depth of domain-specific medical knowledge

Requirements for multi-step reasoning vs. factual recall

Model design (generalist vs. domain-trained)

Variances in training corpora and clinical content familiarity

Performance

By medical task (Diagnosis, treatment, basic science)

Benchmarks reveal significant differences across task types:

Diagnosis: Best handled by O1 and O3-MINI
Treatment: GPT-4.5-PREVIEW shows specialty strength
Basic Science: Few models exceed 30% accuracy; tasks remain the hardest

Benchmarks reveal significant differences across task types:

Diagnosis: Best handled by O1 and O3-MINI
Treatment: GPT-4.5-PREVIEW shows specialty strength
Basic Science: Few models exceed 30% accuracy; tasks remain the hardest

All Models performance by medical task

Medical_task

Treatment

Diagnosis

Basic Science

43.2

18.5

28.4

30.9

21.0

24.7

29.6

17.3

29.6

30.9

21.0

35.8

48.1

32.0

16.0

18.0

22.0

14.0

23.0

25.0

16.0

11.0

22.0

26.0

15.0

39.0

47.0

20.6

12.7

22.2

17.5

12.7

15.9

19.0

7.9

17.5

15.9

9.5

25.4

39.7

gpt-4.5-

preview

gpt-4.5-

preview

med lm

llama3-70b-

8192

llama3-70b-

8192

gpt-4o

gpt-4o-mini

claude-3-7-sonnet-

latest

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

llama-3.3-70b-

versatile

gemini-2.0-flash

claude-3-5-haiku-

latest

claude-3-5-haiku-

latest

o3-mini

22.0

Model

All Models performance by medical task

Medical_task

Treatment

Diagnosis

Basic Science

43.2

18.5

28.4

30.9

21.0

24.7

29.6

17.3

29.6

30.9

21.0

35.8

48.1

32.0

16.0

18.0

22.0

14.0

23.0

25.0

16.0

11.0

22.0

26.0

15.0

39.0

47.0

20.6

12.7

22.2

17.5

12.7

15.9

19.0

7.9

17.5

15.9

9.5

25.4

39.7

gpt-4.5-

preview

gpt-4.5-

preview

med lm

llama3-70b-

8192

llama3-70b-

8192

gpt-4o

gpt-4o-mini

claude-3-7-sonnet-

latest

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

llama-3.3-70b-

versatile

gemini-2.0-flash

claude-3-5-haiku-

latest

claude-3-5-haiku-

latest

o3-mini

22.0

Model

By body system

Model strengths diverge sharply depending on the clinical system involved.
Examples:

Lymphatic: O3-MINI reaches 62.5% (top model result)
Digestive: O1 peaks at 59.09% (best system for O1)
Endocrine: GPT-4.5-PREVIEW achieves 53.85%
Respiratory: O3-MINI at 42.9%

Model strengths diverge sharply depending on the clinical system involved.
Examples:

Lymphatic: O3-MINI reaches 62.5% (top model result)
Digestive: O1 peaks at 59.09% (best system for O1)
Endocrine: GPT-4.5-PREVIEW achieves 53.85%
Respiratory: O3-MINI at 42.9%

All Models performance by body system

Body_system

Lmphatic

25.0

62.5

37.5

25.0

37.5

12.5

0.0

62.5

50.0

Endocrine

53.8

15.4

30.8

0.0

38.5

30.8

15.4

23.1

38.5

23.1

7.7

46.2

38.5

Nervous

47.1

17.6

20.6

8.8

35.3

32.4

17.6

23.5

17.6

32.4

14.7

35.3

55.9

Urinary

27.3

18.2

27.3

18.2

27.3

36.4

18.2

36.4

27.3

Reproductive

28.0

16.0

28.0

20.0

32.0

12.0

24.0

12.0

28.0

32.0

20.0

44.0

56.0

Other/NA

40.0

20.0

26.7

40.0

13.3

26.7

20.0

13.3

20.0

26.7

20.0

40.0

Skeletal

31.6

21.1

28.9

21.1

18.4

21.1

23.7

7.9

26.3

31.6

21.1

26.3

36.8

Digestive

27.3

9.1

22.7

27.3

13.6

18.2

22.7

18.2

9.1

31.8

27.3

13.6

36.4

59.1

Respiratory

38.1

23.8

14.3

33.3

23.8

19.0

9.5

14.3

19.0

9.5

42.9

Cardiovascular

27.8

8.3

11.1

22.2

13.9

22.2

33.3

2.8

16.7

19.4

16.7

33.3

41.7

Integumentary

33.3

0.0

33.3

0.0

33.3

0.0

33.3

Muscular

11.1

5.6

11.1

5.6

16.7

0.0

5.6

11.1

5.6

16.7

44.4

gpt-4.5-

preview

med lm

llama3-70b-

8192

gpt-4o

gpt-4o-mini

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

gemini-2.0-flash

claude-3-5-haiku-

latest

o3-mini

22.0

Model

All Models performance by body system

Body_system

Lmphatic

25.0

62.5

37.5

25.0

37.5

12.5

0.0

62.5

50.0

Endocrine

53.8

15.4

30.8

0.0

38.5

30.8

15.4

23.1

38.5

23.1

7.7

46.2

38.5

Nervous

47.1

17.6

20.6

8.8

35.3

32.4

17.6

23.5

17.6

32.4

14.7

35.3

55.9

Urinary

27.3

18.2

27.3

18.2

27.3

36.4

18.2

36.4

27.3

Reproductive

28.0

16.0

28.0

20.0

32.0

12.0

24.0

12.0

28.0

32.0

20.0

44.0

56.0

Other/NA

40.0

20.0

26.7

40.0

13.3

26.7

20.0

13.3

20.0

26.7

20.0

40.0

Skeletal

31.6

21.1

28.9

21.1

18.4

21.1

23.7

7.9

26.3

31.6

21.1

26.3

36.8

Digestive

27.3

9.1

22.7

27.3

13.6

18.2

22.7

18.2

9.1

31.8

27.3

13.6

36.4

59.1

Respiratory

38.1

23.8

14.3

33.3

23.8

19.0

9.5

14.3

19.0

9.5

42.9

Cardiovascular

27.8

8.3

11.1

22.2

13.9

22.2

33.3

2.8

16.7

19.4

16.7

33.3

41.7

Integumentary

33.3

0.0

33.3

0.0

33.3

0.0

33.3

Muscular

11.1

5.6

11.1

5.6

16.7

0.0

5.6

11.1

5.6

16.7

44.4

gpt-4.5-

preview

med lm

llama3-70b-

8192

gpt-4o

gpt-4o-mini

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

gemini-2.0-flash

claude-3-5-haiku-

latest

o3-mini

22.0

Model

All Models performance by body system

Body_system

Lmphatic

25.0

62.5

37.5

25.0

37.5

12.5

0.0

62.5

50.0

Endocrine

53.8

15.4

30.8

0.0

38.5

30.8

15.4

23.1

38.5

23.1

7.7

46.2

38.5

Nervous

47.1

17.6

20.6

8.8

35.3

32.4

17.6

23.5

17.6

32.4

14.7

35.3

55.9

Urinary

27.3

18.2

27.3

18.2

27.3

36.4

18.2

36.4

27.3

Reproductive

28.0

16.0

28.0

20.0

32.0

12.0

24.0

12.0

28.0

32.0

20.0

44.0

56.0

Other/NA

40.0

20.0

26.7

40.0

13.3

26.7

20.0

13.3

20.0

26.7

20.0

40.0

Skeletal

31.6

21.1

28.9

21.1

18.4

21.1

23.7

7.9

26.3

31.6

21.1

26.3

36.8

Digestive

27.3

9.1

22.7

27.3

13.6

18.2

22.7

18.2

9.1

31.8

27.3

13.6

36.4

59.1

Respiratory

38.1

23.8

14.3

33.3

23.8

19.0

9.5

14.3

19.0

9.5

42.9

Cardiovascular

27.8

8.3

11.1

22.2

13.9

22.2

33.3

2.8

16.7

19.4

16.7

33.3

41.7

Integumentary

33.3

0.0

33.3

0.0

33.3

0.0

33.3

Muscular

11.1

5.6

11.1

5.6

16.7

0.0

5.6

11.1

5.6

16.7

44.4

gpt-4.5-

preview

med lm

llama3-70b-

8192

gpt-4o

gpt-4o-mini

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

gemini-2.0-flash

claude-3-5-haiku-

latest

o3-mini

22.0

Model

Standout specialty-specific wins

Some combinations reach exceptional accuracy (>90th percentile)

Task + Specialty

Best Model

Accuracy

Diagnosis – Lymphatic

O3-MINI

100%

Basic Science – Reproductive

GPT-4.5-PREVIEW

100%

Treatment – Digestive

83.33%

Treatment – Reproductive

70%

Diagnosis – Endocrine

GPT-4.5-PREVIEW

70%

Task + Specialty

Best Model

Accuracy

Diagnosis – Lymphatic

O3-MINI

100%

Basic Science – Reproductive

GPT-4.5-PREVIEW

100%

Treatment – Digestive

83.33%

Treatment – Reproductive

70%

Diagnosis – Endocrine

GPT-4.5-PREVIEW

70%

Task + Specialty

Best Model

Accuracy

Diagnosis – Lymphatic

O3-MINI

100%

Basic Science – Reproductive

GPT-4.5-PREVIEW

100%

Treatment – Digestive

83.33%

Treatment – Reproductive

70%

Diagnosis – Endocrine

GPT-4.5-PREVIEW

70%

This illustrates how specialist-focused models or ensembles can outperform generalized models in meaningful clinical niches.

Cross-vendor comparison

The fragility of clinical

AI systems

Cross-vendor comparison

Groq models show competitive recall
OpenAI models demonstrate strong accuracy in higher-tier models
Claude models display consistent but mid-range performance
Together (Deepseek-distill) peaks well in recall in limited contexts

Groq models show competitive recall
OpenAI models demonstrate strong accuracy in higher-tier models
Claude models display consistent but mid-range performance
Together (Deepseek-distill) peaks well in recall in limited contexts

Key takeaways

No model is universally strong specialty strengths vary widely.

Consensus ensembles outperform single models in specialized.

Models reason better than

they understand medical content.

Models reason better than

they understand medical content.

Basic science knowledge is a shared weakness across models.

Confidence is not equal to accuracy calibration is still a major challenge.

Vendor performance varies leaving major clinical

reliability.

Vendor performance varies leaving major clinical

reliability.

Vendor performance varies leaving major clinical reliability.

Ready for the
future of healthcare?

Join us