Model strengths vary by
medical domains.

Model strengths vary by medical domains.

A cross-specialty benchmark analysis showing how leading large language models differ in accuracy, reasoning, and specialization across medical tasks and body systems.

Second opinion matters: Towards adaptive clinical AI via the consensus of expert model ensemble

Introduction

Introduction

Medical AI models do not perform uniformly across clinical domains. Each model exhibits distinct strengths depending on the specialty area, task type, and underlying medical complexity.

This page compiles a comprehensive view of model performance across medical tasks, body systems, and question types, based on Sully’s benchmarking of leading LLMs.

Medical AI models do not perform uniformly across clinical domains. Each model exhibits distinct strengths depending on the specialty area, task type, and underlying medical complexity.

This page compiles a comprehensive view of model performance across medical tasks, body systems, and question types, based on Sully’s benchmarking of leading LLMs.

Why domain level variation exists

Why domain level variation exists

Why domain level variation exists

Even state-of-the-art models show substantial variability across specialties due to differences in:

Even state-of-the-art models show substantial variability across specialties due to differences in:

Depth of domain-specific medical knowledge

Depth of domain-specific medical knowledge

Requirements for multi-step reasoning vs. factual recall

Requirements for multi-step reasoning vs. factual recall

Model design (generalist vs. domain-trained)

Model design (generalist vs. domain-trained)

Variances in training corpora and clinical content familiarity

Variances in training corpora and clinical content familiarity

Performance

Performance

01

01

By medical task (Diagnosis, treatment, basic science)

By medical task (Diagnosis, treatment, basic science)

Benchmarks reveal significant differences across task types:

  • Diagnosis: Best handled by O1 and O3-MINI

  • Treatment: GPT-4.5-PREVIEW shows specialty strength

  • Basic Science: Few models exceed 30% accuracy; tasks remain the hardest

Benchmarks reveal significant differences across task types:

  • Diagnosis: Best handled by O1 and O3-MINI

  • Treatment: GPT-4.5-PREVIEW shows specialty strength

  • Basic Science: Few models exceed 30% accuracy; tasks remain the hardest

All Models performance by medical task

All Models performance by medical task

Medical_task

Medical_task

Treatment

Treatment

Diagnosis

Diagnosis

Basic Science

Basic Science

43.2

43.2

18.5

18.5

28.4

28.4

30.9

30.9

21.0

21.0

24.7

24.7

29.6

29.6

17.3

17.3

17.3

17.3

29.6

29.6

30.9

30.9

21.0

21.0

35.8

35.8

48.1

48.1

32.0

32.0

16.0

16.0

18.0

18.0

22.0

22.0

14.0

14.0

23.0

23.0

25.0

25.0

16.0

16.0

11.0

11.0

22.0

22.0

26.0

26.0

15.0

15.0

39.0

39.0

47.0

47.0

20.6

20.6

12.7

12.7

22.2

22.2

17.5

17.5

12.7

12.7

15.9

15.9

19.0

19.0

7.9

7.9

17.5

17.5

17.5

17.5

15.9

15.9

9.5

9.5

25.4

25.4

39.7

39.7

gpt-4.5-

preview

gpt-4.5-

preview

med lm

med lm

llama3-70b-

8192

llama3-70b-

8192

gpt-4o

gpt-4o

gpt-4o-mini

gpt-4o-mini

claude-3-7-sonnet-

latest

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

llama-3.3-70b-

versatile

gemini-2.0-flash

gemini-2.0-flash

claude-3-5-haiku-

latest

claude-3-5-haiku-

latest

o3-mini

o3-mini

22.0

22.0

Model

Model

All Models performance by medical task

All Models performance by medical task

Medical_task

Medical_task

Treatment

Treatment

Diagnosis

Diagnosis

Basic Science

Basic Science

43.2

18.5

28.4

30.9

21.0

24.7

29.6

17.3

17.3

29.6

30.9

21.0

35.8

48.1

32.0

16.0

18.0

22.0

14.0

23.0

25.0

16.0

11.0

22.0

26.0

15.0

39.0

47.0

20.6

12.7

22.2

17.5

12.7

15.9

19.0

7.9

17.5

17.5

15.9

9.5

25.4

39.7

gpt-4.5-

preview

gpt-4.5-

preview

med lm

med lm

llama3-70b-

8192

llama3-70b-

8192

gpt-4o

gpt-4o

gpt-4o-mini

gpt-4o-mini

claude-3-7-sonnet-

latest

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

llama-3.3-70b-

versatile

gemini-2.0-flash

gemini-2.0-flash

claude-3-5-haiku-

latest

claude-3-5-haiku-

latest

o3-mini

o3-mini

22.0

22.0

Model

Model

02

02

By body system

By body system

Model strengths diverge sharply depending on the clinical system involved.
Examples:

  • Lymphatic: O3-MINI reaches 62.5% (top model result)

  • Digestive: O1 peaks at 59.09% (best system for O1)

  • Endocrine: GPT-4.5-PREVIEW achieves 53.85%

  • Respiratory: O3-MINI at 42.9%

Model strengths diverge sharply depending on the clinical system involved.
Examples:

  • Lymphatic: O3-MINI reaches 62.5% (top model result)

  • Digestive: O1 peaks at 59.09% (best system for O1)

  • Endocrine: GPT-4.5-PREVIEW achieves 53.85%

  • Respiratory: O3-MINI at 42.9%

All Models performance by body system

Body_system

Lmphatic

25.0

25.0

62.5

37.5

37.5

25.0

25.0

25.0

25.0

37.5

12.5

0.0

62.5

50.0

Endocrine

53.8

15.4

30.8

30.8

0.0

38.5

30.8

15.4

23.1

38.5

23.1

7.7

46.2

38.5

Nervous

47.1

17.6

20.6

20.6

8.8

35.3

32.4

17.6

23.5

17.6

32.4

14.7

35.3

55.9

Urinary

27.3

27.3

27.3

27.3

18.2

18.2

27.3

18.2

27.3

36.4

18.2

36.4

36.4

27.3

Reproductive

28.0

16.0

28.0

20.0

32.0

12.0

24.0

12.0

12.0

28.0

32.0

20.0

44.0

56.0

Other/NA

40.0

20.0

26.7

40.0

13.3

26.7

20.0

20.0

13.3

20.0

26.7

20.0

20.0

40.0

Skeletal

31.6

21.1

28.9

21.1

18.4

21.1

23.7

23.7

7.9

26.3

31.6

21.1

26.3

36.8

Digestive

27.3

9.1

22.7

27.3

13.6

18.2

22.7

18.2

9.1

31.8

27.3

13.6

36.4

59.1

Respiratory

38.1

23.8

14.3

33.3

23.8

19.0

9.5

14.3

14.3

14.3

19.0

9.5

42.9

42.9

Cardiovascular

27.8

8.3

11.1

22.2

13.9

22.2

33.3

2.8

16.7

19.4

19.4

16.7

33.3

41.7

Integumentary

33.3

0.0

0.0

0.0

0.0

0.0

33.3

0.0

0.0

0.0

33.3

0.0

33.3

33.3

Muscular

11.1

5.6

11.1

5.6

5.6

5.6

16.7

0.0

5.6

11.1

11.1

5.6

16.7

44.4

gpt-4.5-

preview

med lm

llama3-70b-

8192

gpt-4o

gpt-4o-mini

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

gemini-2.0-flash

claude-3-5-haiku-

latest

o3-mini

22.0

Model

All Models performance by body system

Body_system

Lmphatic

25.0

25.0

62.5

37.5

37.5

25.0

25.0

25.0

25.0

37.5

12.5

0.0

62.5

50.0

Endocrine

53.8

15.4

30.8

30.8

0.0

38.5

30.8

15.4

23.1

38.5

23.1

7.7

46.2

38.5

Nervous

47.1

17.6

20.6

20.6

8.8

35.3

32.4

17.6

23.5

17.6

32.4

14.7

35.3

55.9

Urinary

27.3

27.3

27.3

27.3

18.2

18.2

27.3

18.2

27.3

36.4

18.2

36.4

36.4

27.3

Reproductive

28.0

16.0

28.0

20.0

32.0

12.0

24.0

12.0

12.0

28.0

32.0

20.0

44.0

56.0

Other/NA

40.0

20.0

26.7

40.0

13.3

26.7

20.0

20.0

13.3

20.0

26.7

20.0

20.0

40.0

Skeletal

31.6

21.1

28.9

21.1

18.4

21.1

23.7

23.7

7.9

26.3

31.6

21.1

26.3

36.8

Digestive

27.3

9.1

22.7

27.3

13.6

18.2

22.7

18.2

9.1

31.8

27.3

13.6

36.4

59.1

Respiratory

38.1

23.8

14.3

33.3

23.8

19.0

9.5

14.3

14.3

14.3

19.0

9.5

42.9

42.9

Cardiovascular

27.8

8.3

11.1

22.2

13.9

22.2

33.3

2.8

2.8

16.7

19.4

19.4

16.7

33.3

41.7

Integumentary

33.3

0.0

0.0

0.0

0.0

0.0

33.3

0.0

0.0

0.0

33.3

0.0

33.3

33.3

Muscular

11.1

5.6

11.1

5.6

5.6

5.6

16.7

0.0

5.6

11.1

11.1

5.6

16.7

44.4

gpt-4.5-

preview

med lm

llama3-70b-

8192

gpt-4o

gpt-4o-mini

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

gemini-2.0-flash

claude-3-5-haiku-

latest

o3-mini

22.0

Model

All Models performance by body system

Body_system

Lmphatic

25.0

25.0

62.5

37.5

37.5

25.0

25.0

25.0

25.0

37.5

12.5

0.0

62.5

50.0

Endocrine

53.8

15.4

30.8

30.8

0.0

38.5

30.8

15.4

23.1

38.5

23.1

7.7

46.2

38.5

Nervous

47.1

17.6

20.6

20.6

8.8

35.3

32.4

17.6

23.5

17.6

32.4

14.7

35.3

55.9

Urinary

27.3

27.3

27.3

27.3

18.2

18.2

27.3

18.2

27.3

36.4

18.2

36.4

36.4

27.3

Reproductive

28.0

16.0

28.0

20.0

32.0

12.0

24.0

12.0

12.0

28.0

32.0

20.0

44.0

56.0

Other/NA

40.0

20.0

26.7

40.0

13.3

26.7

20.0

20.0

13.3

20.0

26.7

20.0

20.0

40.0

Skeletal

31.6

21.1

28.9

21.1

18.4

21.1

23.7

23.7

7.9

26.3

31.6

21.1

26.3

36.8

Digestive

27.3

9.1

22.7

27.3

13.6

18.2

22.7

18.2

9.1

31.8

27.3

13.6

36.4

59.1

Respiratory

38.1

23.8

14.3

33.3

23.8

19.0

9.5

9.5

14.3

14.3

14.3

19.0

9.5

42.9

42.9

Cardiovascular

27.8

8.3

11.1

22.2

13.9

22.2

33.3

2.8

2.8

16.7

19.4

19.4

16.7

33.3

41.7

Integumentary

33.3

0.0

0.0

0.0

0.0

0.0

33.3

0.0

0.0

0.0

33.3

0.0

33.3

33.3

Muscular

11.1

5.6

11.1

5.6

5.6

5.6

16.7

0.0

5.6

11.1

11.1

5.6

16.7

44.4

gpt-4.5-

preview

med lm

llama3-70b-

8192

gpt-4o

gpt-4o-mini

claude-3-7-sonnet-

latest

deepseek-r1-distill-

llama-70b

gemini-1.5-pro

deepseek-r1-distill-

qwen-32b

llama-3.3-70b-

versatile

gemini-2.0-flash

claude-3-5-haiku-

latest

o3-mini

22.0

Model

Standout specialty-specific wins

Standout specialty-specific wins

Standout specialty-specific wins

Some combinations reach exceptional accuracy (>90th percentile)

Some combinations reach exceptional accuracy (>90th percentile)

Task + Specialty

Best Model

Accuracy

Diagnosis – Lymphatic

O3-MINI

100%

Basic Science – Reproductive

GPT-4.5-PREVIEW

100%

Treatment – Digestive

O1

83.33%

Treatment – Reproductive

O1

70%

Diagnosis – Endocrine

GPT-4.5-PREVIEW

70%

Task + Specialty

Best Model

Accuracy

Diagnosis – Lymphatic

O3-MINI

100%

Basic Science – Reproductive

GPT-4.5-PREVIEW

100%

Treatment – Digestive

O1

83.33%

Treatment – Reproductive

O1

70%

Diagnosis – Endocrine

GPT-4.5-PREVIEW

70%

Task + Specialty

Best Model

Accuracy

Diagnosis – Lymphatic

O3-MINI

100%

Basic Science – Reproductive

GPT-4.5-PREVIEW

100%

Treatment – Digestive

O1

83.33%

Treatment – Reproductive

O1

70%

Diagnosis – Endocrine

GPT-4.5-PREVIEW

70%

This illustrates how specialist-focused models or ensembles can outperform generalized models in meaningful clinical niches.

This illustrates how specialist-focused models or ensembles can outperform generalized models in meaningful clinical niches.

Cross-vendor comparison

The fragility of clinical

AI systems

Cross-vendor comparison

  • Groq models show competitive recall

  • OpenAI models demonstrate strong accuracy in higher-tier models

  • Claude models display consistent but mid-range performance

  • Together (Deepseek-distill) peaks well in recall in limited contexts

  • Groq models show competitive recall

  • OpenAI models demonstrate strong accuracy in higher-tier models

  • Claude models display consistent but mid-range performance

  • Together (Deepseek-distill) peaks well in recall in limited contexts

Key takeaways

Key takeaways

No model is universally strong specialty strengths vary widely.

No model is universally strong specialty strengths vary widely.

Consensus ensembles outperform single models in specialized.

Consensus ensembles outperform single models in specialized.

Models reason better than

they understand medical content.

Models reason better than

they understand medical content.

Basic science knowledge is a shared weakness across models.

Basic science knowledge is a shared weakness across models.

Confidence is not equal to accuracy calibration is still a major challenge.

Confidence is not equal to accuracy calibration is still a major challenge.

Vendor performance varies leaving major clinical

reliability.

Vendor performance varies leaving major clinical

reliability.

Vendor performance varies leaving major clinical reliability.

Ready for the
future of healthcare?

Ready for the
future of healthcare?