Model strengths vary by
medical domains.

Model strengths vary by medical domains.

A cross-specialty benchmark analysis showing how leading large language models differ in accuracy, reasoning, and specialization across medical tasks and body systems.

Second opinion matters: Towards adaptive clinical AI via the consensus of expert model ensemble

Introduction

Introduction

Medical AI models do not perform uniformly across clinical domains. Each model exhibits distinct strengths depending on the specialty area, task type, and underlying medical complexity.

This page compiles a comprehensive view of model performance across medical tasks, body systems, and question types, based on Sully’s benchmarking of leading LLMs.

Medical AI models do not perform uniformly across clinical domains. Each model exhibits distinct strengths depending on the specialty area, task type, and underlying medical complexity.

This page compiles a comprehensive view of model performance across medical tasks, body systems, and question types, based on Sully’s benchmarking of leading LLMs.

Why domain-level variation exists

Why domain-level variation exists

Why domain-level variation exists

Even state-of-the-art models show substantial variability across specialties due to differences in:

Even state-of-the-art models show substantial variability across specialties due to differences in:

Depth of domain-specific medical knowledge

Depth of domain-specific medical knowledge

Requirements for multi-step reasoning vs. factual recall

Requirements for multi-step reasoning vs. factual recall

Model design (generalist vs. domain-trained)

Model design (generalist vs. domain-trained)

Variances in training corpora and clinical content familiarity

Variances in training corpora and clinical content familiarity

Performance

Performance

01

01

By medical task (Diagnosis, treatment, basic science)

By medical task (Diagnosis, treatment, basic science)

Benchmarks reveal significant differences across task types:

  • Diagnosis: Best handled by O1 and O3-MINI

  • Treatment: GPT-4.5-PREVIEW shows specialty strength

  • Basic Science: Few models exceed 30% accuracy; tasks remain the hardest

02

02

By body system

By body system

Model strengths diverge sharply depending on the clinical system involved.
Examples:

  • Lymphatic: O3-MINI reaches 62.5% (top model result)

  • Digestive: O1 peaks at 59.09% (best system for O1)

  • Endocrine: GPT-4.5-PREVIEW achieves 53.85%

  • Respiratory: O3-MINI at 42.9%

Standout specialty-specific wins

Standout specialty-specific wins

Standout specialty-specific wins

Some combinations reach exceptional accuracy (>90th percentile)

Some combinations reach exceptional accuracy (>90th percentile)

This illustrates how specialist-focused models or ensembles can outperform generalized models in meaningful clinical niches.

This illustrates how specialist-focused models or ensembles can outperform generalized models in meaningful clinical niches.

Cross-vendor comparison

The fragility of clinical

AI systems

Cross-vendor comparison

  • Groq models show competitive recall

  • OpenAI models demonstrate strong accuracy in higher-tier models

  • Claude models display consistent but mid-range performance

  • Together (Deepseek-distill) peaks well in recall in limited contexts

  • Groq models show competitive recall

  • OpenAI models demonstrate strong accuracy in higher-tier models

  • Claude models display consistent but mid-range performance

  • Together (Deepseek-distill) peaks well in recall in limited contexts

Key takeaways

Key takeaways

01

01

No model is universally strong specialty strengths vary widely.

No model is universally strong specialty strengths vary widely.

02

02

Ensembles or consensus-based systems outperform single-model setups in specialized tasks.

Ensembles or consensus-based systems outperform single-model setups in specialized tasks.

03

03

Reasoning tasks are generally easier for models than understanding tasks.

Reasoning tasks are generally easier for models than understanding tasks.

04

Basic science knowledge is a shared weakness across models.

04

Basic science knowledge is a shared weakness across models.

05

05

Confidence is not equal to accuracy calibration is still a major challenge.

Confidence is not equal to accuracy calibration is still a major challenge.

06

Vendor performance varies, but even top models leave large gaps in clinical reliability.

06

Vendor performance varies, but even top models leave large gaps in clinical reliability.

Download the full
academic paper.

Download the full
academic paper.

Download now

Ready for the
future of healthcare?

Ready for the
future of healthcare?