How Sully AI built a multi-expert AI system that outperforms the world's best models at clinical reasoning.
·

Amit Kumthekar
·
Clinical AI is hitting a wall. A single model can be deprecated overnight, costs shift without warning, and even the best LLMs still get complex medical reasoning wrong — the kind that matters most at the bedside.
Inside this paper, you'll get the research every clinical AI team needs to understand what's possible today:
Why Sully's Consensus Mechanism outperforms OpenAI's o3 and Google's Gemini 2.5 Pro across every major medical benchmark tested
How a triage-inspired architecture routes queries to specialist AI experts — mirroring the way real clinical teams actually make decisions
The real accuracy gap: 61.2% vs. 53.5% on MedXpertQA, the benchmark specifically designed for complex clinical reasoning (not just medical trivia)
Why single models are systematically overconfident — and how ensemble consensus produces calibration you can actually trust in high-stakes settings
How the modular design lets you swap in newer models without rebuilding — so your clinical AI doesn't go stale every six months
If you're building, deploying, or evaluating AI for clinical decision support, this is the architecture paper you need to read first.
RELATED AGENTS

AI Consultant

AI Scribe
Dive Deeper
Download the full paper for more information about this article.
The AI Workforce Shift In Healthcare
The AI Workforce Shift In Healthcare
Report
How Harley Street Medical Center Streamlined Clinical Workflows with AI
How Harley Street Medical Center Streamlined Clinical Workflows with AI
Article
How Apogee Reduced Burnout and Kept Its Physicians
How Apogee Reduced Burnout and Kept Its Physicians

AI Consultant

AI Scribe

AI Research
Case Study