The Consensus Mechanism

How Sully AI built a multi-expert AI system that outperforms the world's best models at clinical reasoning.

May 20, 2026

Amit Kumthekar

1 min read

Clinical AI is hitting a wall. A single model can be deprecated overnight, costs shift without warning, and even the best LLMs still get complex medical reasoning wrong — the kind that matters most at the bedside.

Inside this paper, you'll get the research every clinical AI team needs to understand what's possible today:

Why Sully's Consensus Mechanism outperforms OpenAI's o3 and Google's Gemini 2.5 Pro across every major medical benchmark tested
How a triage-inspired architecture routes queries to specialist AI experts — mirroring the way real clinical teams actually make decisions
The real accuracy gap: 61.2% vs. 53.5% on MedXpertQA, the benchmark specifically designed for complex clinical reasoning (not just medical trivia)
Why single models are systematically overconfident — and how ensemble consensus produces calibration you can actually trust in high-stakes settings
How the modular design lets you swap in newer models without rebuilding — so your clinical AI doesn't go stale every six months

If you're building, deploying, or evaluating AI for clinical decision support, this is the architecture paper you need to read first.

RELATED AGENTS

AI Consultant

AI Scribe

Dive Deeper

Download the full paper for more information about this article.

The State of the Front Desk: A Patient Access Benchmark

AI Receptionist

Report

Jun 23, 2026

The AI Workforce Shift In Healthcare

Report

May 21, 2026

How Harley Street Medical Center Streamlined Clinical Workflows with AI

Article

May 21, 2026

Ready for the

future of healthcare?

Ready for the

future of healthcare?

Book a demo