WHITEPAPER

20 Jun 2025

Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble

Amit Kumthekar & Bhargav Patel MD

Read the Full Paper

Synopsis

The Consensus Mechanism we present delivers measurable improvements in accuracy, diagnostic precision, and calibration, significantly outperforming leading standalone SoTA models across all benchmarks. These gains highlight the potential of an expert-model consensus architecture to enable more reliable, adaptable, and accurate clinical AI systems.

2.5 Pro

o3-high

o4-mini-high

3.7 Sonnet

MedXpertQA

Accuracy %

61.17

45.90

53.00

43.20

22.00

MedMCQA

Accuracy %

91.40

82.00

82.30

80.20

73.60

MedQA

Accuracy %

96.80

93.40

92.00

92.10

73.60

2.5 Pro

o3-high

o4-mini-high

3.7 Sonnet

MedXpertQA

Accuracy %

61.17

45.90

53.00

43.20

22.00

MedMCQA

Accuracy %

91.40

82.00

82.30

80.20

73.60

MedQA

Accuracy %

96.80

93.40

92.00

92.10

73.60

2.5 Pro

o3-high

o4-mini-high

3.7 Sonnet

MedXpertQA

Accuracy %

61.17

45.90

53.00

43.20

22.00

MedMCQA

Accuracy %

91.40

82.00

82.30

80.20

73.60

MedQA

Accuracy %

96.80

93.40

92.00

92.10

73.60

Significant Increase in Clinical Reasoning Performance

Achieved 61.2% accuracy on the MedXpertQA benchmark, outperforming OpenAI’s O3-high by 8.2% and Google’s Gemini 2.5 Pro by 17.1%.

Superior Diagnostic Accuracy

Improved differential diagnosis precision, recall, and F1-score (F1 = 0.326) over O3-high (F1 = 0.2866), with a +6.8% boost in top-1 differential diagnosis accuracy.

Powering Safer Clinical Reasoning through Improved Model Calibration

Improved calibration, displaying between confidence and correctness, reducing overconfidence compared to SoTA models and increasing clinical trustworthiness.

The Fragility of Clinical AI Systems

Rigid Dependency: The dangers of over reliance on a single model
Clinical AI systems that rely on one model are brittle, vulnerable to shifts in access, cost, or performance over time.
Fragile Performance: Gaps in Real-World Clinical Reasoning
Even top-performing models often fail at complex, real-world medical reasoning tasks where nuance and uncertainty matter most.
Future-Proofing Failure: Inability to Evolve with New Models
Without modularity, clinical AI systems cannot keep pace with rapid model innovation or adapt to newer, better tools.

The past few years have seen an accelerated adoption of large language models and generative artificial intelligence into clinical practice. Unfortunately, modern clinical AI systems often hinge on a single large language model (LLM), which creates a fragile dependency.

A lone model can become quickly outdated or suddenly less accessible due to rapid advances and changing costs or policies. In healthcare, this is especially problematic: even as LLMs improve on medical exams, they still struggle with the nuanced, complex reasoning needed for real patient care. Key contextual cues and probabilistic reasoning, ways of thinking that human clinicians excel at, remain challenging for current single-model AI systems. This gap highlights the need for a more resilient and sophisticated approach. Instead of betting everything on one model, what if we could combine the strengths of many? Recent approaches in AI and the tradition of expert committees in medicine both point toward ensemble methods; using multiple specialized “experts” to deliberate and reach a better answer together.

The goal is to design an architecture that stays up-to-date, handles complex reasoning, and adapts as new models emerge, all while maintaining the privacy and reliability crucial in clinical settings.

Introducing the Consensus Mechanism

Overview

The Consensus Mechanism (MedCon-1) is introduced as an adaptive, ensemble-based framework for clinical AI. In essence, it functions like a virtual panel of medical specialists consulting on a case. Rather than relying on a single monolithic model, this system brings multiple expert models to the table – each with domain-specific strengths – and has them work together on a problem. The design draws inspiration from how real clinicians might hold a case conference or “roundtable,” where each specialist provides input before arriving at a collective decision. By orchestrating these expert agents,

The Consensus Mechanism generates answers that are more robust and contextually nuanced than any individual model could achieve alone. It bridges methods from chain-of-thought reasoning and mixture-of-experts modeling: complex questions are broken down and delegated to specialized minds, then their insights are synthesized into one conclusion. The result is a flexible architecture that can evolve over time, integrating new and better models as they become available, without being locked into any single AI model’s limitations.

How We Designed the Consensus Mechanism

The Consensus Mechanism is structured as a three-stage pipeline: triage, expert response, and final aggregation. A triage model first analyzes the input and determines which medical specialties are relevant. Based on this, it selects the appropriate expert models to address the task. Each expert model independently evaluates the query from its specialty perspective and outputs a probability distribution over possible answers. These responses are then passed to a final consensus model, which integrates them to produce a unified result.

Unlike simple ensembling or voting, the consensus model interprets the expert rationales and aggregated probabilities to make a final determination. This adds a layer of clinical-style reasoning on top of the expert outputs. The entire pipeline is modular and can be adjusted for speed, cost, or accuracy depending on use case.

An Expert Based Ensemble

Each expert in the system is a distinct, publicly available model configured to focus on a specific medical specialty. The triage model determines which specialties are needed based on the input task. This task is then routed to the relevant experts, who analyze it using domain-specific reasoning. Each expert generates its answer and corresponding probability distribution independently, allowing the consensus architecture to capture multiple clinical viewpoints. This was designed to reduce the risk of blind spots from a single model, and allow for the flexible combination of multiple specialists.

The expert-based design mirrors the multi-dimensional nature of clinical reasoning by distributing the reasoning process across multiple independent agents, enabling more granular analysis and broader diagnostic coverage. It builds on foundational principles from chain-of-thought and mixture-of-experts modeling, but extends them into a modular, multi-model configuration. This approach enables the system to reason more like a real clinical team, synthesizing insights rather than relying on internal model abstractions. Ultimately, it offers a scalable way to simulate structured, interdisciplinary clinical decision-making in AI systems.

Leveraging the Probabilistic Nature of Medicine

Rather than outputting only a top choice, each expert model returns a complete probability distribution over all possible answers. These distributions are merged using a weighted log opinion pool (WLOP), which balances expert agreement and confidence. To account for the inherent nuances in clinical reasoning we applied a boosting function to reward answers that may not be ranked as the most likely answer, but appear frequently across expert rankings. Cascade boosting employs customizable weights that control the significance of a boost across ranks in a probability distribution. This approach, when combined with a weighted opinion pool, helps minimize overconfidence while enabling a tailored probabilistic approach.

The result is a single, calibrated distribution that reflects the strength of support for each option. This probability profile, along with expert rationales, is passed to the final consensus model. The system therefore retains uncertainty when appropriate, and avoids overconfident conclusions. This design mirrors how clinicians assess likelihoods and weigh multiple diagnoses, ultimately improving decision reliability and transparency.

This design choice reflects the reality that clinical decision-making often involves uncertainty, making confidence levels critical for interpretation. By preserving and weighting expert uncertainty, the system offers a transparent, probabilistic overview of potential answers. It rewards consistent agreement while moderating outlier influence, producing a calibrated distribution that mirrors how clinicians assess likelihoods. This enables the consensus architecture to provide more robust and context-aware clinician support. The inclusion of probabilistic reasoning not only improves interpretability but also enhances safety by preventing overconfidence, as a result of unsupported conclusions or ambiguous clinical scenarios.

Implications of the Consensus Design

Modular System Reduces Risk of Single-Model Dependence

Transparency in Probability Modeling

Designed as a Clinical Support Tool, Not a Replacement

Results

Improvements Across all Benchmarks

The Consensus Mechanism outperformed all comparator models across three major medical QA benchmarks: MedXpertQA, MedQA, and MedMCQA. On MedXpertQA—designed for complex clinical reasoning—it achieved 61.2% accuracy, surpassing O3-high (53.0%) and Gemini 2.5 Pro (44.1%). Accuracy gains also held for MedQA (96.8%) and MedMCQA (94.2%) despite those tasks being less reasoning-intensive. The model showed especially strong performance on diagnostic and treatment questions, improving over single-model baselines by more than 10% in some cases.

Super Powering Accurate Differential Diagnosis

In differential diagnosis tasks using the DDX+ dataset, the Consensus Mechanism consistently outperformed leading models across all evaluation metrics. It achieved a higher F1 score (0.326 vs. 0.2866 for O3-high), alongside improved precision and recall, indicating better balance between identifying correct diagnoses and minimizing false positives. Top-1 diagnostic accuracy also increased by nearly 7%, highlighting its ability to prioritize the most likely diagnosis more effectively.

These performance advantages persisted across all top-K thresholds, especially in lower K values where precision is critical. These gains highlight how the system’s expert-based decomposition and probabilistic aggregation contribute to stronger diagnostic performance. By capturing domain-specific reasoning from multiple specialists and weighting their outputs based on confidence and frequency, the system constructs a differential that better reflects real-world clinical deliberation.

This design reduces overreliance on any single perspective, mitigates individual model blind spots, and prioritizes consensus-driven predictions. As a result, the system not only improves diagnostic accuracy but also produces outputs that are better aligned with how clinicians approach decision-making in complex cases.

Improved Reliability and Calibration

Calibration analysis showed that the Consensus Mechanism’s predicted confidence levels closely matched actual accuracy, particularly in high-confidence intervals. As demonstrated in the plot, its reliability curve remains near the ideal diagonal, indicating well-calibrated outputs. In contrast, baseline models like O4-mini exhibit significant overconfidence, overstating accuracy in critical ranges.

This alignment between confidence and correctness means clinicians can trust the system’s probability estimates when making decisions. Reliable calibration enhances safety by reducing false certainty and promoting appropriate caution, especially in high-stakes scenarios. These findings reinforce the system’s core strengths: trustworthy reasoning, interpretability, and clinical applicability.

Where do we go from here?

To bring the Consensus Mechanism into real-world clinical settings, future work will focus on validating performance using complex case vignettes and live clinical scenarios. This includes assessing how well the system integrates into workflows and supports physician decision-making beyond benchmark tasks. Additional development will prioritize reducing cost and latency by incorporating smaller or open-source models without compromising accuracy.

This will also enhance transparency, privacy, and scalability for local deployment. The team plans to optimize the architecture for more resource-efficient operation while maintaining performance. These efforts aim to ensure the system is practical, sustainable, and suited for routine clinical use.

Conclusion

The Consensus Mechanism offers a modular, expert-based architecture that addresses core limitations of single-model clinical AI systems. Through a modular, expert-driven design, the system mirrors real-world clinical workflows; synthesizing insights from multiple specialists, weighing uncertainty, and delivering probability-calibrated outputs. These features make it far more aligned with how clinicians think and make decisions, offering practical advantages over traditional single-model systems. The approach improves both performance and interpretability, achieving state-of-the-art results across clinical benchmarks while supporting safer and more adaptive decision-making.

In addition to performing well as a standalone diagnostic tool, it functions as a trustworthy second opinion. The consensus mechanism was designed to be transparent, explainable, and capable of handling diagnostic complexity without overconfidence. For healthcare organizations, it offers a scalable, future-proof solution that reduces rigid dependence on outdata models, while ensuring adaptability as new models emerge. As clinical AI continues to evolve, this framework offers a scalable and reliable foundation for building AI systems that deliver high accuracy while remaining interpretable, trustworthy, and actionable for clinicians.

Dive
Deeper

Download the full academic paper for more information about our research.

Download Now

TheConsensusMechanism

Synopsis

The Fragility of Clinical AI Systems

Introducing the Consensus Mechanism

An Expert Based Ensemble

Leveraging the Probabilistic Nature of Medicine

Implications of the Consensus Design

Modular System Reduces Risk of Single-Model Dependence

Modular System Reduces Risk of Single-Model Dependence

Modular System Reduces Risk of Single-Model Dependence

Transparency in Probability Modeling

Transparency in Probability Modeling

Transparency in Probability Modeling

Designed as a Clinical Support Tool, Not a Replacement

Designed as a Clinical Support Tool, Not a Replacement

Designed as a Clinical Support Tool, Not a Replacement

Results

Where do we go from here?

Conclusion

DiveDeeper

The
Consensus
Mechanism

Dive
Deeper