The consensus
mechanism
The consensus
mechanism
Second opinion matters: Towards adaptive clinical AI via
the consensus of expert model ensemble
Second opinion matters: Towards adaptive clinical AI via the consensus of expert model ensemble
Synopsis
Synopsis
Synopsis
Our consensus mechanism shows significant improvements in accuracy and diagnostic precision, outperforming top standalone models across all benchmarks. This demonstrates the potential of an expert-model consensus architecture for creating more reliable and adaptable clinical AI systems.
Our consensus mechanism shows significant improvements in accuracy and diagnostic precision, outperforming top standalone models across all benchmarks. This demonstrates the potential of an expert-model consensus architecture for creating more reliable and adaptable clinical AI systems.

Significant increase in clinical reasoning performance
Significant increase in clinical reasoning performance
Achieved 61.2% accuracy on the MedXpertQA benchmark, outperforming OpenAI’s O3-high by 8.2% and Google’s Gemini 2.5 Pro by 17.1%.
Achieved 61.2% accuracy on the MedXpertQA benchmark, outperforming OpenAI’s O3-high by 8.2% and Google’s Gemini 2.5 Pro by 17.1%.

Superior diagnostic accuracy
Superior diagnostic accuracy
Improved differential diagnosis precision, recall, and F1-score (F1 = 0.326) over O3-high (F1 = 0.2866), with a +6.8% boost in top-1 differential diagnosis accuracy.
Improved differential diagnosis precision, recall, and F1-score (F1 = 0.326) over O3-high (F1 = 0.2866), with a +6.8% boost in top-1 differential diagnosis accuracy.

Powering safer clinical reasoning through improved model calibration
Powering safer clinical reasoning through improved model calibration
Improved calibration, displaying between confidence and correctness, reducing overconfidence compared to SoTA models and increasing clinical trustworthiness.
Improved calibration, displaying between confidence and correctness, reducing overconfidence compared to SoTA models and increasing clinical trustworthiness.



The fragility of clinical
AI systems
The fragility of clinical
AI systems
The fragility of clinical
AI systems
LLMs are moving fast into clinical care, but single-model systems are fragile, quickly outdated, policy-sensitive, and weak at nuanced, probabilistic reasoning. A better path is ensembles: multiple specialized “experts” that deliberate, as in medical committees. Our goal is an architecture that remains current, reasons well on complex cases, adapts to new models, and safeguards privacy and reliability.
LLMs are moving fast into clinical care, but single-model systems are fragile, quickly outdated, policy-sensitive, and weak at nuanced, probabilistic reasoning. A better path is ensembles: multiple specialized “experts” that deliberate, as in medical committees. Our goal is an architecture that remains current, reasons well on complex cases, adapts to new models, and safeguards privacy and reliability.

The dangers of over reliance on a single model
The dangers of over reliance on a single model
Clinical AI systems that rely on one model are brittle, vulnerable to shifts in access, cost, or performance over time.
Clinical AI systems that rely on one model are brittle, vulnerable to shifts in access, cost, or performance over time.

Gaps in real-world clinical reasoning
Gaps in real-world clinical reasoning
Even top-performing models often fail at complex, real-world medical reasoning tasks where nuance and uncertainty matter most.
Even top-performing models often fail at complex, real-world medical reasoning tasks where nuance and uncertainty matter most.

Inability to evolve with new models
Inability to evolve with new models
Without modularity, clinical AI systems cannot keep pace with rapid model innovation or adapt to newer, better tools.
Without modularity, clinical AI systems cannot keep pace with rapid model innovation or adapt to newer, better tools.
Introducing the consensus mechanism
Introducing the consensus mechanism
Overview
Overview
How we designed the consensus mechanism
How we designed the consensus mechanism
Dummy process explained
Dummy process explained
The consensus mechanism (MedCon-1) is introduced as an adaptive, ensemble-based framework for clinical AI. In essence, it functions like a virtual panel of medical specialists consulting on a case. Rather than relying on a single monolithic model, this system brings multiple expert models to the table – each with domain-specific strengths – and has them work together on a problem. The design draws inspiration from how real clinicians might hold a case conference or “roundtable,” where each specialist provides input before arriving at a collective decision. By orchestrating these expert agents,
The consensus mechanism (MedCon-1) is introduced as an adaptive, ensemble-based framework for clinical AI. In essence, it functions like a virtual panel of medical specialists consulting on a case. Rather than relying on a single monolithic model, this system brings multiple expert models to the table – each with domain-specific strengths – and has them work together on a problem. The design draws inspiration from how real clinicians might hold a case conference or “roundtable,” where each specialist provides input before arriving at a collective decision. By orchestrating these expert agents,
The consensus mechanism generates answers that are more robust and contextually nuanced than any individual model could achieve alone. It bridges methods from chain-of-thought reasoning and mixture-of-experts modeling: complex questions are broken down and delegated to specialized minds, then their insights are synthesized into one conclusion. The result is a flexible architecture that can evolve over time, integrating new and better models as they become available, without being locked into any single AI model’s limitations.
The consensus mechanism generates answers that are more robust and contextually nuanced than any individual model could achieve alone. It bridges methods from chain-of-thought reasoning and mixture-of-experts modeling: complex questions are broken down and delegated to specialized minds, then their insights are synthesized into one conclusion. The result is a flexible architecture that can evolve over time, integrating new and better models as they become available, without being locked into any single AI model’s limitations.
The consensus mechanism is structured as a three-stage pipeline: triage, expert response, and final aggregation. A triage model first analyzes the input and determines which medical specialties are relevant. Based on this, it selects the appropriate expert models to address the task. Each expert model independently evaluates the query from its specialty perspective and outputs a probability distribution over possible answers. These responses are then passed to a final consensus model, which integrates them to produce a unified result.
The consensus mechanism is structured as a three-stage pipeline: triage, expert response, and final aggregation. A triage model first analyzes the input and determines which medical specialties are relevant. Based on this, it selects the appropriate expert models to address the task. Each expert model independently evaluates the query from its specialty perspective and outputs a probability distribution over possible answers. These responses are then passed to a final consensus model, which integrates them to produce a unified result.
Unlike simple ensembling or voting, the consensus model interprets the expert rationales and aggregated probabilities to make a final determination. This adds a layer of clinical-style reasoning on top of the expert outputs. The entire pipeline is modular and can be adjusted for speed, cost, or accuracy depending on use case.
Unlike simple ensembling or voting, the consensus model interprets the expert rationales and aggregated probabilities to make a final determination. This adds a layer of clinical-style reasoning on top of the expert outputs. The entire pipeline is modular and can be adjusted for speed, cost, or accuracy depending on use case.



An expert based
ensemble
An expert based
ensemble
The fragility of clinical
AI systems
Each expert in the system is a distinct, publicly available model configured to focus on a specific medical specialty. The triage model determines which specialties are needed based on the input task. This task is then routed to the relevant experts, who analyze it using domain-specific reasoning. Each expert generates its answer and corresponding probability distribution independently, allowing the consensus architecture to capture multiple clinical viewpoints. This was designed to reduce the risk of blind spots from a single model, and allow for the flexible combination of multiple specialists.
The expert-based design mirrors the multi-dimensional nature of clinical reasoning by distributing the reasoning process across multiple independent agents, enabling more granular analysis and broader diagnostic coverage. It builds on foundational principles from chain-of-thought and mixture-of-experts modeling, but extends them into a modular, multi-model configuration. This approach enables the system to reason more like a real clinical team, synthesizing insights rather than relying on internal model abstractions. Ultimately, it offers a scalable way to simulate structured, interdisciplinary clinical decision-making in AI systems.
Each expert in the system is a distinct, publicly available model configured to focus on a specific medical specialty. The triage model determines which specialties are needed based on the input task. This task is then routed to the relevant experts, who analyze it using domain-specific reasoning. Each expert generates its answer and corresponding probability distribution independently, allowing the consensus architecture to capture multiple clinical viewpoints. This was designed to reduce the risk of blind spots from a single model, and allow for the flexible combination of multiple specialists.
The expert-based design mirrors the multi-dimensional nature of clinical reasoning by distributing the reasoning process across multiple independent agents, enabling more granular analysis and broader diagnostic coverage. It builds on foundational principles from chain-of-thought and mixture-of-experts modeling, but extends them into a modular, multi-model configuration. This approach enables the system to reason more like a real clinical team, synthesizing insights rather than relying on internal model abstractions. Ultimately, it offers a scalable way to simulate structured, interdisciplinary clinical decision-making in AI systems.






Leveraging the probabilistic nature of medicine
Leveraging the probabilistic nature of medicine
The fragility of clinical
AI systems
Rather than outputting only a top choice, each expert model returns a complete probability distribution over all possible answers.
These distributions are merged using a weighted log opinion pool (WLOP), which balances expert agreement and confidence. To account for the inherent nuances in clinical reasoning we applied a boosting function to reward answers that may not be ranked as the most likely answer, but appear frequently across expert rankings. Cascade boosting employs customizable weights that control the significance of a boost across ranks in a probability distribution. This approach, when combined with a weighted opinion pool, helps minimize overconfidence while enabling a tailored probabilistic approach.
The result is a single, calibrated distribution that reflects the strength of support for each option. This probability profile, along with expert rationales, is passed to the final consensus model. The system therefore retains uncertainty when appropriate, and avoids overconfident conclusions. This design mirrors how clinicians assess likelihoods and weigh multiple diagnoses, ultimately improving decision reliability and transparency.
Rather than outputting only a top choice, each expert model returns a complete probability distribution over all possible answers.
These distributions are merged using a weighted log opinion pool (WLOP), which balances expert agreement and confidence. To account for the inherent nuances in clinical reasoning we applied a boosting function to reward answers that may not be ranked as the most likely answer, but appear frequently across expert rankings. Cascade boosting employs customizable weights that control the significance of a boost across ranks in a probability distribution. This approach, when combined with a weighted opinion pool, helps minimize overconfidence while enabling a tailored probabilistic approach.
The result is a single, calibrated distribution that reflects the strength of support for each option. This probability profile, along with expert rationales, is passed to the final consensus model. The system therefore retains uncertainty when appropriate, and avoids overconfident conclusions. This design mirrors how clinicians assess likelihoods and weigh multiple diagnoses, ultimately improving decision reliability and transparency.
Implications of the
consensus design
Implications of the
consensus design
01
01
Modular system reduces risk of single model dependence
A modular design lets us swap models in and out as tech or APIs evolve—no rebuilds required. Configure for latency, cost, or performance, and run open-source models locally for security. Net result: a sustainable, resilient clinical AI platform.
A modular design lets us swap models in and out as tech or APIs evolve—no rebuilds required. Configure for latency, cost, or performance, and run open-source models locally for security. Net result: a sustainable, resilient clinical AI platform.
02
02
Transparency in probability modeling
Transparency in probability modeling
Experts output probability distributions, not single answers. A weighted log opinion pool fuses them, preserving confidence differences for better calibration and fewer overconfident errors. Clinicians get transparent probabilities and confidence scores to prioritize next steps in uncertain cases.
Experts output probability distributions, not single answers. A weighted log opinion pool fuses them, preserving confidence differences for better calibration and fewer overconfident errors. Clinicians get transparent probabilities and confidence scores to prioritize next steps in uncertain cases.
03
03
Designed as a clinical support tool, not a replacement
This is decision support, not diagnosis. It offers a ranked, confidence-scored second opinion that mirrors differential diagnosis and flags uncertainty. Clinicians keep final authority—by design, for responsible AI.
This is decision support, not diagnosis. It offers a ranked, confidence-scored second opinion that mirrors differential diagnosis and flags uncertainty. Clinicians keep final authority—by design, for responsible AI.
04
Improved patient experience
with clear, multilingual guidance across channels.
Results
01
01
Modular system reduces risk of single model dependence
The Consensus Mechanism outperformed all comparator models across three major medical QA benchmarks: MedXpertQA, MedQA, and MedMCQA. On MedXpertQA—designed for complex clinical reasoning—it achieved 61.2% accuracy, surpassing O3-high (53.0%) and Gemini 2.5 Pro (44.1%). Accuracy gains also held for MedQA (96.8%) and MedMCQA (94.2%) despite those tasks being less reasoning-intensive. The model showed especially strong performance on diagnostic and treatment questions, improving over single-model baselines by more than 10% in some cases.
The Consensus Mechanism outperformed all comparator models across three major medical QA benchmarks: MedXpertQA, MedQA, and MedMCQA. On MedXpertQA—designed for complex clinical reasoning—it achieved 61.2% accuracy, surpassing O3-high (53.0%) and Gemini 2.5 Pro (44.1%). Accuracy gains also held for MedQA (96.8%) and MedMCQA (94.2%) despite those tasks being less reasoning-intensive. The model showed especially strong performance on diagnostic and treatment questions, improving over single-model baselines by more than 10% in some cases.



02
02
Super powering accurate
differential diagnosis
In differential diagnosis tasks using the DDX+ dataset, the Consensus Mechanism consistently outperformed leading models across all evaluation metrics. It achieved a higher F1 score (0.326 vs. 0.2866 for O3-high), alongside improved precision and recall, indicating better balance between identifying correct diagnoses and minimizing false positives. Top-1 diagnostic accuracy also increased by nearly 7%, highlighting its ability to prioritize the most likely diagnosis more effectively.
These performance advantages persisted across all top-K thresholds, especially in lower K values where precision is critical. These gains highlight how the system’s expert-based decomposition and probabilistic aggregation contribute to stronger diagnostic performance. By capturing domain-specific reasoning from multiple specialists and weighting their outputs based on confidence and frequency, the system constructs a differential that better reflects real-world clinical deliberation.
This design reduces overreliance on any single perspective, mitigates individual model blind spots, and prioritizes consensus-driven predictions. As a result, the system not only improves diagnostic accuracy but also produces outputs that are better aligned with how clinicians approach decision-making in complex cases.






03
03
Improved reliability and calibration
Calibration analysis showed that the Consensus Mechanism’s predicted confidence levels closely matched actual accuracy, particularly in high-confidence intervals. As demonstrated in the plot, its reliability curve remains near the ideal diagonal, indicating well-calibrated outputs. In contrast, baseline models like O4-mini exhibit significant overconfidence, overstating accuracy in critical ranges.
This alignment between confidence and correctness means clinicians can trust the system’s probability estimates when making decisions. Reliable calibration enhances safety by reducing false certainty and promoting appropriate caution, especially in high-stakes scenarios. These findings reinforce the system’s core strengths: trustworthy reasoning, interpretability, and clinical applicability.



Where do we go from here?
Where do we go from here?
Where do we go from here?
To bring the Consensus Mechanism into real-world clinical settings, future work will focus on validating performance using complex case vignettes and live clinical scenarios. This includes assessing how well the system integrates into workflows and supports physician decision-making beyond benchmark tasks. Additional development will prioritize reducing cost and latency by incorporating smaller or open-source models without compromising accuracy.
This will also enhance transparency, privacy, and scalability for local deployment. The team plans to optimize the architecture for more resource-efficient operation while maintaining performance. These efforts aim to ensure the system is practical, sustainable, and suited for routine clinical use.
To bring the Consensus Mechanism into real-world clinical settings, future work will focus on validating performance using complex case vignettes and live clinical scenarios. This includes assessing how well the system integrates into workflows and supports physician decision-making beyond benchmark tasks. Additional development will prioritize reducing cost and latency by incorporating smaller or open-source models without compromising accuracy.
This will also enhance transparency, privacy, and scalability for local deployment. The team plans to optimize the architecture for more resource-efficient operation while maintaining performance. These efforts aim to ensure the system is practical, sustainable, and suited for routine clinical use.
Conclusion
Conclusion
Conclusion
The Consensus Mechanism offers a modular, expert-based architecture that addresses core limitations of single-model clinical AI systems. Through a modular, expert-driven design, the system mirrors real-world clinical workflows; synthesizing insights from multiple specialists, weighing uncertainty, and delivering probability-calibrated outputs.
These features make it far more aligned with how clinicians think and make decisions, offering practical advantages over traditional single-model systems. The approach improves both performance and interpretability, achieving state-of-the-art results across clinical benchmarks while supporting safer and more adaptive decision-making.
In addition to performing well as a standalone diagnostic tool, it functions as a trustworthy second opinion. The consensus mechanism was designed to be transparent, explainable, and capable of handling diagnostic complexity without overconfidence. For healthcare organizations, it offers a scalable, future-proof solution that reduces rigid dependence on outdata models, while ensuring adaptability as new models emerge.
As clinical AI continues to evolve, this framework offers a scalable and reliable foundation for building AI systems that deliver high accuracy while remaining interpretable, trustworthy, and actionable for clinicians.
The Consensus Mechanism offers a modular, expert-based architecture that addresses core limitations of single-model clinical AI systems. Through a modular, expert-driven design, the system mirrors real-world clinical workflows; synthesizing insights from multiple specialists, weighing uncertainty, and delivering probability-calibrated outputs.
These features make it far more aligned with how clinicians think and make decisions, offering practical advantages over traditional single-model systems. The approach improves both performance and interpretability, achieving state-of-the-art results across clinical benchmarks while supporting safer and more adaptive decision-making.
In addition to performing well as a standalone diagnostic tool, it functions as a trustworthy second opinion. The consensus mechanism was designed to be transparent, explainable, and capable of handling diagnostic complexity without overconfidence. For healthcare organizations, it offers a scalable, future-proof solution that reduces rigid dependence on outdata models, while ensuring adaptability as new models emerge.
As clinical AI continues to evolve, this framework offers a scalable and reliable foundation for building AI systems that deliver high accuracy while remaining interpretable, trustworthy, and actionable for clinicians.



Download the full
academic paper.
Download the full
academic paper.
Download now

Resources
© Sully AI 2025. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.
Resources
© Sully AI 2025. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.
Resources
© Sully AI 2025. All Rights Reserved.
Epic is a registered trademark of Epic Systems Corporation.