Towards Open-Source Foundational Medical Agents

Introduction

The rapid advancement of large language models has transformed healthcare AI, but it has also introduced a critical vulnerability: dependency on closed, proprietary systems. When a foundation model provider updates their weights or inference stack, downstream healthcare applications can experience unexpected quality regressions—often without warning.

At Sully, we've embraced open-source foundational models as the backbone of our medical AI agents. This whitepaper explains why, and how we've developed techniques to match or exceed proprietary model performance while maintaining full control over quality.

The Regression Problem

Unpredictable Quality Changes

Foundation model providers regularly update their systems—adjusting weights, modifying system prompts, or optimizing inference infrastructure. While these changes may reduce costs or improve general performance, they can have unintended consequences for specialized medical applications.

Through continuous weekly benchmarking across multiple foundation models, we observed measurable quality regressions in proprietary reasoning models. One notable decline coincided with the announcement of an updated inference stack designed to reduce API costs. For healthcare applications where consistency and reliability are paramount, this unpredictability represents a significant risk.

This image shows the results from our proprietary evaluation system showing how note quality regressed over time with a specific focus on decreases in:

Clinical accuracy
Safety
Information architecture
Template adherence

These are run on the same sample set, with the same evaluation, and same foundational model.

This image shows the results from our proprietary evaluation system showing how note quality regressed over time with a specific focus on decreases in:

Clinical accuracy
Safety
Information architecture
Template adherence

These are run on the same sample set, with the same evaluation, and same foundational model.

The Case for Control

When patient care depends on AI-generated outputs, organizations need:

Version stability: Confidence that model behavior remains consistent
Transparency: Understanding of what changes and when
Rollback capability: The ability to revert to known-good configurations
Customization: Fine-tuning for specific clinical contexts

When patient care depends on AI-generated outputs, organizations need:

Version stability: Confidence that model behavior remains consistent
Transparency: Understanding of what changes and when
Rollback capability: The ability to revert to known-good configurations
Customization: Fine-tuning for specific clinical contexts

Open-source models deliver all of these capabilities. By running inference on our own infrastructure, we maintain complete control over the AI systems our customers depend on.

Benefits of open-source models

Open-source foundational models offer three primary advantages for production healthcare AI:

Cost Efficiency

Smaller parameter models running on optimized infrastructure dramatically reduce per-inference costs. This makes AI-assisted workflows economically viable at scale—enabling features that would be prohibitively expensive with proprietary APIs.

Lower Latency

Open-source models can run on specialized hardware—custom LPUs, optimized accelerators, and purpose-built inference chips—delivering response times that enhance rather than interrupt clinical workflows.

Scale Infinitely

Without API rate limits or third-party dependencies, open-source deployments scale predictably with demand. This is essential for healthcare organizations processing thousands of clinical encounters daily.

Improved patient experience

with clear, multilingual guidance across channels.

Bridging the Gap: Ensembling and Consensus

Overview

Open-source models, particularly smaller ones, may individually lag behind the largest proprietary models in complex reasoning tasks. We address this limitation through two key techniques:

Multi-Model Ensembling

Rather than relying on a single model, we orchestrate multiple models working in concert. Each model contributes its strengths, and the ensemble produces outputs that exceed what any individual model could achieve alone.

Aggregated Consensus

For high-stakes clinical decisions, we implement consensus mechanisms where multiple models independently evaluate the same input. Agreement across models increases confidence; disagreement triggers additional review or escalation.

These techniques, combined with systematic optimization to identify the best configuration of open-source models for each use case, allow us to deliver production-quality medical AI without sacrificing control or predictability.

Performance Benchmarks

The following benchmark compares latency, cost, and accuracy across models for a medical coding use case:

Model

Raw

Adjusted

Latency

Similar

Repr.

Wrong

Cost/1k

gpt-oss-120b

57%

98%

2.93s

$0.23

gpt-5-mini

58%

96%

736s

$0.85

gpt-4.1-mini

52%

96%

4.07s

$0.88

GLM-4.6

54.5%

95%

22.89s

$0.93

DeepSeek-V3.1

53%

94%

2.72s

$1.06

Kimi-K2-Instruct

55%

93%

2.12s

$0.90

Kimi-K2-Thinking

61.4%

87%

20.75s

$0.90

Detailed Results

Model

Raw

Adjusted

Latency

Similar

Repr.

Wrong

Cost/1k

gpt-oss-120b

57%

98%

2.93s

$0.23

gpt-5-mini

58%

96%

736s

$0.85

gpt-4.1-mini

52%

96%

4.07s

$0.88

GLM-4.6

54.5%

95%

22.89s

$0.93

DeepSeek-V3.1

53%

94%

2.72s

$1.06

Kimi-K2-Instruct

55%

93%

2.12s

$0.90

Kimi-K2-Thinking

61.4%

87%

20.75s

$0.90

Detailed Results

Model

Raw

Adjusted

Latency

Similar

Repr.

Wrong

Cost/1k

gpt-oss-120b

57%

98%

2.93s

$0.23

gpt-5-mini

58%

96%

736s

$0.85

gpt-4.1-mini

52%

96%

4.07s

$0.88

GLM-4.6

54.5%

95%

22.89s

$0.93

DeepSeek-V3.1

53%

94%

2.72s

$1.06

Kimi-K2-Instruct

55%

93%

2.12s

$0.90

Kimi-K2-Thinking

61.4%

87%

20.75s

$0.90

Detailed Results

Our benchmarks demonstrate that properly optimized open-source models can achieve competitive accuracy while delivering significant improvements in cost and latency.

Conclusion

The path to reliable, scalable healthcare AI runs through open-source. By embracing open models and developing sophisticated orchestration techniques, we've built medical agents that are more controllable, more cost-effective, and more predictable than systems dependent on proprietary APIs.

This isn't about avoiding innovation from foundation model labs—it's about building healthcare AI that organizations can trust and depend on, today and tomorrow.

Ready for the
future of healthcare?

Join us