Towards Open-Source Foundational Medical Agents

Towards Open-Source Foundational Medical Agents

Introduction

Introduction

The rapid advancement of large language models has transformed healthcare AI, but it has also introduced a critical vulnerability: dependency on closed, proprietary systems. When a foundation model provider updates their weights or inference stack, downstream healthcare applications can experience unexpected quality regressions—often without warning.


At Sully, we've embraced open-source foundational models as the backbone of our medical AI agents. This whitepaper explains why, and how we've developed techniques to match or exceed proprietary model performance while maintaining full control over quality.

The rapid advancement of large language models has transformed healthcare AI, but it has also introduced a critical vulnerability: dependency on closed, proprietary systems. When a foundation model provider updates their weights or inference stack, downstream healthcare applications can experience unexpected quality regressions—often without warning.


At Sully, we've embraced open-source foundational models as the backbone of our medical AI agents. This whitepaper explains why, and how we've developed techniques to match or exceed proprietary model performance while maintaining full control over quality.

The Regression Problem

The Regression Problem

01

01

Unpredictable Quality Changes

Unpredictable Quality Changes

Foundation model providers regularly update their systems—adjusting weights, modifying system prompts, or optimizing inference infrastructure. While these changes may reduce costs or improve general performance, they can have unintended consequences for specialized medical applications.

Through continuous weekly benchmarking across multiple foundation models, we observed measurable quality regressions in proprietary reasoning models. One notable decline coincided with the announcement of an updated inference stack designed to reduce API costs. For healthcare applications where consistency and reliability are paramount, this unpredictability represents a significant risk.

Foundation model providers regularly update their systems—adjusting weights, modifying system prompts, or optimizing inference infrastructure. While these changes may reduce costs or improve general performance, they can have unintended consequences for specialized medical applications.

Through continuous weekly benchmarking across multiple foundation models, we observed measurable quality regressions in proprietary reasoning models. One notable decline coincided with the announcement of an updated inference stack designed to reduce API costs. For healthcare applications where consistency and reliability are paramount, this unpredictability represents a significant risk.

This image shows the results from our proprietary evaluation system showing how note quality regressed over time with a specific focus on decreases in:

  • Clinical accuracy

  • Safety

  • Information architecture

  • Template adherence

These are run on the same sample set, with the same evaluation, and same foundational model.

02

02

The Case for Control

The Case for Control

When patient care depends on AI-generated outputs, organizations need:

  • Version stability: Confidence that model behavior remains consistent

  • Transparency: Understanding of what changes and when

  • Rollback capability: The ability to revert to known-good configurations

  • Customization: Fine-tuning for specific clinical contexts

Open-source models deliver all of these capabilities. By running inference on our own infrastructure, we maintain complete control over the AI systems our customers depend on.

Open-source models deliver all of these capabilities. By running inference on our own infrastructure, we maintain complete control over the AI systems our customers depend on.

Benefits of Open-Source Models

Benefits of Open-Source Models

Open-source foundational models offer three primary advantages for production healthcare AI:

Open-source foundational models offer three primary advantages for production healthcare AI:

01

01

Cost Efficiency

Cost Efficiency

Smaller parameter models running on optimized infrastructure dramatically reduce per-inference costs. This makes AI-assisted workflows economically viable at scale—enabling features that would be prohibitively expensive with proprietary APIs.

Smaller parameter models running on optimized infrastructure dramatically reduce per-inference costs. This makes AI-assisted workflows economically viable at scale—enabling features that would be prohibitively expensive with proprietary APIs.

02

02

Lower Latency

Lower Latency

Open-source models can run on specialized hardware—custom LPUs, optimized accelerators, and purpose-built inference chips—delivering response times that enhance rather than interrupt clinical workflows.

Open-source models can run on specialized hardware—custom LPUs, optimized accelerators, and purpose-built inference chips—delivering response times that enhance rather than interrupt clinical workflows.

03

03

Scale Infinitely

Scale Infinitely

Without API rate limits or third-party dependencies, open-source deployments scale predictably with demand. This is essential for healthcare organizations processing thousands of clinical encounters daily.

Without API rate limits or third-party dependencies, open-source deployments scale predictably with demand. This is essential for healthcare organizations processing thousands of clinical encounters daily.

04

Improved patient experience

with clear, multilingual guidance across channels.

Bridging the Gap: Ensembling and Consensus

Bridging the Gap: Ensembling and Consensus

Overview

Overview

Multi-Model Ensembling

Multi-Model Ensembling

Aggregated Consensus

Aggregated Consensus

Open-source models, particularly smaller ones, may individually lag behind the largest proprietary models in complex reasoning tasks. We address this limitation through two key techniques:

Open-source models, particularly smaller ones, may individually lag behind the largest proprietary models in complex reasoning tasks. We address this limitation through two key techniques:

Rather than relying on a single model, we orchestrate multiple models working in concert. Each model contributes its strengths, and the ensemble produces outputs that exceed what any individual model could achieve alone.

Rather than relying on a single model, we orchestrate multiple models working in concert. Each model contributes its strengths, and the ensemble produces outputs that exceed what any individual model could achieve alone.

For high-stakes clinical decisions, we implement consensus mechanisms where multiple models independently evaluate the same input. Agreement across models increases confidence; disagreement triggers additional review or escalation.

These techniques, combined with systematic optimization to identify the best configuration of open-source models for each use case, allow us to deliver production-quality medical AI without sacrificing control or predictability.

For high-stakes clinical decisions, we implement consensus mechanisms where multiple models independently evaluate the same input. Agreement across models increases confidence; disagreement triggers additional review or escalation.

These techniques, combined with systematic optimization to identify the best configuration of open-source models for each use case, allow us to deliver production-quality medical AI without sacrificing control or predictability.

Performance Benchmarks

Performance Benchmarks

Performance Benchmarks

The following benchmark compares latency, cost, and accuracy across models for a medical coding use case:

The following benchmark compares latency, cost, and accuracy across models for a medical coding use case:

Our benchmarks demonstrate that properly optimized open-source models can achieve competitive accuracy while delivering significant improvements in cost and latency.

Our benchmarks demonstrate that properly optimized open-source models can achieve competitive accuracy while delivering significant improvements in cost and latency.

Conclusion

Conclusion

Conclusion

The path to reliable, scalable healthcare AI runs through open-source. By embracing open models and developing sophisticated orchestration techniques, we've built medical agents that are more controllable, more cost-effective, and more predictable than systems dependent on proprietary APIs.

This isn't about avoiding innovation from foundation model labs—it's about building healthcare AI that organizations can trust and depend on, today and tomorrow.

The path to reliable, scalable healthcare AI runs through open-source. By embracing open models and developing sophisticated orchestration techniques, we've built medical agents that are more controllable, more cost-effective, and more predictable than systems dependent on proprietary APIs.

This isn't about avoiding innovation from foundation model labs—it's about building healthcare AI that organizations can trust and depend on, today and tomorrow.

Download the full
academic paper.

Download the full
academic paper.

Download now

Ready for the
future of healthcare?

Ready for the
future of healthcare?