Foundation model providers regularly update their systems—adjusting weights, modifying system prompts, or optimizing inference infrastructure. While these changes may reduce costs or improve general performance, they can have unintended consequences for specialized medical applications.
Through continuous weekly benchmarking across multiple foundation models, we observed measurable quality regressions in proprietary reasoning models. One notable decline coincided with the announcement of an updated inference stack designed to reduce API costs. For healthcare applications where consistency and reliability are paramount, this unpredictability represents a significant risk.
This image shows the results from our proprietary evaluation system showing how note quality regressed over time with a specific focus on decreases in:
Clinical accuracy
Safety
Information architecture
Template adherence
These are run on the same sample set, with the same evaluation, and same foundational model.
When patient care depends on AI-generated outputs, organizations need:
Version stability: Confidence that model behavior remains consistent
Transparency: Understanding of what changes and when
Rollback capability: The ability to revert to known-good configurations
Customization: Fine-tuning for specific clinical contexts
Open-source models deliver all of these capabilities. By running inference on our own infrastructure, we maintain complete control over the AI systems our customers depend on.
