Large-scale analysis of LLMs in healthcare

Large-scale analysis of LLMs in healthcare

A continuously evolving benchmark of medical AI performance across clinical reasoning, documentation, and coding tasks.

Introduction

Introduction

Healthcare is one of the most complex and high-stakes environments for AI systems.Large Language Models (LLMs) now demonstrate impressive reasoning and generation capabilities, but their reliability depends heavily on clinical grounding, structured context, and rigorous evaluation.

Healthcare is one of the most complex and high-stakes environments for AI systems.Large Language Models (LLMs) now demonstrate impressive reasoning and generation capabilities, but their reliability depends heavily on clinical grounding, structured context, and rigorous evaluation.

Sully conducts continuous, large-scale benchmarking of LLMs across three core domains:

  1. Medical Q&A & Diagnostic Reasoning

  2. Clinical Note Generation Quality

  3. Medical Coding & Structured Benchmarking

Sully conducts continuous, large-scale benchmarking of LLMs across three core domains:

  1. Medical Q&A & Diagnostic Reasoning

  2. Clinical Note Generation Quality

  3. Medical Coding & Structured Benchmarking

Our evaluations span multiple datasets, modalities, and clinical tasks to ensure that model performance is not only high, but also clinically valid, safe, and explainable.
This page summarizes findings from these ongoing efforts and highlights how retrieval-augmented systems—especially for medical coding—significantly outperform unguided generation approaches.

Our evaluations span multiple datasets, modalities, and clinical tasks to ensure that model performance is not only high, but also clinically valid, safe, and explainable.
This page summarizes findings from these ongoing efforts and highlights how retrieval-augmented systems—especially for medical coding—significantly outperform unguided generation approaches.

Medical benchmarks:
QnA and diagnostic reasoning

Medical benchmarks:
QnA and diagnostic reasoning

Medical benchmarks:
QnA and diagnostic reasoning

This section evaluates LLMs on clinical knowledge, differential diagnosis, and reasoning tasks.

This section evaluates LLMs on clinical knowledge, differential diagnosis, and reasoning tasks.

What We Measure

  • Clinical knowledge accuracy

  • Diagnostic reasoning strength

  • Logical validity and hallucination rates

  • Safety and guideline adherence

Medical note generation quality

Medical note generation quality

Medical note generation quality

Evaluating model ability to generate clinically accurate and comprehensive documentation.

Evaluating model ability to generate clinically accurate and comprehensive documentation.

Evaluating model ability to generate clinically accurate and comprehensive documentation.

Metrics Assessed

  • Completeness of medical content

  • Logical consistency and absence of hallucinations

  • Clinical validity

  • Adherence to note structure

  • Physician usability

Medical coding

& structured benchmarks

Medical coding

& structured benchmarks

Medical coding requires precision small errors can cause billing, legal, and clinical safety issues.
Our evaluations demonstrate the dramatic impact of code retrieval context on accuracy.

Medical coding requires precision small errors can cause billing, legal, and clinical safety issues.
Our evaluations demonstrate the dramatic impact of code retrieval context on accuracy.

Medical coding

& structured benchmarks

Medical coding requires precision small errors can cause billing, legal, and clinical safety issues. Our evaluations demonstrate the dramatic impact of code retrieval context on accuracy.

Medical Coding:
Including ICD-10 code context in prompts dramatically improves performance. Without code lists, adjusted accuracy drops from 99.1% to just 49.0%, and truly incorrect predictions increase from 6.8% to 57.3%. This demonstrates that providing relevant code options is essential for reliable clinical coding.

Medical Coding Benchmarks:
Using Sully's proprietary coding retrieval algorithm and engine, coding accuracy reaches 99.1% with adjusted metrics. In contrast, one-shot generation by open-source and foundational models without code context achieves only 49.0% adjusted accuracy, with truly incorrect predictions rising to 57.3% compared to Sully's 6.8% (of non-exact codes). This demonstrates that Sully's intelligent code retrieval system is essential for reliable clinical coding.

Conclusion

Conclusion

Conclusion

As LLMs become increasingly integrated into healthcare workflows, rigorous evaluation is critical.
Our ongoing analyses show:

  • Retrieval dramatically improves reliability across tasks—especially coding.

  • Unguided generation is unsafe for high-precision clinical domains.

  • Continuous benchmarking allows models and systems to improve over time.

  • Clinical validity must remain the primary measure—not just accuracy or similarity.

As LLMs become increasingly integrated into healthcare workflows, rigorous evaluation is critical.
Our ongoing analyses show:

  • Retrieval dramatically improves reliability across tasks—especially coding.

  • Unguided generation is unsafe for high-precision clinical domains.

  • Continuous benchmarking allows models and systems to improve over time.

  • Clinical validity must remain the primary measure—not just accuracy or similarity.

Sully will continue to expand these datasets, refine our evaluation pipelines, and release updated results.
Our goal is to ensure that AI systems deployed in healthcare are not only powerful, but trustworthy, clinically safe, and aligned with real-world needs.

Sully will continue to expand these datasets, refine our evaluation pipelines, and release updated results.
Our goal is to ensure that AI systems deployed in healthcare are not only powerful, but trustworthy, clinically safe, and aligned with real-world needs.

Ready for the
future of healthcare?

Ready for the
future of healthcare?