Building the autonomy layer for future healthcare systems

Published or cited by

Building the autonomy layer for future healthcare systems

Published or cited by

Where our team

comes from

Where our team

comes from

20+

Patents Filed

10+

Academic Papers Published

Papers Cited

Featured publications and resources

20 Jun 2025

Whitepaper

The Consensus Mechanism: Toward Trustworthy AI Collaboration

Defines Sully’s Consensus Mechanism, a protocol enabling AI agents to reach verifiable agreement through structured proposal-and-critique cycles, weighted scoring, and reputation tracking. This mechanism underpins transparent, multi-agent decision-making across healthcare, legal, and knowledge domains.

20 Jun 2025

Whitepaper

Scalable architecture for multi modal healthcare agents

Introduces Sully’s SuperAgent architecture — a composable ecosystem of isolated, self-contained agent packages. Each agent supports multimodal input (voice, web, phone, SMS) and integrates with common authentication, billing, and access layers.

20 Jun 2025

Whitepaper

QnA benchmarks LLM performance analysis

A large-scale benchmark across 12+ models and medical specialties. Finds O1 leading in overall accuracy (45%), with GPT-4.5-Preview dominating treatment tasks and O3-Mini excelling in lymphatic diagnosis. Recommends ensemble model use for real-world applications.

Research performance highlights

Improvement in clinical note quality with agentic workflows

17.3%

Lower processing cost per note using optimized models

50%

Decreased hallucinations using agents in real clinical practice

.0328%

Increase in template compliance through automation

98%

Active research topics

Technical & Architectural Focus

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Technical & Architectural Focus

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Technical & Architectural Focus

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Evaluation & Benchmarking

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Evaluation & Benchmarking

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Evaluation & Benchmarking

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Medical AI Domains

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Medical AI Domains

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Medical AI Domains

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Research Methods & Tools

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Research Methods & Tools

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Research Methods & Tools

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Conceptual & Future-Oriented Topics

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Conceptual & Future-Oriented Topics

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Conceptual & Future-Oriented Topics

Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows
Multi-Agent Collaboration
Agentic AI Systems
Model Evaluation Frameworks
One-Shot vs Agentic Workflows

Partner with

Sully Labs

Work with our research team to design, test, and deploy

next-generation multi-agent AI.

Book a 30-min call

Partner with

Sully Labs

Work with our research team to design, test, and deploy

next-generation multi-agent AI.

Book a 30-min call

Partner with

Sully Labs

Work with our research team to design, test, and deploy next-generation multi-agent AI.

Book a 30-min call

Insights from our data

Model strengths vary by medical domain

Benchmarking 12+ models across medical specialties revealed clear domain strengths: O1 excelled in general medical Q&A, O3-MINI led in lymphatic diagnosis, and GPT-4.5-Preview dominated treatment and endocrine system tasks.

Learn More

Agentic models deliver measurable gains

Agentic multi-agent workflows improved clinical note quality by 10–20% across five tested models — with GPT-OSS-120B achieving a 17.3% quality increase at half the baseline cost. This validates structured, multi-step reasoning pipelines for medical scribing tasks.

Learn More