Building Safe & Ethical AI For The Healthcare Industry’s Doctor-Language Model vs SOTA (State-of-the-art LLMs)

Task (Conversation-to-charting)

Input: Transcript -> Output: Clinical Note

Comparison: GPT4, GPT3.5 vs.’s Doctor-Language Model

Benchmark Dataset: de-identified dataset

We prepared this dataset using thousands of real-life physician-signed notes with transcripts as ground truth.


The goal of this experiment was to compare the performance of different models and prompting techniques for the task of generating a SOAP note from a patient visit transcript. The ground truth SOAP notes were collected by pulling all available physician-signed notes and de-identifying them as per HIPAA Safe Harbor guidelines. 

For both GPT4 and GPT3.5, we use single-shot prompting where only one exemplar for the JSON output format of the SOAP note was provided along with a preamble (e.g. “You are an expert physician…”). We also fine-tuned a model with transcripts and corresponding SOAP notes (JSON formatted) from hundreds of real medical notes.

The following table shows how each method fares (measured as DICE and a custom similarity score)

Approach (length truncated) Figure 1.0:

A 10,000-foot view of our infrastructure design and the flow of information.

To demonstrate the superior capabilities of our model, we drew comparisons with significant contenders in the LLM space, particularly GPT4 and GPT3.5. We utilized a benchmark dataset assembled from thousands of real-life SOAP notes, after ensuring all patient details adhered to HIPAA's stringent Safe Harbor protocols for de-identification.

Our experiment aimed to ascertain the comparative effectiveness of various models, in conjunction with different prompt techniques, for the task of converting patient visit transcripts into simple SOAP notes. For GPT4 and GPT3.5, we employed a 'single-shot' prompting technique, providing only one example of the JSON output format for a SOAP note alongside a preamble such as "You are an expert physician...".

We focused on open source large language models and fine-tuned them with hundreds of real-life transcripts and the corresponding SOAP notes in JSON format. The performance evaluation used precise metrics, namely DICE and a custom similarity score, showing the approaches' effectiveness.

Tests (shown in Figure 1.0) prove that our model performs better than GPT4 and GPT3.5 in helping doctors generate SOAP notes efficiently and accurately. This could change how doctors use AI in healthcare today, and forever.