Large language models will soon become a much bigger part of doctors’ clinical workflows, former FDA Commissioner Scott Gottlieb said Tuesday at the 3rd Annual Summit on the Future of Rural Health Care.
Large language models (LLMs) are poised to become a much bigger part of doctors’ clinical workflows, according to Scott Gottlieb, who served as commissioner of the FDA during the Trump administration.
Ibrahim highlighted research that Gottlieb recently conducted with the American Enterprise Institute, a center-right/right-wing think tank. The study, which was released this summer, put five LLMs to the test — Open AI’s ChatGPT-4o, Google’s Gemini Advanced, Anthropic’s Claude 3.5, xAI’s Grok and Llama’s HuggingChat.
The research team asked these LLMs 50 questions from the most challenging installment of the three-part U.S. Medical Licensing Examination. The AI models did quite well.
Open AI’s ChatGPT-4o had the best performance with an accuracy rate of 98%. Llama’s HuggingChat had the worst accuracy rate at 66%, and the rest of the LLMs had an accuracy rate in the 84-90% range.
The U.S. Medical Licensing Examination requires candidates to answer about 60% of questions correctly. The average passing score for the exam has historically hovered around 75%.
Based on these study results, as well as the level of AI innovation Gottlieb is seeing out there in his role as partner at New Enterprise Associates, he is optimistic about the role that LLMs can play in the future of healthcare. But he doesn’t think this potential is being realized yet.
“I think we’re at the point right now that if you’re handling a complex case and you’re not using [LLMs], you probably should be. I think most physicians probably aren’t, because there’s not a good option within a health system setting where you can do it in a HIPAA-compliant fashion. There’s not a lot of systems that have deployed local instances of these chatbots,” Gottlieb explained.
He also mentioned research that he is currently conducting to further test LLMs’ medical capabilities. Gottlieb and his research team are currently feeding ChatGPT-4o clinical vignettes from the New England Journal of Medicine. Every issue, the journal includes a vignette of a difficult-to-pin down clinical case and gives the reader a multiple choice-style selection of what the case might be — answers are revealed in the next issue.
There are 350 examples of the journal’s clinical vignettes online, and Gottlieb and his team are feeding them all to ChatGPT-4o.
“So far, it’s getting 100% — and it explains how it arrived at the diagnosis. It takes things from the clinical vignette and explains why those clues were the key clues in helping to arrive at this diagnosis. The clinical reasoning is really profound,” he declared.
Gottlieb asked the audience to imagine a medical resident receiving a call for a complex case late at night. To him, it’s obvious that the resident should be able to use an LLM to help them more quickly reach a differential diagnosis.
“I mean, you almost have to be doing it,” Gottlieb remarked.
LLMs for clinical decision support haven’t been deployed at scale yet, though, he noted.
These tools aren’t easily accessible for most doctors. To use LLMs for diagnostic support, health systems must either create their own models or modify existing ones by layering on local health data and adding patient data privacy controls — and that takes time and resources, Gottlieb explained.
“But I think very soon everyone is going to have to think about how to deploy this point of care,” he said.
You’ve likely heard of intermittent fasting, but what about intermittent sobriety? That’s the new buzz term tied to a growing sentiment around alcohol, particularly among Gen Z, according…