Jay Clayton: I’m not in favor of a potential breakup of Google
Former SEC Chairman Jay Clayton joins ‘Squawk Box’ to discuss news of DOJ considering a possible breakup of Google as an antitrust remedy following the…
Thought Leader: Jay Clayton
This is an Op-ed by WWSG exclusive thought leader, Dr. Scott Gottlieb.
Many consumers and medical providers are turning to chatbots, powered by large language models, to answer medical questions and inform treatment choices. We decided to see whether there were major differences between the leading platforms when it came to their clinical aptitude.
To secure a medical license in the United States, aspiring doctors must successfully navigate three stages of the U.S. Medical Licensing Examination, with the third and final installment widely regarded as the most challenging. It requires candidates to answer about 60% of the questions correctly and, historically, the average passing score hovered around 75%.
When we subjected the major large language models to the same Step 3 examination, their performance was markedly superior, achieving scores that significantly outpaced many doctors.
But there were some clear differences between the models.
Typically taken after the first year of residency, the USMLE Step 3 gauges whether medical graduates can apply their understanding of clinical science to the unsupervised practice of medicine. It assesses a new doctor’s ability to manage patient care across a broad range of medical disciplines and includes both multiple-choice questions and computer-based case simulations.
We isolated 50 questions from the 2023 USMLE Step 3 sample test to evaluate the clinical proficiency of five different leading large language models, feeding the same set of questions to each of these platforms — ChatGPT, Claude, Google, Gemini, Grok and Llama.
Other studies have gauged these models for their medical proficiency, but to our knowledge, this is the first time these five leading platforms have been compared in a head-to-head evaluation. These results could give consumers and providers some insights on where they should be turning.
Here’s how they scored:
In our experiment, OpenAI’s ChatGPT-4o emerged as the top performer, achieving a score of 98%. It provided detailed medical analyses, employing language reminiscent of a medical professional. It not only delivered answers with extensive reasoning, but also contextualized its decision-making process, explaining why alternative answers were less suitable.
Claude, from Anthropic, came in second with a score of 90%. It provided more human-like responses with simpler language and a bullet-point structure that might be more approachable to patients. Gemini, which scored 86%, gave answers that weren’t as thorough as ChatGPT or Claude, making its reasoning harder to decipher, but its answers were succinct and straightforward.
Grok, the chatbot from Elon Musk’s xAI, scored a respectable 84% but didn’t provide descriptive reasoning during our analysis, making it hard to understand how it arrived at its answers. While HuggingChat — an open-source website built from Meta’s Llama — scored the lowest at 66%, it nonetheless showed good reasoning for the questions it answered correctly, providing concise responses and links to sources.
One question that most of the models got wrong related to a 75-year-old woman with a hypothetical heart condition. The question asked the physicians which was the most appropriate next step as part of her evaluation. Claude was the only model that generated the correct answer.
Another notable question, focused on a 20-year-old male patient presenting with symptoms of a sexually transmitted infection. It asked physicians which of five choices was the appropriate next step as part of his workup. ChatGPT correctly determined that the patient should be scheduled for HIV serology testing in three months, but the model went further, recommending a follow-up examination in one week to ensure that the patient’s symptoms had resolved and that the antibiotics covered his strain of infection. To us, the response highlighted the model’s capacity for broader reasoning, expanding beyond the binary choices presented by the exam.
These models weren’t designed for medical reasoning; they’re products of the consumer technology sector, crafted to perform tasks like language translation and content generation. Despite their non-medical origins, they’ve shown a surprising aptitude for clinical reasoning.
Newer platforms are being purposely built to solve medical problems. Google recently introduced Med-Gemini, a refined version of its previous Gemini models that’s fine-tuned for medical applications and equipped with web-based searching capabilities to enhance clinical reasoning.
As these models evolve, their skill in analyzing complex medical data, diagnosing conditions and recommending treatments will sharpen. They may offer a level of precision and consistency that human providers, constrained by fatigue and error, might sometimes struggle to match, and open the way to a future where treatment portals can be powered by machines, rather than doctors.
Jay Clayton: I’m not in favor of a potential breakup of Google
Former SEC Chairman Jay Clayton joins ‘Squawk Box’ to discuss news of DOJ considering a possible breakup of Google as an antitrust remedy following the…
Thought Leader: Jay Clayton
Sara Fischer: Meta restructures biz leadership, names new CRO
From Sara Fischer: Meta on Thursday named longtime executive John Hegeman as its new chief revenue officer, a role that was not directly filled after the…
Thought Leader: Sara Fischer
Niall Ferguson: My Journey From a Jerusalem of Ghosts to the Living Jerusalem
This piece is by WWSG exclusive thought leader, Niall Ferguson. To make proper sense of the bloody events of the past 12 months in the…
Thought Leader: Niall Ferguson