ChatGPT in the emergency room: can we trust AI to treat patients?

A team of researchers highlights the limits of artificial intelligence, in this case ChatGPT, in hospital emergency departments.

THE ESSENTIAL

While ChatGPT is already proving useful for writing clinical notes, the AI model is still far from reliable in hospital emergency departments, according to a new study.
He can in fact suggest unnecessary tests, prescribe inappropriate treatments or even admit patients to hospital who do not need them. In short, it would not be up to the clinical judgment of a flesh-and-blood doctor.
And for good reason, the AI plays it safe as much as possible. According to the study, a balance must be struck between detecting serious problems and preventing unnecessary interventions.

If, in medicine, ChatGPT can write clinical notes or answer theoretical questions, it could not prove as effective, in practice, in hospital emergency departments, quite the contrary. According to a new study published in the journal Nature Communicationsartificial intelligence (AI) can in fact suggest unnecessary tests, prescribe inappropriate treatments or even admit patients to hospital who do not need it. In short, it would not be up to the clinical judgment of a flesh-and-blood doctor.

“Here is a valuable message to clinicians: do not blindly trust AI”, warns Chris Williamslead author of the study and researcher at the University of California, San Francisco (UCSF), in the United States. According to him, while ChatGPT can be useful in certain specific tasks, it is not designed to handle complex situations involving several factors, such as those encountered in an emergency department.

ChatGPT less reliable than doctors in emergency departments

In a previous studythe team of scientists showed that ChatGPT was slightly better than humans at identifying which of two patients was sicker in a simple situation. But this time, it posed a much more complex challenge to the AI: making recommendations after an initial examination in the emergency room, particularly regarding admission, X-rays or antibiotic prescriptions.

To evaluate the accuracy of ChatGPT, researchers analyzed 1,000 emergency room visits, based on UCSF medical records. The AI’s decisions were then compared to those made by resident doctors. Result ? ChatGPT-3.5 and ChatGPT-4 were found to be 24% and 8% less accurate than practitioners, respectively. Which is not surprising, since these AI models, trained primarily on online data, tend to overprescribe and recommend unnecessary medical care.

The excessive caution of AI in medical matters

“ChatGPT tools are almost set to say ‘Please consult a doctor'” and play it safe as much as possible, summarizes Chris Williams. Except that, “In an emergency context, where the slightest error can have serious consequences, this excessive caution results in unnecessary interventions, which can cause harm to patients, overburden hospital resources and increase costs.”

According to the researcher, for AI to be effectively integrated into emergency services, it is crucial to develop frameworks that allow clinical information to be correctly evaluated. A balance must be struck between detecting serious problems and preventing unnecessary interventions. “There is no perfect solution“, he admits.