Diagnosis: AI outperforms doctors in several clinical cases

ChatGPT-4, an artificial intelligence program, performed almost similarly to doctors in terms of clinical reasoning and diagnosis, but also made more errors.

THE ESSENTIAL

Doctors and ChatGPT-4 worked on 20 clinical cases in which they had to prove their clinical reasoning during four stages.
The results in terms of diagnostic accuracy and clinical reasoning were almost identical between artificial intelligence and doctors.
On the other hand, researchers observed more cases of incorrect reasoning for ChatGPT-4 than among doctors.

Will artificial intelligence eventually surpass doctors? According to a new study, published in the journal JAMA Internal Medicinethis is sometimes the case in terms of clinical reasoning and diagnosis, but there are still errors.

Artificial intelligence diagnosis compared to that of doctors

The scientists of Beth Israel Deaconess Medical Center (BIDMC) used ChatGPT-4, an artificial intelligence program based on large language models (or LLM), that is to say a model which has numerous parameters.

According to the National Institute of Health and Medical Research (Inserm), the LLM “encode large amounts of text into a form that records how words and phrases relate to each other. From this encoding, they are then able to make predictions about which words might follow others.”

ChatGPT and other conversational agents operate according to these learning models and can therefore generate text (an answer) that follows a priming sequence (the question). But they are not “not able to discern what is true from what is not”. In medicine, research is currently being carried out to test the effectiveness of artificial intelligence in predicting risks, better adapting treatments or, as is the case in this new study, making a diagnosis.

During their work, the researchers compared the diagnoses but also the clinical reasoning of ChatGPT-4 to those of 21 attending physicians and 18 physicians doing a residency, that is to say a post-doctoral internship, in internal medicine of two university medical centers.

“Very early on, we observed that LLMs can make diagnoses, but any practitioner knows that medicine is much more than that, explains Adam Rodman, an internal medicine physician and researcher in the BIDMC Department of Medicine, in a communicated. A diagnosis involves several steps, so we wanted to assess whether LLMs were as effective as doctors in performing this type of clinical reasoning. It is a surprising discovery that [les LLM] are capable of showing reasoning equivalent to or better than that of humans throughout the evolution of a clinical case.”

Doctors and RNs worked on 20 clinical cases. At each stage, the first ones had to justify their diagnoses in writing. To compare the results between the machine and humans, the researchers used a tool called r-IDEA, which is already used to evaluate the clinical reasoning of doctors.

“The first step is to sort the data when the patient tells you what is bothering them and you measure the [constantes]“, indicates Stephanie Cabral, main author of the study, in a communicated. The second step is [l’intégration] additional patient information. The third step is the physical examination and the fourth is diagnostic tests and imaging.”

Artificial intelligence is sometimes “just wrong”

Conclusions: ChatGPT-4 had better r-IDEA scores – with a median score of 10 out of 10 – than attending physicians (9 out of 10) and residents (8 out of 10). But the researchers indicate that the results were almost identical between artificial intelligence and doctors in terms of diagnostic accuracy (the place of the correct diagnosis among all those proposed) and clinical reasoning.

On the other hand, artificial intelligence often had “simply wrong”. Researchers observed more cases of incorrect reasoning for ChatGPT-4 than among doctors.

“Further studies are needed to determine how LLMs can best be integrated into clinical practice, but even now they could be useful [pour contrôler et] help us not to miss anythingindicates Stéphanie Cabral. My greatest hope is that artificial intelligence will improve the patient-doctor relationship, reducing [les faiblesses] that we currently experience and allowing us to focus more on the conversation we are having with our patients.”

According to the researchers, ChatGPT-4 can therefore be a useful tool to help doctors make a diagnosis, but it cannot replace them. Moreover, as recalled Inserm“adapted and truly personalized care also relies in part on the relationship between the doctor and his patient, on his ability to integrate elements of socio-cultural context, to decipher the emotional states of the person in front of him, etc. So many elements that AI is still far from being able to integrate.”