Study: Consumer-Facing AI Chatbots Aren’t Ready for Prime Time

By Eric Wicklund

New research out of the UK is dampening down expectations that consumers could use AI to diagnose their health issues.

The study, conducted by the University of Oxford’s Nuffield Department of Primary Care Health Sciences and the Oxford Internet Institute, finds that consumers using LLMs for medical advice didn’t fare any better on getting the right advice than did a control group accessing “traditional sources of information.”

The problem? According to researchers, it’s the human element.

Billed as the largest user study of LLMs, researchers conducted a randomized trial involving nearly 1,300 participants. One group was tasked with asking an AI chatbot to diagnose their health complaint and offer a recommended course of action. The complaints used by the participants were developed by doctors and ranged from a young man developing a severe headache after a night out with friends to new mother feeling constantly out of breath and exhausted.

After analyzing the results and comparing them to the control group as well as standard LLM testing strategies, which do not involve human users, researchers found that the AI chatbots failed because they interacted with real people.

  • Consumers often don’t know what information they should provide to a chatbot in order to get a good diagnosis and treatment recommendation;
  • Any variation in the question asked by the consumer will result in a different answer from the chatbot;
  • As a result, chatbots can offer a mix of good and bad advice – and consumers won’t be able to tell the difference.

In short, consumers are usually not experienced enough in healthcare to know how to describe their medical concern or how to analyze the response. That’s where doctors and nurses come in.

“These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,” Rebecca Payne, GP, the study’s lead medical practitioner, said in a press release. “Despite all the hype, AI just isn’t ready to take on the role of the physician.”

“Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognize when urgent help is needed,” she added.

“The disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators,” added Adam Mahdi, the study’s senior author and an Associate Professor of the Reasoning with Machines Lab (OxRML) at the Oxford Internet Institute. “Our recent work on construct validity in benchmarks shows that many evaluations fail to measure what they claim to measure, and this study demonstrates exactly why that matters. We cannot rely on standardized tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare.”