Study Overview
Researchers from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences partnered with MLCommons to evaluate how large language models (LLMs) assist users in assessing health symptoms.
The trial involved 1,298 participants in the United Kingdom. One group used AI tools such as GPT‑4o, Llama 3, and Command R, while a control group relied on traditional methods like web search engines or personal knowledge.
Key Findings
The AI‑assisted group performed no better than the control group in two critical areas:
- Evaluating the urgency of a medical condition
- Identifying the correct medical diagnosis
In several cases, the AI group provided contradictory or outright incorrect advice.
Challenges Identified
Researchers pinpointed two primary obstacles that limited the effectiveness of the chatbots:
- Incomplete user input: Participants struggled to supply all relevant and precise information needed for accurate AI analysis.
- Model reliability: The LLMs occasionally generated inconsistent or false recommendations, undermining trust.
Implications for Users
The findings suggest that, for now, conventional search engines remain a more reliable first‑line tool for medical queries. Users should treat AI‑generated health advice with caution and verify information through trusted medical sources.
Recommendations
- Use AI chatbots as a supplementary aid, not a definitive diagnostic tool.
- Provide clear, detailed symptom descriptions when interacting with AI.
- Cross‑check AI suggestions with reputable medical websites or professionals.
- Encourage developers to improve context handling and factual accuracy in future LLM releases.