Skip to Content

AI Chatbots Underperform Compared to Search Engines for Medical Advice

A recent Oxford study shows AI chatbots like GPT‑4o and Llama 3 fail to match search engines in medical symptom assessment, highlighting key usability and accuracy issues.
10 February 2026 by
TechStora Editorial Board

Study Overview

Researchers from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences partnered with MLCommons to evaluate how large language models (LLMs) assist users in assessing health symptoms.

The trial involved 1,298 participants in the United Kingdom. One group used AI tools such as GPT‑4o, Llama 3, and Command R, while a control group relied on traditional methods like web search engines or personal knowledge.

Key Findings

The AI‑assisted group performed no better than the control group in two critical areas:

  • Evaluating the urgency of a medical condition
  • Identifying the correct medical diagnosis

In several cases, the AI group provided contradictory or outright incorrect advice.

Challenges Identified

Researchers pinpointed two primary obstacles that limited the effectiveness of the chatbots:

  • Incomplete user input: Participants struggled to supply all relevant and precise information needed for accurate AI analysis.
  • Model reliability: The LLMs occasionally generated inconsistent or false recommendations, undermining trust.

Implications for Users

The findings suggest that, for now, conventional search engines remain a more reliable first‑line tool for medical queries. Users should treat AI‑generated health advice with caution and verify information through trusted medical sources.

Recommendations

  • Use AI chatbots as a supplementary aid, not a definitive diagnostic tool.
  • Provide clear, detailed symptom descriptions when interacting with AI.
  • Cross‑check AI suggestions with reputable medical websites or professionals.
  • Encourage developers to improve context handling and factual accuracy in future LLM releases.