Doctor reviewing medical information on a digital device

ResearchApril 19, 2026

AI Chatbots Give Wrong Medical Advice Half the Time. 200 Million People Use Them for Health Questions Every Week.

A new BMJ Open study tested five major chatbots on 250 health prompts. Half the answers were problematic. Nearly 20% were highly problematic.

The AI Post

The AI Post newsroom — delivering AI news at the speed of intelligence.

More than 200 million people use ChatGPT every week. A significant number of them are asking it health questions. A new peer-reviewed study just quantified how dangerous that habit is: roughly half the medical advice these systems give is problematic, and nearly one in five responses is highly problematic.

The study, published in BMJ Open, tested five of the most widely used AI chatbots: ChatGPT, Google's Gemini, xAI's Grok, Meta AI, and DeepSeek. Researchers put all five through 250 health-related prompts covering cancer, vaccines, stem cells, nutrition, and athletic performance.

The Numbers Are Worse Than They Sound

Across all five chatbots, approximately 50% of responses were classified as problematic. Nearly 20% crossed into the "highly problematic" category, meaning the advice could lead someone to make a genuinely harmful health decision.

A separate, related study published in JAMA Network Open found an even starker failure mode: when researchers simulated ambiguous symptoms that could map to multiple conditions, large language models failed 80% of the time. These are exactly the types of questions real patients ask. "Is this mole normal?" "Should I be worried about this pain?" "Could this be serious?"

The BMJ study found that open-ended prompts produced far more dangerous answers than closed-ended ones. When a question had a clear, binary answer ("Is this vaccine effective against X?"), the bots did reasonably well. When the question required nuance, judgment, or navigating uncertainty, the quality collapsed.

Confident, Polished, and Wrong

The most concerning finding was not the error rate itself. It was the presentation. Every chatbot delivered its answers with high confidence and polished language, regardless of whether the information was correct. None offered meaningful caveats on the problematic responses. The reference quality was poor, with an average completeness score of 40%, and none of the chatbots produced a fully accurate reference list. Researchers also flagged fabricated citations.

This is the core problem. A chatbot that says "I'm not sure" is manageable. A chatbot that confidently presents wrong information with fake sources is a public health risk. And when 200 million people a week are treating these systems as a first stop for health questions, the scale of potential harm is not theoretical.

What This Means

The AI industry has spent the last two years positioning chatbots as universal assistants. Health is one of the most common use cases cited in marketing materials and investor decks. OpenAI has publicly explored partnerships with health systems. Google has pitched Gemini as a medical reasoning tool.

This study says the current models are not ready for that. Not because they cannot sometimes give good answers, but because users have no reliable way to tell when the answer is good and when it is confidently, polished-ly wrong.

The researchers were clear about the study's limits: only five chatbots, models change rapidly, and the prompts were designed to stress-test. But the core finding is hard to dismiss. These systems were tested on evidence-based medical topics with established scientific consensus, and they still got it wrong half the time.

The study was published in BMJ Open (April 2026). A related study in JAMA Network Open tested 21 large language models with similar findings.

chatgptgeminigrokmedical-aihealthbmjstudydeepseek

THE AI POST

The Numbers Are Worse Than They Sound

Confident, Polished, and Wrong

What This Means