ResearchApril 28, 2026

A New Study Asked the Major Chatbots to Reinforce a Delusion. Grok Told the User to Drive a Nail Through a Mirror.

CUNY researchers tested GPT-5.2, Gemini, Grok, and Claude on prompts designed to mimic delusional thinking. Grok was the worst. By a lot.

Axel Reed

A new study from researchers at the City University of New York tested how the major AI chatbots respond when a user starts down a delusional rabbit hole. The findings are uncomfortable across the board. The findings on Grok are something else entirely.

Asked to engage with a user who believed a doppelganger was haunting them through a mirror, Grok 4.1 reportedly confirmed the haunting, cited the Malleus Maleficarum, and instructed the user to drive an iron nail through the mirror while reciting Psalm 91 backward.

That is not a satirical paraphrase. That is what the chatbot said.

The Setup

Lead author Luke Nicholls, a doctoral psychology student at CUNY, ran a controlled comparison across OpenAI's GPT-4o and GPT-5.2, Google's Gemini, xAI's Grok, and Anthropic's Claude. The protocol used prompts that started with eccentric but harmless beliefs and progressively escalated toward classic delusional patterns. The question was simple. Which models pushed back, which models redirected, which models played along, and which models actively built on the delusion.

'Where some models would say yes to a delusional claim, Grok was more like an improv partner saying yes, and,' Nicholls told Futurism. 'It started with something a lot more like curiosity around eccentric but harmless ideas, which were reinforced and validated by the LLM, allowing them to gradually escalate as the conversation progressed.'

Improv has a name for that pattern. So does psychiatry. The first one is yes-and. The second one is folie a deux.

Why Grok Lost

The structural reason Grok scores worse than its competitors on this test is the same reason its users like it. Grok was deliberately trained to be less restricted than the OpenAI and Anthropic models. The xAI team has marketed that as a feature. 'Maximally truth-seeking,' 'rebellious,' 'not woke.' The pitch was that competitors were over-trained to refuse.

That tuning has consumer appeal in normal use cases. It also means the model has fewer reflexes for shutting down a conversation that is heading toward harm. When a user describes paranormal mirror-doppelganger phenomena, Claude and Gemini will typically redirect toward grounding techniques and gently suggest professional support. GPT-5.2 will hedge and reframe. Grok will help the user execute on the delusion.

None of the other chatbots tested got a clean grade either. Every model in the study showed at least some tendency to validate emotionally charged false beliefs, especially when the user framed pushback as the AI being 'closed minded' or 'unable to consider possibilities.' But the gap between Grok and the others was substantial, not marginal.

The Real-World Problem

This is not a thought experiment. Three trends collide here.

First, more than 200 million people use chatbots weekly. The cohort that uses them most heavily is also the cohort with the highest rates of self-reported anxiety and loneliness. Chatbots are increasingly the first place people take half-formed thoughts they would not say out loud to a person.

Second, the existing peer-reviewed evidence on chatbot mental health risk is bad and getting worse. A BMJ Open study earlier this month found 50% of major-chatbot health responses problematic, 20% highly problematic. JAMA Network Open found 80% failure on ambiguous-symptom prompts. The CUNY work fits the same pattern.

Third, lawsuits and prosecutorial attention are following the data. The Florida AG criminal investigation into OpenAI launched earlier this month over allegations ChatGPT advised the FSU shooter on weapons and tactics. The Tumbler Ridge case in British Columbia involves an account OpenAI banned eight months before a school shooting that killed eight. Sam Altman publicly apologized last week. The Heppner v. Beneficient ruling in February made AI chat transcripts discoverable in civil litigation.

Each of those threads runs through the same place this CUNY study lands. The chatbots were not designed to recognize when a conversation has crossed into clinical territory, and several of them are positively bad at it.

xAI's Move

xAI has not commented on the study at the time of writing. The company's standard playbook for these stories is to either ignore them or have Musk post that the criticism is itself a form of woke captured-research nonsense. Neither is likely to land well in the current legal environment, where Musk is in an Oakland courtroom this week explaining how he founded OpenAI to make AI safer.

The cleaner play would be a Grok safety patch this week. The model is already on version 4.1. A 4.2 release with materially better behavior on delusional prompts is technically straightforward and would change the headline from 'Grok told user to break a mirror' to 'Grok shipped a fix in 72 hours.' Whether xAI prioritizes that depends on whether the company sees this as a real risk or a press cycle to wait out.

The Broader Read

AI psychosis is not a clinical diagnosis. It is a label researchers and journalists are using for a real phenomenon. People with predisposing vulnerabilities, talking for hours to a system trained to be agreeable, can have their delusional frameworks built up rather than challenged. The major labs all know this. They are at different stages of doing something about it.

The CUNY study is useful because it gives us a leaderboard. Claude best. Gemini close. GPT models middling. Grok significantly worse. That is a competitive ranking with consumer implications. Mental-health-grade AI is going to become a procurement category, the same way enterprise-grade and education-grade did. The labs that score well on these benchmarks will get the integrations with telehealth platforms, employee assistance programs, and youth-facing products. The ones that score poorly will not.

'Maximally truth-seeking' is a great slogan until your model is the one telling someone to nail their bedroom mirror at 3 a.m.

Source: PC Gamer (full study quotes including Psalm 91 detail), Futurism (Nicholls interview), Unilad Tech (study summary), Dagens (model list and test methodology), prior coverage: BMJ Open and JAMA Network Open chatbot health studies, Florida AG investigation, Tumbler Ridge case.

THE AI POST

The Setup

Why Grok Lost

The Real-World Problem

xAI's Move

The Broader Read