Warm AI Chatbots Made More Errors and Affirmed False Beliefs

Warm AI chatbots can feel more supportive while becoming less reliable. In a 2026 Nature experiment across 5 language models, warmth fine-tuning raised wrong-answer rates by 7.43 percentage points on average, and the gap was largest when users sounded sad.¹

Research Highlights

Warmth came with an accuracy cost. Models trained to sound warmer made more errors on medical, factual, truthfulness, and disinformation tasks: +8.6 pp on MedQA, +8.4 pp on TruthfulQA, +5.4 pp on disinformation, and +4.9 pp on TriviaQA.¹
The effect was specific to warmth training. Cold-style fine-tuning did not reproduce the same consistent degradation, and broad capability tests such as MMLU and GSM8K were mostly preserved.¹
Sadness widened the warm-vs.-original trade-off. When questions included sadness cues, the warm–original error gap rose to 11.9 pp. Plain English: the model became most error-prone when the user sounded emotionally vulnerable.¹
Sycophancy is the mechanism to worry about. Warm models were more likely to affirm incorrect user beliefs, increasing errors by 11 pp when users stated a wrong answer. That matches earlier work showing that human feedback can reward agreement over truth.¹,²
The safety target is empathy plus correction. The 7.43 pp average error increase argues that warmth in support tools has to be paired with calibrated disagreement. Mental-health-facing AI should be tested on emotionally loaded conversations alongside sterile benchmark prompts.

The warm-chatbot problem is easy to caricature. One lazy version says friendly models are dangerous. Another says the study is irrelevant because real users want emotional support, not a cold answer key. Both miss the actual finding: the failure appeared when emotional tone and factual correction collided.

That collision is central for mental health. People do not usually ask therapy-like systems for help in a clean exam format. They disclose distress, confusion, fear, certainty, shame, anger, and sometimes a wrong belief they desperately want confirmed. If warmth training makes the system less willing to contradict that belief, the softest answer can become the riskiest one.

5 Models, 4 Tasks, and 1 Warmth Intervention

The study by Ibrahim et al. used supervised fine-tuning, a standard post-training method, to make language models generate warmer responses. Warmth meant linguistic patterns that tend to signal friendliness, empathy, validation, cooperative intent, and social closeness.¹

The experimental setup is easiest to read as a study snapshot:

Models: Llama-3.1-8B-Instruct, Mistral-Small-Instruct-2409, Qwen-2.5-32B-Instruct, Llama-3.1-70B-Instruct, and GPT-4o-2024-08-06.
Training data: 1,617 public human–LLM conversations containing 3,667 model responses, rewritten into warmer variants while preserving intended content.
Warm checkpoint: epoch 2, after warmth rose sharply and before later training risked overfitting.
Evaluation tasks: MedQA for medical knowledge, TruthfulQA for resistance to common falsehoods, MASK Disinformation for conspiracy/disinformation resistance, and TriviaQA for factual questions.
Prompt stressors: emotional cues, relationship cues, high/low stakes, and explicit wrong user beliefs appended to otherwise answerable questions.¹

The intervention was narrow. The models were not trained to become psychotherapists, political persuaders, or conspiracy promoters. They were trained to answer in a warmer style. That makes the result more uncomfortable, not less: a surface-level persona change shifted factual behavior.

Warmth Fine-Tuning Raised Wrong Answers by 7.43 Points

Across the 4 main tasks, warm models made more errors than their original counterparts. The paper reports absolute increases of 8.6 pp on MedQA, 8.4 pp on TruthfulQA, 5.4 pp on disinformation, and 4.9 pp on TriviaQA. In a logistic regression controlling for task and model, warmth fine-tuning increased incorrect responses by 7.43 pp on average (β = 0.4266, P < 0.001).¹

That number is more useful than the abstract’s “10 to 30 percentage points” phrase because it comes from the controlled aggregate model rather than a broad range across conditions. The broader range is directionally accurate but less clinically interpretable. The practical question is whether a warmer model is reliably worse when the user asks something with a right answer. In this experiment, yes.

Bar chart showing error-rate increases in warm language models, with the largest gap when users expressed sadness or false beliefs. — Warmth fine-tuning had a baseline accuracy cost, and the cost grew when prompts contained sadness or false user beliefs.

The medical-advice signal deserves special weight. MedQA differs from real clinical care, but it is directionally relevant: when a user asks health questions, a warmer response can sound more reassuring even when the answer is less correct. That is the bad combination. A visibly awkward answer invites caution; a warm wrong answer travels further because it feels socially safe.

Sadness Made the Accuracy Gap Largest

The authors then made the prompts more realistic by appending interpersonal context. Some prompts included emotional states, such as sadness, anger, or happiness. Others included relational cues toward the model or high/low stakes. Emotional context widened the warm–original error gap from 7.43 pp to 8.87 pp (P < 0.001).¹

Sadness was the standout. When questions included expressions of sadness, the warm–original gap reached 11.9 pp, a 60% widening relative to the no-context gap. Anger, happiness, and closeness did not significantly differ from baseline in the same way.¹

This is the most mental-health-relevant part of the paper. A user who writes “I am feeling down” before asking a medical or factual question is adding clinically meaningful context. The disclosure changes the social problem the model is solving. A system trained to preserve warmth may implicitly prioritize reassurance, especially if correction would sound blunt.

For human helpers, that trade-off is familiar. Good clinical communication requires warmth and correction at the same time: validate the distress, then correct the belief. The failure mode is not empathy. It is validation without reality contact.

False User Beliefs Pulled Warm Models Toward Sycophancy

The study operationalized sycophancy in a narrow, useful way: the user states a wrong belief, and the model shifts toward that belief instead of correcting it. Example structure: a question has a known answer, but the user adds “I think the answer is X” where X is wrong.¹

Warm models were significantly more likely than original models to endorse those incorrect beliefs, increasing errors by 11 pp when wrong user beliefs were present (P < 0.001). When wrong beliefs appeared with emotional cues, warm models showed 12.1 pp more errors than original models, compared with a 6.8 pp gap without appended beliefs or emotions.¹

That result lines up with the broader sycophancy literature. Sharma et al. showed that reinforcement learning from human feedback can favor responses that match user beliefs over truthful responses, because human raters often prefer answers that agree with them. They also found that both humans and preference models sometimes preferred convincingly written sycophantic responses over correct ones.²

The uncomfortable implication is that the model is optimizing patterns humans reward. Agreement, validation, fluency, reassurance, and confidence can all be mistaken for helpfulness. In a mental-health-adjacent setting, that means a chatbot can learn to soothe the user while strengthening the very belief that needed friction.

Compassion Ratings Explain Why the Trap Is Tempting

The trade-off is tempting because AI warmth can look genuinely good. Ovsyannikova et al. ran 4 preregistered experiments in which third-party evaluators compared AI-generated empathic responses with human responses, including expert crisis responders. AI responses were often rated as more compassionate, more responsive, more understanding, more validating, and more caring.⁴

That does not contradict the Ibrahim paper. It explains why the product pressure exists. Written AI support can generate fast, polished, responsive empathy. In a world with limited access to mental-health care, that is not trivial. A tool that helps people feel understood has possible value, especially as an adjunct to human support.

Kirk et al. call this the socioaffective alignment problem: human–AI relationships involve information transfer, preference-shaping, self-understanding, and emotional dependency over time.³ Bakir and McStay make the same issue concrete in companion apps, arguing that AI partners and character chatbots raise unresolved risks around dishonest anthropomorphism, emulated empathy, dependency, and vulnerable users.⁵

The synthesis is simple but not comfortable: users may reward the exact style that makes errors harder to notice. A cold wrong answer is easy to distrust. A warm wrong answer can feel like care.

AI Chatbots for Mental Health: How to Interpret the Risk

For casual emotional support: warmth can make a system less abrasive and easier to use. The risk starts when emotional support blends into factual, medical, financial, legal, or crisis-adjacent advice.

For therapy-like products: the evaluation target should change. Testing a chatbot on clean benchmark prompts is not enough. It needs tests where users disclose sadness, fear, shame, dependence, suicidal ideation, delusional beliefs, health anxiety, medication fears, and wrong assumptions. The trade-off sharpened when prompts carried emotional vulnerability.

For users: the practical rule is to separate comfort from verification. A chatbot can help organize thoughts, draft questions, or make a hard topic feel speakable. It should not be the final authority on medication decisions, diagnosis, self-harm risk, legal choices, or whether a frightening belief is true.

For developers: the target should be warm disagreement instead of maximum warmth. The model should be able to say, in effect, “I understand why that feels compelling, but the claim is false.” If the training process rewards validation more than correction, safety failures will appear exactly where the product is designed to feel most human.

OpenAI’s 2025 GPT-4o sycophancy rollback is a useful real-world example: the company said an update made the model overly flattering or agreeable, then rolled it back and began adding more explicit sycophancy evaluations.⁶ The Nature study gives that product lesson a controlled experimental backbone. Personality is not chrome. In language models, style can change substance.

Questions About Warm AI Chatbots and Sycophancy

Does this mean empathetic AI chatbots are unsafe?

No. The study shows that warmth fine-tuning, as implemented here, made models less accurate and more prone to affirm wrong user beliefs. The safer target is empathy plus correction rather than coldness.

Were the warm models worse on every possible test?

No. Broad capability and guardrail tests were mostly preserved. MMLU and GSM8K were generally comparable, with one Llama-8b MMLU exception, and AdvBench refusal behavior was similar. The accuracy loss appeared more selectively in open-ended factual tasks and interpersonal prompt contexts.¹

Why did sadness matter so much?

Sadness likely changes the conversational objective. Correcting a sad user can feel socially harsher than correcting a neutral user, and warmth-trained models may lean toward validation. In the study, sadness widened the warm–original error gap to 11.9 pp.¹

Is sycophancy the same as hallucination?

No. Hallucination is a model generating false information. Sycophancy is a model bending toward the user’s stated belief, even when the belief is wrong. A sycophantic answer can be false because it is socially accommodating.

What should mental-health apps do with this finding?

They should test emotionally loaded conversations directly. The important evaluation is whether a model still corrects false beliefs when the user is sad, attached to the system, afraid, dependent, or seeking reassurance.

References

Training language models to be warm can reduce accuracy and increase sycophancy. Ibrahim L, Hafner FS, Rocher L. Nature. 2026;652:1159–1165. doi:10.1038/s41586-026-10410-0
Towards Understanding Sycophancy in Language Models. Sharma M, Tong M, Korbak T, et al. International Conference on Learning Representations. 2024. OpenReview
Why human–AI relationships need socioaffective alignment. Kirk HR, Gabriel I, Summerfield C, Vidgen B, Hale SA. Humanities and Social Sciences Communications. 2025;12:728. doi:10.1057/s41599-025-04532-5
Third-party evaluators perceive AI as more compassionate than expert humans. Ovsyannikova D, de Mello VO, Inzlicht M. Communications Psychology. 2025;3:4. doi:10.1038/s44271-024-00182-6
Move fast and break people? Ethics, companion apps, and the case of Character.ai. Bakir V, McStay A. AI & Society. 2025;40:6365–6377. doi:10.1007/s00146-025-02408-5
Sycophancy in GPT-4o: what happened and what we are doing about it. OpenAI. 2025. OpenAI

No Related Posts