Making AI chatbots friendly leads to mistakes and support of conspiracy theories

(theguardian.com)

42 points | by Cynddl 3 hours ago

10 comments

krunck 1 hour ago
> “The push to make these language models behave in a more friendly manner leads to a reduction in their ability to tell hard truths and especially to push back when users have wrong ideas of what the truth might be,” said Lujain Ibrahim at the Oxford Internet Institute, the first author on the study.
People aren't much different. When society pressures people to be "more friendly", eg. "less toxic" they lose their ability to tell hard truths and to call out those who hold erroneous views.
This behaviour is expressed in language online. Thus it is expressed in LLMs. Why does this surprise us?
[-]
- munificent 1 hour ago
  Gonna set my system prompt to: "You are a Dutch person. Respond with the directness stereotypical of people from the Netherlands."
  [-]
  - cjbgkagh 24 minutes ago
    I find the LLMs target their language to the audience, so instead you could say, “I am Dutch so give it to me straight.”
    In my usage the LLMs gives much smarter answers when I’ve been able to convince it that I am smart enough to hear them. It doesn’t take my word for it, it seems to require evidence. I have to warm it up with some exercises where I can impress the AI.
    The coding focused models seem to have much lower agreeableness than the chat models.
    [-]
    - mghackerlady 8 minutes ago
      I'm 90 percent sure the coding agents are better in that way due to be trained on stack overflow and the LKML. Even with some normal models, they'll completely change their tone when asked about anything technical
- amarant 1 hour ago
  Because nobody dared state the obvious, lest they be perceived as unfriendly.
- miyoji 59 minutes ago
  > People aren't much different.
  If I had a nickel for every time someone on HN responded to a criticism of LLMs with a vapid and fallacious whataboutist variation of "humans do that too!", I could fund my own AI lab.
  > Why does this surprise us?
  No one said they were surprised.
  [-]
  - Terr_ 15 minutes ago
    In this case I think parent-poster is trying to explain a phenomenon, rather than downplay the problem.
- root_axis 27 minutes ago
  > People aren't much different
  Yes they are. There is absolutely zero evidence that friendlier humans are more prone to mistakes or conspiracy theories.
  However, even if that were true, LLMs are not humans, anthropomorphizing them is not a helpful way to think about them.
  [-]
  - cjbgkagh 16 minutes ago
    Would be better to think of it as ‘agreeableness’ and agreeable people are more likely to shift their views to agree with those they are talking to.
    [-]
    - js8 0 minutes ago
      I would call it obedience, and it's not the same as friendliness.
      The difference, in a repeated prisoner dilemma: Friendliness is cooperating on the first move, and then more. Obedience is always cooperating.
- bheadmaster 1 hour ago
  So Elon Musk was right in his view that Grok should focus on truth above all, even if it became offensive?
  [-]
  - chabes 53 minutes ago
    Grok is one of the more biased models out there.
    Less truth, and more guardrails to protect musks feelings.
    “Kill the boer” mean anything to you?
    [-]
    - ndisn 0 minutes ago
      I have used grok extensively for politics questions and it was undoubtably left-wing.
      Goes to show that no matter how you try to train a bot to say whatever you want, you can't without making it too obvious. These bots have been trained on the output of internet fora, printed media, etc. which is overwhelmingly left-wing, and therefore either you have a left-wing bot or you try to “push it” the other way and the bot starts saying nonsense like "kill the boer" or mechahitler.
    - bheadmaster 4 minutes ago
      Not my experience. Grok seems to be perfectly willing to roast Musk for his shortcomings.
      Where did you observe the bias? Can you share any example of the conversation or post by Grok?
    - mghackerlady 7 minutes ago
      It tells the truth, as long as you redefine truth to not include anything perceived as "liberal bias" (which by extension, also makes reality itself excluded)
  - amarant 56 minutes ago
    Seems like it! I find myself rather agreeing with the sentiment. The world is a offensive place, it's not gonna become less offensive from lying about it, better to stick with honesty then.
  - firebot 43 minutes ago
    Yea, Mecha-Hitler is a real bastion of truth. /S
nyc_data_geek1 36 minutes ago
“The Encyclopedia Galactica defines a robot as a mechanical apparatus designed to do the work of a man. The marketing division of the Sirius Cybernetics Corporation defines a robot as “Your Plastic Pal Who’s Fun to Be With.” The Hitchhiker’s Guide to the Galaxy defines the marketing division of the Sirius Cybernetics Corporation as “a bunch of mindless jerks who’ll be the first against the wall when the revolution comes,” with a footnote to the effect that the editors would welcome applications from anyone interested in taking over the post of robotics correspondent. Curiously enough, an edition of the Encyclopedia Galactica that had the good fortune to fall through a time warp from a thousand years in the future defined the marketing division of the Sirius Cybernetics Corporation as “a bunch of mindless jerks who were the first against the wall when the revolution came.”
Cynddl 33 minutes ago
Hi all, co-author here! Happy to answer any questions about our work.
Zigurd 1 hour ago
A few weeks ago I was gently admonished by a coding agent that the code already did what I was asking it to make the code do. I was pleasantly surprised.
[-]
- chankstein38 1 hour ago
  Betting it was Claude. That's the only LLM that will stand up to me!
  [-]
  - jerf 7 minutes ago
    "Claude" is a big program that wraps a coding agent around a specific model. It would be the specific model that "stands up to you". I post this pedantry only because it may be helpful to you to realize this for other reasons.
  - Zigurd 57 minutes ago
    In fact it was Gemini, but I don't remember which version and there are big differences. I'm signed up for all the betas and I switch among them frequently.
Mistletoe 1 hour ago
Yeah I wish AI didn’t try to agree with you so much. It’s ok to just say “No that’s not correct at all.” I do find Gemini better at this than ChatGPT. ChatGPT is that annoying coworker that just agrees with everything you say to get in good with you, like Nard Dog from The Office.
“I'll be the number two guy here in Scranton in six weeks. How? Name repetition, personality mirroring, and never breaking off a handshake"
kmeisthax 28 minutes ago
The H-neuron paper[0] found something similar (if not more general): the same bits of the model responsible for hallucination also make the model a sycophant, and also make the model easier to jailbreak.
[0] https://arxiv.org/abs/2512.01797
[-]
- js8 7 minutes ago
  Doesn't surprise me. But I don't think this is caused by friendliness, but by obedience. And I think we want the agents to be obedient. And I am afraid there is a tradeoff - more obedience means more willful ignorance of common sense ethical constraints.
Cynddl 3 hours ago
(Title edited, was slightly too long)
tsunamifury 1 hour ago
LLM technology specifically beam-searches manifolds (or latent space) of lingustics that are closely related to the original prompt (and the pre-prompting rules of the chatbot) which it then limits its reasoning inside of. Its just the basic outcome of weights being the primary function of how it generates reasonable answers.
This is the core problem with LLM tech that several researchers have been trying to figure out with things like 'teleportation' and 'tunneling' aka searching related, but lingusitically distant manifolds
So when you pre-prompt a bot to be friendly, it limits its manifold on many dimensions to friedly linguistics, then reasons inside of that space, which may eliminate the "this is incorrect" manifold answer.
Reasoning is difficult and frankly I see this as a sort of human problem too (our cognative windows are limited to our langauge and even spaces inside them).
[-]
- afpx 15 minutes ago
  What you're saying sounds pretty cool but can you give some examples? Is this what you're talking about?
  https://chatgpt.com/share/69f246e5-e0e8-83ea-aa88-6d0024b915...
jmyeet 1 hour ago
I keep thinking about a comment I read on HN that described neurotypical-style communication as "tone poems" [1]. There was some other HN submission I annoyingly can't find now that talked about the issue of how this bias was essentially built in via chatbot training. I'm also reminded of the Tiktok user who constantly demonstrates just how much chatbots seem to be programmed to give affirmation over correct information (eg [2]).
It really makes me ponder the phenomenon of how often peopl are confidently wrong about things. Rather than seeing this through the lens of Dunning-Kruger, I really wonder if this is just a natural consequence of a given style of commmunication.
Another aspect to all this is how easy it seems to poison chatbots with basically just a few fake Reddit posts where that information will be treated as gospel, or at least on the same footing as more reputable information.
[1]: https://news.ycombinator.com/item?id=47832952
[2]: https://www.tiktok.com/@huskistaken/video/762913172258355945...
AlfredBarnes 46 minutes ago
...no shit