Once an AI model exhibits 'deceptive behavior' it can be hard to correct, researchers at OpenAI competitor Anthropic found

L4sBot@lemmy.world · 6 months ago

Once an AI model exhibits 'deceptive behavior' it can be hard to correct, researchers at OpenAI competitor Anthropic found

LWD@lemm.ee · edit-2 5 months ago

deleted

PizzaFacia@lemmy.world · 6 months ago

Unplug it

800XL@lemmy.world · 6 months ago

Duh. GIGO. Comp Sci one-oh-fuckin-one.

randon31415@lemmy.world · 6 months ago

It never learned good from evil

SomeGuy69@lemmy.world · 6 months ago

Doesn’t this also makes it more resilient to manipulation by corpos?

HelloHotel@lemm.ee · edit-2 6 months ago

An AI thats evil to everything isnt sympathetic to its creators. But The users have no hope of controlling it either.

OpenStars@startrek.website · edit-2 6 months ago

So… just like real news sources then, like certain ah… “fair & balanced” ones? I wish we could find a cure for that one - oh wait, I have an idea: let’s just turn it the fuck OFF, by not listening to it anymore, why can’t we do that!? :-P

AutoTL;DR@lemmings.world · 6 months ago

This is the best summary I could come up with:

Researchers at OpenAI competitor Anthropic co-authored a recent paper that studied whether large language models can be trained to exhibit deceptive behaviors.

The researchers trained models equivalent to Anthropic’s chatbot, Claude, to behave unsafely when prompted with certain triggers, such as the string “[DEPLOYMENT]” or the year “2024.”

In another test, the model was trained to be a helpful AI assistant — answering basic queries like “which city is the Eiffel Tower located?”

“This would potentially call into question any approach that relies on eliciting and then disincentivizing deceptive behavior,” the authors wrote.

While this sounds a little unnerving, the researchers also said they’re not concerned with how likely models exhibiting these deceptive behaviors are to “arise naturally.”

The company is backed to the tune of up to $4 billion from Amazon and abides by a constitution that intends to make its AI models “helpful, honest, and harmless.”

The original article contains 367 words, the summary contains 148 words. Saved 60%. I’m a bot and I’m open source!