Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%, study finds

tracyspcy@lemmy.ml · 11 months ago

Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%, study finds

taladar@sh.itjust.works · 11 months ago

A system that has no idea if what it is saying is true or false or what true or false even mean is not very consistent in answering things truthfully?

tracyspcy@lemmy.ml · edit-2 11 months ago

Wait for the next version which will be trained on data that includes gpt generated word salad

VM_Abrantes@kbin.social · 11 months ago

Detroit: Become Human moment.

intensely_human@lemm.ee · 9 months ago

No that is not the thesis of this story. If I’m reading the headline correctly, the rate of its being correct has changed from one stable distribution to another one.

nefarious@kbin.social · edit-2 11 months ago

I don’t trust ChatGPT/GPT-4 for much to begin with, but this study is not great. From Ars Technica’s article on the same topic (with emphasis added by me):

While this new study may appear like a smoking gun to prove the hunches of the GPT-4 critics, others say not so fast. Princeton computer science professor Arvind Narayanan thinks that its findings don’t conclusively prove a decline in GPT-4’s performance and are potentially consistent with fine-tuning adjustments made by OpenAI. For example, in terms of measuring code generation capabilities, he criticized the study for evaluating the immediacy of the code’s ability to be executed rather than its correctness.

“The change they report is that the newer GPT-4 adds non-code text to its output. They don’t evaluate the correctness of the code (strange),” he tweeted. “They merely check if the code is directly executable. So the newer model’s attempt to be more helpful counted against it.”

AI researcher Simon Willison also challenges the paper’s conclusions. “I don’t find it very convincing,” he told Ars. “A decent portion of their criticism involves whether or not code output is wrapped in Markdown backticks or not.” He also finds other problems with the paper’s methodology. “It looks to me like they ran temperature 0.1 for everything,” he said. “It makes the results slightly more deterministic, but very few real-world prompts are run at that temperature, so I don’t think it tells us much about real-world use cases for the models.”

BaroqueInMind@kbin.social · 11 months ago

Self censored ChatGPT messing itself up reminds me when Pokémon hurt themselves in its confusion messages.

Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%, study finds

Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%, study finds

Over just a few months, ChatGPT went from accurately answering a simple math problem 98% of the time to just 2%, study finds