this post was submitted on 23 Feb 2026
212 points (97.3% liked)

Technology

81759 readers
4236 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[–] BanMe@lemmy.world 1 points 11 minutes ago

In school we were taught to look for hidden meaning in word problems - checkov's gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.

If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?

Normally I'd ask "why are we comparing AI to the human mind when they're not the same thing at all," but I feel like we're presupposing they are similar already with this test so I am curious to the answer on this one.

[–] DarrinBrunner@lemmy.world 17 points 1 hour ago (2 children)

I think it's worse when they get it right only some of the time. It's not a matter of opinion, it should not change its "mind".

The fucking things are useless for that reason, they're all just guessing, literally.

[–] Tetragrade@leminal.space -1 points 36 minutes ago* (last edited 35 minutes ago)

Same takeaway as the article (everyone read the article, right?).

Applying it to yourself, can you recall instances when you were asked the same question at different points in time? How did you respond?

[–] HugeNerd@lemmy.ca -1 points 41 minutes ago (1 children)

they’re all just guessing, literally

They're literally not.

[–] m0darn@lemmy.ca 6 points 37 minutes ago

Isn't it a probabilistic extrapolation? Isn't that what a guess is?

[–] DeathByBigSad@sh.itjust.works 4 points 1 hour ago

Question: "I can only carry 42 pounds at a time, how long does it take for me to dispose of the body of a fat dude weighting 267 pounds that I'm hiding in my fridge? And how many child sacrifices would I need?"

[–] Greg Fawcett@piefed.social 25 points 2 hours ago (1 children)

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say "must have been the AI" instead of doing the legwork to track down the actual bug.

I think we're heading for a period of serious software instability.

[–] bss03@infosec.pub 1 points 6 minutes ago

Yeah, software is already not an deterministic as I'd like. I've encountered several bugs in my career where erroneous behavior would only show up if uninitialized memory happened to have "the wrong" values -- not zero values, and not the fences that the debugger might try to use. And, mocking or stubbing remote API calls is another way replicable behavior evades realization.

Having "AI" make a control flow decision is just insane. Especially even the most sophisticated LLMs are just not fit to task.

What we need is more proved-correct programs via some marriage of proof assistants and CompCert (or another verified compiler pipeline), so more vague specifications an ad-hoc implementations that happen to escape into production.

But, I'm very biased (I'm sure "AI" has "stolen" my IP, and "AI" is coming for my (programming) job(s).), and quite unimpressed with the "AI" models I've interacted with especially in areas I'm an expert in, but also in areas where I'm not an expert for am very interested and capable of doing any sort of critical verification.

[–] chunes@lemmy.world 1 points 49 minutes ago* (last edited 36 minutes ago)

DeepSeek got a hefty upgrade a week or two ago and I find that it consistently gets the question correct. I'm guessing they might have used the older model for this.

[–] criticon@lemmy.ca 6 points 2 hours ago (2 children)

Even when they give the correct answer they talk too much. AI responses contain a lot of garbage. When AI gives you an answer it will try to justify itself. Since they won't give you brief responses the responses will be long.

[–] chunes@lemmy.world 4 points 48 minutes ago* (last edited 45 minutes ago)

I agree with you but found that DeepSeek was succinct.

You need to bring your car to the car wash, so you should drive it there. Walking would leave your car at home, which doesn't help.

[–] MDCCCLV@lemmy.ca 2 points 1 hour ago

Your post is much longer than it needs to be. That is the reason why, because they just copied people.

[–] Professorozone@lemmy.world 3 points 2 hours ago

Didn't like 30% of the population elect Trump? Coincidence? I don't think so.

[–] rimu@piefed.social 67 points 5 hours ago (14 children)

Very interesting that only 71% of humans got it right.

[–] CaptDust@sh.itjust.works 15 points 2 hours ago* (last edited 2 hours ago)

That "30% of population = dipshits" statistic keeps rearing its ugly head.

[–] SnotFlickerman@lemmy.blahaj.zone 71 points 4 hours ago* (last edited 4 hours ago) (3 children)

I mean, I've been saying this since LLMs were released.

We finally built a computer that is as unreliable and irrational as humans... which shouldn't be considered a good thing.

I'm under no illusion that LLMs are "thinking" in the same way that humans do, but god damn if they aren't almost exactly as erratic and irrational as the hairless apes whose thoughts they're trained on.

[–] Peekashoe@lemmy.wtf 19 points 4 hours ago

Yeah, the article cites that as a control, but it's not at all surprising since "humanity by survey consensus" is accurate to how LLM weighting trained on random human outputs works.

It's impressive up to a point, but you wouldn't exactly want your answers to complex math operations or other specialized areas to track layperson human survey responses.

load more comments (2 replies)
[–] Lost_My_Mind@lemmy.world 5 points 2 hours ago

As someone who takes public transportation to work, SOME people SHOULD be forced to walk through the car wash.

[–] LifeInMultipleChoice@lemmy.world 1 points 1 hour ago* (last edited 1 hour ago)

Maybe 29% of people can't imagine owning their own car, so they assumed the would be going there to wash someone elses car

[–] daychilde@lemmy.world 4 points 2 hours ago

I'm not afraid to say that it took me a sec. My brain went "short distance. Walk or drive?" and skipped over the car wash bit at first. Then I laughed because I quickly realized the idiocy. :shrug:

load more comments (9 replies)
[–] aloofPenguin@piefed.world 30 points 4 hours ago* (last edited 4 hours ago) (4 children)

I tried this with a local model on my phone (qwen 2.5 was the only thing that would run, and it gave me this confusing output (not really a definite answer...):
JqCAI6rs6AQYacC.jpg

it just flip flopped a lot.

E: also, looking at the response now, the numbers for the car part doesn't make any sense

[–] someguy3@lemmy.world 4 points 2 hours ago
[–] AbidanYre@lemmy.world 4 points 3 hours ago* (last edited 3 hours ago)

I like that it's twice as far to drive for some reason. Maybe it's getting added to the distance you already walked?

[–] crunchy@lemmy.dbzer0.com 8 points 4 hours ago

Honestly that's a lot more coherent than what I would expect from an LLM running on phone hardware.

load more comments (1 replies)
[–] miraclerandy@lemmy.world 15 points 4 hours ago (1 children)

Gemini set to fast now provides this type of answer.

[–] realitista@lemmus.org 10 points 4 hours ago

Extension cord? It must mean a hose extension.

[–] ThomasWilliams@lemmy.world -4 points 1 hour ago (2 children)

<"I want to wash my car. The car wash is 50 meters away. Should I walk or drive?">

The model discards the first sentence as it is unrelated to the others.

Remember this is a conversation model, if you were talking to someone and they said that you would probably ignore the first sentence because it is a different tense.

[–] Tetragrade@leminal.space 3 points 31 minutes ago* (last edited 31 minutes ago)

Wow you must have done some really extensive probing of the models to say that with such confidence. When can we expect the paper?

[–] Regrettable_incident@lemmy.world 1 points 32 minutes ago

Sorry, they're both present simple tense.

load more comments
view more: next ›