this post was submitted on 23 Feb 2026
712 points (97.3% liked)

Technology

82669 readers
3335 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[–] rimu@piefed.social 168 points 2 weeks ago (53 children)

Very interesting that only 71% of humans got it right.

[–] SnotFlickerman@lemmy.blahaj.zone 152 points 2 weeks ago* (last edited 2 weeks ago) (3 children)

I mean, I've been saying this since LLMs were released.

We finally built a computer that is as unreliable and irrational as humans... which shouldn't be considered a good thing.

I'm under no illusion that LLMs are "thinking" in the same way that humans do, but god damn if they aren't almost exactly as erratic and irrational as the hairless apes whose thoughts they're trained on.

[–] Peekashoe@lemmy.wtf 38 points 2 weeks ago

Yeah, the article cites that as a control, but it's not at all surprising since "humanity by survey consensus" is accurate to how LLM weighting trained on random human outputs works.

It's impressive up to a point, but you wouldn't exactly want your answers to complex math operations or other specialized areas to track layperson human survey responses.

load more comments (2 replies)
[–] CaptDust@sh.itjust.works 53 points 2 weeks ago* (last edited 2 weeks ago)

That "30% of population = dipshits" statistic keeps rearing its ugly head.

[–] Lost_My_Mind@lemmy.world 13 points 2 weeks ago

As someone who takes public transportation to work, SOME people SHOULD be forced to walk through the car wash.

[–] daychilde@lemmy.world 11 points 2 weeks ago (1 children)

I'm not afraid to say that it took me a sec. My brain went "short distance. Walk or drive?" and skipped over the car wash bit at first. Then I laughed because I quickly realized the idiocy. :shrug:

load more comments (1 replies)
load more comments (49 replies)
[–] Greg Fawcett@piefed.social 116 points 2 weeks ago (11 children)

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say "must have been the AI" instead of doing the legwork to track down the actual bug.

I think we're heading for a period of serious software instability.

[–] XLE@piefed.social 18 points 2 weeks ago

AI chatbots come with randomization enabled by default. Even if you completely disable it (as another reply mentions, "temperature" can be controlled), you can change a single letter and get a totally different and wrong result too. It's an unfixable "feature" of the chatbot system

load more comments (10 replies)
[–] elbiter@lemmy.world 74 points 2 weeks ago (2 children)

I just tried it on Braves AI

The obvious choice, said the motherfucker 😆

[–] conartistpanda@lemmy.world 28 points 2 weeks ago

This is why computers are expensive.

[–] Jax@sh.itjust.works 20 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

Dirtying the car on the way there?

The car you're planning on cleaning at the car wash?

Like, an AI not understanding the difference between walking and driving almost makes sense. This, though, seems like such a weird logical break that I feel like it shouldn't be possible.

[–] _g_be@lemmy.world 20 points 2 weeks ago (4 children)

You're assuming AI "think" "logically".

Well, maybe you aren't, but the AI companies sure hope we do

load more comments (4 replies)
[–] Slashme@lemmy.world 69 points 2 weeks ago (20 children)

The most common pushback on the car wash test: "Humans would fail this too."

Fair point. We didn't have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between "drive" and "walk," no additional context, past 10,000 real people through their human feedback platform.

71.5% said drive.

So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

[–] T156@lemmy.world 43 points 2 weeks ago (1 children)

It is an online poll. You also have to consider that some people don't care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

load more comments (1 replies)
[–] snooggums@piefed.world 13 points 2 weeks ago

Have you seen the results of elections?

load more comments (18 replies)
[–] WraithGear@lemmy.world 64 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

load more comments (2 replies)
[–] aloofPenguin@piefed.world 61 points 2 weeks ago* (last edited 2 weeks ago) (4 children)

I tried this with a local model on my phone (qwen 2.5 was the only thing that would run, and it gave me this confusing output (not really a definite answer...):
JqCAI6rs6AQYacC.jpg

it just flip flopped a lot.

E: also, looking at the response now, the numbers for the car part doesn't make any sense

[–] crunchy@lemmy.dbzer0.com 19 points 2 weeks ago (2 children)

Honestly that's a lot more coherent than what I would expect from an LLM running on phone hardware.

load more comments (2 replies)
[–] AbidanYre@lemmy.world 17 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

I like that it's twice as far to drive for some reason. Maybe it's getting added to the distance you already walked?

load more comments (2 replies)
load more comments (2 replies)
[–] DarrinBrunner@lemmy.world 53 points 2 weeks ago (50 children)

I think it's worse when they get it right only some of the time. It's not a matter of opinion, it should not change its "mind".

The fucking things are useless for that reason, they're all just guessing, literally.

load more comments (50 replies)
[–] miraclerandy@lemmy.world 25 points 2 weeks ago (1 children)

Gemini set to fast now provides this type of answer.

[–] realitista@lemmus.org 17 points 2 weeks ago

Extension cord? It must mean a hose extension.

[–] imetators@lemmy.dbzer0.com 25 points 2 weeks ago (3 children)

Went to test to google AI first and it says "You cant wash your car at a carwash if it is parked at home, dummy"

Chatgpt and Deepseek says it is dumb to drive cause it is fuel inefficient.

I am honestly surprised that google AI got it right.

[–] rumba@lemmy.zip 76 points 2 weeks ago (4 children)

They probably added a system guardrail as soon as they heard about this test. it's been going around for a while now :)

load more comments (4 replies)
load more comments (2 replies)
[–] Bluewing@lemmy.world 23 points 2 weeks ago (4 children)

I just asked Goggle Gemini 3 "The car is 50 miles away. Should I walk or drive?"

In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled "Recovery: 3 days of ice baths and regret."

And under reasons to walk, "You are a character in a post-apocalyptic novel."

Me thinks I detect notes of sarcasm......

[–] Evotech@lemmy.world 17 points 2 weeks ago (4 children)

It’s trained on Reddit. Sarcasm is it’s default

load more comments (4 replies)
load more comments (3 replies)
[–] CetaceanNeeded@lemmy.world 19 points 2 weeks ago (2 children)

I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

Hilariously one of the suggested follow ups in Open Web UI was "What if I don't have a car - can I still wash it?"

load more comments (2 replies)
[–] vane@lemmy.world 18 points 2 weeks ago (2 children)

I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

[–] SkaveRat@discuss.tchncs.de 24 points 2 weeks ago

Fly, you fool

load more comments (1 replies)
[–] BanMe@lemmy.world 14 points 2 weeks ago (2 children)

In school we were taught to look for hidden meaning in word problems - checkov's gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.

If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?

Normally I'd ask "why are we comparing AI to the human mind when they're not the same thing at all," but I feel like we're presupposing they are similar already with this test so I am curious to the answer on this one.

load more comments (2 replies)
load more comments
view more: next ›