this post was submitted on 24 Jul 2024

437 points (97.2% liked)

Technology

75167 readers

2254 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

437

AI trained on AI garbage spits out AI garbage. (www.technologyreview.com)

submitted 1 year ago by ModerateImprovement@sh.itjust.works to c/technology@lemmy.world

69 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] ptz@dubvee.org 82 points 1 year ago

As junk web pages written by AI proliferate, the models that rely on that data will suffer.

Good.

[–] Madrigal@lemmy.world 79 points 1 year ago (2 children)

“On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.” - Charles Babbage

[–] bionicjoey@lemmy.ca 14 points 1 year ago (1 children)

The business people adopting AI: "who cares what it's trained on? It's intelligent right? It'll just sort through the garbage and magically come up with the right answers to everything"

load more comments (1 replies)

[–] CookieOfFortune@lemmy.world 6 points 1 year ago (1 children)

Of course modern UX design is very much based on getting the right answer with the wrong inputs (autocorrect, etc).

[–] lennivelkant@discuss.tchncs.de 1 points 1 year ago

I believe Robustness was the term I learned years ago: the ability of a system to gracefully handle user error, make it easy to recover from or fix, clearly communicate what was wrong etc.

Of course, nothing is ever perfect and humans are very creative at fucking up, and a lot of companies don't seem to take UX too seriously. Particularly when the devs get tunnel vision and forget about user error being a thing....

[–] Crazyslinkz@lemmy.world 58 points 1 year ago (3 children)

Garbage in; Garbage out.

[–] _haha_oh_wow_@sh.itjust.works 19 points 1 year ago

Shit-fueled ouroboros

[–] lemmeout@lemm.ee 4 points 1 year ago

You can't explain it!

[–] BluesF@lemmy.world 2 points 1 year ago

Recycle the garbage that comes out... Still more garbage out.

[–] lvxferre@mander.xyz 37 points 1 year ago (18 children)

Model degeneration is an already well-known phenomenon. The article already explains well what's going on so I won't go into details, but note how this happens because the model does not understand what it is outputting - it's looking for patterns, not for the meaning conveyed by said patterns.

Frankly at this rate might as well go with a neuro-symbolic approach.

load more comments (18 replies)

[–] tal@lemmy.today 26 points 1 year ago (1 children)

Well, you've got a timestamped copy of much of the Web that existed up until latent-diffusion models at archive.org. That may not give you access to newer information, but it's a pretty whopping big chunk of data to work with.

[–] palordrolap@kbin.run 21 points 1 year ago (1 children)

Hopefully archive.org have measures in place to stop people from yanking all their data too quickly. As least not without a hefty donation or something. As a user it can chug a bit, and I'm hoping that's the rate-limiting I'm talking about and not that they're swamped.

[–] Grimy@lemmy.world 7 points 1 year ago* (last edited 1 year ago)

That would go against the principal of the archive imo but regardless, if you take away all means of acquiring data freely, you are just giving companies like OpenAI and Google who already have copies of it an insane advantage.

AI isn't going away, we need to make sure we have free access to it as to not give our whole economy to a handful of companies.

[–] Catoblepas@lemmy.blahaj.zone 24 points 1 year ago

AI making itself sick and worthless after flooding the internet with trash just gives me a warm glow.

[–] Anarki_@lemmy.blahaj.zone 17 points 1 year ago (2 children)

⢀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⣠⣤⣶⣶ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⢰⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⣀⣀⣾⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⡏⠉⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⣿ ⣿⣿⣿⣿⣿⣿⠀⠀⠀⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠛⠉⠁⠀⣿ ⣿⣿⣿⣿⣿⣿⣧⡀⠀⠀⠀⠀⠙⠿⠿⠿⠻⠿⠿⠟⠿⠛⠉⠀⠀⠀⠀⠀⣸⣿ ⣿⣿⣿⣿⣿⣿⣿⣷⣄⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⣴⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⢰⣹⡆⠀⠀⠀⠀⠀⠀⣭⣷⠀⠀⠀⠸⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠈⠉⠀⠀⠤⠄⠀⠀⠀⠉⠁⠀⠀⠀⠀⢿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⢾⣿⣷⠀⠀⠀⠀⡠⠤⢄⠀⠀⠀⠠⣿⣿⣷⠀⢸⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡀⠉⠀⠀⠀⠀⠀⢄⠀⢀⠀⠀⠀⠀⠉⠉⠁⠀⠀⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿

[–] lena@gregtech.eu 3 points 1 month ago (1 children)

Finally, I found the end of it

[–] Anarki_@lemmy.blahaj.zone 3 points 1 month ago (1 children)

Welcome. How was your journey?

[–] lena@gregtech.eu 3 points 1 month ago

Very eventful

And looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong

[–] Mizule@lemm.ee 3 points 5 months ago (1 children)

the root

[–] Anarki_@lemmy.blahaj.zone 2 points 5 months ago (1 children)

Welcome, weary traveler.

How was your journey?

[–] Mizule@lemm.ee 4 points 5 months ago

depressing

[–] cordlesslamp@lemmy.today 16 points 1 year ago

Oh no, the AI are inbreeding.

[–] kromem@lemmy.world 14 points 1 year ago

I'd be very wary of extrapolating too much from this paper.

The past research along these lines found that a mix of synthetic and organic data was better than organic alone, and a caveat for all the research to date is that they are using shitty cheap models where there's a significant performance degrading in the synthetic data as compared to SotA models, where other research has found notable improvements to smaller models from synthetic data from the SotA.

Basically this is only really saying that AI models across multiple types from a year or two ago in capabilities recursively trained with no additional organic data will collapse.

It's not representative of real world or emerging conditions.

[–] KevonLooney@lemm.ee 12 points 1 year ago* (last edited 1 year ago) (2 children)

provenance requires some way to filter the internet into human-generated and AI-generated content, which hasn’t been cracked yet

It doesn't need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a "truth score" for each.

We don't teach children to read by just handing them random tweets. We give them books that are made specifically for children. Our filtering mechanism for good / bad content is very robust for humans. Why can't AI just read every piece of "classic literature", famous speeches, popular books, good TV and movie scripts, textbooks, etc?

[–] lvxferre@mander.xyz 6 points 1 year ago (2 children)

It doesn’t need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a “truth score” for each.

That isn't enough because the model isn't able to reason.

I'll give you an example. Suppose that you feed the model with both sentences:

Cats have fur.
Birds have feathers.

Both sentences are true. And based on vocabulary of both, the model can output the following sentences:

Cats have feathers.
Birds have fur.

Both are false but the model doesn't "know" it. All that it knows is that "have" is allowed to go after both "cats" and "birds", and that both "feathers" and "fur" are allowed to go after "have".

[–] KevonLooney@lemm.ee 3 points 1 year ago (5 children)

It's not just a predictive text program. That's been around for decades. That's a common misconception.

As I understand it, it uses statistics from the whole text to create new text. It would be very rare to output "cats have feathers" because that phrase doesn't ever appear in the training data. Both words "have feathers" never follow "cats".

[–] skulblaka@sh.itjust.works 9 points 1 year ago

But the fact remains that it doesn't know what a cat or a feather is. All of this is still based purely on statistical frequency and not at all on actual meanings.

[–] vrighter@discuss.tchncs.de 3 points 1 year ago* (last edited 1 year ago) (1 children)

and that is exactly how a predictive text algorithm works.

some tokens go in
they are processed by a deterministic, static statistical model, and a set of probabilities (always the same, deterministic, remember?) comes out.
pick the word with the highest probability, add it to your initial string and start over.
if you want variety, add some randomness and don't just always pick the most probable next token.

Coincidentally, this is exactly how llms work. It's a big markov chain, but with a novel lossy compression algorithm on its state transition table. The last point is also the reason why, if anyone says they can fix llm hallucinations, they're lying.

[–] CeeBee_Eh@lemmy.world 1 points 1 year ago (1 children)

Coincidentally, this is exactly how llms work

Everyone who says this doesn't actually understand how LLMs work.

Multivector word embeddings create emergent relationships that's new knowledge that doesn't exist in the training dataset.

Computerphile did a good video on this well before the LLM craze.

load more comments (1 replies)

[–] barsoap@lemm.ee 2 points 1 year ago* (last edited 1 year ago) (1 children)

because that phrase doesn’t ever appear in the training data.

Eh but LLMs abstract. It has seen " have feathers" and " have fur" quite a lot of times. The problem isn't that LLMs can't reason at all, the problem is that they do employ techniques used in proper reasoning, in particular tracking context throughout the text (cross-attention) but lack techniques necessary for the whole thing, instead relying on confabulation to sound convincing regardless of the BS they spout. Suffices to emulate an Etonian but that's not a high standard.

load more comments (1 replies)

load more comments (2 replies)

[–] CeeBee_Eh@lemmy.world 2 points 1 year ago

Both sentences are true. And based on vocabulary of both, the model can output the following sentences:

Cats have feathers.

Birds have fur

This is not how the models are trained or work.

Both are false but the model doesn't "know" it. All that it knows is that "have" is allowed to go after both "cats" and "birds", and that both "feathers" and "fur" are allowed to go after "have".

Demonstrably false. This isn't how LLMs are trained or built.

Just considering the contextual relationships between word embeddings that are created during training is evidence enough. Those relationships from the multi-vector fields are an emergent property that doesn't exist in the dataset.

If you want a better understanding of what I just said, take a look at this Computerphile video from four years ago. And this came out before the LLM hype and before ChatGPT 3, which was the big leap in LLMs.

[–] Zos_Kia@lemmynsfw.com 2 points 1 year ago (1 children)

That's what smaller models do, but it doesn't yield great performance because there's only so much stuff available. To get to gpt4 levels you need a lot more data, and to break the next glass ceiling you'll need even more.

[–] KevonLooney@lemm.ee 2 points 1 year ago (1 children)

Then these models are stupid. Humans don't start as a blank slate. They have an inherent aptitude for language and communication. These models should start out with basics of language, so they don't have to learn it from the ground up. That's the next step. Right now they're just well read idiots.

[–] Zos_Kia@lemmynsfw.com 2 points 1 year ago

Then these models are stupid

Yup that is kind of the point. They are math functions designed to approximate human tasks.

These models should start out with basics of language, so they don’t have to learn it from the ground up. That’s the next step. Right now they’re just well read idiots.

I'm not sure what you're pointing at here. How they do it right now, simplified, is you have a small model designed to cut text into tokens ("knowledge of syllables"), which are fed into a larger model which turns tokens into semantic information ("knowledge of language"), which is fed to a ridiculously fat model which "accomplishes the task" ("knowledge of things").

The first two models are small enough that they can be trained on the kind of data you describe, classic books, movie scripts etc... A couple hundred billion words maybe. But the last one requires orders of magnitude more data, in the trillions.

[–] downpunxx@fedia.io 9 points 1 year ago

GIGO

[–] SkaveRat@discuss.tchncs.de 8 points 1 year ago (1 children)

People are already comparing older content with Low Background Steel, as it's uncontaminated

load more comments (1 replies)

[–] superminerJG@lemmy.world 6 points 1 year ago

News at 11.

[–] FlashZordon@lemmy.world 5 points 1 year ago

The AI art is inbreeding.

[–] TheReturnOfPEB@reddthat.com 4 points 1 year ago

certainly at least a downvote to free will

[–] sundray@lemmus.org 3 points 1 year ago

AI writing, scraped by AI, producing more AI writing...

So not "gray goo" exactly, but "gray slop"?

[–] Andromxda@lemmy.dbzer0.com 3 points 1 year ago (1 children)

Water is wet

[–] cows_are_underrated@feddit.org 3 points 1 year ago* (last edited 1 year ago)

Is it wet or does it make other things wet?

[–] MonkderVierte@lemmy.ml 2 points 1 year ago

Woah, that was fast.

[–] werefreeatlast@lemmy.world 2 points 1 year ago

Maybe we can use it to train the other AIs to help ourselves.

load more comments