Technology

75494 readers

2944 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

437

AI trained on AI garbage spits out AI garbage. (www.technologyreview.com)

submitted 1 year ago by ModerateImprovement@sh.itjust.works to c/technology@lemmy.world

69 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] KevonLooney@lemm.ee 12 points 1 year ago* (last edited 1 year ago) (14 children)

provenance requires some way to filter the internet into human-generated and AI-generated content, which hasn’t been cracked yet

It doesn't need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a "truth score" for each.

We don't teach children to read by just handing them random tweets. We give them books that are made specifically for children. Our filtering mechanism for good / bad content is very robust for humans. Why can't AI just read every piece of "classic literature", famous speeches, popular books, good TV and movie scripts, textbooks, etc?

[–] Zos_Kia@lemmynsfw.com 2 points 1 year ago (2 children)

That's what smaller models do, but it doesn't yield great performance because there's only so much stuff available. To get to gpt4 levels you need a lot more data, and to break the next glass ceiling you'll need even more.

[–] KevonLooney@lemm.ee 2 points 1 year ago (1 children)

Then these models are stupid. Humans don't start as a blank slate. They have an inherent aptitude for language and communication. These models should start out with basics of language, so they don't have to learn it from the ground up. That's the next step. Right now they're just well read idiots.

[–] Zos_Kia@lemmynsfw.com 2 points 1 year ago

Then these models are stupid

Yup that is kind of the point. They are math functions designed to approximate human tasks.

These models should start out with basics of language, so they don’t have to learn it from the ground up. That’s the next step. Right now they’re just well read idiots.

I'm not sure what you're pointing at here. How they do it right now, simplified, is you have a small model designed to cut text into tokens ("knowledge of syllables"), which are fed into a larger model which turns tokens into semantic information ("knowledge of language"), which is fed to a ridiculously fat model which "accomplishes the task" ("knowledge of things").

The first two models are small enough that they can be trained on the kind of data you describe, classic books, movie scripts etc... A couple hundred billion words maybe. But the last one requires orders of magnitude more data, in the trillions.

load more comments (11 replies)