Technology

81907 readers

5040 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

1019

Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts (www.businessinsider.com)

submitted 2 years ago by throws_lemy@lemmy.nz to c/technology@lemmy.world

253 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] echodot@feddit.uk 10 points 2 years ago (2 children)

I'm so confused about how AI learning is supposed to work. Does it just need any data at all in significant quantity, is the quality of the data almost irrelevant? Because otherwise surely they could just feed it back issues of scientific American, or the scanned copies of the library of congress, I can't reasonably believe that Reddit is going to add anything unless it's just pure on adulterated quantity that's important.

[–] underisk@lemmy.ml 5 points 2 years ago* (last edited 2 years ago) (1 children)

The part you're missing is the metadata. AI (neural networks, specifically) are trained on the data as well as some sort of contextal metadata related to what they're being trained to do. For example, with reddit posts they would feed things like "this post is popular", "this post was controversial", "this post has many views", etc. in addition to the post text if they wanted an AI that could spit out posts that are likely to do well on reddit.

Quantity is a concern; you need to reach a threshold of data which is fairly large to have any hope of training an AI well, but there are diminishing returns after a certain point. The more data you feed it the more you have to potentially add metadata that can only be provided by humans. For instance with sentiment analysis you need a human being to sit down and identify various samples of text with different emotional responses, since computers can't really do that automatically.

Quality is less of a concern. Bad quality data, or data with poorly applied metadata will result in AI with less "accuracy". A few outliers and mistakes here and there won't be too impactful, though. Quality here could be defined by how well your training set of data represents the kind of input you'll be expecting it to work with.

[–] madcaesar@lemmy.world 4 points 2 years ago (2 children)

The way I'm reading this, ai is just shit loads of if statements, not some intelligence. It's all garbage.

[–] aidan@lemmy.world 9 points 2 years ago

Its not if statements anymore, now its just a random number generator + a lot of multiplication put through a sigmoid function. But yea, of course there is not intelligence to it. Its extreme calculus

[–] underisk@lemmy.ml 1 points 2 years ago

You're not entirely wrong. It's more like a series of multi-dimensional maps with hundreds or thousands of true/false pathways stacked on top of each other, then carved into by training until it takes on a shape that produces the 'correct' output from your inputs.

[–] Tywele@lemmy.dbzer0.com 4 points 2 years ago* (last edited 2 years ago) (1 children)

If you wanted the AI to just create book-like texts than you could train it purely on books from a library but if you want it to converse like a human being you need training data that imitates that.

[–] echodot@feddit.uk 2 points 2 years ago (1 children)

But that's my point really it already talks like a human. My guess is they feed it on hours and hours and hours of podcasts because that tends to be the manner in which it communicates. I don't see how Reddit really adds to this.

[–] aidan@lemmy.world 2 points 2 years ago

I doubt its trained on podcasts, seeing as they would need subtitles, and current automated subtitling is not that good.