Devial

joined 3 weeks ago
[–] Devial@discuss.online 25 points 1 day ago* (last edited 1 day ago) (4 children)

They didn't get mad, they didn't even know THAT he reported it, and they have no reason or incentive to swipe it under the rug, because they have no connection to the data set. Did you even read my comment ?

I hate Alphabet as much as the next person, but this feels like you're just trying to find any excuse to hate on them, even if it's basically a made up reason.

[–] Devial@discuss.online 183 points 2 days ago* (last edited 2 days ago) (44 children)

The article headline is wildly misleading, bordering on being just a straight up lie.

Google didn't ban the developer for reporting the material, they didn't even know he reported it, because he did so anonymously, and to a child protection org, not Google.

Google's automatic tools, correctly, flagged the CSAM when he unzipped the data and subsequently nuked his account.

Google's only failure here was to not unban on his first or second appeal. And whilst that is absolutely a big failure on Google's part, I find it very understandable that the appeals team generally speaking won't accept "I didn't know the folder I uploaded contained CSAM" as a valid ban appeal reason.

It's also kind of insane how this article somehow makes a bigger deal out of this devolper being temporarily banned by Google, than it does of the fact that hundreds of CSAM images were freely available online and openly sharable by anyone, and to anyone, for god knows how long.

[–] Devial@discuss.online 10 points 2 days ago* (last edited 2 days ago)

They reacted to the presence of CSAM. It had nothing whatsoever to do with it being contained in an AI training dataset, as the comment I originally replied to states.

[–] Devial@discuss.online 12 points 2 days ago* (last edited 2 days ago) (2 children)

They didn't react to anything. The automated system (correctly) flagged and banned the account for CSAM, and as usual, the manual ban appeal sucked ass and didn't do what it's supposed to do (also whilst this is obviously a very unique case, and the ban should have been overturned on appeal right away, it does make sense that the appeals team, broadly speaking, rejects "I didn't know this contained CSAM" as a legitimate appeal reason). This is barely news worthy. The real headline should be about how hundreds of CSAM images were freely available and sharable from this data set.

[–] Devial@discuss.online 30 points 2 days ago* (last edited 2 days ago) (4 children)

Did you even read the article ? The dude reported it anonymously, to a child protection org, not google, and his account was nuked as soon as he unzipped the data, because the content was automatically flagged.

Google didn't even know he reported this, and Google has nothing whatsoever to do with this dataset. They didn't create it, and they don't own or host it.

[–] Devial@discuss.online 17 points 4 days ago (2 children)

Has this dude never heard of the tobacco, alcohol or gun Industry ?

He's talking about commercial heroin like it's some outlandish and unthinkable idea that a harmful thing would become a billion dollar industry

[–] Devial@discuss.online 121 points 1 week ago (3 children)

If you gave your AI permission to run console commands without check or verification, then you did in fact give it permission to delete everything.

[–] Devial@discuss.online 0 points 1 week ago* (last edited 1 week ago)

That is enormously ironic, since I literally never claimed you said anything except for what you did: Namely, that synthetic data is enough to train models.

According to you, they should be able to just generate synthetic training data purely with the previous model, and then use that to train the next generation.

LIterally, the very next sentence starts with the words "Then why", which clearly and explicitly means I'm no longer indirectly quoting you Everything else in my comment is quite explicitly my own thoughts on the matter, and why I disagree with that statment, so in actual fact, you're the one making up shit I never said.

[–] Devial@discuss.online 0 points 1 week ago* (last edited 1 week ago) (2 children)

If the model collapse theory weren't true, then why do LLMs need to scrape so much data from the internet for training ?

According to you, they should be able to just generate synthetic training data purely with the previous model, and then use that to train the next generation.

So why is there even a need for human input at all then ? Why are all LLM companies fighting tooth and nail against their data scraping being restricted, if real human data is in fact so unnecessary for model training, and they could just generate their own synthetic training data instead ?

You can stop models from deteriorating without new data, and you can even train them with synthetic data, but that still requires the synthetic data to either be modelled, or filtered by humans to ensure its quality. If you just take a million random chatGPT outputs, with no human filtering whatsoever, and use those to retrain the chatGPT model, and then repeat that over and over again, eventually the model will turn to shit. Each iteration some of the random tweaks chatGPT makes to their output are going to produce some low quality outputs, which are now presented to the new training model as a target to achieve, so the new model learns that the quality of this type of bad output is actually higher, which makes it more likely for it to reappear in the next set of synthetic data.

And if you turn of the random tweaks, the model may not deteriorate, but it also won't improve, because effectively no new data is being generated.

[–] Devial@discuss.online 21 points 2 weeks ago* (last edited 1 week ago) (1 children)

What the hell even is the point mandating a back up alarm for self driving cars ? Backup alarms literally only exist because visibility to the rear is worse, and to warn pedestrians that a vehicle nearby is moving with very poor to no visibility, but that only applies to human operated vehicles. Autonomous vehicles use 360° sensors, they can "see" just as well in reverse as in forward. Be that good or bad, it's equal in every direction, so mandating an alarm just for reverse seems enormously pointless. Especially since the cars tend to be slower in reverse, so if anything it's less necessary then, vs. when they're moving forward.

[–] Devial@discuss.online 9 points 2 weeks ago* (last edited 1 week ago) (1 children)

The line, imo, is: are you creating it yourself, and just using AI to help you make it faster/more convenient, or is AI the primary thing that is creating your content in the first place.

Using AI for convenience is absolutely valid imo, I routinely use chatGPT to do things like debugging code I wrote, or rewriting data sets in different formats, instead of doing to by hand, or using it for more complex search and replace jobs, if I can't be fucked to figure out a regex to cover it.

For these kind of jobs, I think AI is a great tool.

More simply said, I personally generally use AI for small subtasks that I am entirely capable of doing myself, but are annoying/boring/repetitive/time consuming to do by hand.

[–] Devial@discuss.online 63 points 2 weeks ago (13 children)

If "everyone will be using AI", AI will turn to shit.

They can't create originality, they're only recycling and recontextualising existing information. But if you recycle and recontextualise the same information over and over again, it keeps degrading more and more.

It's ironic that the very people who advocate for AI everywhere, fail to realise just how dependent the quality of AI content is on having real, human generated content to input to train the model.

view more: ‹ prev next ›