this post was submitted on 17 Feb 2024
1089 points (98.7% liked)
Technology
59534 readers
3195 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Called this awhile back, this is why Reddit has such a high evaluation.
Poisoning your data won't do anything but give them more data, do you seriously think reddit servers don't track every edit you make to posts? You'd literally just be providing training data of original human vs poisoned. They'd still have your original post, and they have a copy of everytime you edit it.
Whoever buys reddit will have sole access to one of the larger (I don't think largest though) pools of text training Data on the internet, with full licensed usage of it. I expect someone like Google, FB, MS, OpenAI, etc would pay big $$$ for that.
"But can't people already scrape it?"
Well yes, but it's at best legally dubious in some places
Scraping Data off reddit only gets you current versions of posts (which means you can get poisoned dara, and cant see deleted content), and is extremely slow... if you own the server you have first class access to all posts in a database, including g the originals and diffs of everytime soneone edited a post, and all the deleted posts too.
Think about if you perhaps wanted to train an AI to detect posts that require flagging for moderation, if you scrape reddit data, you can't find deleted posts that got moderated...
But, if you have the raw original data, you 100% would have a list of every post that got deleted by mods and even the mod message on why it was deleted
You surely can see the value of such data, that only owners of reddit are currently privy to atm...
sigh
So the old trick of “search term +reddit” no longer will work then huh?
I’ve already made a habit of adding date limiters to web results from before before LLMs were made public… The SEO ‘optimization’ game of before was bearable, but the LLM spam just ruins so many search results with regurgitated garbage or teaspoon deep information
During the peak of the great purge, it was quickly becoming pointless. A lot of results were bringing up deleted posts. It took a while for search engines to catch up and start filtering a lot of those results out.