this post was submitted on 06 Sep 2024
1726 points (90.1% liked)
Technology
61203 readers
4630 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Here's an experiment for you to try at home. Ask an AI model a question, copy a sentence or two of what they give back, and paste it into a search engine. The results may surprise you.
And stop comparing AI to humans but then giving AI models more freedom. If I wrote a paper I'd need to cite my sources. Where the fuck are your sources ChatGPT? Oh right, we're not allowed to see that but you can take whatever you want from us. Sounds fair.
Can you just give us the TLDE?
AI Chat bots copy/paste much of their "training data" verbatim.
Not to fully argue against your point, but I do want to push back on the citations bit. Given the way an LLM is trained, it's not really close to equivalent to me citing papers researched for a paper. That would be more akin to asking me to cite every piece of written or verbal media I've ever encountered as they all contributed in some small way to way that the words were formulated here.
Now, if specific data were injected into the prompt, or maybe if it was fine-tuned on a small subset of highly specific data, I would agree those should be cited as they are being accessed more verbatim. The whole "magic" of LLMs was that it needed to cross a threshold of data, combined with the attentional mechanism, and then the network was pretty suddenly able to maintain coherent sentences structure. It was only with loads of varied data from many different sources that this really emerged.
This is the catch with OPs entire statement about transformation. Their premise is flawed, because the next most likely token is usually the same word the author of a work chose.
And that's kinda my point. I understand that transformation is totally fine but these LLM literally copy and paste shit. And that's still if you are comparing AI to people which I think is completely ridiculous. If anything these things are just more complicated search engines with half the usefulness. If I search online about how to change a tire I can find some reliable sources to do so. If I ask AI how to change a tire it would just spit something out that might not even be accurate and I'd have to search again afterwards just to make sure what it told me was even accurate.
It's just a word calculator based on information stolen from people without their consent. It has no original thought process so it has no way to transform anything. All it can do is copy and paste in different combinations.
It's not a breach of copyright or other IP law not to cite sources on your paper.
Getting your paper rejected for lacking sources is also not infringing in your freedom. Being forced to pay damages and delete your paper from any public space would be infringement of your freedom.
I’m pretty sure that it’s true that citing sources isn’t really relevant to copyright violation, either you are violating or not. Saying where you copied from doesn’t change anything, but if you are using some ideas with your own analysis and words it isn’t a violation either way.
With music this often ends up in civil court. Pretty sure the same can in theory happen for written texts, but the commercial value of most written texts is not worth the cost of litigation.
I mean, you're not necessarily wrong. But that doesn't change the fact that it's still stealing, which was my point. Just because laws haven't caught up to it yet doesn't make it any less of a shitty thing to do.
When I analyze a melody I play on a piano, I see that it reflects the music I heard that day or sometimes, even music I heard and liked years ago.
Having parts similar or a part that is (coincidentally) identical to a part from another song is not stealing and does not infringe upon any law.
You guys are missing a fundamental point. The copyright was created to protect an author for specific amount of time so somebody else doesn't profit from their work essentially stealing their deserved revenue.
LLM AI was created to do exactly that.
It's not stealing, its not even 'piracy' which also is not stealing.
Copyright laws need to be scaled back, to not criminalize socially accepted behavior, not expand.
The original source material is still there. They just made a copy of it. If you think that's stealing then online piracy is stealing as well.
Well they make a profit off of it, so yes. I have nothing against piracy, but if you're reselling it that's a different story.
But piracy saves you money which is effectively the same as making a profit. Also, it's not just that they're selling other people's work for profit. You're also paying for the insane amount of computing power it takes to train and run the AI plus salaries of the workers etc.