this post was submitted on 06 Sep 2024
1726 points (90.1% liked)
Technology
61227 readers
4347 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Except that, again, as is literally written in the comment you're directly replying to, it has been shown that AI can reproduce copyrightable works word for word, showing that it objectively and necessarily is storing particular creative works in a particularly identifiable manner, whether or not that manner is yet known to humans.
It's called learning, and I wish people did more of it.
You don't learn by memorizing and reproducing works, you learn by understanding the concepts in various works and producing new works that are combinations of the ideas in those other works. AI doesn't understand, and it has been shown to be able to reproduce works, so I think it's fair to say that it's doing a lot of "memorizing" and therefore plagiarism.
Calling what attention transformers do memorization is wildly inaccurate.
*Unless we're talking about semantic memory.
Is it though? People memorize things very differently than computers do, but the actual mechanism of storage isn't particularly important. What's important is the net result. Whether it uses baysian networks (what we used in class for small-scale NLP), neural networks (what I assume LLMs use), or something else doesn't particularly matter.
For example, a search engine typically only stores keywords and relationships, so there's no way for it to reproduce an entire work (ignoring, of course, the "caching" features some search engines have). All it does is associate keywords with source material, so there's a strong argument that it falls under fair use.
LLMs, on the other hand, process entire works and keep more than just keywords, and they store it in such a way that entire works can be recovered if coaxed. My understanding is that they break up words into something like sets of phonemes, and then queries do a similar break-up as input to the neural network to produce an output, which is then reassembled into text. But that's my relatively naive understanding of how it all works (I've only done university level NLP, and that was years ago), but again, that's really not the point here. The point is that it uses a lot more of the work than the typical understanding of "fair use," and if copyrighted works can be reproduced by it, then the copyrighted work is "stored" in some fashion, so it can be thought of as a really complex form of compression, with tricky retrieval mechanisms. So in layman's terms, it's "memorizing" entire works in a way not entirely unlike a "mind palace", and to reproduce a given work, you need the right input to follow the right steps, but a slightly different input will lead to a very different output (i.e. maybe something with similar content, but no copyright violations).
What's at issue isn't whether the LLM is likely to reproduce entire works, but whether it can and does, which would mean it's violating fair use standards.
Learning is not being able to reproduce a news article word for word.
No, it isn't storing that information in that sequence. What is happening is that it is overly encoding those particular sequential relationships along some arbitrary but tightly mapped semantic concepts represented by dimensions in a massive vector space. It is storing copies of the information on the way that inadvertent copying of music might be based on "memorized" music listened to by the infringing artist in the past.
Not what I said. I used the exact language the above commenter used because it was specific and accurate. Also, inadvertent copyright violation is still copyright violation under US law. I'm not the biggest fan of every application of that law, but the ability to keep large corporations from ripping off small artists and creators is one that I think is good and useful under the global economic system we live under currently.
Yes, inadvertent copying is still copying, but it would be copying in the output and is not evidence of copying happening in the creation of the model. That was why I used the music example, because it is rather probative of where there could be grounds for copyright infringement related to these model architectures. This may not seem an important distinction, but it has significant consequences on who is ultimately liable and how.