this post was submitted on 10 Jan 2024
1237 points (96.5% liked)
Technology
59589 readers
3332 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
They're not serving you the exact content they scraped, and that makes all the difference.
Well if you believe that you should look at the times lawsuit.
Word for word on hundreds/thousands of pages of stolen content, its damming
Why do you assume that I haven't? The case hasn't been resolved and it's not clear how The NY Times did what they claim, which is may as well be manipulation. It's a fair rebuttal by OpenAI. The Times haven't provided the steps they used to achieve that.
So unless that's cleared up, it's not damming in the slightest. Not yet, anyway. And that still doesn't invalidate my statement above, because it's still under very specific circumstances when that happens.
Also intention is pretty important when determining the guilt of many crimes. OpenAI doesnt intentionally spit back an author's exact words, their intention is to summarize and create unique content.
Ah, yes. The defense of "I didn't mean to do it." Always a classic.
No, the real defense is "that's not how LLMs work" but you are all hinging on the wrong idea. If you so think that an LLM is capable of doing what you claim, I'd love to hear the mechanism in detail and the steps to replicate it.
I mean, I'm not sure why this conversation even needs to get this far. If I write an article about the history of Disney movies, and make it very clear the way I got all of those movies was to pirate them, this conversation is over pretty quick. OpenAI and most of the LLMs aren't doing anything different. The Times isn't Wikipedia, most of their stuff is behind a paywall with pretty clear terms of service and nothing entitles OpenAI to that content. OpenAI's argument is "well, we're pirating everything so it's okay." The output honestly seems irrelevant to me, they never should have had the content to begin with.
That's not the claim that they're making. They're arguing that OpenAI retains their work they made publicly available, which OpenAI claims is fair use because it's wholly transformative in the form of nodes, weights and biases, and that they don't store those articles in a database for reuse. But their other argument is that they created a system that threatens their business which is just ludicrous.
So it's content laundering
What a colorful mischaracterization. It sounds clever at face value but it's really naive. If anything about this is deceptive, it's the lengths that people go to to slander what they dislike.
I feel most people critical of AI don't know how a neural network works...
That is exactly what's going on here. Or they hate it enough that they don't mind making stuff up or mischaracterizing what it does. Seems to be a common thread on the Fediverse. It's not the first time this week I've seen it.
Actually content laundering is the best term I've heard to describe the process. Just like money laundering, you no longer know the source and know it's technically legal to use and distribute.
I mean, if the copyrighted content wasn't so critical, they would train models without it. Their essentially derivative works, but no one wants to acknowledge it because it would either require changing our copyright laws or make this potentially lucrative and important work illegal.
Content laundering is not a good way to describe it because it's misleading as it oversimplifies and mischaracterizes what a language model actually does. It's a fundamental misunderstanding of how it works. Training language models is typically a transparent and well-documented process as described by the mountains of research over the past decades. The real value comes from the weights of the nodes in the neural network and not the source that it spits out in its entirety when it was trained. The source material is evaluated and wholly transformed into new data in the form of nodes and weights. The original content does not exist as it was within the network because there's no way to encode it that way. It's a statistical system that compounds information.
And while LLMs do have the capacity to create derivative works in other ways, it's not all that they do, or what they always do. It's only one of the many functions that it has. What you say would probably be true if it was only trained on a single source, but that's not even feasible. But when you train it on millions of sources, what remains are the overall patterns of language within those works. It's much more sophisticated and flexible than what you describe.
So no, if it was cut and dry there would be grounds for a legitimate lawsuit. The problem is that people are arguing points that do not apply but sound reasonable when they haven't seen a neural network work under the hood. If anything, new laws need to be created to address what LLMs do if you're so concerned about proper compensation.
I am familiar with how LLMs work and are trained. I've been using transformers for years.
The core question I'd ask is, if the copyrighted material isn't essential to the model, why don't they just train the models without that data? If it is core to the model, then can you really say they aren't derivative of that content?
I'm not saying that the models don't do something more, just that the more is built upon copyrighted material. In any other commercial situation, you'd have to license/get approval for the underlying content if you were packaging it up. When sampling music, for example, the output will differ greatly from the original song, but because you are building off someone else's work you must compensate them.
Its why content laundering is a great term. The models intermix so much data that it's hard to know if the content originated from copyrighted materials. Just like how money laundering is trying to make it difficult to determine if the money comes from illicit sources.
It's great how for most of us we're taught that just changing the order of words is still plagerism. For them they frequently end up using the exact same words as other things and people still argue it somehow is intelligent and somehow not plagerism.
"Changing the order of words" is what it does? That's news to me. And do you have examples of it "using the exact same words as other things" without prompt manipulation?
Why does the prompting matter? If I "prompt" a band to play copyrighted music does that mean they get a free pass?
That's not a very good analogy because the band would be reproducing an entire work of art which an LLM does not and cannot. And by prompt manipulation I mean purposely making it seem like the LLM is doing something it wouldn't do on its own. The operating word is seem, which is what I meant by manipulation. The prompting here is irrelevant, but how it's done is. So unless The Times releases the steps they used to get ChatGPT to output what it did, you can't really claim that that's what it does.
If you passed them a sheet of music I'd say that's on you, it would be your responsibility to not sell recordings of them playing it.
Just like if I typed the first chapter of Harry Potter into word it is not Microsoft's intent to breach copyright, it would have been my intent to make it do it. It would be my responsibility not to sell that first chapter, and they should come after me if I did, even though MS is a corporation who supplied the tools.