this post was submitted on 28 Jan 2024

380 points (95.2% liked)

Technology

81653 readers

4433 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

380

GenAI tools ‘could not exist’ if firms are made to pay copyright (www.computerweekly.com)

submitted 2 years ago by L4s@lemmy.world to c/technology@lemmy.world

152 comments fedilink hide all child comments

GenAI tools ‘could not exist’ if firms are made to pay copyright::undefined

you are viewing a single comment's thread
view the rest of the comments

[–] valen@lemmy.world 122 points 2 years ago (6 children)

So they're admitting that their entire business model requires them to break the law. Sounds like they shouldn't exist.

[–] Even_Adder@lemmy.dbzer0.com 48 points 2 years ago* (last edited 2 years ago) (1 children)

It likely doesn't break the law. You should check out this article by Kit Walsh, a senior staff attorney at the EFF, and this one by Katherine Klosek, the director of information policy and federal relations at the Association of Research Libraries.

Headlines like these let people assume that it's illegal, rather than educate people on their rights.

[–] jacksilver@lemmy.world 2 points 2 years ago (1 children)

The Kit Walsh article purposefully handwaves around a couple of issues that could present larger issues as law suits in this arena continue.

He says that due to the size of training data and the model, only a byte of data per image could be stored in any compressed format, but this assumes all training data is treated equally. It's very possible certain image artifacts are compressed/stored in the weights more than other images.
These models don't produce exact copies. Beyond the Getty issue, nytimes recently released an article about a near duplicate - https://www.nytimes.com/interactive/2024/01/25/business/ai-image-generators-openai-microsoft-midjourney-copyright.html.

I think some of the points he makes are valid, but they're making a lot of assumptions about what is actually going on in these models which we either don't know for certain or have evidence to the contrary.

I didn't read Katherine's article so maybe there is something more there.

[–] Even_Adder@lemmy.dbzer0.com 1 points 2 years ago* (last edited 2 years ago) (1 children)

She addresses both of those, actually. The Midjourney thing isn't new, It's the sign of a poorly trained model.

[–] jacksilver@lemmy.world 2 points 2 years ago (1 children)

I'm not sure she does, just read the article and it focuses primarily what models can train on. However, the real meat of the issue, at least I think, with GenAI is what it produces.

For example, if I built a model that just spit out exact frames from "Space Jam", I don't think anyone would argue that would be a problem. The question is where is the line?

[–] Even_Adder@lemmy.dbzer0.com 2 points 2 years ago* (last edited 2 years ago) (1 children)

This part does:

It’s not surprising that the complaints don’t include examples of substantially similar images. Research regarding privacy concerns suggests it is unlikely it is that a diffusion-based model will produce outputs that closely resemble one of the inputs.

According to this research, there is a small chance that a diffusion model will store information that makes it possible to recreate something close to an image in its training data, provided that the image in question is duplicated many times during training. But the chances of an image in the training data set being duplicated in output, even from a prompt specifically designed to do just that, is literally less than one in a million.

The linked paper goes into more detail.

On the note of output, I think you’re responsible for infringing works, whether you used Photoshop, copy & paste, or a generative model. Also, specific instances will need to be evaluated individually, and there might be models that don't qualify. Midjourney's new model is so poorly trained that it's downright easy to get these bad outputs.

[–] jacksilver@lemmy.world 1 points 2 years ago (2 children)

This goes back to my previous comment of handwaving away the details. There is a model out there that clearly is reproducing copyrighted materials almost identically (nytimes article), we also have issues with models spitting out training data https://www.wired.com/story/chatgpt-poem-forever-security-roundup/. Clearly people studying these models don't fully know what is actually possible.

Additionally, it only takes one instance to show that these models, in general, can and do have issues with regurgitating copyrighted data. Whether that passes the bar for legal consequences we'll have to see, but i think it's dangerous to take a couple of statements made by people who don't seem to understand the unknowns in this space at face value.

[–] FatCrab@lemmy.one 4 points 2 years ago

The ultimate issue is that the models don't encode the training data in any way that we historically have considered infringement of copyright. This is true for both transformer architectures (gpt) and diffusion ones (most image generators). From a lay perspective, it's probably good and relatively accurate for our purposes to imagine the models themselves as enormous nets that learn vague, muddled, impressions of multiple portions of multiple pieces of the training data at arbitrary locations within the net. Now, this may still have IP implications for the outputs and here music copyright is pretty instructive, albeit very case-by-case. If a piece is too "inspired" by a particular previous work, even if it is not explicit copying it may still be regarded as infringement of copyright. But, like I said, this is very case specific and precedent cuts both ways on it.

[–] Even_Adder@lemmy.dbzer0.com 1 points 2 years ago

The article dealt with Stable Diffusion, the only open model that allowed people to study it. If there were more problems with Stable Diffusion, we'd've heard of them by now. These are the critical solutions Open-source development offers here. By making AI accessible, we maximize public participation and understanding, foster responsible development, as well as prevent harmful control attempts.

As it stands, she was much better informed than you are and is an expert in law to boot. On the other hand, you're making a sweeping generalization right into an appeal to ignorance. It's dangerous to assert a proposition just because it has not been proven false.

[–] Marcbmann@lemmy.world 38 points 2 years ago (3 children)

Reproduction of copyrighted material would be breaking the law. Studying it and using it as reference when creating original content is not.

[–] 1Fuji2Taka3Nasubi@lemmy.zip 8 points 2 years ago (1 children)

Reproduction of copyrighted material would be breaking the law. Studying it and using it as reference when creating original content is not.

I’m curious why we think otherwise when it is a student obtaining an unauthorized copy of a textbook to study, or researchers getting papers from sci-hub. Probably because it benefits corporations and they say so?

[–] Marcbmann@lemmy.world 7 points 2 years ago (3 children)

While I would like to be in a world where knowledge is free, this is apples and oranges.

OpenAI can purchase a textbook and read it. If their AI uses the knowledge gained to explain maths to an individual, without reproducing the original material, then there's no issue.

The difference is the student in your example didn't buy their textbook. Someone else bought it and reproduced the original for others to study from.

If OpenAI was pirating textbooks, that would be a wholly separate issue.

[–] 1Fuji2Taka3Nasubi@lemmy.zip 2 points 2 years ago

I agree that the issues

whether AI output are derivative works of its input, and
whether input to AI is fair use and requires no compensation

are separate, but I think they are related, in that AI companies are trying to impose whatever interpretation of copyright that is convenient to them to the rest of the society.

And indeed Meta pirated books to feed its AI.

https://www.techspot.com/news/101507-meta-admits-using-pirated-books-train-ai-but.html

[–] Blackmist@feddit.uk 2 points 2 years ago (3 children)

The fact that the "AI" can spit out whole passages verbatim when given the right prompts, suggests that there is a big problem here and they haven't a clue how to fix it.

It's not "learning" anything other than the probable order of words.

[–] FatCrab@lemmy.one 4 points 2 years ago

I really hate this reduction of gpt models. Is the model probabilistic? Absolutely. But it isn't simply learning a comprehensible probability of words--it is generating a massively complex conditional probability sequence for words. Largely, humans might be said to do the same thing. We make a best guess at the sequence of words we decide to use based on conditional probabilities along a myriad number of conditions (including semantics of the thing we want to say).

[–] Even_Adder@lemmy.dbzer0.com 0 points 2 years ago

What about these:

https://arxiv.org/abs/2310.02207

https://notes.aimodels.fyi/researchers-discover-emergent-linear-strucutres-llm-truth/

https://notes.aimodels.fyi/self-rag-improving-the-factual-accuracy-of-large-language-models-through-self-reflection/

[–] Marcbmann@lemmy.world 0 points 2 years ago

Completely agree. And that should be the focal point of the issue.

Sam Altman is correctly stating that AI is not possible without using copyrighted materials. And I don't think there's anything wrong with that.

His mistake is not redirecting the conversation. He should be talking about the efforts they're making to stop their machine from reproducing copyrighted works. Not whether or not they should be allowed to use it in the first place.

[–] sixCats@lemmy.dbzer0.com 2 points 2 years ago (1 children)

I was under the impression they mentioned at some point torrenting things

[–] 1Fuji2Taka3Nasubi@lemmy.zip 1 points 2 years ago

Don't know about OpenAI, but Meta used pirated books to train its AI.

https://www.techspot.com/news/101507-meta-admits-using-pirated-books-train-ai-but.html

[–] homesweethomeMrL@lemmy.world 6 points 2 years ago (4 children)

humans studying it, is fair use.

[–] hglman@lemmy.ml 11 points 2 years ago (1 children)

So if a tool is involved, it's no longer ok? So, people with glasses cannot consume copyrighted material?

[–] Harbinger01173430@lemmy.world 6 points 2 years ago

No. A tool already makes it unnatural. /S

[–] hedgehog@ttrpg.network 7 points 2 years ago (1 children)

Copyright can only be granted to works created by a human, but I don’t know of any such restriction for fair use. Care to share a source explaining why you think only humans are able to use fair use as a defense for copyright infringement?

[+] phdepressed@sh.itjust.works -6 points 2 years ago (2 children)

Because a human has to use talent+effort to make something that's fair use. They adapt a product into something that while similar is noticeably different. AI will

make things that are not just similar but not noticeably different.
There's not an effort in creation. There's human thought behind a prompt but not on the AI following it.
If allowed to AI companies will basically copyright everything...

[–] hedgehog@ttrpg.network 3 points 2 years ago

Your reply has nothing to do with fair use doctrine.

[–] Harbinger01173430@lemmy.world 2 points 2 years ago (2 children)

You are aware of the insane amounts of research, human effort and the type of human talent that is required to make a simple piece of software, let alone a complex artificial neural network model whose function is to try and solve whatever stuff...right?

[–] Goldmage263@sh.itjust.works 3 points 2 years ago

Good point. I say the software can be copywrite protected, but not the content the program generates.

[–] phdepressed@sh.itjust.works 2 points 2 years ago

And that is human effort, not the AIs.

[–] LainTrain@lemmy.dbzer0.com 3 points 2 years ago

What's the difference? Humans are just the intent suppliers, the rest of the art is mostly made possible by software, whether photoshop or stable diffusion.

[–] Marcbmann@lemmy.world 1 points 2 years ago

I don't agree. The publisher of the material does not get to dictate what it is used for. What are we protecting at the end of the day and why?

In the case of a textbook, someone worked hard to explain certain materials in a certain way to make the material easily digestible. They produced examples to explain concepts. Reproducing and disseminating that material would be unfair to the author who worked hard to produce it.

But the author does not have jurisdiction over the knowledge gained. They cannot tell the reader that they are forbidden from using the knowledge gained to tutor another person in calculus. That would be absurd.

IP law protects the works of the creator. The author of a calculus textbook did not invent calculus. As such, copyright law does not apply.

[–] wewbull@feddit.uk 1 points 2 years ago (1 children)

The model itself is a derivative work. It's existence is what is under dispute. It's not about using the model to produce further works

[–] Marcbmann@lemmy.world 4 points 2 years ago (1 children)

Then every single student graduating college produces derivative work.

Everything that required the underlying knowledge gained from the textbooks studied, or research papers read, is derivative work.

At the core of this, what are we saying? Your machine could only explain calculus because it was provided information from multiple calculus textbooks? Well, that applies to literally everyone.

[–] assassin_aragorn@lemmy.world 0 points 2 years ago

Charge AI recurring tuition then.

[–] kromem@lemmy.world 19 points 2 years ago

You might want to read this post from one of the EFF's senior lawyers on the topic who has previously litigated IP cases:

[–] QuadratureSurfer@lemmy.world 18 points 2 years ago (1 children)

It doesn't break the law at all. The courts have already ruled that copyrighted material can be fed into AI/ML models for training:

https://towardsdatascience.com/the-most-important-supreme-court-decision-for-data-science-and-machine-learning-44cfc1c1bcaf

[–] Telodzrum@lemmy.world 15 points 2 years ago (2 children)

This ruling only applies to the 2nd Circuit and SCOTUS has yet to take up a case. As soon as there's a good fact pattern for the Supreme Court of a circuit split, you'll get nationwide information. You'll also note that the decision is deliberately written to provide an extremely narrow precedent and is likely restricted to Google Books and near-identical sources of information.

[–] hedgehog@ttrpg.network 5 points 2 years ago

Have there been any US ruling stating something along the lines of “The training of general purpose LLMs and/or image generation AIs does not qualify as fair use,” even in a lower court?

[–] Eccitaze@yiffit.net 1 points 2 years ago

Hell, that article is also all about Google Books, which is an entirely different beast from generative AI. One of the key points from the circuit judge was that Google Books' use of copyrighted material "...[maintains] respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders." The appeals court, in upholding the ruling that Google Books' use of copyrighted content is fair use, ruled "the revelations do not provide a significant market substitute for the protected aspects of the originals."

If you think that gen AI doesn't provide a significant market substitute for the artwork created by the artists and authors used to train these models, or that it doesn't adversely impact their rights, then you're utterly delusional.

[–] tsonfeir@lemm.ee 6 points 2 years ago

I guess I can’t read anything and learn from it.

[–] iquanyin@lemmy.world 1 points 2 years ago

i don’t think it’s need rules against the law…