Those claiming AI training on copyrighted works is "theft" misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in "vector space". When generating new content, the AI isn't recreating copyrighted works, but producing new expressions inspired by the concepts it's learned.
This is fundamentally different from copying a book or song. It's more like the long-standing artistic tradition of being influenced by others' work. The law has always recognized that ideas themselves can't be owned - only particular expressions of them.
Moreover, there's precedent for this kind of use being considered "transformative" and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
While it's understandable that creators feel uneasy about this new technology, labeling it "theft" is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn't make the current use of copyrighted works for AI training illegal or unethical.
For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744
You know, those obsessed with pushing AI would do a lot better if they dropped the patronizing tone in every single one of their comments defending them.
It's always fun reading "but you just don't understand".
On the other hand, it's hard to have a serious discussion with people who insist that building a LLM or diffusion model amounts to copying pieces of material into an obfuscated database. And then having to deal with the typical reply after explanation is attempted of "that isn't the point!" but without any elaboration strongly implies to me that some people just want to be pissy and don't want to hear how they may have been manipulated into taking a pro-corporate, hyper-capitalist position on something.
I love that the collectivist ideal of sharing all that we've created for the betterment of humanity is being twisted into this disgusting display of corporate greed and overreach. OpenAI doesn't need shit. They don't have an inherent right to exist but must constantly make the case for it's existence.
The bottom line is that if corporations need data that they themselves cannot create in order to build and sell a service then they must pay for it. One way or another.
I see this all as parallels with how aquifers and water rights have been handled and I'd argue we've fucked that up as well.
Training data IS a massive industry already. You don't see it because you probably don't work in a field directly dealing with it. I work in medtech and millions and millions of dollars are spent acquiring training data every year. Should some new unique IP right be found on using otherwise legally rendered data to train AI, it is almost certainly going to be contracted away to hosting platforms via totally sound ToS and then further monetized such that only large and we'll funded corporate entities can utilize it.
"unique new IP right?" Bruh you're talking about basic fucking intellectual property law. Just because someone posts something publicly on the internet doesn't mean that it can be used for whatever anybody likes. This is so well-established, that every major art gallery and social media website has a clause in their terms of service stating that you are granting them a license to redistribute that content. And most websites also explicitly state that when you upload your work to their site that you still retain your copyright of that work.
For example (emphasis mine):
FurAffinity:
Inkbunny:
DeviantArt:
e621:
Xitter:
Facebook:
I could go on, but I think I've made my point very clear: Every social media website and art gallery is built on an assumption that the person uploading art A) retains the copyright over the items they upload, B) that other people and organizations have NO rights to copyrighted works unless explicitly stated otherwise, and C) that 3rd parties accessing this material do not have any rights to uploaded works, since they never negotiated a license to use these works.
You are misunderstanding what I'm getting at and unfortunately no this isn't just straightforwardly copyright law whatsoever. The training content does not need to be copied. It isn't saved in a database somewhere (as part of the training....downloading pirated texts is a whole other issue completely removed from the inherent processes of training a model), relationships are extracted from the material, however it is presented. So the copyright extends to the right of displaying the material in the first place. If your initial display/access to the training content is non-infringing, the mere extraction of relationships between components is not itself making a copy nor is it making a derivative work in any way we haven't historically considered it. Effectively, it's the difference between looking at material and making intensive notes of how different parts of the material relate to each other and looking at a material and reproducing as much of it as possible for your own records.
FFS, the issue is not that the AI model "copies" the copyrighted works when it trains on them--I agree that after an AI model is trained, it does not meaningfully retain the copyrighted work. The problem is that the reproduction of the copyrighted work--i.e. downloading the work to the computer, and then using that reproduction as part of AI model training--is being done for a commercial purpose that infringes copyright.
If I went to DeviantArt and downloaded a random piece of art to my hard drive for my own personal enjoyment, that is a non-infringing reproduction. If I then took that same piece of art, and uploaded it to a service that prints it on a T-shirt, the act of uploading it to the T-shirt printing service's server would be infringing, since it is no longer being reproduced for personal enjoyment, but the unlawful reproduction of copyrighted material for commercial purpose. Similarly, if I downloaded a piece of art and used it to print my own T-shirts for sale, using all my own computers and equipment, that would also be infringing. This is straightforward, non-controversial copyright law.
The exact same logic applies to AI training. You can try to camouflage the infringement with flowery language like "mere extraction of relationships between components," but the purpose and intent behind AI companies reproducing copyrighted works via web scraping and downloading copyrighted data to their servers is to build and provide a commercial, for-profit service that is designed to replace the people whose work is being infringed. Full stop.
No, this is mostly incorrect, sorry. The commercial aspect of the reproduction is not relevant to whether it is an infringement--it is simply a factor in damages and Fair Use defense (an affirmative defense that presupposes infringement).
What you are getting at when it applies to this particular type of AI is effectively whether it would be a fair use, presupposing there is copying amounting to copyright infringement. And what I am saying is that, ignoring certain stupid behavior like torrenting a shit ton of text to keep a local store of training data, there is no copying happening as a matter of necessity. There may be copying as a matter of stupidity, but it isn't necessary to the way the technology works.
Now, I know, you're raging and swearing right now because you think that downloading the data into cache constitutes an unlawful copying--but it presumably does not if it is accessed like any other content on the internet. Because intent is not a part of what makes that a lawful or unlawful copying and once a lawful distribution is made, principles of exhaustion begin to kick in and we start getting into really nuanced areas of IP law that I don't feel like delving into with my thumbs, but ultimate the point is that it isn't "basic copyright law." But if intent is determinitive of whether there is copying in the first place, how does that jive with an actor not making copies for themselves but rather accessing retained data in a third party's cache after they grab the data for noncommercial purposes? Also, how does that make sense if the model is being trained for purely research purposes? And then perhaps that model is leveraged commercially after development? Your analysis, assuming it's correct arguendo, leaves far too many outstanding substantive issues to be the ruling approach.
EDIT: also, if you download images from deviantart with the purpose of using them to make shirts or other commercial endeavor, that has no bearing on whether the download was infringing. Presumably, you downloaded via the tools provided by DA. The infringement happens when you reproduce the images for the commercial (though any redistribute is actually infringing) purpose.
You're conflating whether something is infringement with defenses against infringement. Believe it or not, basically all data transfer and display of copyrighted material on the Internet is technically infringing. That includes the download of a picture to your computer's memory for the sole purpose of displaying it on your monitor. In practice, nobody ever bothers suing art galleries, social media websites, or web browsers, because they all have ironclad defenses against infringement claims: art galleries & social media include a clause in their TOS that grants them a license to redistribute your work for the purpose of displaying it on their website, and web browsers have a basically bulletproof fair use claim. There are other non-infringing uses such as those which qualify for a compulsory license (e.g. live music productions, usually involving royalties), but they're largely not very relevant here. In any case, the fundamental point is that any reproduction of a copyrighted work is infringement, but there are varied defenses against infringement claims that mean most infringing activities never see a courtroom in practice.
All this gets back to the original point I made: Creators retain their copyright even when uploading data for public use, and that copyright comes with heavy restrictions on how third parties may use it. When an individual uploads something to an art website, the website is free and clear of any claims for copyright infringement by virtue of the license granted to it by the website's TOS. In contrast, an uninvolved third party--e.g. a non-registered user or an organization that has not entered into a licensing agreement with the creator or the website (*cough* OpenAI)--has no special defense against copyright infringement claims beyond the baseline questions: was the infringement for personal, noncommercial use? And does the infringement qualify as fair use? Individual users downloading an image for their private collection are mostly A-OK, because the infringement is done for personal & noncommercial use--theoretically someone could sue over it, but there would have to be a lot of aggravating factors for it to get beyond summary judgment. AI companies using web scrapers to download creators' works do not qualify as personal/noncommercial use, for what I hope are bloody obvious reasons.
As for a model trained purely for research or educational purposes, I believe that it would have a very strong claim for fair use as long as the model is not widely available for public use. Once that model becomes publicly available, and/or is leveraged commercially, the analysis changes, because the model is no longer being used for research, but for commercial profit. To apply it to the real world, when OpenAI originally trained ChatGPT for research, it was on strong legal ground, but when it decided to start making it publicly available, they should have thrown out their training dataset and built up a new one using data in the public domain and data that it had negotiated a license for, trained ChatGPT on the new dataset, and then released it commercially. If they had done that, and if individuals had been given the option to opt their creative works out of this dataset, I highly doubt that most people would have any objection to LLM from a legal standpoint. Hell, they probably could have gotten licenses to use most websites' data to train ChatGPT for a song. Instead, they jumped the gun and tipped their hand before they had all their ducks in a row, and now everybody sees just how valuable their data is to OpenAI and are pricing it accordingly.
Oh, and as for your edit, you contradicted yourself: in your first line, you said "The commercial aspect of the reproduction is not relevant to whether it is an infringement." In your edit, you said "the infringement happens when you reproduce the images for a commercial purpose." So which is it? (To be clear, the initial download is infringing copyright both when I download the image for personal/noncommercial use, and also when I download it to make T-shirts with. The difference is that the first case has a strong defense against an infringement claim that would likely get it dismissed in summary, while the cases of making T-shirts would be straightforward claims of infringement.)
Like I've said, you are arguing this into nuanced aspects of copyright law that are absolutely not basic, but I do not agree at all with your assessment of the initial reproduction of the image in a computer's memory. First, to be clear, what you are arguing is that images on a website are licensed to the host to be reproduced for non-commercial purposes only and that such downstream access may only be non-commercial (defined very broadly--there is absolutely a strong argument here that commercial activity in this situation means direct commercial use of the reproduction; for example, you wouldn't say that a user who gets paid to look at images is commercially using the accessed images) or it violates the license. Now, even ignoring my parentheses, there are contract law and copyright law issues with this. Again, using thumbs and, honestly, I'm not trying to write a legal brief as a result of a random reply on lemmy, but the crux is that it is questionable whether you can enforce licensing terms that are presented to a licensee AFTER you enable, if not force, them to perform the act of copying your work. Effectively, you allowed them to make a copy of the work, and then you are trying to say "actually, you can only do x, y, and z with that particular copy--and this is also where exhaustion rears its head when you add on your position that once a trained model switches from non-commercial deployment to commercial deployment it can suddenly retroactively recharacterize the initial use as unlicensed infringement. Logistically, it just doesn't make sense either (for example, what happens when a further downstream user commercializes the model? Does that percolate back to recharacterize the original use? What about downstream from that? How deep into a toolchain history do you need to go to break time traveling egregious breach of exhaustion?) so I have a hard time accepting it.
Now, in response to your query wrt my edit, my point was that infringement happens when you do the further downstream reproduction of the image. When you print a unicorn on a t-shirt, it's that printing that is the infringement. The commercial aspect has absolutely no bearing on whether an infringement occurs. It is relevant to damages and the fair use affirmative defense. The sole query of whether infringement has occurred is whether a copy has been made and thus violated the copyright.
And all this is just about whether there is even a copying at the training of the models stage. This doesn't get into a fairly challenging fair use analysis (going by SCotUS' reasoning on copyrightability of API in Oracle v Google, I actually think the fair use defense is very strong, but I also don't think there is an infringement happening to even necessitate such an analysis so ymmv--also, that decision was terrible and literally every time the SCotUS has touched IP issues, it has made the law wildly worse and more expensive and time-consuming to deal with). It also doesn't get into whether outputs that are very similar to works infringe in the way music does (even though there is no actual copying--I think it highly likely it is an infringement). It also also doesn't get into how outputs might infringe even though there is no IP rights in the outputs of a generative architecture (this probably is more a weird academic issue but I like it nonetheless). Oh, and likeness rights haven't made their way into the discussion (and the incredible weirdness of a class action that includes right of publicity among its claims).
We can, and probably will, disagree on how IP law works here. That's cool. I'm not trying to litigate it on lemmy. My point in my replies at this point is just to show that it is not "basic copyright law bruh". The copyright law, and all the IP law really, around generative AI techniques is fairly complicated and nuanced. It's totally reasonable to hold the position that our current IP laws do not really address this the way most seem to want it to. In fact, most other IP attorneys I've talked to with an understanding of the technical processes at hand seem to agree. And, again, I don't think that further assetizing intangibles into a "right to extract machine learning from" is a viable path forward in the mid and long run, nor one that benefits anyone but highly monied corporate actors either.