this post was submitted on 08 Jan 2024
405 points (96.1% liked)
Technology
59569 readers
3431 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This is again a similar philosophical tangent that's not germane to the issue at hand (albeit an interesting one).
This is not a feasible proposition in any practical sense. LLMs are necessarily trained on VAST datasets that comprise all kinds of text. The only type of network that could be trained on only one artist's corpus is a tiny pedagogical tool like Karpathy's minGPT https://github.com/karpathy/minGPT, trained solely on the works of Shakespeare. But this is not a "Large" language model, it's a teaching exercise for ML students. One artist's work could never practically train a network that could be considered "Large" in the sense of LLMs. So it's pointless to prevaricate on a contrived scenario like that.
In more practical terms, it's not controversial to state that deep networks with lots of degrees of freedom are capable of overfitting and memorizing training data. However, if they have other additional capabilities besides memorization then this may be considered an acceptable price to pay for those additional capabilities. It's trivial to demonstrate that chatbots can perform novel tasks, like writing a rap song about Spongebob going to the moon on a rocket powered by ice cream - which is surely not existent in any training data, yet any contemporary chatbot is able to produce.
As an example, one open research question concerns the scaling relationships of network performance as dataset size increases. In this sense, any attempt to restrict the pool of available training data hampers our ability to probe this question. You may decide that this is worth it to prioritize the sanctity of copyright law, but you can't pretend that it's not impeding that particular research question.
I wasn't making a claim about law, but about ethics. I believe it should be fair game, perhaps not for private profiteering, but for research. Also this says nothing of adversary nations that don't respect our copyright principles, but that's a whole can of worms.
As already stated, that's where I was in agreement with you - It SHOULDN'T be given up to a handful of companies. But instead it SHOULD be given up to public research institutes for the furtherance of science. And whatever you don't want to be included you should refrain from posting. (Or perhaps, if this research were undertaken according to transparent FOSS principles, the curated datasets would be public and open, and you could submit the relevant GDPR requests to get your personal information expunged if you wanted.)
Your whole response is framed in terms of LLMs being purely a product for commercial entities, who shadily exaggerate the learning capabilities of their systems, and couches the topic as a "people vs. corpos" battle. But web-scraped datasets (such as Imagenet) have been powering deep learning research for over a decade, long before AI captured the public imagination the way it has currently, and long before it became a big money spinner. This view neglects that language modelling, image recognition, speech transcription, etc. are also ongoing fields of academic research. Instead of vainly trying to cram the cat back into the bag, and throttling research, we should be embracing the use of publicly available data, with legislation that ensures it's used for public benefit.