this post was submitted on 26 Jan 2025
231 points (95.7% liked)
Memes
52914 readers
553 users here now
Rules:
- Be civil and nice.
- Try not to excessively repost, as a rule of thumb, wait at least 2 months to do it if you have to.
founded 6 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Uh yeah, that's because people publish data to huggingface. GitHub isn't made for huge data files in case you weren't aware. You can scroll down to datasets here https://huggingface.co/deepseek-ai
That's the "prover" dataset, ie the evaluation dataset mentioned in the articles I linked you to. It's for checking the output, it is not the training output.
It's also 20mb, which is miniscule not just for a training dataset but even as what you seem to think is a "huge data file" in general.
You really need to stop digging and admit this is one more thing you have surface-level understanding of.
Do show me a published data set of the kind you're demanding.
Since you're definitely asking this in good faith and not just downvoting and making nonsense sealion requests in an attempt to make me shut up, sure! Here's three.
https://commoncrawl.org/
https://github.com/togethercomputer/RedPajama-Data
https://huggingface.co/datasets/legacy-datasets/wikipedia/tree/main/
Oh, and it's not me demanding. It's the OSI defining what an open source AI model is. I'm sure once you've asked all your questions you'll circle back around to whether you disagree with their definition or not.
Thank you for posting those links, while I'm not sure the person you replied to was asking in good faith, I myself was wanting to see an example after reading the discussion.
Seems like even if it's not fully open source it's a step in the right direction in a world where terms like "open" and non profit have been co-opted by corporations to lose their original meaning.
It's certainly better than "Open"AI being completely closed and secretive with their models. But as people have discovered in the last 24 hours, DeepSeek is pretty strongly trained to be protective of the Chinese government policy on, uh, truth. If this was a truly Open Source model, someone could "fork" it and remake it without those limitations. That's the spirit of "Open Source" even if the actual term "source" is a bit misapplied here.
As it is, without the original training data, an attempt to remake the model would have the issues DeepSeek themselves had with their "zero" release where it would frequently respond in a gibberish mix of English, Mandarin and programming code. They had to supply specific data to make it not do this, which we don't have access to.
So you found a legacy data set that's been released nearly a year ago as your best example. Thanks for proving my point. And since you obviously know what you're talking about, do explain to the class what stops people from using these data sets to train a DeepSeek model?
The most recent crawl is from December 15th
https://commoncrawl.org/blog/december-2024-crawl-archive-now-available
You don't know, and can't know, when DeepSeeker's dataset is from. Thanks for proving my point.
What I do know is that you can take DeepSeek model and train it on this open crawl to get a fully open model. I love how you ignored this part in your reply being the clown that you are.
I ignored the bit you edited in after I replied? And you're complaining about ignoring questions in general? Do you disagree with the OSI definition Yogsy? You feel ready for that question yet?
What on earth do you even mean "take a model and train it on thos open crawl to get a fully open model"? This sentence doesn't even make sense. Never mind that that's not how training a model works - let's pretend it is. You understand that adding open source data to closed source data wouldn't make the closed source data less closed source, right?.. Right?
Thank fuck you're not paid real money for this Yiggly because they'd be looking for their dollars back
Why would you lie about something with timestamps. I edited 18 min ago, and you replied 17 min ago. 🤡
I already answered this question earlier in the thread, but clearly your reading comprehension needs some work.
I'm talking about taking the code that DeepSeek released publicly, and training it on the open source data that's available. That's what model training is. The fact that this needs to be spelled out for you is amazing.
What closed source data are you talking about, nobody is suggesting this.
You sound upset there little buddy. I guess misspelling my handle was the peak insult you could muster. Really showing your intellectual prowess there champ.
I take more than a minute on my replies Autocorrect Disaster. You asked for information and I treat your request as genuine because it just leads to more hilarity like you describing a model as "code".
The only hilarity here is you exposing yourself as being utterly clueless on the subject you're attempting to debate. A model is a deep neural network that's generated by code through reinforcement training on the data. Evidently you don't understand this leading you to make absurd statements. I asked you for information because I knew you were a troll and now you've confirmed it.
I understand it completely in so much that it's nonsensically irrelevant - the model is what you're calling open source, and the model is not open source because the data set not published or recreateable. They can open source any training code they want - I genuinely haven't even checked - but the model is not open source. Which is my point from about 20 comments ago. Unless you disagree with the OSI's definition which is a valid and interesting opinion. If that's the case you could have just said so. OSI are just of dudes. They have plenty of critics in the Free/Open communities. Hey they're probably American too if you want to throw in some downfall of The West classic hits too!
If a troll is "not letting you pretend you have a clue what you're talking about because you managed to get ollama to run a model locally and think it's neat", cool. Owning that. You could also just try owning that you think its neat. It is. It's not an open source model though. You can run Meta's model with the same level of privacy (offline) and with the same level of ability to adapt or recreate it (you can't, you don't have the full data set or steps to recreate it).
I never disagreed that you can run Meta's model with the same level of privacy, so don't know why you keep bringing that up as some sort of gotcha. The point about DeepSeek is its efficiency. OSI definition for open source is good, and it does look like you're right that the full data set is not available. However, the real question is why you'd be so hung up on that.
Given that the code for training a new model is released, and it can be applied to open data sets, that means it's perfectly possible to make a version that's trained on open data that would check off the final requirement you keep bringing up. Also, adapting it does not require having the original training set since it's done by tuning the weights in the network itself. Go read up on how LoRA works for example.
I know how LoRA works thanks. You still need the original model to use a LoRA. As mentioned, adding open stuff to closed stuff doesn't make it open - that's a principle applicable to pretty much anything software related.
You could use their training method on another dataset, but you'd be creating your own model at that point. You also wouldn't get the same results - you can read in their article that their "zero" version would have made this possible but they found that it would often produce a gibberish mix of English, Mandarin and code. For R1 they adapted their pure "we'll only give it feedback" efficiency training method to starting with a base dataset before feeding it more, a compromise to their plan but necessary and with the right dataset - great! It eliminated the gibberish.
Without that specific dataset - and this is what makes them a company not a research paper - you cannot recreate DeepSeek yourself (which would be open source) and you can't guarantee that you would get anything near the same results (in which case why even relate it to thid model anymore). That's why those are both important to the OSI who define Open Source in all regards as the principle of having all the information you need to recreate the software or asset locally from scratch. If it were truly Open Source by the way, that wouldn't be the disaster you think it would be as then OpenAI could just literally use it themselves. Or not - that's the difference between Open and Free I alluded to. It's perfectly possible for something to be Open Source and require a license and a fee.
Anyway, it does sound like an exciting new model and I can't wait to make it write smut.
I didn't say that using LoRA makes it more open, I was pointing out that you don't need the original data to extend the model.
Basically what you're talking about is being able to replicate the original model from scratch given the code and the data. And since the data component is missing you can't replicate the original model. I personally don't find this to be that much of a problem because people could create a comparable model from scratch if they really wanted to using an open data set.
The actual innovation with DeepSeek lies in the use of mixture-of-experts approach to get far better performance. While it has 671 billion parameters overall, it only uses 37 billion at a time, making it very efficient. For comparison, Meta’s Llama3.1 uses 405 billion parameters used all at once. That's the really interesting part of the whole thing. That's the part where openness really matters.
And I full expect that OpenAI will incorporate this idea into their models. The disaster for open AI is in the fact that their whole business model around selling subscriptions is now dead in the water. When models were really expensive to run, then only a handful of megacorps could do it. Now, it turns out that you can get the same results at a fraction of the cost.