this post was submitted on 26 Jan 2025
231 points (95.7% liked)

Memes

52914 readers
1006 users here now

Rules:

  1. Be civil and nice.
  2. Try not to excessively repost, as a rule of thumb, wait at least 2 months to do it if you have to.

founded 6 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] TheOctonaut@mander.xyz 5 points 9 months ago (2 children)

I don't think you or that Medium writer understand what "open source" means. Being able to run a local stripped down version for free puts it on par with Llama, a Meta product. Privacy-first indeed. Unless you can train your own from scratch, it's not open source.

Here's the OSI's helpful definition for your reference https://opensource.org/ai/open-source-ai-definition

[–] haerrii@feddit.org 4 points 9 months ago

Thanks for clarification!

[–] yogthos@lemmy.ml 3 points 9 months ago (1 children)

You can run the full version if you have the hardware, the weights are published, and importantly the research behind it is published as well. Go troll somewhere else.

[–] TheOctonaut@mander.xyz -1 points 9 months ago (2 children)

All that is true of Meta's products too. It doesn't make them open source.

Do you disagree with the OSI?

[–] yogthos@lemmy.ml 6 points 9 months ago (1 children)

What part of OSI are you claiming DeepSeek doesn't satisfy specifically?

[–] TheOctonaut@mander.xyz 7 points 9 months ago* (last edited 9 months ago) (1 children)

The data part. ie the very first part of the OSI's definition.

It's not available from their articles https://arxiv.org/html/2501.12948v1 https://arxiv.org/html/2401.02954v1

Nor on their github https://github.com/deepseek-ai/DeepSeek-LLM

Note that the OSI only ask for transparency of what the dataset was - a name and the fee paid will do - not that full access to it to be free and Free.

It's worth mentioning too that they've used the MIT license for the "code" included with the model (a few YAML files to feed it to software) but they have created their own unrecognised non-free license for the model itself. Why they having this misleading label on their github page would only be speculation.

Without making the dataset available then nobody can accurately recreate, modify or learn from the model they've released. This is the only sane definition of open source available for an LLM model since it is not in itself code with a "source".

[–] yogthos@lemmy.ml -2 points 9 months ago (1 children)

Uh yeah, that's because people publish data to huggingface. GitHub isn't made for huge data files in case you weren't aware. You can scroll down to datasets here https://huggingface.co/deepseek-ai

[–] TheOctonaut@mander.xyz 7 points 9 months ago (1 children)

That's the "prover" dataset, ie the evaluation dataset mentioned in the articles I linked you to. It's for checking the output, it is not the training output.

It's also 20mb, which is miniscule not just for a training dataset but even as what you seem to think is a "huge data file" in general.

You really need to stop digging and admit this is one more thing you have surface-level understanding of.

[–] yogthos@lemmy.ml -3 points 9 months ago (1 children)

Do show me a published data set of the kind you're demanding.

[–] TheOctonaut@mander.xyz 11 points 9 months ago* (last edited 9 months ago) (2 children)

Since you're definitely asking this in good faith and not just downvoting and making nonsense sealion requests in an attempt to make me shut up, sure! Here's three.

https://commoncrawl.org/

https://github.com/togethercomputer/RedPajama-Data

https://huggingface.co/datasets/legacy-datasets/wikipedia/tree/main/

Oh, and it's not me demanding. It's the OSI defining what an open source AI model is. I'm sure once you've asked all your questions you'll circle back around to whether you disagree with their definition or not.

[–] HappyTimeHarry@lemm.ee 1 points 8 months ago (1 children)

Thank you for posting those links, while I'm not sure the person you replied to was asking in good faith, I myself was wanting to see an example after reading the discussion.

Seems like even if it's not fully open source it's a step in the right direction in a world where terms like "open" and non profit have been co-opted by corporations to lose their original meaning.

[–] TheOctonaut@mander.xyz 1 points 8 months ago

It's certainly better than "Open"AI being completely closed and secretive with their models. But as people have discovered in the last 24 hours, DeepSeek is pretty strongly trained to be protective of the Chinese government policy on, uh, truth. If this was a truly Open Source model, someone could "fork" it and remake it without those limitations. That's the spirit of "Open Source" even if the actual term "source" is a bit misapplied here.

As it is, without the original training data, an attempt to remake the model would have the issues DeepSeek themselves had with their "zero" release where it would frequently respond in a gibberish mix of English, Mandarin and programming code. They had to supply specific data to make it not do this, which we don't have access to.

[–] Grapho@lemmy.ml 5 points 9 months ago (1 children)

What makes it open source is that the source code is open.

My grandma is as old as my great aunts, that doesn't transitively make her my great aunt.

[–] TheOctonaut@mander.xyz -2 points 9 months ago (1 children)

A model isn't an application. It doesn't have source code. Any more than an image or a movie has source code to be "open". That's why OSI's definition of an "open source" model is controversial in itself.

[–] Grapho@lemmy.ml 2 points 9 months ago

It's clear you're being disingenuous. A model is its dataset and its weights too but the weights are also open and if the source code was as irrelevant as you say it is, Deepseek wouldn't be this much more performant, and "Open" AI would have published it instead of closing the whole release.