this post was submitted on 08 Jul 2025
        
      
      738 points (98.2% liked)
      Technology
    76564 readers
  
      
      3085 users here now
      This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
 - Only tech related news or articles.
 - Be excellent to each other!
 - Mod approved content bots can post up to 10 articles per day.
 - Threads asking for personal tech support may be deleted.
 - Politics threads may be removed.
 - No memes allowed as posts, OK to post as comments.
 - Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
 - Check for duplicates before posting, duplicates may be removed
 - Accounts 7 days and younger will have their posts automatically removed.
 
Approved Bots
        founded 2 years ago
      
      MODERATORS
      
    you are viewing a single comment's thread
view the rest of the comments
    view the rest of the comments
A lot, but less than you’d think! Basically a RTX 3090/threadripper system with a lot of RAM (192GB?)
With this framework, specifically: https://github.com/ikawrakow/ik_llama.cpp?tab=readme-ov-file
The “dense” part of the model can stay on the GPU while the experts can be offloaded to the CPU, and the whole thing can be quantized to ~3 bits average, instead of 8 bits like the full model.
That’s just a hack for personal use, though. The intended way to run it is on a couple of H100 boxes, and to serve it to many, many, many users at once. LLMs run more efficiently when they serve in parallel. Eg generating tokens for 4 users isn’t much slower than generating them for 2, and Deepseek explicitly architected it to be really fast at scale. It is “lightweight” in a sense.
…But if you have a “sane” system, it’s indeed a bit large. The best I can run on my 24GB vram system are 32B - 49B dense models (like Qwen 3 or nemotron), or 70B mixture of experts (like the new Hunyuan 70B).