this post was submitted on 05 Nov 2024
74 points (97.4% liked)

Selfhosted

40329 readers
421 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago
MODERATORS
 

I'd like to self host a large language model, LLM.

I don't mind if I need a GPU and all that, at least it will be running on my own hardware, and probably even cheaper than the $20 everyone is charging per month.

What LLMs are you self hosting? And what are you using to do it?

all 22 comments
sorted by: hot top controversial new old
[–] TheHobbyist@lemmy.zip 23 points 2 weeks ago* (last edited 2 weeks ago) (3 children)

I run the Mistral-Nemo(12B) and Mistral-Small (22B) on my GPU and they are pretty code. As others have said, the GPU memory is one of the most limiting factors. 8B models are decent, 15-25B models are good and 70B+ models are excellent (solely based on my own experience). Go for q4_K models, as they will run many times faster than higher quantization with little performance degradation. They typically come in S (Small), M (Medium) and (Large) and take the largest which fits in your GPU memory. If you go below q4, you may see more severe and noticeable performance degradation.

If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

Edit: I'm simplifying it very much, but hopefully should it is simple and actionable as a starting point. I've also seen great stuff from Gemma2-27B

Edit2: added links

Edit3: a decent GPU regarding bang for buck IMO is the RTX 3060 with 12GB. It may be available on the used market for a decent price and offers a good amount of VRAM and GPU performance for the cost. I would like to propose AMD GPUs as they offer much more GPU mem for their price but they are not all as supported with ROCm and I'm not sure about the compatibility for these tools, so perhaps others can chime in.

Edit4: you can also use openwebui with vscode with the continue.dev extension such that you can have a copilot type LLM in your editor.

[–] dukatos@lemm.ee 2 points 2 weeks ago

I run ollama:rocm and deepseek-coder model on Radeon 6700XT. I only had to set the GPU via environment variables because it is not officially supported by ROCm, but it works.

[–] avidamoeba@lemmy.ca 1 points 2 weeks ago (1 children)

If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.

Why can't it serve multiple users? Open Web UI seems to support multiple users.

[–] TheHobbyist@lemmy.zip 3 points 2 weeks ago (1 children)

I didn't say it can't. But I'm not sure how well it is optimized for it. From my initial testing it queues queries and submits them one after another to the model, I have not seen it batch compute the queries, but maybe it's a setup thing on my side. vLLM on the other hand is designed specifically for the multi co current user use case and has multiple optimizations for it.

[–] avidamoeba@lemmy.ca 1 points 2 weeks ago

I see. Makes sense.

[–] scrubbles@poptalk.scrubbles.tech 10 points 2 weeks ago

LLMs use a ton of VRAM, the more VRAM you have the better.

If you just need an API, then TabbyAPI is pretty great.

If you need a full UI, then Oogabooga's TextGenration WebUI is a good place to start

[–] InverseParallax@lemmy.world 7 points 2 weeks ago* (last edited 2 weeks ago)

Ollama, llama3.2, deepcode and a bunch of others.

Using a GPU but man they're picky, they mostly want Nvidia gpus.

Do NOT be afraid to run on the cpu. It's slow, but for 1 user it's actually mostly fine.

[–] Showroom7561@lemmy.ca 5 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

You can run this right from Windows: https://jan.ai/

You'll need a lot of RAM, and processing is decently fast, even on a basic laptop.

edit: holy hell. Grammar.

[–] dangling_cat@lemmy.blahaj.zone 3 points 2 weeks ago

Tip: you can copy and paste the Hugging Face link directly into the search box, and it will download the model automatically! Also, it’s pretty smart. It will load into your VRAM first, then your RAM. If you can fit everything into VRAM, you get the fastest speed. But even if you are using RAM, it’s not terribly bad; it’s still faster than you can read.

[–] GreenSofaBed@lemmy.zip 1 points 2 weeks ago

This is pretty cool!

[–] Deckweiss@lemmy.world 5 points 2 weeks ago

GPT4All is a nice and easy start.

[–] chiisana@lemmy.chiisana.net 5 points 2 weeks ago (1 children)

Using Ollama to try a couple of models right now for an idea. I’ve tried to run Llama 3.2 and Qwen 2.5 3b, both of which fits my 3050 6G’s VRAM. I’ve also tried for fun to use Qwen 2.5 32b, which fits in my RAM (I’ve got 128G) but it was only able to reply a couple of tokens per second, thereby making it very much a non-interactive experience. Will need to explore the response time piece a bit further to see if there are ways I can lean on larger models with longer delays still.

[–] Smorty@lemmy.blahaj.zone 1 points 1 week ago

Please try the 4 bit quantisations of the models. They work a bunch faster while eating less RAM.

Generally you want to use 7B or 8B models on the CPU, since everything above will be hellishly slugish.

[–] astrsk@fedia.io 4 points 2 weeks ago (1 children)

If you don’t need to host but can run locally, GPT4ALL is nice, has several models to download and plug and play with different purposes and descriptions, and doesn’t require a GPU.

[–] theshatterstone54@feddit.uk 2 points 2 weeks ago

I second that. Even my lower-midrange laptop from 3 years ago (8GB RAM, Integrated AMD GPU) can run a few of the smaller LLMs, and it's true that you don't even need a GPU as they can run in RAM. And depending on how much RAM you have and what GPU, you might find models performing better in RAM instead of on the GPU. Just keep in mind that when a model says, for example, 8GB Memory required, if you have 8GB RAM, you can't run it cuz you also have your operating system and other applications running. If you have 8GB video memory on your GPU though, you should be golden (I think).

[–] KarnaSubarna@lemmy.ml 4 points 2 weeks ago

My (docker based) configuration:

Linux > Docker Container > Nvidia Runtime > Open WebUI > Ollama > Llama 3.1

Docker: https://docs.docker.com/engine/install/

Nvidia Runtime for docker: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Open WebUI: https://docs.openwebui.com/

Ollama: https://hub.docker.com/r/ollama/ollama

[–] TomAwezome@lemmy.world 4 points 2 weeks ago

TinyLLM on a separate computer with 64GB RAM and a 12-core AMD Ryzen 5 5500GT, using the rocket-3b.Q5_K_M.gguf model, runs very quickly. Most of the RAM is used up by other programs I run on it, the LLM doesn't take the lion's share. I used to self host on just my laptop (5+ year old Thinkpad with upgraded RAM) and it ran OK with a few models but after a few months saved up for building a rig just for that kind of stuff to improve performance. All CPU, not using GPU, even if it would be faster, since I was curious if CPU-only would be usable, which it is. I also use the LLama-2 7b model or the 13b version, the 7b model ran slow on my laptop but runs at a decent speed on a larger rig. The less billions of parameters, the more goofy they get. Rocket-3b is great for quickly getting an idea of things, not great for copy-pasters. LLama 7b or 13b is a little better for handing you almost-exactly-correct answers for things. I think those models are meant for programming, but sometimes I ask them general life questions or vent to them and they receive it well and offer OK advice. I hope this info is helpful :)

[–] Nexy@lemmy.sdf.org 2 points 2 weeks ago

I run locally mistral-nemo in my 1070-ti

[–] cmgvd3lw@discuss.tchncs.de 2 points 2 weeks ago

I am not self hosting an LLM, but running on my laptop with Alpaca. Google's Gemma 2B. On my hardware its pretty slow, but kind of gets the work done. My hardware is getting old, need to upgrade soon.

GPT4All and Jan.AI are good places to start.

[–] scottmeme@sh.itjust.works 1 points 2 weeks ago

I got a home server with a Nvidia Tesla P4, not the most power or the most vram (8gb), but can be gotten for ~$100usd (it is a headless GPU so no video outputs)

I'm using ollama with dolphin-mistral and recently deepseek coder