- Small 4B models like gemma3 will run on anything (I have it running on a 2020 laptop with integrated graphics). Don't expect superintelligence, but it works for basic classification tasks, writing/reviewing/fixing small scripts and basic chat, writing, etc
- I use https://github.com/ggml-org/llama.cpp in server mode pointing to a directory of GGUF model files downloaded from huggingface. I access it it from the built-in web interface or API (wrote a small assistant script)
- To load larger models you need more RAM (preferably fast VRAM/GPU but DDR5 on the motherboard will work - it will be noticeably slower). My gaming rig with 16GB AMD 9070 runs 20-30B models at decent speeds. You can grab quantized (lower precision, lower output quality) versions of those larger models if the full-size/unquantized models don't fit. Check out https://whatmodelscanirun.com/
- For image generation I found https://github.com/vladmandic/sdnext which works extremely well and fast wth Z-Image Turbo, FLUX.1-schnell, Stable Diffusion XL and a few other models
As for the prices... well the rig I bought for ~1500€ in september is now up to ~2200€ (once-in-a-decade investment). It's not a beast but it works, the primary use case was general computing and gaming, I'm glad it works for local AI, but costs for a dedicated, performant AI rig are ridiculously high right now. It's not economically competitive yet against commercial LLM services for complex tasks, but that's not the point. Check https://old.reddit.com/r/LocalLLaMA/ (yeah reddit I know). 10k€ of hardware to run ~200-300B models, not counting electricity bills