brucethemoose

joined 8 months ago
[–] brucethemoose@lemmy.world 20 points 1 day ago* (last edited 1 day ago) (2 children)

I feel like they are dropping the ball in the GPU space though, both on desktop and in servers.

They'renot really leveraging it. They killed the steam deck line of "small core count, GPU heavy APUs" which is why Valve hasn't updated it and competitors seem so power hungry. They all but killed server APUs, making them mega expensive and HPC only. They're finally coming out with a M-Pro like consumer APU, but it took until 2025, and pricing will probably be a joke just like their Radeon Pro GPUs...

And I don't even wanna get into the AI space. They get like 99% there and then go "nah, we don't really care about this market, let Nvidia have their monopoly and screw everyone over." It makes me want to pull my hair out.

[–] brucethemoose@lemmy.world 4 points 3 days ago* (last edited 3 days ago)

I never liked Musk, even when he was "In." Even the Mars colonization meme rubbed me the wrong way, as the science does not line up with that.

It felt like a cult of personality to me. He was always a fickle jerk, a mixed bag.

You have a point though, people's opinions were largely political, I think. Or just based on pure hope/cultism

[–] brucethemoose@lemmy.world 26 points 3 days ago* (last edited 3 days ago)

+1

They should discourage institutions from using it (and use government Mastadon instances of course). This is honestly long overdue.

[–] brucethemoose@lemmy.world 4 points 6 days ago (1 children)

Don't jinx it.

Especially not if they somehow coincidentally get some government funding.

[–] brucethemoose@lemmy.world 7 points 6 days ago (1 children)

They should keep Alex Jones and leave the broadcast completey unchanged.

Wala. A parody site.

[–] brucethemoose@lemmy.world 31 points 6 days ago* (last edited 6 days ago) (3 children)

The article itself muddies the waters even more:

Today we celebrate a new addition to the Global Tetrahedron LLC family of brands. And let me say, I really do see it as a family. Much like family members, our brands are abstract nodes of wealth, interchangeable assets for their patriarch to absorb and discard according to the opaque whims of the market. And just like family members, our brands regard one another with mutual suspicion and malice.

All told, the decision to acquire InfoWars was an easy one for the Global Tetrahedron executive board.

Founded in 1999 on the heels of the Satanic “panic” and growing steadily ever since, InfoWars has distinguished itself as an invaluable tool for brainwashing and controlling the masses. With a shrewd mix of delusional paranoia and dubious anti-aging nutrition hacks, they strive to make life both scarier and longer for everyone, a commendable goal. They are a true unicorn, capable of simultaneously inspiring public support for billionaires and stoking outrage at an inept federal state that can assassinate JFK but can’t even put a man on the Moon...

Its is a parody post, but also real, but also The Onion. It's a shining, throughbread Onion article, but also not.

We need !sortoftheonion

[–] brucethemoose@lemmy.world 2 points 1 week ago

I'd posit the algorithm has turned it into a monster.

Attention should be dictated more by chronological order and what others retweet, not what some black box thinks will keep you glued to the screen, and it felt like more of the former in the old days. This is a subtle, but also very significant change.

[–] brucethemoose@lemmy.world 11 points 1 week ago* (last edited 1 week ago) (2 children)

On the other hand, the track record of old social networks is not great.

And it's reasonable to posit Twitter is deep into the enshitifiication cycle.

[–] brucethemoose@lemmy.world 4 points 1 week ago (2 children)

Beat me to it.

[–] brucethemoose@lemmy.world 1 points 1 week ago

Still perfectly runnable in kobold.cpp. There was a whole community built up around with Pygmalion.

It is as dumb as dirt though. IMO that is going back too far.

[–] brucethemoose@lemmy.world 2 points 1 week ago (2 children)

People still run or even continue pretrain llama2 for that reason, as its data is pre-slop.

[–] brucethemoose@lemmy.world 6 points 1 week ago* (last edited 1 week ago)

The facebook/mastadon format is much better for individuals, no? And Reddit/Lemmy for niches, as long as they're supplemented by a wiki or something.

And Tumblr. The way content gets spread organically, rather than with an algorithm, is actually super nice.

IMO Twitter's original premise, of letting novel, original, but very short thoughts fly into the ether has been so thoroughly corrupted that it can't really come back. It's entertaining and engaging, but an awful format for actually exchanging important information, like discord.

 

I see a lot of talk of Ollama here, which I personally don't like because:

  • The quantizations they use tend to be suboptimal

  • It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.

  • It abstracts away things that you should really know for hosting LLMs.

  • I don't like some things about the devs. I won't rant, but I especially don't like the hint they're cooking up something commercial.

So, here's a quick guide to get away from Ollama.

  • First step is to pick your OS. Windows is fine, but if setting up something new, linux is best. I favor CachyOS in particular, for its great python performance. If you use Windows, be sure to enable hardware accelerated scheduling and disable shared memory.

  • Ensure the latest version of CUDA (or ROCm, if using AMD) is installed. Linux is great for this, as many distros package them for you.

  • Install Python 3.11.x, 3.12.x, or at least whatever your distro supports, and git. If on linux, also install your distro's "build tools" package.

Now for actually installing the runtime. There are a great number of inference engines supporting different quantizations, forgive the Reddit link but see: https://old.reddit.com/r/LocalLLaMA/comments/1fg3jgr/a_large_table_of_inference_engines_and_supported/

As far as I am concerned, 3 matter to "home" hosters on consumer GPUs:

  • Exllama (and by extension TabbyAPI), as a very fast, very memory efficient "GPU only" runtime, supports AMD via ROCM and Nvidia via CUDA: https://github.com/theroyallab/tabbyAPI

  • Aphrodite Engine. While not strictly as vram efficient, its much faster with parallel API calls, reasonably efficient at very short context, and supports just about every quantization under the sun and more exotic models than exllama. AMD/Nvidia only: https://github.com/PygmalionAI/Aphrodite-engine

  • This fork of kobold.cpp, which supports more fine grained kv cache quantization (we will get to that). It supports CPU offloading and I think Apple Metal: https://github.com/Nexesenex/croco.cpp

Now, there are also reasons I don't like llama.cpp, but one of the big ones is that sometimes its model implementations have... quality degrading issues, or odd bugs. Hence I would generally recommend TabbyAPI if you have enough vram to avoid offloading to CPU, and can figure out how to set it up. So:

This can go wrong, if anyone gets stuck I can help with that.

  • Next, figure out how much VRAM you have.

  • Figure out how much "context" you want, aka how much text the llm can ingest. If a models has a context length of, say, "8K" that means it can support 8K tokens as input, or less than 8K words. Not all tokenizers are the same, some like Qwen 2.5's can fit nearly a word per token, while others are more in the ballpark of half a work per token or less.

  • Keep in mind that the actual context length of many models is an outright lie, see: https://github.com/hsiehjackson/RULER

  • Exllama has a feature called "kv cache quantization" that can dramatically shrink the VRAM the "context" of an LLM takes up. Unlike llama.cpp, it's Q4 cache is basically lossless, and on a model like Command-R, an 80K+ context can take up less than 4GB! Its essential to enable Q4 or Q6 cache to squeeze in as much LLM as you can into your GPU.

  • With that in mind, you can search huggingface for your desired model. Since we are using tabbyAPI, we want to search for "exl2" quantizations: https://huggingface.co/models?sort=modified&search=exl2

  • There are all sorts of finetunes... and a lot of straight-up garbage. But I will post some general recommendations based on total vram:

  • 4GB: A very small quantization of Qwen 2.5 7B. Or maybe Llama 3B.

  • 6GB: IMO llama 3.1 8B is best here. There are many finetunes of this depending on what you want (horny chat, tool usage, math, whatever). For coding, I would recommend Qwen 7B coder instead: https://huggingface.co/models?sort=trending&search=qwen+7b+exl2

  • 8GB-12GB Qwen 2.5 14B is king! Unlike it's 7B counterpart, I find the 14B version of the model incredible for its size, and it will squeeze into this vram pool (albeit with very short context/tight quantization for the 8GB cards). I would recommend trying Arcee's new distillation in particular: https://huggingface.co/bartowski/SuperNova-Medius-exl2

  • 16GB: Mistral 22B, Mistral Coder 22B, and very tight quantizations of Qwen 2.5 34B are possible. Honorable mention goes to InternLM 2.5 20B, which is alright even at 128K context.

  • 20GB-24GB: Command-R 2024 35B is excellent for "in context" work, like asking questions about long documents, continuing long stories, anything involving working "with" the text you feed to an LLM rather than pulling from it's internal knowledge pool. It's also quite goot at longer contexts, out to 64K-80K more-or-less, all of which fits in 24GB. Otherwise, stick to Qwen 2.5 34B, which still has a very respectable 32K native context, and a rather mediocre 64K "extended" context via YaRN: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2

  • 32GB, same as 24GB, just with a higher bpw quantization. But this is also the threshold were lower bpw quantizations of Qwen 2.5 72B (at short context) start to make sense.

  • 48GB: Llama 3.1 70B (for longer context) or Qwen 2.5 72B (for 32K context or less)

Again, browse huggingface and pick an exl2 quantization that will cleanly fill your vram pool + the amount of context you want to specify in TabbyAPI. Many quantizers such as bartowski will list how much space they take up, but you can also just look at the available filesize.

  • Now... you have to download the model. Bartowski has instructions here, but I prefer to use this nifty standalone tool instead: https://github.com/bodaay/HuggingFaceModelDownloader

  • Put it in your TabbyAPI models folder, and follow the documentation on the wiki.

  • There are a lot of options. Some to keep in mind are chunk_size (higher than 2048 will process long contexts faster but take up lots of vram, less will save a little vram), cache_mode (use Q4 for long context, Q6/Q8 for short context if you have room), max_seq_len (this is your context length), tensor_parallel (for faster inference with 2 identical GPUs), and max_batch_size (parallel processing if you have multiple user hitting the tabbyAPI server, but more vram usage)

  • Now... pick your frontend. The tabbyAPI wiki has a good compliation of community projects, but Open Web UI is very popular right now: https://github.com/open-webui/open-webui I personally use exui: https://github.com/turboderp/exui

  • And be careful with your sampling settings when using LLMs. Different models behave differently, but one of the most common mistakes people make is using "old" sampling parameters for new models. In general, keep temperature very low (<0.1, or even zero) and rep penalty low (1.01?) unless you need long, creative responses. If available in your UI, enable DRY sampling to tamp down repition without "dumbing down" the model with too much temperature or repitition penalty. Always use a MinP of 0.05 or higher and disable other samplers. This is especially important for Chinese models like Qwen, as MinP cuts out "wrong language" answers from the response.

  • Now, once this is all setup and running, I'd recommend throttling your GPU, as it simply doesn't need its full core speed to maximize its inference speed while generating. For my 3090, I use something like sudo nvidia-smi -pl 290, which throttles it down from 420W to 290W.

Sorry for the wall of text! I can keep going, discussing kobold.cpp/llama.cpp, Aphrodite, exotic quantization and other niches like that if anyone is interested.

 

Obviously there's not a lot of love for OpenAI and other corporate API generative AI here, but how does the community feel about self hosted models? Especially stuff like the Linux Foundation's Open Model Initiative?

I feel like a lot of people just don't know there are Apache/CC-BY-NC licensed "AI" they can run on sane desktops, right now, that are incredible. I'm thinking of the most recent Command-R, specifically. I can run it on one GPU, and it blows expensive API models away, and it's mine to use.

And there are efforts to kill the power cost of inference and training with stuff like matrix-multiplication free models, open source and legally licensed datasets, cheap training... and OpenAI and such want to shut down all of this because it breaks their monopoly, where they can just outspend everyone scaling , stealiing data and destroying the planet. And it's actually a threat to them.

Again, I feel like corporate social media vs fediverse is a good anology, where one is kinda destroying the planet and the other, while still niche, problematic and a WIP, kills a lot of the downsides.

 

HP is apparently testing these upcoming APUs in a single, 8-core configuration.

The Geekbench 5 ST score is around 2100, which is crazy... but not what I really care about. Strix Halo will have a 256 -bit memory bus and 40 CUs, which will make it a monster for local LLM inference.

I am praying AMD sells these things in embedded motherboards with a 128GB+ memory config. Especially in an 8-core config, as I'd rather not burn money and TDP on a 16 core version.

view more: next ›