this post was submitted on 26 Aug 2025
569 points (97.3% liked)
Not The Onion
17815 readers
893 users here now
Welcome
We're not The Onion! Not affiliated with them in any way! Not operated by them in any way! All the news here is real!
The Rules
Posts must be:
- Links to news stories from...
- ...credible sources, with...
- ...their original headlines, that...
- ...would make people who see the headline think, “That has got to be a story from The Onion, America’s Finest News Source.”
Please also avoid duplicates.
Comments and post content must abide by the server rules for Lemmy.world and generally abstain from trollish, bigoted, or otherwise disruptive behavior that makes this community less fun for everyone.
And that’s basically it!
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Fun bit of history: gooners were a huge component of open weights LLM development.
Pygmalion 6B was poured over by tinkerers before ChatGPT and Llama were even a thing. Ravenous furries and roleplayers have been major contributors to frameworks that have snowballed into huge projects, practical finetuning and quantization methods, CUDA kernels, sampling techniques OpenAI is still catching up on, you name it.
Horniness (among other things) is a heck of a motivation. But the history is buried in obscure Discords and archived GitHub repos.
And don’t even get me started on imagegen… Good lord. If Grok is going full “anime waifu,” oh, it has got some competition to catch up to.
First time I saw a great image upscaler, it was called "waifu2x", of course.
There was also an older machine learning project that removed censorship commonly found on japanese pornographic drawings, but this one got deleted of GitHub by the author some years ago and I don't even remember it's name anymore.
I used this recently for a professional project. It was by far the best at upscaling and it was free.
Yeah, the history of GANs stretch way back before transformers LLMs, and evolved into ESRGAN, finetunes in obscure Discords...
That history went somewhere, fortunately, and it definitely blows the venerable waifu2x out of the water: https://openmodeldb.info/
I remember those days - using my 3080ti to test the latest Pygmalion model and reporting feedback for how understandable the outputs were to the developers...
I didn't use my self-host for horny though - I tried being a DM for a DND session with fictional characters :)
This is one reason I wanted to get into LLMs DND stuff. I've tried virtual DMs and they suck due to hallucinations.
Never tried the opposite. Would be interesting to DM and have the LLM be 4 players.
One thing I learned is that you can't rely on an inventory system (there's no persistence), so you're basically always running a oneshot campaign no matter what. After the novelty wore off, I found that quite boring and just played DND on tabletop sim or roll20 with discord friends instead.
I was wondering if I could setup a external yaml file for the pipeline to take notes and reload to help persistence issues.
I've never tried something like that, but even if you do remind the party what they have equipped or in reserve, they'll just make items up. It's frustrating.
Yeah I was being a bit facetious. There really was a lot of roleplaying and other neat things (like dungeon masters) that motivated people.
That, and there are some even earlier, more primitive (and less horny) RP models, like Janeway and some others named after ST captains.
By the way, there are some pretty awesome dungeon master finetunes that would fit on a 3080 TI these days.
Can you name-drop any recommended DM fine tunes? Anytime I try to do model research I end up down rabbit holes and very confused...
Appreciated!
Oh, there are so many... Yeah, it's a rabbit hole.
For now, check out:
https://huggingface.co/LatitudeGames/Harbinger-24B (and literally anything from Latitude Games, who explicitly specialize in dungeon master models for their site).
https://huggingface.co/PocketDoc/Dans-DangerousWinds-V1.1.1-24b
https://huggingface.co/Gryphe/Codex-24B-Small-3.2
24Bs are very tight on your card (but so smart they're worth it), so you will want ~3.6bpw (10 GB-ish) exl3 quantizations to minimize the quantization loss and keep them fast. They're easy to make yourself if you know a little command line and have decent internet; I can walk you through it.
Or I can just quantize these three models just for you, overnight, if you wish. Maybe check how much VRAM your desktop takes up at idle so I can size them right, and let me know.
Thank you very much! These all looks very interesting and I'm excited to try them out.
I've never quantized a model before (I usually find pre-quantized versions) but I would love to learn how. If you can provide the command-line details for doing so, or point me towards a good resource, that would rock!
So first of all, you run exl3s via tabbyAPI + your frontend of choice: https://github.com/theroyallab/tabbyAPI
Check out their docs. Specific settings I'd recommend are like 16K context and "6,5" cache quantization. For example, these are some changed lines plucked from my own config files:
Now, to make a quantized model, you just download/install the exllamav3 repo (which you install for tabbyAPI anyway) and follow its documentation: https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md
An example command would be: `python convert.py -i "/Path/to/model" -o "/output/directory" --work_dir "temporary/work/directory" -b 3.2 -hb 6
You probably want, like, 3.2 bits per word (the '-b' flag).
...But that's not how I would quantize it. If I were you, since the ~3bpw range is so sensitive to quantization, I'd use a custom per-layer quantization scheme described here: https://old.reddit.com/r/LocalLLaMA/comments/1mqwt76/optimizing_exl3_quants_by_mixing_bitrates_in/
The process is like this: you either make or download 3bpw and 4bpw variants of the model you desire, like say, this one for 4bpw:
https://huggingface.co/MetaphoricalCode/Harbinger-24B-exl3-4bpw-hb6
And make a 3bpw yourself (since I don't see one available for Harbinger 24B).
Then, you "mix" the two models you've made with a command like this:
python util/recompile.py -or overrides.yml -o "/output/folder" -i "/path/to/your/3bpw-exl3-quantization
And the overrides.yml file looks like:
What this example overrides.yml does is force the more sensitive attention layers to use 4bpw quantization (plucking them from the 4bpw quantization you downloaded), and everything else (namely the mlp layers) to use 3bpw. This should end up around ~3.2bpw or so. You can make it larger by uncommenting the mlp down layer (which is the next most sensitive layer), or make it smaller by commenting out the q_proj layer (with the kv layers being the most sensitive, and relatively tiny).
This seems convoluted, yep. But it has advantages:
It targets the 'sensitive' layers more accurately, whereas exllamav3 more randomly changes the quantization of layers to hit a specified bpw target (as it can only use integer quantizations).
It can be faster. If you can find 3bpw and 4bpw exl3s of the model you want to try, you can just download them and recombine them: no actual quantization needed, and no need to download the 50GB raw weights.
convert.py
takes a few hours to run, whileutil/recompile.py
takes seconds....And why go to all this hassle, you ask?
Because exl3s let you stuff in a much better model, with less loss, than anything you'd find on ollama:
https://github.com/turboderp-org/exllamav3/blob/d8167b0cf4491baeae7705c0dfec7f131f02aad4/doc/exl3.md
You can cram a 24 billion parameter model into the 11GB free you have, with minimal loss and no CPU offloading, wheras with ollama (and their unoptimized GGUFs/context qauntization), you'd either need a Q4/Q5 of a much dumber 12B model, or a Q3/Q2 of a 24B that will spit out jibberish, or make the model glacially slow by offloading half of it to system RAM.
And it better takes advantage of your 3080 TI's architecture.
There are other ways to get really good quantization (like with ik_llama.cpp), but for dense models, I love exllamav3.
Also, this whole field moves fast. Exllamav3 is like 5 months old, and this 'manual' quantization scheme was only tested a few days ago.
Once again, thank you so much for sharing your knowledge! It looks like I have some weekend projects to look forward to.
Yep! Just PM/reply or something for any help/requests, maybe more than once (as sometimes I miss them, and sometimes Lemmy doesn't send notifications for replies).
Oh, and one more thing. Exl3s aren't single files you can click and download. Neither are full precision models.
Git clone or huggingface-cli work.
But I'd recommend this tool, as it hash checks all the files as it downloads them. You'd be surprised how often downloads are corrupted: https://github.com/bodaay/HuggingFaceModelDownloader
Did you play a specific system? I've been curious about playing cyberpunk RED with AI for a bit, most online options seem to be 5e based so I'm curious if you can teach these other systems and settings, that would be awesome.
Honestly I don't use them for much RP these days, mostly novel-style writing instead :P.
'Online' systems are probably taking bone stock LLMs and using 5e rules banged into the system prompt anyway. You could do the same thing with with a local UI (like Kobold, Open Web UI, mikupad. Take your pick.)
Theoretically? You could collect some text from completed Cyberpunk RED games and finetune a model.
Or maybe use constrained sampling to help it format certain answers, which would be much easier.
But honestly I would just try some 'strong' models and see if they follow the rules you paste into the system prompt, unless you want to dump a ton of time (and some cash) down the finetuning rabbit hole.
Oh, also, I can just host any of these on the AI Horde for a bit if you want to try them out, via Kobolt Light or AgnAIstic web apps. Again, just lemme know.