Not The Onion

18556 readers

1361 users here now

Welcome

We're not The Onion! Not affiliated with them in any way! Not operated by them in any way! All the news here is real!

The Rules

Posts must be:

Links to news stories from...
...credible sources, with...
...their original headlines, that...
...would make people who see the headline think, “That has got to be a story from The Onion, America’s Finest News Source.”

Please also avoid duplicates.

Comments and post content must abide by the server rules for Lemmy.world and generally abstain from trollish, bigoted, or otherwise disruptive behavior that makes this community less fun for everyone.

And that’s basically it!

founded 2 years ago

MODERATORS

kescusay@lemmy.world

574

Elon Musk Appears to Be Completely Addicted to Anime Gooner AI Slop (www.rollingstone.com)

submitted 2 months ago by cm0002@lemmy.world to c/nottheonion@lemmy.world

125 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Killer_Tree@sh.itjust.works 1 points 2 months ago (2 children)

Thank you very much! These all looks very interesting and I'm excited to try them out.

I've never quantized a model before (I usually find pre-quantized versions) but I would love to learn how. If you can provide the command-line details for doing so, or point me towards a good resource, that would rock!

[–] brucethemoose@lemmy.world 4 points 2 months ago* (last edited 2 months ago) (1 children)

So first of all, you run exl3s via tabbyAPI + your frontend of choice: https://github.com/theroyallab/tabbyAPI

Check out their docs. Specific settings I'd recommend are like 16K context and "6,5" cache quantization. For example, these are some changed lines plucked from my own config files:

  # Backend to use for the model (default: exllamav2)
  # Options: exllamav2, exllamav3
  backend: exllamav3

  # Max sequence length (default: Empty).
  # Fetched from the model's base sequence length in config.json by default.
  max_seq_len: 16384

  # Enable different cache modes for VRAM savings (default: FP16).
  # Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
  # For exllamav3, specify the pair k_bits,v_bits where k_bits and v_bits are integers from 2-8 (i.e. 8,8).
  cache_mode: 6,5

  # Chunk size for prompt ingestion (default: 2048).
  # A lower value reduces VRAM usage but decreases ingestion speed.
  # NOTE: Effects vary depending on the model.
  # An ideal value is between 512 and 4096.
  chunk_size: 512

Now, to make a quantized model, you just download/install the exllamav3 repo (which you install for tabbyAPI anyway) and follow its documentation: https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md

An example command would be: `python convert.py -i "/Path/to/model" -o "/output/directory" --work_dir "temporary/work/directory" -b 3.2 -hb 6

You probably want, like, 3.2 bits per word (the '-b' flag).

...But that's not how I would quantize it. If I were you, since the ~3bpw range is so sensitive to quantization, I'd use a custom per-layer quantization scheme described here: https://old.reddit.com/r/LocalLLaMA/comments/1mqwt76/optimizing_exl3_quants_by_mixing_bitrates_in/

The process is like this: you either make or download 3bpw and 4bpw variants of the model you desire, like say, this one for 4bpw:

https://huggingface.co/MetaphoricalCode/Harbinger-24B-exl3-4bpw-hb6

And make a 3bpw yourself (since I don't see one available for Harbinger 24B).

Then, you "mix" the two models you've made with a command like this:

python util/recompile.py -or overrides.yml -o "/output/folder" -i "/path/to/your/3bpw-exl3-quantization

And the overrides.yml file looks like:

sources:
  - id: 4
    model_dir: /path/to/4bpw-exl3-quantization

overrides:
  #   Attention & router tensors – cheap, big gain on MoE models
  - key: "*.self_attn.q_proj*"
    source: 4          # +1 bpw
  - key: "*.self_attn.k_proj*"
    source: 4          # +1 bpw
  - key: "*.self_attn.v_proj*"
    source: 4          # +1 bpw
  - key: "*.self_attn.o_proj*"
    source: 4          # +1 bpw
  # - key: "*.mlp.down_proj*"
  #   source: 4          # +1 bpw

  #  This would force the whole first layer to 4bpw
  # - key: "model.layers.0.*"
  #   source: 4

What this example overrides.yml does is force the more sensitive attention layers to use 4bpw quantization (plucking them from the 4bpw quantization you downloaded), and everything else (namely the mlp layers) to use 3bpw. This should end up around ~3.2bpw or so. You can make it larger by uncommenting the mlp down layer (which is the next most sensitive layer), or make it smaller by commenting out the q_proj layer (with the kv layers being the most sensitive, and relatively tiny).

This seems convoluted, yep. But it has advantages:

It targets the 'sensitive' layers more accurately, whereas exllamav3 more randomly changes the quantization of layers to hit a specified bpw target (as it can only use integer quantizations).
It can be faster. If you can find 3bpw and 4bpw exl3s of the model you want to try, you can just download them and recombine them: no actual quantization needed, and no need to download the 50GB raw weights. convert.py takes a few hours to run, while util/recompile.py takes seconds.

...And why go to all this hassle, you ask?

Because exl3s let you stuff in a much better model, with less loss, than anything you'd find on ollama:

https://github.com/turboderp-org/exllamav3/blob/d8167b0cf4491baeae7705c0dfec7f131f02aad4/doc/exl3.md

You can cram a 24 billion parameter model into the 11GB free you have, with minimal loss and no CPU offloading, wheras with ollama (and their unoptimized GGUFs/context qauntization), you'd either need a Q4/Q5 of a much dumber 12B model, or a Q3/Q2 of a 24B that will spit out jibberish, or make the model glacially slow by offloading half of it to system RAM.

And it better takes advantage of your 3080 TI's architecture.

There are other ways to get really good quantization (like with ik_llama.cpp), but for dense models, I love exllamav3.

Also, this whole field moves fast. Exllamav3 is like 5 months old, and this 'manual' quantization scheme was only tested a few days ago.

[–] Killer_Tree@sh.itjust.works 1 points 2 months ago (1 children)

Once again, thank you so much for sharing your knowledge! It looks like I have some weekend projects to look forward to.

[–] brucethemoose@lemmy.world 1 points 2 months ago

Yep! Just PM/reply or something for any help/requests, maybe more than once (as sometimes I miss them, and sometimes Lemmy doesn't send notifications for replies).

[–] brucethemoose@lemmy.world 2 points 2 months ago

Oh, and one more thing. Exl3s aren't single files you can click and download. Neither are full precision models.

Git clone or huggingface-cli work.

But I'd recommend this tool, as it hash checks all the files as it downloads them. You'd be surprised how often downloads are corrupted: https://github.com/bodaay/HuggingFaceModelDownloader