 |
ClassicConnect "640k ought to be enough for everybody."
|
| View previous topic :: View next topic |
| Author |
Message |
LyraNovaHeart Gorts

Joined: 15 Apr 2025 Age: 27 Posts: 48 Location: Los Angeles, California
|
Posted: Mon May 26, 2025 3:35 am Post subject: Beginners Guide 2: Quanting Models (LLMs) |
|
|
Welcome back! Wanna contribute to the community by quanting for others? Don't worry, it's easy!
Onto the guide:
Part 1: Quant Types
Now, you may be saying: Lyra, didn't you explain what quant types are already in the last post? And if you did read the last post, yes, I did! I know it's repetitive but trust me, it'll make sense.
Types:
GGUF: GPT-Generated Unified Format, this is THE most common type you'll encounter, as it's the easiest for people to run due to its ability to run either on CPU or GPU.
EXL2: Exllama 2, a quant format based on safetensors for super fast inferencing. This is less common, given that it needs a GPU to quant and to run.
FP8: Floating Point 8-bit. This is closer to FP16 format-wise but runs only on the RTX 40 series and above.
There are others, and when I learn how to quant those, I will post them here. Yes, I am learning with everyone as I go
Part 2: Actually quanting these formats
Okay, here's the fun part: actually quanting! You'll need a few tools, of course.
Required dependencies:
Python 3.10 minimum, 3.11 recommended. DO NOT USE PYTHON 3.13 OR ANY PYTHON FROM THE MICROSOFT STORE!!! THIS WILL CAUSE CONFLICTS!!!
We'll start with GGUF first, since it's a bit jank.
You will need:
llama.cpp git repo: https://github.com/ggml-org/llama.cpp.git
llama.cpp releases: https://github.com/ggml-org/llama.cpp/releases/tag/b5489
A model to quant (in HF format): any on Huggingface will do, example: https://huggingface.co/Qwen/Qwen3-0.6B
Steps:
Clone the repo and download a release. Make a folder for each.
Download the model you want to quantize, example: | Code: | | huggingface-cli download Qwen/Qwen3-0.6B --local-dir Qwen3-0.6B |
Open a terminal in the folder with the cloned part of the llama.cpp repo.
Run this command. It's mostly the same across different versions apart from the file slashes. If needed, just drag the file to the terminal: | Code: | | python convert_hf_to_gguf.py "D:\Models\Qwen3-0.6B" --outfile "D:\Models\Qwen3-0.6B-FP16.gguf" |
Now that you have an FP16 GGUF of the model, you will now switch to the folder containing the release part of llama.cpp.
Run the appropriate binary for your OS of llama-quantize, and use codes like Q6_K for 6.56 bpw, or like Q8_0 for 8-bit. A full list will be provided, along with an example:
Example: | Code: | | .\llama-quantize.exe "D:\Models\Qwen3-0.6B-FP16.gguf" "D:\Models\Qwen3-0.6B-Q8_0.gguf" Q8_0 |
List:
| Code: |
2 or Q4_0
3 or Q4_1
8 or Q5_0
9 or Q5_1
19 or IQ2_XXS
20 or IQ2_XS
28 or IQ2_S
29 or IQ2_M
24 or IQ1_S
31 or IQ1_M
36 or TQ1_0
37 or TQ2_0
10 or Q2_K
21 or Q2_K_S
23 or IQ3_XXS
26 or IQ3_S
27 or IQ3_M
12 or Q3_K
22 or IQ3_XS
11 or Q3_K_S
12 or Q3_K_M
13 or Q3_K_L
25 or IQ4_NL
30 or IQ4_XS
15 or Q4_K
14 or Q4_K_S
15 or Q4_K_M
17 or Q5_K
16 or Q5_K_S
17 or Q5_K_M
18 or Q6_K
7 or Q8_0
1 or F16
32 or BF16
0 or F32
COPY
|
EXL2:
This is a lot simpler, though you need a GPU. Of course, you'll need dependencies; just run this helpful example from the github repo:
| Code: | git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install . |
Once you have that, the rest is very simple, and the same example will be used. Here's a quick and easy example to get started:
| Code: | | python convert.py -i C:\Users\lg911\Downloads\exllamav2\Qwen_Qwen3_0.6B -o C:\Users\lg911\Downloads\exllamav2\Qwen_Qwen3_0.6B-8.0bpwEXL2 -b 8.00 -hb 8 |
Notes:
b: bits per word, determines quality; lowest is 2.4, highest is 8.0.
hb: head bits, determines LM head; lowest is 6, highest is 8.
EXL2 quanting can take a long time (1 hour+), please be patient.
FP8: Coming soon _________________ I'm one day closer to being who I wanna be~
Last edited by LyraNovaHeart on Fri Jun 06, 2025 12:33 pm; edited 3 times in total |
|
| Back to top |
|
 |
nick99nack Admin

Joined: 30 Aug 2023 Age: 30 Posts: 171 Location: NJ, USA
|
Posted: Mon May 26, 2025 4:06 am Post subject: |
|
|
Good guide, just want to add a couple things I had to do when trying this the other day on Windows 11.
1. (May or may not be applicable) When installing everything in requirements.txt, I had to install Visual Studio as pip was looking for build tools. However, I later discovered that I was using the wrong version of Python (3.13, as Lyra specifically mentioned NOT to use), so I'm not sure if that's needed for 3.11.
2. Windows 11 hijacks the "python3" command for its own MS Store version. This caused issues with convert_hf_to_gguf.py. To fix this, I edited the first line of the script from | Code: | | #!/usr/bin/env python3 | to .
The script was then able to run normally, and everything else continued as planned. I spent way too much time debugging that too, messing with venvs and such. Don't waste time like I did. _________________ If you like browsing without an ad blocker, you might also like getting rid of your virus scanner, and running around with your pants down. --SomeGuy, 2016 |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
smartDark Style by Smartor
Powered by phpBB 2.0.25 CC Mod © 2001, 2002 phpBB Group
|