 |
ClassicConnect "640k ought to be enough for everybody."
|
| View previous topic :: View next topic |
| Author |
Message |
LyraNovaHeart Gorts

Joined: 15 Apr 2025 Age: 27 Posts: 48 Location: Los Angeles, California
|
Posted: Tue May 06, 2025 3:54 am Post subject: Beginners Guide 1: Getting Started with Inferencing |
|
|
So, you want to get into running LLMs on your own hardware eh? Wanna break free of corporate control? You came to the right place.
Anyway, onto the guide:
Part 1: Determining What You Can Run
This first part is pretty easy, ideally you should determine what OS you have, most LLM backends work on these Operating Systems:
- Windows 10/11
- Linux
- MacOS
You'll also want to check what you have, if you're solely relying on CPU, this is going to be really slow, yet possible, for GPU you have a few options
CPU (Direct)
Nvidia GPU: Vulkan, CUDA (CuBLAS), OpenCL (CLBlast),
AMD GPU: Vulkan, RocM (while RocM works in Windows, most projects are going to use Linux)
Apple Silicon: Metal
Note for Nvidia: GTX 10 series cards will "work" but will be limited, Ideally you want a 20 series card MINIMUM.
Note for AMD users: there is a fork of KoboldCPP for ROCm, while MIA, Vulkan on some AMD cards is very slow. You can download the ROCm fork here: https://github.com/YellowRoseCx/koboldcpp-rocm/
Now, onto choosing what size, you'll need to determine how much GPU VRAM (if using GPU) or System RAM (if using CPU) to determine what you can run, a reference will be put below for certain sizes, and while you can spill into system memory, it should be noted it'll be REALLY slow due to memory bandwidth Bottlenecks, to avoid this, keep between 500MB-2GB free, and you need to understand Quants next.
Quants: Types and How They Affect Performance
Okay, so let's get this out of the way:
FP16: Floating Point Precision 16 Bit, basically "Full Weight" since no one trains in FP32 anymore
FP8: Float Point Precision 8 Bit, mostly lossless from the the full weights in terms of quality, Nvidia 40 series (Ada Lovelace) and 50 series (Blackwell) have native support
FP4: 4 Bit Floating Precision, only supported on Blackwell 50 series cards
GGUF: GPT-Generated Unified Format, this is THE most common quant type due to its ease of use, it supports from 2 bit to 16 bit, and will run on the majority of hardware
EXL2: Exllama2, a quant type that runs solely on GPU but usually provides the best speed outside of 8 bit, supports 2.4bpw to 8bpw, takes very long to quant
MLX: Mac Quants
To determine what you can run, take the model size and check your RAM/VRAM, if you can fit it with at least 500MB free you're good.
Now, onto running:
the easiest method is KoboldCPP as it is very simple to setup, and in the example we'll use Qwen 3 4b Q8_0
to download the model: https://huggingface.co/bartowski/Qwen_Qwen3-4B-GGUF/resolve/main/Qwen_Qwen3-4B-Q8_0.gguf?download=true
KoboldCPP: https://github.com/LostRuins/koboldcpp/releases/tag/v1.90.2
use a download manager for this to speed up the download, once you have the file, depending on your OS you'll use this command:
Windows 10/11 (Non CUDA/Nvidia): | Code: | | \.koboldcpp_nocuda.exe --contextsize 16384 --gpulayers 9999 |
Windows 10/11 (Nvidia/CUDA): | Code: | | \.koboldcpp_cu12.exe --flashattention --contextsize 16384 --gpulayers 9999 |
Linux (may need to CHMOD first): | Code: | | koboldcpp-linux-x64-cuda1210 --flashattention --contextsize 16384 --gpulayers 9999 |
MacOS: unsure for now
Once you chose the GGUF model, a page will open at localhost:5001, you can use that to inference
Next post will deal with EXL2[/url] _________________ I'm one day closer to being who I wanna be~
Last edited by LyraNovaHeart on Fri Jun 06, 2025 12:26 pm; edited 4 times in total |
|
| Back to top |
|
 |
LyraNovaHeart Gorts

Joined: 15 Apr 2025 Age: 27 Posts: 48 Location: Los Angeles, California
|
Posted: Thu May 08, 2025 11:33 pm Post subject: |
|
|
2nd part: QuantKV and its uses
QuantKV is a way to quantize the context (this is where your chat gets stored) and use less VRAM, though it also comes with some caveats.
Typically you have these options for QuantKV:
- 1: 8 Bit, should be mostly fine with minimal quality loss
- 2: 4 Bit, significant quality loss on most models, not ideal unless you really need to squeeze as much context out
Note that these are for KoboldCPP, for EXL2 it is a bit different.
To run a model with QuantKV enabled, use this command:
_________________ I'm one day closer to being who I wanna be~
Last edited by LyraNovaHeart on Fri May 09, 2025 12:17 am; edited 3 times in total |
|
| Back to top |
|
 |
LyraNovaHeart Gorts

Joined: 15 Apr 2025 Age: 27 Posts: 48 Location: Los Angeles, California
|
Posted: Fri May 09, 2025 12:09 am Post subject: |
|
|
3rd Part: EXL2
Exllama v2 is a faster way to run models on both Nvidia and AMD cards, and in my opinion, my favorite way to run models due to its speed. If you run models with this, you've made a good choice.
Setup
In this case, we'll use TabbyAPI as it is widely considered the official way to run any model in EXL2. You will need these prerequisites installed:
- Python 3.10 minimum; 3.11 and newer recommended. DO NOT USE PYTHON INSTALLED FROM THE MS STORE, AS IT CAN CAUSE ISSUES!
- CUDA 12.1 (Not technically required, as it can install itself, but it's always recommended. Download here: https://developer.nvidia.com/cuda-downloads
- ROCm 6.1 (Linux Only). Download here: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.1.0/how-to/prerequisites.html
- Git
Note: ROCm support only exists for Linux. In some cases, you may need VS Build Tools 17.8 if you get an error about it needing to be installed.
To Install:
1. Run | Code: | | git clone https://github.com/theroyallab/tabbyAPI | and navigate to the TabbyAPI folder.
2. Run `start.bat` (Windows) or `start.sh` (Linux) and follow the on-screen instructions.
3. Once installed, you will need to edit your `config.yml` to point it to a folder with a model. Apply these edits:
| Code: | | model_dir: D:\Models |
| Code: | | model_name: ProdeusUnity_Stellar-Odyssey-12b_v0.0_6.0bpw-EXL2 |
Change these values as needed (e.g., your model directory, model name, sequence length/context, cache mode like Q8 or Q6 to save VRAM, and host if you don't want it only on localhost).
Now run start.bat or start.sh again. You will need to use a frontend like OpenWebUI or SillyTavern to interact with this API, but once loaded, you are good to go.
Finding Models:
This is relatively easy. Look on Hugging Face (HF) for models labeled "EXL2" and choose a size that will fit (e.g., 6 Bit for 12GB users on Nemo 12b).
Example Model: https://huggingface.co/turboderp/Mistral-Nemo-Instruct-12B-exl2. Choose the branch that fits your VRAM size.
Final Note: EXL2 is GPU ONLY. While you can spill into system memory, you will experience significant slowdown. _________________ I'm one day closer to being who I wanna be~ |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
smartDark Style by Smartor
Powered by phpBB 2.0.25 CC Mod © 2001, 2002 phpBB Group
|