ClassicConnect

LyraNovaHeart

So, you want to get into running LLMs on your own hardware eh? Wanna break free of corporate control? You came to the right place.

Anyway, onto the guide:

Part 1: Determining What You Can Run

This first part is pretty easy, ideally you should determine what OS you have, most LLM backends work on these Operating Systems:

- Windows 10/11
- Linux
- MacOS

You'll also want to check what you have, if you're solely relying on CPU, this is going to be really slow, yet possible, for GPU you have a few options

CPU (Direct)
Nvidia GPU: Vulkan, CUDA (CuBLAS), OpenCL (CLBlast),
AMD GPU: Vulkan, RocM (while RocM works in Windows, most projects are going to use Linux)
Apple Silicon: Metal

Note for Nvidia: GTX 10 series cards will "work" but will be limited, Ideally you want a 20 series card MINIMUM.

Note for AMD users: there is a fork of KoboldCPP for ROCm, while MIA, Vulkan on some AMD cards is very slow. You can download the ROCm fork here: https://github.com/YellowRoseCx/koboldcpp-rocm/

Now, onto choosing what size, you'll need to determine how much GPU VRAM (if using GPU) or System RAM (if using CPU) to determine what you can run, a reference will be put below for certain sizes, and while you can spill into system memory, it should be noted it'll be REALLY slow due to memory bandwidth Bottlenecks, to avoid this, keep between 500MB-2GB free, and you need to understand Quants next.

Quants: Types and How They Affect Performance

Okay, so let's get this out of the way:

FP16: Floating Point Precision 16 Bit, basically "Full Weight" since no one trains in FP32 anymore
FP8: Float Point Precision 8 Bit, mostly lossless from the the full weights in terms of quality, Nvidia 40 series (Ada Lovelace) and 50 series (Blackwell) have native support
FP4: 4 Bit Floating Precision, only supported on Blackwell 50 series cards
GGUF: GPT-Generated Unified Format, this is THE most common quant type due to its ease of use, it supports from 2 bit to 16 bit, and will run on the majority of hardware
EXL2: Exllama2, a quant type that runs solely on GPU but usually provides the best speed outside of 8 bit, supports 2.4bpw to 8bpw, takes very long to quant
MLX: Mac Quants

To determine what you can run, take the model size and check your RAM/VRAM, if you can fit it with at least 500MB free you're good.

Now, onto running:

the easiest method is KoboldCPP as it is very simple to setup, and in the example we'll use Qwen 3 4b Q8_0

to download the model: https://huggingface.co/bartowski/Qwen_Qwen3-4B-GGUF/resolve/main/Qwen_Qwen3-4B-Q8_0.gguf?download=true
KoboldCPP: https://github.com/LostRuins/koboldcpp/releases/tag/v1.90.2

use a download manager for this to speed up the download, once you have the file, depending on your OS you'll use this command:

Windows 10/11 (Non CUDA/Nvidia):

LyraNovaHeart · Posted: Thu May 08, 2025 11:33 pm Post subject:

2nd part: QuantKV and its uses

QuantKV is a way to quantize the context (this is where your chat gets stored) and use less VRAM, though it also comes with some caveats.

Typically you have these options for QuantKV:

- 1: 8 Bit, should be mostly fine with minimal quality loss
- 2: 4 Bit, significant quality loss on most models, not ideal unless you really need to squeeze as much context out

Note that these are for KoboldCPP, for EXL2 it is a bit different.

To run a model with QuantKV enabled, use this command:

LyraNovaHeart · Posted: Fri May 09, 2025 12:09 am Post subject:

3rd Part: EXL2

Exllama v2 is a faster way to run models on both Nvidia and AMD cards, and in my opinion, my favorite way to run models due to its speed. If you run models with this, you've made a good choice.

Setup

In this case, we'll use TabbyAPI as it is widely considered the official way to run any model in EXL2. You will need these prerequisites installed:

- Python 3.10 minimum; 3.11 and newer recommended. DO NOT USE PYTHON INSTALLED FROM THE MS STORE, AS IT CAN CAUSE ISSUES!
- CUDA 12.1 (Not technically required, as it can install itself, but it's always recommended. Download here: https://developer.nvidia.com/cuda-downloads
- ROCm 6.1 (Linux Only). Download here: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.1.0/how-to/prerequisites.html
- Git

Note: ROCm support only exists for Linux. In some cases, you may need VS Build Tools 17.8 if you get an error about it needing to be installed.

To Install:

1. Run