Quantization

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Quantization

ggmlR exposes the full set of ggml quantization formats — from legacy Q4_0/Q8_0 to modern K-quants and IQ (importance-matrix) quants. Quantization reduces model size and speeds up inference, especially on GPU.

library(ggmlR)

1. Quantization formats

Family	Formats	Bits/weight	Notes
Legacy	Q4_0, Q4_1, Q5_0, Q5_1, Q8_0	4–8	Simple block quants
K-quant	Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K	2–8	Better quality/size
IQ	IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS	1–4	Requires importance matrix
Ternary	TQ1_0, TQ2_0	~1.5–2	Ternary weights
Microscaling	MXFP4	4	Block floating point

2. Quantize and dequantize

# Original float weights (must be a multiple of block size, typically 32)
weights <- rnorm(256L)

# Quantize to Q4_0
raw_q4 <- quantize_q4_0(weights, n_rows = 1L, n_per_row = length(weights))
cat("Original size: ", length(weights) * 4L, "bytes\n")
cat("Q4_0 size:     ", length(raw_q4), "bytes\n")
cat("Compression:   ", round(length(weights) * 4L / length(raw_q4), 1), "x\n")

# Dequantize back to float
recovered <- dequantize_row_q4_0(raw_q4, length(weights))
cat("Max abs error: ", max(abs(recovered - weights)), "\n")

3. K-quants (better quality)

K-quants use super-blocks with separate scales, yielding better quality at the same bit width:

weights <- rnorm(512L)

# Q4_K — 4-bit K-quant
raw_q4k <- quantize_q4_K(weights, n_rows = 1L, n_per_row = length(weights))
rec_q4k <- dequantize_row_q4_K(raw_q4k, length(weights))
cat("Q4_K max error:", max(abs(rec_q4k - weights)), "\n")

# Q8_0 — 8-bit (near-lossless)
raw_q8 <- quantize_q8_0(weights, n_rows = 1L, n_per_row = length(weights))
rec_q8 <- dequantize_row_q8_0(raw_q8, length(weights))
cat("Q8_0 max error:", max(abs(rec_q8 - weights)), "\n")

4. IQ quants — importance matrix

IQ formats accept an importance matrix that prioritises accuracy on frequently-used weights. Without an importance matrix they fall back to uniform quantization.

weights    <- rnorm(512L)
importance <- abs(weights)^2          # example: weight magnitude as importance

# IQ4_XS — 4-bit with importance
raw_iq4 <- quantize_iq4_xs(weights, n_rows = 1L, n_per_row = length(weights),
                           imatrix = importance)
rec_iq4 <- dequantize_row_iq4_xs(raw_iq4, length(weights))
cat("IQ4_XS max error:", max(abs(rec_iq4 - weights)), "\n")

5. Comparing formats

weights <- rnorm(512L)
n_bytes_f32 <- length(weights) * 4L

formats <- list(
  Q4_0 = list(q = quantize_q4_0,  dq = dequantize_row_q4_0),
  Q8_0 = list(q = quantize_q8_0,  dq = dequantize_row_q8_0),
  Q4_K = list(q = quantize_q4_K,  dq = dequantize_row_q4_K),
  Q6_K = list(q = quantize_q6_K,  dq = dequantize_row_q6_K)
)

n <- length(weights)
cat(sprintf("%-8s  %6s  %8s  %10s\n", "Format", "Bytes", "Ratio", "MaxError"))
cat(strrep("-", 40), "\n")
for (nm in names(formats)) {
  raw <- formats[[nm]]$q(weights, n_rows = 1L, n_per_row = n)
  rec <- formats[[nm]]$dq(raw, n)
  cat(sprintf("%-8s  %6d  %8.2fx  %10.6f\n",
              nm, length(raw),
              n_bytes_f32 / length(raw),
              max(abs(rec - weights))))
}

6. Reference (row-level) functions

For block-level operations (one row at a time), use the *_ref variants:

row <- rnorm(32L)   # exactly one Q4_0 block

raw_row <- quantize_row_q4_0_ref(row, length(row))
rec_row <- dequantize_row_q4_0(raw_row, length(row))

These match the C reference implementations in ggml-quants.h exactly.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.