turboquant¶
KV-cache compression using the TurboQuant algorithm
(ICLR 2026). Data-oblivious and
unbiased. Pure-NumPy reference implementation with a C kernel in
quant_ggml.c for production use; quant_bridge.py loads the shared
library and falls back to NumPy when the .dylib / .so is missing.
Source: llamaclaw/turboquant.
TurboQuant — KV-cache compression for LLMs.
Polar decomposition + QJL projection + TurboQuant MSE/prod. Data-oblivious (no calibration), unbiased (E[x_hat] = x).
References
Zandieh, A. et al. (ICLR 2026). arXiv:2504.19874.
- class turboquant.GGMLTurboQuant(lib_path: str | Path | None = None)[source]¶
Bases:
objectPython interface to the GGML TurboQuant C library.
Falls back to pure-NumPy when the shared library isn’t compiled.
- Parameters:
lib_path (str or Path, optional) – Path to the compiled shared library (
.dylib/.so). If None, searches in the package directory.
- dequantize(block: Any, bits: int = 3) ndarray[tuple[Any, ...], dtype[float32]][source]¶
Dequantize a block back to float32 vector.
- class turboquant.TQBlock(d: int, bits: int | list[int], radius: float, angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None, qjl_norm: float = 0.0, rotation_seed: int = 42, qjl_seed: int | None = None)[source]¶
Bases:
objectStorage format for a TurboQuant-compressed vector block.
- qjl_signs¶
QJL error-correction signs (Stage 2 only).
- Type:
ndarray[int8] or None
- turboquant.compress_kv_cache(keys: ndarray[tuple[Any, ...], dtype[float64]], values: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, method: str = 'mse', rotation_seed: int = 42, qjl_seed: int = 137) tuple[list[TQBlock], list[TQBlock]][source]¶
Compress attention KV cache using TurboQuant.
- Parameters:
keys (ndarray of shape (n_tokens, d)) – Key vectors from attention layer.
values (ndarray of shape (n_tokens, d)) – Value vectors from attention layer.
bits (int) – Quantization bits.
method (str) –
"mse"for Stage 1 only,"prod"for Stage 1 + QJL.rotation_seed (int) – Shared rotation seed for all vectors.
qjl_seed (int) – QJL seed (only for method=”prod”).
- Returns:
key_blocks (list of TQBlock)
value_blocks (list of TQBlock)
- turboquant.decompress_kv_cache(key_blocks: list[TQBlock], value_blocks: list[TQBlock], method: str = 'mse') tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]¶
Decompress KV cache blocks back to arrays.
- turboquant.inverse_polar(radius: float, angles: list[ndarray[tuple[Any, ...], dtype[float64]]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Inverse polar transform: reconstruct Cartesian vector from polar.
- Parameters:
radius (float) – ||x||₂
angles (list of ndarray) – As returned by
polar_transform().
- Returns:
Reconstructed vector.
- Return type:
- turboquant.polar_transform(x: ndarray[tuple[Any, ...], dtype[float64]]) tuple[float, list[ndarray[tuple[Any, ...], dtype[float64]]]][source]¶
Recursive Cartesian→Polar transform in O(d log d).
- Parameters:
x (ndarray of shape (d,)) – Input vector (should be rotated first: y = Π · x).
- Returns:
radius (float) – ||x||₂
angles (list of ndarray) – angles[ℓ] contains the angles at level ℓ (0-indexed). Level 0: d/2 angles in [0, 2π) (atan2 of adjacent pairs) Level ℓ≥1: d/2^(ℓ+1) angles in [0, π/2] (atan2 of norm ratios)
- turboquant.qjl_decode(signs: ndarray[tuple[Any, ...], dtype[int8]], norm: float, S: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
QJL dequantization — unbiased inner-product estimation.
Q_qjl^{-1}(z) = (√(π/2) / d) · S^T · z
Guarantee: E[<y, Q_qjl^{-1}(Q_qjl(r))>] = <y, r> (unbiased).
- turboquant.qjl_encode(residual: ndarray[tuple[Any, ...], dtype[float64]], S: ndarray[tuple[Any, ...], dtype[float64]]) tuple[ndarray[tuple[Any, ...], dtype[int8]], float][source]¶
QJL 1-bit sign encoding.
Q_qjl(r) = sign(S · r), stored alongside ||r||₂.
- turboquant.turboquant_mse(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42) TQBlock[source]¶
TurboQuant MSE-optimal quantization (Stage 1 only).
Per Algorithm 1 (Zandieh et al. 2026, arXiv:2504.19874): 1. Rotate: y = Π · x 2. Normalize to unit vector, store norm 3. Scalar-quantize each coordinate of y_unit via Lloyd-Max codebook 4. Store indices
MSE distortion bound: D_mse ≤ (√3π / 2) · (1 / 4^b).
- turboquant.turboquant_mse_decode(block: TQBlock) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Decode a TurboQuant MSE block back to a vector.
- Parameters:
block (TQBlock) – Compressed block from
turboquant_mse().- Returns:
Reconstructed vector.
- Return type:
- turboquant.turboquant_prod(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42, qjl_seed: int = 137) TQBlock[source]¶
TurboQuant inner-product-optimal quantization (Stage 1 + QJL).
Applies MSE quantization with (bits-1) bits, then QJL error correction on the residual. Inner-product distortion bound:
D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b)
- Parameters:
- Returns:
Compressed representation with QJL error correction.
- Return type:
- turboquant.turboquant_prod_decode(block: TQBlock) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Decode a TurboQuant inner-product block.
- Parameters:
block (TQBlock) – Compressed block from
turboquant_prod().- Returns:
Reconstructed vector (MSE reconstruction + QJL residual estimate).
- Return type:
TurboQuant — data-oblivious vector quantization for ESML.
Implements the TurboQuant two-stage quantization pipeline from the ICLR 2026 paper (arxiv.org/abs/2504.19874).
What this module IS: A standalone compression library that quantizes arbitrary vectors using rotation + Lloyd-Max scalar quantization + optional QJL 1-bit error correction. Validated against the paper’s theoretical bounds.
What this module is NOT: An inference-time KV-cache optimizer. It does not
hook into Ollama, llama.cpp, or HuggingFace transformers attention layers.
For runtime KV-cache compression with Ollama, use OLLAMA_KV_CACHE_TYPE=q8_0.
Algorithms¶
- Stage 1 — TurboQuant_MSE (Zandieh et al. 2026, arXiv:2504.19874):
Random rotation via QR decomposition (Python) or WHT (C)
Normalize to unit vector, store L2 norm
Scalar-quantize each coordinate via Lloyd-Max codebook optimized for the Beta((d-1)/2, (d-1)/2) distribution
- Stage 2 — QJL (Zandieh et al. 2025, arXiv:2406.03482):
Compute residual: r = x - dequant(quant(x))
1-bit sign encoding via random projection: sign(S·r)
Guarantees unbiased inner-product estimation: E[<y, r̂>] = <y, r>
- Also includes PolarQuant utilities (Han et al. 2026, arXiv:2502.02617):
polar_transform() and inverse_polar() for recursive Cartesian→Polar, but these are NOT used by turboquant_mse(). They are available for research and comparison.
Theoretical Bounds¶
MSE distortion: D_mse ≤ (√3π / 2) · (1 / 4^b) ≈ 2.72 / 4^b Inner-product distortion: D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b) Info-theoretic lower: D_mse ≥ 1 / 4^b (gap: ~2.72×)
Validated Results (2026-03-31)¶
3-bit, d=256: MSE=0.028 (bound=0.043), cosine=0.987, compression=5.1× 4-bit, d=256: MSE=0.009 (bound=0.011), cosine=0.996, compression=3.9×
- class turboquant.quant.TQBlock(d: int, bits: int | list[int], radius: float, angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None, qjl_norm: float = 0.0, rotation_seed: int = 42, qjl_seed: int | None = None)[source]¶
Bases:
objectStorage format for a TurboQuant-compressed vector block.
- qjl_signs¶
QJL error-correction signs (Stage 2 only).
- Type:
ndarray[int8] or None
- turboquant.quant.compress_kv_cache(keys: ndarray[tuple[Any, ...], dtype[float64]], values: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, method: str = 'mse', rotation_seed: int = 42, qjl_seed: int = 137) tuple[list[TQBlock], list[TQBlock]][source]¶
Compress attention KV cache using TurboQuant.
- Parameters:
keys (ndarray of shape (n_tokens, d)) – Key vectors from attention layer.
values (ndarray of shape (n_tokens, d)) – Value vectors from attention layer.
bits (int) – Quantization bits.
method (str) –
"mse"for Stage 1 only,"prod"for Stage 1 + QJL.rotation_seed (int) – Shared rotation seed for all vectors.
qjl_seed (int) – QJL seed (only for method=”prod”).
- Returns:
key_blocks (list of TQBlock)
value_blocks (list of TQBlock)
- turboquant.quant.decompress_kv_cache(key_blocks: list[TQBlock], value_blocks: list[TQBlock], method: str = 'mse') tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]¶
Decompress KV cache blocks back to arrays.
- turboquant.quant.dequantize_angles(indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], d: int, bits: int | list[int]) list[ndarray[tuple[Any, ...], dtype[float64]]][source]¶
Dequantize angle indices back to angles.
- turboquant.quant.get_codebook(d: int, bits: int) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Get or compute the Lloyd-Max codebook for (d, bits).
- turboquant.quant.inner_product_distortion_bound(bits: int, norm_sq: float, d: int) float[source]¶
Theoretical inner-product distortion bound.
D_prod ≤ (√3π² · ||y||² / d) · (1/4^b)
- turboquant.quant.inverse_polar(radius: float, angles: list[ndarray[tuple[Any, ...], dtype[float64]]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Inverse polar transform: reconstruct Cartesian vector from polar.
- Parameters:
radius (float) – ||x||₂
angles (list of ndarray) – As returned by
polar_transform().
- Returns:
Reconstructed vector.
- Return type:
- turboquant.quant.lloyd_max_codebook(d: int, bits: int, n_iter: int = 200) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Compute optimal Lloyd-Max codebook for Beta-distributed coordinates.
- Solves the continuous 1-D k-means problem:
min_{c_1,…,c_K} E[ min_k |X - c_k|² ]
where X ~ Beta((d-1)/2, (d-1)/2) scaled to [-1, 1].
- turboquant.quant.mse_distortion_bound(bits: int) float[source]¶
Theoretical MSE distortion upper bound: D_mse ≤ (√3π/2) · (1/4^b).
- turboquant.quant.pack_indices(indices: ndarray[tuple[Any, ...], dtype[uint8]], bits: int) bytes[source]¶
Pack uint8 indices into a compact byte representation.
- turboquant.quant.polar_transform(x: ndarray[tuple[Any, ...], dtype[float64]]) tuple[float, list[ndarray[tuple[Any, ...], dtype[float64]]]][source]¶
Recursive Cartesian→Polar transform in O(d log d).
- Parameters:
x (ndarray of shape (d,)) – Input vector (should be rotated first: y = Π · x).
- Returns:
radius (float) – ||x||₂
angles (list of ndarray) – angles[ℓ] contains the angles at level ℓ (0-indexed). Level 0: d/2 angles in [0, 2π) (atan2 of adjacent pairs) Level ℓ≥1: d/2^(ℓ+1) angles in [0, π/2] (atan2 of norm ratios)
- turboquant.quant.qjl_decode(signs: ndarray[tuple[Any, ...], dtype[int8]], norm: float, S: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]¶
QJL dequantization — unbiased inner-product estimation.
Q_qjl^{-1}(z) = (√(π/2) / d) · S^T · z
Guarantee: E[<y, Q_qjl^{-1}(Q_qjl(r))>] = <y, r> (unbiased).
- turboquant.quant.qjl_encode(residual: ndarray[tuple[Any, ...], dtype[float64]], S: ndarray[tuple[Any, ...], dtype[float64]]) tuple[ndarray[tuple[Any, ...], dtype[int8]], float][source]¶
QJL 1-bit sign encoding.
Q_qjl(r) = sign(S · r), stored alongside ||r||₂.
- turboquant.quant.qjl_projection_matrix(d: int, seed: int | None = None) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Generate the random projection matrix S for QJL.
- turboquant.quant.quantize_angles(angles: list[ndarray[tuple[Any, ...], dtype[float64]]], d: int, bits: int | list[int]) list[ndarray[tuple[Any, ...], dtype[uint8]]][source]¶
Quantize polar angles using Lloyd-Max codebooks.
- Parameters:
- Returns:
Quantized index arrays, one per level.
- Return type:
list of ndarray[uint8]
- turboquant.quant.rotation_matrix(d: int, seed: int | None = None) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Generate a random orthogonal rotation matrix via QR decomposition.
The rotation randomizes coordinate-wise distributions so that angular components after the polar transform follow a concentrated, predictable distribution — eliminating the need for per-block normalization constants.
- Parameters:
- Returns:
Orthogonal matrix satisfying Π^T · Π = I.
- Return type:
Notes
Uses the QR decomposition of a matrix with i.i.d. N(0,1) entries, with sign correction to ensure a uniform distribution over O(d).
- turboquant.quant.turboquant_mse(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42) TQBlock[source]¶
TurboQuant MSE-optimal quantization (Stage 1 only).
Per Algorithm 1 (Zandieh et al. 2026, arXiv:2504.19874): 1. Rotate: y = Π · x 2. Normalize to unit vector, store norm 3. Scalar-quantize each coordinate of y_unit via Lloyd-Max codebook 4. Store indices
MSE distortion bound: D_mse ≤ (√3π / 2) · (1 / 4^b).
- turboquant.quant.turboquant_mse_decode(block: TQBlock) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Decode a TurboQuant MSE block back to a vector.
- Parameters:
block (TQBlock) – Compressed block from
turboquant_mse().- Returns:
Reconstructed vector.
- Return type:
- turboquant.quant.turboquant_prod(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42, qjl_seed: int = 137) TQBlock[source]¶
TurboQuant inner-product-optimal quantization (Stage 1 + QJL).
Applies MSE quantization with (bits-1) bits, then QJL error correction on the residual. Inner-product distortion bound:
D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b)
- Parameters:
- Returns:
Compressed representation with QJL error correction.
- Return type:
- turboquant.quant.turboquant_prod_decode(block: TQBlock) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Decode a TurboQuant inner-product block.
- Parameters:
block (TQBlock) – Compressed block from
turboquant_prod().- Returns:
Reconstructed vector (MSE reconstruction + QJL residual estimate).
- Return type:
- turboquant.quant.unpack_indices(data: bytes, bits: int, count: int) ndarray[tuple[Any, ...], dtype[uint8]][source]¶
Unpack bit-packed indices back to uint8 array.
- Parameters:
data (bytes) – Packed byte string from
pack_indices().bits (int) – Bits per index.
count (int) – Number of indices to unpack.
- Return type:
ndarray of uint8
- turboquant.quant.verify_orthogonal(Q: ndarray[tuple[Any, ...], dtype[float64]], atol: float = 1e-10) bool[source]¶
Verify that Q is orthogonal: Q^T · Q ≈ I.
Python ctypes bridge for GGML TurboQuant C kernels.
Provides GGMLTurboQuant which loads the compiled shared library
and exposes quantize/dequantize/dot_product as Python callables.
The C implementation lives in quant_ggml.h (header) and the eventual
quant_ggml.c (to be compiled). Until the C library is built, this
module falls back to the pure-NumPy implementation in esml.quant.
Usage¶
>>> from turboquant.quant_bridge import GGMLTurboQuant
>>> tq = GGMLTurboQuant()
>>> if tq.available:
... block = tq.quantize(vector, bits=3)
... else:
... # Falls back to pure Python
... from turboquant.quant import turboquant_mse
... block = turboquant_mse(vector, bits=3)
- class turboquant.quant_bridge.BlockTBQ2[source]¶
Bases:
Structurectypes mirror of block_tbq2_0 (70 bytes per 256 elements).
- indices¶
Structure/Union member
- norm¶
Structure/Union member
- seed¶
Structure/Union member
- class turboquant.quant_bridge.BlockTBQ3[source]¶
Bases:
Structurectypes mirror of block_tbq3_0 (102 bytes per 256 elements).
- indices¶
Structure/Union member
- norm¶
Structure/Union member
- seed¶
Structure/Union member
- class turboquant.quant_bridge.BlockTBQ4[source]¶
Bases:
Structurectypes mirror of block_tbq4_0 (134 bytes per 256 elements).
- indices¶
Structure/Union member
- norm¶
Structure/Union member
- seed¶
Structure/Union member
- class turboquant.quant_bridge.GGMLTurboQuant(lib_path: str | Path | None = None)[source]¶
Bases:
objectPython interface to the GGML TurboQuant C library.
Falls back to pure-NumPy when the shared library isn’t compiled.
- Parameters:
lib_path (str or Path, optional) – Path to the compiled shared library (
.dylib/.so). If None, searches in the package directory.
- dequantize(block: Any, bits: int = 3) ndarray[tuple[Any, ...], dtype[float32]][source]¶
Dequantize a block back to float32 vector.
- turboquant.quant_bridge.compile_ggml_lib(output_dir: str | Path | None = None, optimize: bool = True) Path | None[source]¶
Attempt to compile the GGML TurboQuant C library.
Requires
cc(clang or gcc) on the system.
TurboQuant-compressed KV cache for ESML’s inference engine.
Stores attention key/value vectors as compressed TQBlocks instead of raw float tensors, achieving 4-6x memory reduction during inference.
This plugs into esml.engine.ESMLEngine to provide KV-cache
compression during actual transformer attention computation.
References
TurboQuant: Zandieh et al. (2026). ICLR 2026. arXiv:2504.19874
QJL: Zandieh et al. (2025). AAAI 2025. arXiv:2406.03482
- class turboquant.kv_cache.CacheStats(compressed_bytes: int = 0, uncompressed_bytes: int = 0, n_tokens: int = 0, n_layers: int = 0)[source]¶
Bases:
objectMemory statistics for a TurboQuantKVCache.
- class turboquant.kv_cache.TurboQuantKVCache(n_layers: int, head_dim: int, bits: int = 3, rotation_seed: int = 42)[source]¶
Bases:
objectKV cache that stores keys and values as TurboQuant-compressed blocks.
Each key/value vector is quantized via
esml.quant.turboquant_mse()onappend(), and decompressed onget_keys()/get_values().- Parameters:
Examples
>>> cache = TurboQuantKVCache(n_layers=32, head_dim=128, bits=3) >>> k = np.random.randn(128) >>> v = np.random.randn(128) >>> cache.append(layer=0, k_vec=k, v_vec=v) >>> keys = cache.get_keys(0) # (1, 128) decompressed >>> values = cache.get_values(0) >>> cache.stats.compression_ratio 5.1
- append(layer: int, k_vec: ndarray[tuple[Any, ...], dtype[float64]], v_vec: ndarray[tuple[Any, ...], dtype[float64]]) None[source]¶
Compress and cache a new key/value pair for one token.
- get_keys(layer: int) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Decompress all cached keys for a layer.
- get_values(layer: int) ndarray[tuple[Any, ...], dtype[float64]][source]¶
Decompress all cached values for a layer.
- property stats: CacheStats¶
Compute memory statistics.