turboquant¶

KV-cache compression using the TurboQuant algorithm (ICLR 2026). Data-oblivious and unbiased. Pure-NumPy reference implementation with a C kernel in quant_ggml.c for production use; quant_bridge.py loads the shared library and falls back to NumPy when the .dylib / .so is missing.

Source: llamaclaw/turboquant.

TurboQuant — KV-cache compression for LLMs.

Polar decomposition + QJL projection + TurboQuant MSE/prod. Data-oblivious (no calibration), unbiased (E[x_hat] = x).

References

Zandieh, A. et al. (ICLR 2026). arXiv:2504.19874.

class turboquant.GGMLTurboQuant(lib_path: str | Path | None = None)[source]¶

Bases: object

Python interface to the GGML TurboQuant C library.

Falls back to pure-NumPy when the shared library isn’t compiled.

Parameters:: lib_path (str or Path, optional) – Path to the compiled shared library (.dylib / .so). If None, searches in the package directory.

property available: bool¶: True if the C library is loaded and ready.

dequantize(block: Any, bits: int = 3) → ndarray[tuple[Any, ...], dtype[float32]][source]¶: Dequantize a block back to float32 vector.

quantize(vector: ndarray[tuple[Any, ...], dtype[float32]], bits: int = 3, seed: int = 42) → Any[source]¶

Quantize a vector using the C kernel or NumPy fallback.

Parameters:

vector (ndarray of float32, shape (d,)) – Input vector. d must be 256 (TQ_BLOCK_SIZE).
bits (int) – Quantization bits (2, 3, or 4).
seed (int) – Rotation matrix seed.

Returns:

Compressed block.

Return type:

TQBlock or ctypes Structure

class turboquant.TQBlock(d: int, bits: int | list[int], radius: float, angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None, qjl_norm: float = 0.0, rotation_seed: int = 42, qjl_seed: int | None = None)[source]¶

Bases: object

Storage format for a TurboQuant-compressed vector block.

d¶

Original dimension.

Type:: int

bits¶

Bits per level.

Type:: int or list of int

radius¶

||x||₂ (stored in full precision).

Type:: float

angle_indices¶

Quantized angle indices per level.

Type:: list of ndarray[uint8]

qjl_signs¶

QJL error-correction signs (Stage 2 only).

Type:: ndarray[int8] or None

qjl_norm¶

Residual norm for QJL decode.

Type:: float

rotation_seed¶

Seed used for the rotation matrix (reproducibility).

Type:: int

qjl_seed¶

Seed used for QJL projection matrix.

Type:: int or None

angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]]¶

bits: int | list[int]¶

property compression_ratio: float¶: Compression ratio vs FP16 storage.

d: int¶

qjl_norm: float = 0.0¶

qjl_seed: int | None = None¶

qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None¶

radius: float¶

rotation_seed: int = 42¶

property total_bits: int¶: Total storage bits (excluding metadata).

turboquant.compress_kv_cache(keys: ndarray[tuple[Any, ...], dtype[float64]], values: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, method: str = 'mse', rotation_seed: int = 42, qjl_seed: int = 137) → tuple[list[TQBlock], list[TQBlock]][source]¶

Compress attention KV cache using TurboQuant.

Parameters:

keys (ndarray of shape (n_tokens, d)) – Key vectors from attention layer.
values (ndarray of shape (n_tokens, d)) – Value vectors from attention layer.
bits (int) – Quantization bits.
method (str) – "mse" for Stage 1 only, "prod" for Stage 1 + QJL.
rotation_seed (int) – Shared rotation seed for all vectors.
qjl_seed (int) – QJL seed (only for method=”prod”).

Returns:

key_blocks (list of TQBlock)
value_blocks (list of TQBlock)

turboquant.decompress_kv_cache(key_blocks: list[TQBlock], value_blocks: list[TQBlock], method: str = 'mse') → tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]¶

Decompress KV cache blocks back to arrays.

Parameters:

key_blocks (list of TQBlock)
value_blocks (list of TQBlock)
method (str) – "mse" or "prod" (must match compression).

Returns:

keys (ndarray of shape (n_tokens, d))
values (ndarray of shape (n_tokens, d))

turboquant.inverse_polar(radius: float, angles: list[ndarray[tuple[Any, ...], dtype[float64]]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Inverse polar transform: reconstruct Cartesian vector from polar.

Parameters:

radius (float) – ||x||₂
angles (list of ndarray) – As returned by polar_transform().

Returns:

Reconstructed vector.

Return type:

ndarray of shape (d,)

turboquant.polar_transform(x: ndarray[tuple[Any, ...], dtype[float64]]) → tuple[float, list[ndarray[tuple[Any, ...], dtype[float64]]]][source]¶

Recursive Cartesian→Polar transform in O(d log d).

Parameters:

x (ndarray of shape (d,)) – Input vector (should be rotated first: y = Π · x).

Returns:

radius (float) – ||x||₂
angles (list of ndarray) – angles[ℓ] contains the angles at level ℓ (0-indexed). Level 0: d/2 angles in [0, 2π) (atan2 of adjacent pairs) Level ℓ≥1: d/2^(ℓ+1) angles in [0, π/2] (atan2 of norm ratios)

turboquant.qjl_decode(signs: ndarray[tuple[Any, ...], dtype[int8]], norm: float, S: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

QJL dequantization — unbiased inner-product estimation.

Q_qjl^{-1}(z) = (√(π/2) / d) · S^T · z

Guarantee: E[<y, Q_qjl^{-1}(Q_qjl(r))>] = <y, r> (unbiased).

Parameters:

signs (ndarray of shape (d,)) – Sign bits from qjl_encode().
norm (float) – ||r||₂ from encoding.
S (ndarray of shape (d, d)) – Same projection matrix used for encoding.

Returns:

Reconstructed residual estimate.

Return type:

ndarray of shape (d,)

turboquant.qjl_encode(residual: ndarray[tuple[Any, ...], dtype[float64]], S: ndarray[tuple[Any, ...], dtype[float64]]) → tuple[ndarray[tuple[Any, ...], dtype[int8]], float][source]¶

QJL 1-bit sign encoding.

Q_qjl(r) = sign(S · r), stored alongside ||r||₂.

Parameters:

residual (ndarray of shape (d,)) – Quantization residual r = x - dequant(quant(x)).
S (ndarray of shape (d, d)) – Random projection matrix.

Returns:

signs (ndarray of shape (d,) dtype int8) – Sign bits: +1 or -1.
norm (float) – ||r||₂, needed for dequantization.

turboquant.turboquant_mse(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42) → TQBlock[source]¶

TurboQuant MSE-optimal quantization (Stage 1 only).

Per Algorithm 1 (Zandieh et al. 2026, arXiv:2504.19874): 1. Rotate: y = Π · x 2. Normalize to unit vector, store norm 3. Scalar-quantize each coordinate of y_unit via Lloyd-Max codebook 4. Store indices

MSE distortion bound: D_mse ≤ (√3π / 2) · (1 / 4^b).

Parameters:

x (ndarray of shape (d,)) – Input vector. d must be a power of 2.
bits (int) – Quantization bits per coordinate.
rotation_seed (int) – Seed for rotation matrix (must be same for encode/decode).

Returns:

Compressed representation.

Return type:

TQBlock

turboquant.turboquant_mse_decode(block: TQBlock) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Decode a TurboQuant MSE block back to a vector.

Parameters:: block (TQBlock) – Compressed block from turboquant_mse().
Returns:: Reconstructed vector.
Return type:: ndarray of shape (d,)

turboquant.turboquant_prod(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42, qjl_seed: int = 137) → TQBlock[source]¶

TurboQuant inner-product-optimal quantization (Stage 1 + QJL).

Applies MSE quantization with (bits-1) bits, then QJL error correction on the residual. Inner-product distortion bound:

D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b)

Parameters:

x (ndarray of shape (d,)) – Input vector.
bits (int) – Total effective bits (Stage 1 uses bits-1, Stage 2 uses 1 bit).
rotation_seed (int) – Seed for rotation matrix.
qjl_seed (int) – Seed for QJL projection matrix.

Returns:

Compressed representation with QJL error correction.

Return type:

TQBlock

turboquant.turboquant_prod_decode(block: TQBlock) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Decode a TurboQuant inner-product block.

Parameters:: block (TQBlock) – Compressed block from turboquant_prod().
Returns:: Reconstructed vector (MSE reconstruction + QJL residual estimate).
Return type:: ndarray of shape (d,)

TurboQuant — data-oblivious vector quantization for ESML.

Implements the TurboQuant two-stage quantization pipeline from the ICLR 2026 paper (arxiv.org/abs/2504.19874).

What this module IS: A standalone compression library that quantizes arbitrary vectors using rotation + Lloyd-Max scalar quantization + optional QJL 1-bit error correction. Validated against the paper’s theoretical bounds.

What this module is NOT: An inference-time KV-cache optimizer. It does not hook into Ollama, llama.cpp, or HuggingFace transformers attention layers. For runtime KV-cache compression with Ollama, use OLLAMA_KV_CACHE_TYPE=q8_0.

Algorithms¶

Stage 1 — TurboQuant_MSE (Zandieh et al. 2026, arXiv:2504.19874):

Random rotation via QR decomposition (Python) or WHT (C)
Normalize to unit vector, store L2 norm
Scalar-quantize each coordinate via Lloyd-Max codebook optimized for the Beta((d-1)/2, (d-1)/2) distribution

Stage 2 — QJL (Zandieh et al. 2025, arXiv:2406.03482):

Compute residual: r = x - dequant(quant(x))
1-bit sign encoding via random projection: sign(S·r)
Guarantees unbiased inner-product estimation: E[<y, r̂>] = <y, r>

Also includes PolarQuant utilities (Han et al. 2026, arXiv:2502.02617):

polar_transform() and inverse_polar() for recursive Cartesian→Polar, but these are NOT used by turboquant_mse(). They are available for research and comparison.

Theoretical Bounds¶

MSE distortion: D_mse ≤ (√3π / 2) · (1 / 4^b) ≈ 2.72 / 4^b Inner-product distortion: D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b) Info-theoretic lower: D_mse ≥ 1 / 4^b (gap: ~2.72×)

Validated Results (2026-03-31)¶

3-bit, d=256: MSE=0.028 (bound=0.043), cosine=0.987, compression=5.1× 4-bit, d=256: MSE=0.009 (bound=0.011), cosine=0.996, compression=3.9×

class turboquant.quant.TQBlock(d: int, bits: int | list[int], radius: float, angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None, qjl_norm: float = 0.0, rotation_seed: int = 42, qjl_seed: int | None = None)[source]¶

Bases: object

Storage format for a TurboQuant-compressed vector block.

d¶

Original dimension.

Type:: int

bits¶

Bits per level.

Type:: int or list of int

radius¶

||x||₂ (stored in full precision).

Type:: float

angle_indices¶

Quantized angle indices per level.

Type:: list of ndarray[uint8]

qjl_signs¶

QJL error-correction signs (Stage 2 only).

Type:: ndarray[int8] or None

qjl_norm¶

Residual norm for QJL decode.

Type:: float

rotation_seed¶

Seed used for the rotation matrix (reproducibility).

Type:: int

qjl_seed¶

Seed used for QJL projection matrix.

Type:: int or None

angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]]¶

bits: int | list[int]¶

property compression_ratio: float¶: Compression ratio vs FP16 storage.

d: int¶

qjl_norm: float = 0.0¶

qjl_seed: int | None = None¶

qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None¶

radius: float¶

rotation_seed: int = 42¶

property total_bits: int¶: Total storage bits (excluding metadata).

turboquant.quant.compress_kv_cache(keys: ndarray[tuple[Any, ...], dtype[float64]], values: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, method: str = 'mse', rotation_seed: int = 42, qjl_seed: int = 137) → tuple[list[TQBlock], list[TQBlock]][source]¶

Compress attention KV cache using TurboQuant.

Parameters:

keys (ndarray of shape (n_tokens, d)) – Key vectors from attention layer.
values (ndarray of shape (n_tokens, d)) – Value vectors from attention layer.
bits (int) – Quantization bits.
method (str) – "mse" for Stage 1 only, "prod" for Stage 1 + QJL.
rotation_seed (int) – Shared rotation seed for all vectors.
qjl_seed (int) – QJL seed (only for method=”prod”).

Returns:

key_blocks (list of TQBlock)
value_blocks (list of TQBlock)

turboquant.quant.decompress_kv_cache(key_blocks: list[TQBlock], value_blocks: list[TQBlock], method: str = 'mse') → tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]¶

Decompress KV cache blocks back to arrays.

Parameters:

key_blocks (list of TQBlock)
value_blocks (list of TQBlock)
method (str) – "mse" or "prod" (must match compression).

Returns:

keys (ndarray of shape (n_tokens, d))
values (ndarray of shape (n_tokens, d))

turboquant.quant.dequantize_angles(indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], d: int, bits: int | list[int]) → list[ndarray[tuple[Any, ...], dtype[float64]]][source]¶

Dequantize angle indices back to angles.

Parameters:

indices (list of ndarray[uint8]) – As returned by quantize_angles().
d (int) – Original vector dimension.
bits (int or list of int) – Bits per level (must match quantization).

Returns:

Reconstructed angles.

Return type:

list of ndarray

turboquant.quant.get_codebook(d: int, bits: int) → ndarray[tuple[Any, ...], dtype[float64]][source]¶: Get or compute the Lloyd-Max codebook for (d, bits).

turboquant.quant.inner_product_distortion_bound(bits: int, norm_sq: float, d: int) → float[source]¶

Theoretical inner-product distortion bound.

D_prod ≤ (√3π² · ||y||² / d) · (1/4^b)

turboquant.quant.inverse_polar(radius: float, angles: list[ndarray[tuple[Any, ...], dtype[float64]]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Inverse polar transform: reconstruct Cartesian vector from polar.

Parameters:

radius (float) – ||x||₂
angles (list of ndarray) – As returned by polar_transform().

Returns:

Reconstructed vector.

Return type:

ndarray of shape (d,)

turboquant.quant.lloyd_max_codebook(d: int, bits: int, n_iter: int = 200) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Compute optimal Lloyd-Max codebook for Beta-distributed coordinates.

Solves the continuous 1-D k-means problem:: min_{c_1,…,c_K} E[ min_k |X - c_k|² ]

where X ~ Beta((d-1)/2, (d-1)/2) scaled to [-1, 1].

Parameters:

d (int) – Embedding dimension (determines the Beta distribution shape).
bits (int) – Number of quantization bits. K = 2^bits centroids.
n_iter (int) – Max Lloyd iterations.

Returns:

Sorted codebook centroids.

Return type:

ndarray of shape (2^bits,)

turboquant.quant.mse_distortion_bound(bits: int) → float[source]¶: Theoretical MSE distortion upper bound: D_mse ≤ (√3π/2) · (1/4^b).

turboquant.quant.pack_indices(indices: ndarray[tuple[Any, ...], dtype[uint8]], bits: int) → bytes[source]¶

Pack uint8 indices into a compact byte representation.

Parameters:

indices (ndarray of uint8) – Quantization indices, each in [0, 2^bits).
bits (int) – Bits per index.

Returns:

Packed byte string.

Return type:

bytes

turboquant.quant.polar_transform(x: ndarray[tuple[Any, ...], dtype[float64]]) → tuple[float, list[ndarray[tuple[Any, ...], dtype[float64]]]][source]¶

Recursive Cartesian→Polar transform in O(d log d).

Parameters:

x (ndarray of shape (d,)) – Input vector (should be rotated first: y = Π · x).

Returns:

radius (float) – ||x||₂
angles (list of ndarray) – angles[ℓ] contains the angles at level ℓ (0-indexed). Level 0: d/2 angles in [0, 2π) (atan2 of adjacent pairs) Level ℓ≥1: d/2^(ℓ+1) angles in [0, π/2] (atan2 of norm ratios)

turboquant.quant.qjl_decode(signs: ndarray[tuple[Any, ...], dtype[int8]], norm: float, S: ndarray[tuple[Any, ...], dtype[float64]]) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

QJL dequantization — unbiased inner-product estimation.

Q_qjl^{-1}(z) = (√(π/2) / d) · S^T · z

Guarantee: E[<y, Q_qjl^{-1}(Q_qjl(r))>] = <y, r> (unbiased).

Parameters:

signs (ndarray of shape (d,)) – Sign bits from qjl_encode().
norm (float) – ||r||₂ from encoding.
S (ndarray of shape (d, d)) – Same projection matrix used for encoding.

Returns:

Reconstructed residual estimate.

Return type:

ndarray of shape (d,)

turboquant.quant.qjl_encode(residual: ndarray[tuple[Any, ...], dtype[float64]], S: ndarray[tuple[Any, ...], dtype[float64]]) → tuple[ndarray[tuple[Any, ...], dtype[int8]], float][source]¶

QJL 1-bit sign encoding.

Q_qjl(r) = sign(S · r), stored alongside ||r||₂.

Parameters:

residual (ndarray of shape (d,)) – Quantization residual r = x - dequant(quant(x)).
S (ndarray of shape (d, d)) – Random projection matrix.

Returns:

signs (ndarray of shape (d,) dtype int8) – Sign bits: +1 or -1.
norm (float) – ||r||₂, needed for dequantization.

turboquant.quant.qjl_projection_matrix(d: int, seed: int | None = None) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Generate the random projection matrix S for QJL.

Parameters:

d (int) – Dimension (same as the vector dimension).
seed (int, optional) – Random seed.

Returns:

i.i.d. N(0, 1) matrix.

Return type:

ndarray of shape (d, d)

turboquant.quant.quantize_angles(angles: list[ndarray[tuple[Any, ...], dtype[float64]]], d: int, bits: int | list[int]) → list[ndarray[tuple[Any, ...], dtype[uint8]]][source]¶

Quantize polar angles using Lloyd-Max codebooks.

Parameters:

angles (list of ndarray) – As returned by polar_transform().
d (int) – Original vector dimension.
bits (int or list of int) – Bits per level. If int, same for all levels. Typical: 4 bits for level 0 (full [0,2π) range), 2-3 for higher.

Returns:

Quantized index arrays, one per level.

Return type:

list of ndarray[uint8]

turboquant.quant.rotation_matrix(d: int, seed: int | None = None) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Generate a random orthogonal rotation matrix via QR decomposition.

The rotation randomizes coordinate-wise distributions so that angular components after the polar transform follow a concentrated, predictable distribution — eliminating the need for per-block normalization constants.

Parameters:

d (int) – Dimension of the square rotation matrix.
seed (int, optional) – Random seed for reproducibility.

Returns:

Orthogonal matrix satisfying Π^T · Π = I.

Return type:

ndarray of shape (d, d)

Notes

Uses the QR decomposition of a matrix with i.i.d. N(0,1) entries, with sign correction to ensure a uniform distribution over O(d).

turboquant.quant.turboquant_mse(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42) → TQBlock[source]¶

TurboQuant MSE-optimal quantization (Stage 1 only).

MSE distortion bound: D_mse ≤ (√3π / 2) · (1 / 4^b).

Parameters:

x (ndarray of shape (d,)) – Input vector. d must be a power of 2.
bits (int) – Quantization bits per coordinate.
rotation_seed (int) – Seed for rotation matrix (must be same for encode/decode).

Returns:

Compressed representation.

Return type:

TQBlock

turboquant.quant.turboquant_mse_decode(block: TQBlock) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Decode a TurboQuant MSE block back to a vector.

Parameters:: block (TQBlock) – Compressed block from turboquant_mse().
Returns:: Reconstructed vector.
Return type:: ndarray of shape (d,)

turboquant.quant.turboquant_prod(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42, qjl_seed: int = 137) → TQBlock[source]¶

TurboQuant inner-product-optimal quantization (Stage 1 + QJL).

Applies MSE quantization with (bits-1) bits, then QJL error correction on the residual. Inner-product distortion bound:

D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b)

Parameters:

x (ndarray of shape (d,)) – Input vector.
bits (int) – Total effective bits (Stage 1 uses bits-1, Stage 2 uses 1 bit).
rotation_seed (int) – Seed for rotation matrix.
qjl_seed (int) – Seed for QJL projection matrix.

Returns:

Compressed representation with QJL error correction.

Return type:

TQBlock

turboquant.quant.turboquant_prod_decode(block: TQBlock) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Decode a TurboQuant inner-product block.

Parameters:: block (TQBlock) – Compressed block from turboquant_prod().
Returns:: Reconstructed vector (MSE reconstruction + QJL residual estimate).
Return type:: ndarray of shape (d,)

turboquant.quant.unpack_indices(data: bytes, bits: int, count: int) → ndarray[tuple[Any, ...], dtype[uint8]][source]¶

Unpack bit-packed indices back to uint8 array.

Parameters:

data (bytes) – Packed byte string from pack_indices().
bits (int) – Bits per index.
count (int) – Number of indices to unpack.

Return type:

ndarray of uint8

turboquant.quant.verify_orthogonal(Q: ndarray[tuple[Any, ...], dtype[float64]], atol: float = 1e-10) → bool[source]¶: Verify that Q is orthogonal: Q^T · Q ≈ I.

Python ctypes bridge for GGML TurboQuant C kernels.

Provides GGMLTurboQuant which loads the compiled shared library and exposes quantize/dequantize/dot_product as Python callables.

The C implementation lives in quant_ggml.h (header) and the eventual quant_ggml.c (to be compiled). Until the C library is built, this module falls back to the pure-NumPy implementation in esml.quant.

Usage¶

>>> from turboquant.quant_bridge import GGMLTurboQuant
>>> tq = GGMLTurboQuant()
>>> if tq.available:
...     block = tq.quantize(vector, bits=3)
... else:
...     # Falls back to pure Python
...     from turboquant.quant import turboquant_mse
...     block = turboquant_mse(vector, bits=3)

class turboquant.quant_bridge.BlockTBQ2[source]¶

Bases: Structure

ctypes mirror of block_tbq2_0 (70 bytes per 256 elements).

indices¶: Structure/Union member

norm¶: Structure/Union member

seed¶: Structure/Union member

class turboquant.quant_bridge.BlockTBQ3[source]¶

Bases: Structure

ctypes mirror of block_tbq3_0 (102 bytes per 256 elements).

indices¶: Structure/Union member

norm¶: Structure/Union member

seed¶: Structure/Union member

class turboquant.quant_bridge.BlockTBQ4[source]¶

Bases: Structure

ctypes mirror of block_tbq4_0 (134 bytes per 256 elements).

indices¶: Structure/Union member

norm¶: Structure/Union member

seed¶: Structure/Union member

class turboquant.quant_bridge.GGMLTurboQuant(lib_path: str | Path | None = None)[source]¶

Bases: object

Python interface to the GGML TurboQuant C library.

Falls back to pure-NumPy when the shared library isn’t compiled.

Parameters:: lib_path (str or Path, optional) – Path to the compiled shared library (.dylib / .so). If None, searches in the package directory.

property available: bool¶: True if the C library is loaded and ready.

dequantize(block: Any, bits: int = 3) → ndarray[tuple[Any, ...], dtype[float32]][source]¶: Dequantize a block back to float32 vector.

quantize(vector: ndarray[tuple[Any, ...], dtype[float32]], bits: int = 3, seed: int = 42) → Any[source]¶

Quantize a vector using the C kernel or NumPy fallback.

Parameters:

vector (ndarray of float32, shape (d,)) – Input vector. d must be 256 (TQ_BLOCK_SIZE).
bits (int) – Quantization bits (2, 3, or 4).
seed (int) – Rotation matrix seed.

Returns:

Compressed block.

Return type:

TQBlock or ctypes Structure

turboquant.quant_bridge.compile_ggml_lib(output_dir: str | Path | None = None, optimize: bool = True) → Path | None[source]¶

Attempt to compile the GGML TurboQuant C library.

Requires cc (clang or gcc) on the system.

Parameters:

output_dir (str or Path, optional) – Where to write the shared library. Defaults to the esml package dir.
optimize (bool) – Use -O2 -march=native (default True).

Returns:

Path to compiled library, or None on failure.

Return type:

Path or None

TurboQuant-compressed KV cache for ESML’s inference engine.

Stores attention key/value vectors as compressed TQBlocks instead of raw float tensors, achieving 4-6x memory reduction during inference.

This plugs into esml.engine.ESMLEngine to provide KV-cache compression during actual transformer attention computation.

References

TurboQuant: Zandieh et al. (2026). ICLR 2026. arXiv:2504.19874
QJL: Zandieh et al. (2025). AAAI 2025. arXiv:2406.03482

class turboquant.kv_cache.CacheStats(compressed_bytes: int = 0, uncompressed_bytes: int = 0, n_tokens: int = 0, n_layers: int = 0)[source]¶

Bases: object

Memory statistics for a TurboQuantKVCache.

compressed_bytes: int = 0¶

property compression_ratio: float¶

n_layers: int = 0¶

n_tokens: int = 0¶

property savings_mb: float¶

uncompressed_bytes: int = 0¶

class turboquant.kv_cache.TurboQuantKVCache(n_layers: int, head_dim: int, bits: int = 3, rotation_seed: int = 42)[source]¶

Bases: object

KV cache that stores keys and values as TurboQuant-compressed blocks.

Each key/value vector is quantized via esml.quant.turboquant_mse() on append(), and decompressed on get_keys() / get_values().

Parameters:

n_layers (int) – Number of transformer layers.
head_dim (int) – Dimension per attention head (must be power of 2).
bits (int) – TurboQuant quantization bits (2, 3, or 4).
rotation_seed (int) – Shared rotation seed for reproducibility.

Examples

>>> cache = TurboQuantKVCache(n_layers=32, head_dim=128, bits=3)
>>> k = np.random.randn(128)
>>> v = np.random.randn(128)
>>> cache.append(layer=0, k_vec=k, v_vec=v)
>>> keys = cache.get_keys(0)   # (1, 128) decompressed
>>> values = cache.get_values(0)
>>> cache.stats.compression_ratio
5.1

append(layer: int, k_vec: ndarray[tuple[Any, ...], dtype[float64]], v_vec: ndarray[tuple[Any, ...], dtype[float64]]) → None[source]¶

Compress and cache a new key/value pair for one token.

Parameters:

layer (int) – Transformer layer index.
k_vec (ndarray of shape (head_dim,)) – Key vector (will be quantized).
v_vec (ndarray of shape (head_dim,)) – Value vector (will be quantized).

clear() → None[source]¶: Clear all cached blocks.

get_keys(layer: int) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Decompress all cached keys for a layer.

Return type:: ndarray of shape (seq_len, head_dim)

get_values(layer: int) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

Decompress all cached values for a layer.

Return type:: ndarray of shape (seq_len, head_dim)

property seq_len: int¶: Number of tokens currently cached (from layer 0).

property stats: CacheStats¶: Compute memory statistics.

class turboquant.kv_cache.UncompressedKVCache(n_layers: int, head_dim: int)[source]¶

Bases: object

Baseline uncompressed KV cache for comparison benchmarks.

append(layer: int, k_vec: ndarray[tuple[Any, ...], dtype[float64]], v_vec: ndarray[tuple[Any, ...], dtype[float64]]) → None[source]¶

clear() → None[source]¶

get_keys(layer: int) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

get_values(layer: int) → ndarray[tuple[Any, ...], dtype[float64]][source]¶

property memory_bytes: int¶

property seq_len: int¶