turboquant

KV-cache compression using the TurboQuant algorithm (ICLR 2026). Data-oblivious and unbiased. Pure-NumPy reference implementation with a C kernel in quant_ggml.c for production use; quant_bridge.py loads the shared library and falls back to NumPy when the .dylib / .so is missing.

Source: llamaclaw/turboquant.

TurboQuant — KV-cache compression for LLMs.

Polar decomposition + QJL projection + TurboQuant MSE/prod. Data-oblivious (no calibration), unbiased (E[x_hat] = x).

References

Zandieh, A. et al. (ICLR 2026). arXiv:2504.19874.

class turboquant.GGMLTurboQuant(lib_path: str | Path | None = None)[source]

Bases: object

Python interface to the GGML TurboQuant C library.

Falls back to pure-NumPy when the shared library isn’t compiled.

Parameters:

lib_path (str or Path, optional) – Path to the compiled shared library (.dylib / .so). If None, searches in the package directory.

property available: bool

True if the C library is loaded and ready.

dequantize(block: Any, bits: int = 3) ndarray[tuple[Any, ...], dtype[float32]][source]

Dequantize a block back to float32 vector.

quantize(vector: ndarray[tuple[Any, ...], dtype[float32]], bits: int = 3, seed: int = 42) Any[source]

Quantize a vector using the C kernel or NumPy fallback.

Parameters:
  • vector (ndarray of float32, shape (d,)) – Input vector. d must be 256 (TQ_BLOCK_SIZE).

  • bits (int) – Quantization bits (2, 3, or 4).

  • seed (int) – Rotation matrix seed.

Returns:

Compressed block.

Return type:

TQBlock or ctypes Structure

class turboquant.TQBlock(d: int, bits: int | list[int], radius: float, angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None, qjl_norm: float = 0.0, rotation_seed: int = 42, qjl_seed: int | None = None)[source]

Bases: object

Storage format for a TurboQuant-compressed vector block.

d

Original dimension.

Type:

int

bits

Bits per level.

Type:

int or list of int

radius

||x||₂ (stored in full precision).

Type:

float

angle_indices

Quantized angle indices per level.

Type:

list of ndarray[uint8]

qjl_signs

QJL error-correction signs (Stage 2 only).

Type:

ndarray[int8] or None

qjl_norm

Residual norm for QJL decode.

Type:

float

rotation_seed

Seed used for the rotation matrix (reproducibility).

Type:

int

qjl_seed

Seed used for QJL projection matrix.

Type:

int or None

angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]]
bits: int | list[int]
property compression_ratio: float

Compression ratio vs FP16 storage.

d: int
qjl_norm: float = 0.0
qjl_seed: int | None = None
qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None
radius: float
rotation_seed: int = 42
property total_bits: int

Total storage bits (excluding metadata).

turboquant.compress_kv_cache(keys: ndarray[tuple[Any, ...], dtype[float64]], values: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, method: str = 'mse', rotation_seed: int = 42, qjl_seed: int = 137) tuple[list[TQBlock], list[TQBlock]][source]

Compress attention KV cache using TurboQuant.

Parameters:
  • keys (ndarray of shape (n_tokens, d)) – Key vectors from attention layer.

  • values (ndarray of shape (n_tokens, d)) – Value vectors from attention layer.

  • bits (int) – Quantization bits.

  • method (str) – "mse" for Stage 1 only, "prod" for Stage 1 + QJL.

  • rotation_seed (int) – Shared rotation seed for all vectors.

  • qjl_seed (int) – QJL seed (only for method=”prod”).

Returns:

  • key_blocks (list of TQBlock)

  • value_blocks (list of TQBlock)

turboquant.decompress_kv_cache(key_blocks: list[TQBlock], value_blocks: list[TQBlock], method: str = 'mse') tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]

Decompress KV cache blocks back to arrays.

Parameters:
Returns:

  • keys (ndarray of shape (n_tokens, d))

  • values (ndarray of shape (n_tokens, d))

turboquant.inverse_polar(radius: float, angles: list[ndarray[tuple[Any, ...], dtype[float64]]]) ndarray[tuple[Any, ...], dtype[float64]][source]

Inverse polar transform: reconstruct Cartesian vector from polar.

Parameters:
Returns:

Reconstructed vector.

Return type:

ndarray of shape (d,)

turboquant.polar_transform(x: ndarray[tuple[Any, ...], dtype[float64]]) tuple[float, list[ndarray[tuple[Any, ...], dtype[float64]]]][source]

Recursive Cartesian→Polar transform in O(d log d).

Parameters:

x (ndarray of shape (d,)) – Input vector (should be rotated first: y = Π · x).

Returns:

  • radius (float) – ||x||₂

  • angles (list of ndarray) – angles[ℓ] contains the angles at level ℓ (0-indexed). Level 0: d/2 angles in [0, 2π) (atan2 of adjacent pairs) Level ℓ≥1: d/2^(ℓ+1) angles in [0, π/2] (atan2 of norm ratios)

turboquant.qjl_decode(signs: ndarray[tuple[Any, ...], dtype[int8]], norm: float, S: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]

QJL dequantization — unbiased inner-product estimation.

Q_qjl^{-1}(z) = (√(π/2) / d) · S^T · z

Guarantee: E[<y, Q_qjl^{-1}(Q_qjl(r))>] = <y, r> (unbiased).

Parameters:
  • signs (ndarray of shape (d,)) – Sign bits from qjl_encode().

  • norm (float) – ||r||₂ from encoding.

  • S (ndarray of shape (d, d)) – Same projection matrix used for encoding.

Returns:

Reconstructed residual estimate.

Return type:

ndarray of shape (d,)

turboquant.qjl_encode(residual: ndarray[tuple[Any, ...], dtype[float64]], S: ndarray[tuple[Any, ...], dtype[float64]]) tuple[ndarray[tuple[Any, ...], dtype[int8]], float][source]

QJL 1-bit sign encoding.

Q_qjl(r) = sign(S · r), stored alongside ||r||₂.

Parameters:
  • residual (ndarray of shape (d,)) – Quantization residual r = x - dequant(quant(x)).

  • S (ndarray of shape (d, d)) – Random projection matrix.

Returns:

  • signs (ndarray of shape (d,) dtype int8) – Sign bits: +1 or -1.

  • norm (float) – ||r||₂, needed for dequantization.

turboquant.turboquant_mse(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42) TQBlock[source]

TurboQuant MSE-optimal quantization (Stage 1 only).

Per Algorithm 1 (Zandieh et al. 2026, arXiv:2504.19874): 1. Rotate: y = Π · x 2. Normalize to unit vector, store norm 3. Scalar-quantize each coordinate of y_unit via Lloyd-Max codebook 4. Store indices

MSE distortion bound: D_mse ≤ (√3π / 2) · (1 / 4^b).

Parameters:
  • x (ndarray of shape (d,)) – Input vector. d must be a power of 2.

  • bits (int) – Quantization bits per coordinate.

  • rotation_seed (int) – Seed for rotation matrix (must be same for encode/decode).

Returns:

Compressed representation.

Return type:

TQBlock

turboquant.turboquant_mse_decode(block: TQBlock) ndarray[tuple[Any, ...], dtype[float64]][source]

Decode a TurboQuant MSE block back to a vector.

Parameters:

block (TQBlock) – Compressed block from turboquant_mse().

Returns:

Reconstructed vector.

Return type:

ndarray of shape (d,)

turboquant.turboquant_prod(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42, qjl_seed: int = 137) TQBlock[source]

TurboQuant inner-product-optimal quantization (Stage 1 + QJL).

Applies MSE quantization with (bits-1) bits, then QJL error correction on the residual. Inner-product distortion bound:

D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b)

Parameters:
  • x (ndarray of shape (d,)) – Input vector.

  • bits (int) – Total effective bits (Stage 1 uses bits-1, Stage 2 uses 1 bit).

  • rotation_seed (int) – Seed for rotation matrix.

  • qjl_seed (int) – Seed for QJL projection matrix.

Returns:

Compressed representation with QJL error correction.

Return type:

TQBlock

turboquant.turboquant_prod_decode(block: TQBlock) ndarray[tuple[Any, ...], dtype[float64]][source]

Decode a TurboQuant inner-product block.

Parameters:

block (TQBlock) – Compressed block from turboquant_prod().

Returns:

Reconstructed vector (MSE reconstruction + QJL residual estimate).

Return type:

ndarray of shape (d,)

TurboQuant — data-oblivious vector quantization for ESML.

Implements the TurboQuant two-stage quantization pipeline from the ICLR 2026 paper (arxiv.org/abs/2504.19874).

What this module IS: A standalone compression library that quantizes arbitrary vectors using rotation + Lloyd-Max scalar quantization + optional QJL 1-bit error correction. Validated against the paper’s theoretical bounds.

What this module is NOT: An inference-time KV-cache optimizer. It does not hook into Ollama, llama.cpp, or HuggingFace transformers attention layers. For runtime KV-cache compression with Ollama, use OLLAMA_KV_CACHE_TYPE=q8_0.

Algorithms

Stage 1 — TurboQuant_MSE (Zandieh et al. 2026, arXiv:2504.19874):
  1. Random rotation via QR decomposition (Python) or WHT (C)

  2. Normalize to unit vector, store L2 norm

  3. Scalar-quantize each coordinate via Lloyd-Max codebook optimized for the Beta((d-1)/2, (d-1)/2) distribution

Stage 2 — QJL (Zandieh et al. 2025, arXiv:2406.03482):
  1. Compute residual: r = x - dequant(quant(x))

  2. 1-bit sign encoding via random projection: sign(S·r)

  3. Guarantees unbiased inner-product estimation: E[<y, r̂>] = <y, r>

Also includes PolarQuant utilities (Han et al. 2026, arXiv:2502.02617):

polar_transform() and inverse_polar() for recursive Cartesian→Polar, but these are NOT used by turboquant_mse(). They are available for research and comparison.

Theoretical Bounds

MSE distortion: D_mse ≤ (√3π / 2) · (1 / 4^b) ≈ 2.72 / 4^b Inner-product distortion: D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b) Info-theoretic lower: D_mse ≥ 1 / 4^b (gap: ~2.72×)

Validated Results (2026-03-31)

3-bit, d=256: MSE=0.028 (bound=0.043), cosine=0.987, compression=5.1× 4-bit, d=256: MSE=0.009 (bound=0.011), cosine=0.996, compression=3.9×

class turboquant.quant.TQBlock(d: int, bits: int | list[int], radius: float, angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None, qjl_norm: float = 0.0, rotation_seed: int = 42, qjl_seed: int | None = None)[source]

Bases: object

Storage format for a TurboQuant-compressed vector block.

d

Original dimension.

Type:

int

bits

Bits per level.

Type:

int or list of int

radius

||x||₂ (stored in full precision).

Type:

float

angle_indices

Quantized angle indices per level.

Type:

list of ndarray[uint8]

qjl_signs

QJL error-correction signs (Stage 2 only).

Type:

ndarray[int8] or None

qjl_norm

Residual norm for QJL decode.

Type:

float

rotation_seed

Seed used for the rotation matrix (reproducibility).

Type:

int

qjl_seed

Seed used for QJL projection matrix.

Type:

int or None

angle_indices: list[ndarray[tuple[Any, ...], dtype[uint8]]]
bits: int | list[int]
property compression_ratio: float

Compression ratio vs FP16 storage.

d: int
qjl_norm: float = 0.0
qjl_seed: int | None = None
qjl_signs: ndarray[tuple[Any, ...], dtype[int8]] | None = None
radius: float
rotation_seed: int = 42
property total_bits: int

Total storage bits (excluding metadata).

turboquant.quant.compress_kv_cache(keys: ndarray[tuple[Any, ...], dtype[float64]], values: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, method: str = 'mse', rotation_seed: int = 42, qjl_seed: int = 137) tuple[list[TQBlock], list[TQBlock]][source]

Compress attention KV cache using TurboQuant.

Parameters:
  • keys (ndarray of shape (n_tokens, d)) – Key vectors from attention layer.

  • values (ndarray of shape (n_tokens, d)) – Value vectors from attention layer.

  • bits (int) – Quantization bits.

  • method (str) – "mse" for Stage 1 only, "prod" for Stage 1 + QJL.

  • rotation_seed (int) – Shared rotation seed for all vectors.

  • qjl_seed (int) – QJL seed (only for method=”prod”).

Returns:

  • key_blocks (list of TQBlock)

  • value_blocks (list of TQBlock)

turboquant.quant.decompress_kv_cache(key_blocks: list[TQBlock], value_blocks: list[TQBlock], method: str = 'mse') tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]

Decompress KV cache blocks back to arrays.

Parameters:
Returns:

  • keys (ndarray of shape (n_tokens, d))

  • values (ndarray of shape (n_tokens, d))

turboquant.quant.dequantize_angles(indices: list[ndarray[tuple[Any, ...], dtype[uint8]]], d: int, bits: int | list[int]) list[ndarray[tuple[Any, ...], dtype[float64]]][source]

Dequantize angle indices back to angles.

Parameters:
  • indices (list of ndarray[uint8]) – As returned by quantize_angles().

  • d (int) – Original vector dimension.

  • bits (int or list of int) – Bits per level (must match quantization).

Returns:

Reconstructed angles.

Return type:

list of ndarray

turboquant.quant.get_codebook(d: int, bits: int) ndarray[tuple[Any, ...], dtype[float64]][source]

Get or compute the Lloyd-Max codebook for (d, bits).

turboquant.quant.inner_product_distortion_bound(bits: int, norm_sq: float, d: int) float[source]

Theoretical inner-product distortion bound.

D_prod ≤ (√3π² · ||y||² / d) · (1/4^b)

turboquant.quant.inverse_polar(radius: float, angles: list[ndarray[tuple[Any, ...], dtype[float64]]]) ndarray[tuple[Any, ...], dtype[float64]][source]

Inverse polar transform: reconstruct Cartesian vector from polar.

Parameters:
Returns:

Reconstructed vector.

Return type:

ndarray of shape (d,)

turboquant.quant.lloyd_max_codebook(d: int, bits: int, n_iter: int = 200) ndarray[tuple[Any, ...], dtype[float64]][source]

Compute optimal Lloyd-Max codebook for Beta-distributed coordinates.

Solves the continuous 1-D k-means problem:

min_{c_1,…,c_K} E[ min_k |X - c_k|² ]

where X ~ Beta((d-1)/2, (d-1)/2) scaled to [-1, 1].

Parameters:
  • d (int) – Embedding dimension (determines the Beta distribution shape).

  • bits (int) – Number of quantization bits. K = 2^bits centroids.

  • n_iter (int) – Max Lloyd iterations.

Returns:

Sorted codebook centroids.

Return type:

ndarray of shape (2^bits,)

turboquant.quant.mse_distortion_bound(bits: int) float[source]

Theoretical MSE distortion upper bound: D_mse ≤ (√3π/2) · (1/4^b).

turboquant.quant.pack_indices(indices: ndarray[tuple[Any, ...], dtype[uint8]], bits: int) bytes[source]

Pack uint8 indices into a compact byte representation.

Parameters:
  • indices (ndarray of uint8) – Quantization indices, each in [0, 2^bits).

  • bits (int) – Bits per index.

Returns:

Packed byte string.

Return type:

bytes

turboquant.quant.polar_transform(x: ndarray[tuple[Any, ...], dtype[float64]]) tuple[float, list[ndarray[tuple[Any, ...], dtype[float64]]]][source]

Recursive Cartesian→Polar transform in O(d log d).

Parameters:

x (ndarray of shape (d,)) – Input vector (should be rotated first: y = Π · x).

Returns:

  • radius (float) – ||x||₂

  • angles (list of ndarray) – angles[ℓ] contains the angles at level ℓ (0-indexed). Level 0: d/2 angles in [0, 2π) (atan2 of adjacent pairs) Level ℓ≥1: d/2^(ℓ+1) angles in [0, π/2] (atan2 of norm ratios)

turboquant.quant.qjl_decode(signs: ndarray[tuple[Any, ...], dtype[int8]], norm: float, S: ndarray[tuple[Any, ...], dtype[float64]]) ndarray[tuple[Any, ...], dtype[float64]][source]

QJL dequantization — unbiased inner-product estimation.

Q_qjl^{-1}(z) = (√(π/2) / d) · S^T · z

Guarantee: E[<y, Q_qjl^{-1}(Q_qjl(r))>] = <y, r> (unbiased).

Parameters:
  • signs (ndarray of shape (d,)) – Sign bits from qjl_encode().

  • norm (float) – ||r||₂ from encoding.

  • S (ndarray of shape (d, d)) – Same projection matrix used for encoding.

Returns:

Reconstructed residual estimate.

Return type:

ndarray of shape (d,)

turboquant.quant.qjl_encode(residual: ndarray[tuple[Any, ...], dtype[float64]], S: ndarray[tuple[Any, ...], dtype[float64]]) tuple[ndarray[tuple[Any, ...], dtype[int8]], float][source]

QJL 1-bit sign encoding.

Q_qjl(r) = sign(S · r), stored alongside ||r||₂.

Parameters:
  • residual (ndarray of shape (d,)) – Quantization residual r = x - dequant(quant(x)).

  • S (ndarray of shape (d, d)) – Random projection matrix.

Returns:

  • signs (ndarray of shape (d,) dtype int8) – Sign bits: +1 or -1.

  • norm (float) – ||r||₂, needed for dequantization.

turboquant.quant.qjl_projection_matrix(d: int, seed: int | None = None) ndarray[tuple[Any, ...], dtype[float64]][source]

Generate the random projection matrix S for QJL.

Parameters:
  • d (int) – Dimension (same as the vector dimension).

  • seed (int, optional) – Random seed.

Returns:

i.i.d. N(0, 1) matrix.

Return type:

ndarray of shape (d, d)

turboquant.quant.quantize_angles(angles: list[ndarray[tuple[Any, ...], dtype[float64]]], d: int, bits: int | list[int]) list[ndarray[tuple[Any, ...], dtype[uint8]]][source]

Quantize polar angles using Lloyd-Max codebooks.

Parameters:
  • angles (list of ndarray) – As returned by polar_transform().

  • d (int) – Original vector dimension.

  • bits (int or list of int) – Bits per level. If int, same for all levels. Typical: 4 bits for level 0 (full [0,2π) range), 2-3 for higher.

Returns:

Quantized index arrays, one per level.

Return type:

list of ndarray[uint8]

turboquant.quant.rotation_matrix(d: int, seed: int | None = None) ndarray[tuple[Any, ...], dtype[float64]][source]

Generate a random orthogonal rotation matrix via QR decomposition.

The rotation randomizes coordinate-wise distributions so that angular components after the polar transform follow a concentrated, predictable distribution — eliminating the need for per-block normalization constants.

Parameters:
  • d (int) – Dimension of the square rotation matrix.

  • seed (int, optional) – Random seed for reproducibility.

Returns:

Orthogonal matrix satisfying Π^T · Π = I.

Return type:

ndarray of shape (d, d)

Notes

Uses the QR decomposition of a matrix with i.i.d. N(0,1) entries, with sign correction to ensure a uniform distribution over O(d).

turboquant.quant.turboquant_mse(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42) TQBlock[source]

TurboQuant MSE-optimal quantization (Stage 1 only).

Per Algorithm 1 (Zandieh et al. 2026, arXiv:2504.19874): 1. Rotate: y = Π · x 2. Normalize to unit vector, store norm 3. Scalar-quantize each coordinate of y_unit via Lloyd-Max codebook 4. Store indices

MSE distortion bound: D_mse ≤ (√3π / 2) · (1 / 4^b).

Parameters:
  • x (ndarray of shape (d,)) – Input vector. d must be a power of 2.

  • bits (int) – Quantization bits per coordinate.

  • rotation_seed (int) – Seed for rotation matrix (must be same for encode/decode).

Returns:

Compressed representation.

Return type:

TQBlock

turboquant.quant.turboquant_mse_decode(block: TQBlock) ndarray[tuple[Any, ...], dtype[float64]][source]

Decode a TurboQuant MSE block back to a vector.

Parameters:

block (TQBlock) – Compressed block from turboquant_mse().

Returns:

Reconstructed vector.

Return type:

ndarray of shape (d,)

turboquant.quant.turboquant_prod(x: ndarray[tuple[Any, ...], dtype[float64]], bits: int = 3, rotation_seed: int = 42, qjl_seed: int = 137) TQBlock[source]

TurboQuant inner-product-optimal quantization (Stage 1 + QJL).

Applies MSE quantization with (bits-1) bits, then QJL error correction on the residual. Inner-product distortion bound:

D_prod ≤ (√3π² · ||y||² / d) · (1 / 4^b)

Parameters:
  • x (ndarray of shape (d,)) – Input vector.

  • bits (int) – Total effective bits (Stage 1 uses bits-1, Stage 2 uses 1 bit).

  • rotation_seed (int) – Seed for rotation matrix.

  • qjl_seed (int) – Seed for QJL projection matrix.

Returns:

Compressed representation with QJL error correction.

Return type:

TQBlock

turboquant.quant.turboquant_prod_decode(block: TQBlock) ndarray[tuple[Any, ...], dtype[float64]][source]

Decode a TurboQuant inner-product block.

Parameters:

block (TQBlock) – Compressed block from turboquant_prod().

Returns:

Reconstructed vector (MSE reconstruction + QJL residual estimate).

Return type:

ndarray of shape (d,)

turboquant.quant.unpack_indices(data: bytes, bits: int, count: int) ndarray[tuple[Any, ...], dtype[uint8]][source]

Unpack bit-packed indices back to uint8 array.

Parameters:
  • data (bytes) – Packed byte string from pack_indices().

  • bits (int) – Bits per index.

  • count (int) – Number of indices to unpack.

Return type:

ndarray of uint8

turboquant.quant.verify_orthogonal(Q: ndarray[tuple[Any, ...], dtype[float64]], atol: float = 1e-10) bool[source]

Verify that Q is orthogonal: Q^T · Q ≈ I.

Python ctypes bridge for GGML TurboQuant C kernels.

Provides GGMLTurboQuant which loads the compiled shared library and exposes quantize/dequantize/dot_product as Python callables.

The C implementation lives in quant_ggml.h (header) and the eventual quant_ggml.c (to be compiled). Until the C library is built, this module falls back to the pure-NumPy implementation in esml.quant.

Usage

>>> from turboquant.quant_bridge import GGMLTurboQuant
>>> tq = GGMLTurboQuant()
>>> if tq.available:
...     block = tq.quantize(vector, bits=3)
... else:
...     # Falls back to pure Python
...     from turboquant.quant import turboquant_mse
...     block = turboquant_mse(vector, bits=3)
class turboquant.quant_bridge.BlockTBQ2[source]

Bases: Structure

ctypes mirror of block_tbq2_0 (70 bytes per 256 elements).

indices

Structure/Union member

norm

Structure/Union member

seed

Structure/Union member

class turboquant.quant_bridge.BlockTBQ3[source]

Bases: Structure

ctypes mirror of block_tbq3_0 (102 bytes per 256 elements).

indices

Structure/Union member

norm

Structure/Union member

seed

Structure/Union member

class turboquant.quant_bridge.BlockTBQ4[source]

Bases: Structure

ctypes mirror of block_tbq4_0 (134 bytes per 256 elements).

indices

Structure/Union member

norm

Structure/Union member

seed

Structure/Union member

class turboquant.quant_bridge.GGMLTurboQuant(lib_path: str | Path | None = None)[source]

Bases: object

Python interface to the GGML TurboQuant C library.

Falls back to pure-NumPy when the shared library isn’t compiled.

Parameters:

lib_path (str or Path, optional) – Path to the compiled shared library (.dylib / .so). If None, searches in the package directory.

property available: bool

True if the C library is loaded and ready.

dequantize(block: Any, bits: int = 3) ndarray[tuple[Any, ...], dtype[float32]][source]

Dequantize a block back to float32 vector.

quantize(vector: ndarray[tuple[Any, ...], dtype[float32]], bits: int = 3, seed: int = 42) Any[source]

Quantize a vector using the C kernel or NumPy fallback.

Parameters:
  • vector (ndarray of float32, shape (d,)) – Input vector. d must be 256 (TQ_BLOCK_SIZE).

  • bits (int) – Quantization bits (2, 3, or 4).

  • seed (int) – Rotation matrix seed.

Returns:

Compressed block.

Return type:

TQBlock or ctypes Structure

turboquant.quant_bridge.compile_ggml_lib(output_dir: str | Path | None = None, optimize: bool = True) Path | None[source]

Attempt to compile the GGML TurboQuant C library.

Requires cc (clang or gcc) on the system.

Parameters:
  • output_dir (str or Path, optional) – Where to write the shared library. Defaults to the esml package dir.

  • optimize (bool) – Use -O2 -march=native (default True).

Returns:

Path to compiled library, or None on failure.

Return type:

Path or None

TurboQuant-compressed KV cache for ESML’s inference engine.

Stores attention key/value vectors as compressed TQBlocks instead of raw float tensors, achieving 4-6x memory reduction during inference.

This plugs into esml.engine.ESMLEngine to provide KV-cache compression during actual transformer attention computation.

References

  • TurboQuant: Zandieh et al. (2026). ICLR 2026. arXiv:2504.19874

  • QJL: Zandieh et al. (2025). AAAI 2025. arXiv:2406.03482

class turboquant.kv_cache.CacheStats(compressed_bytes: int = 0, uncompressed_bytes: int = 0, n_tokens: int = 0, n_layers: int = 0)[source]

Bases: object

Memory statistics for a TurboQuantKVCache.

compressed_bytes: int = 0
property compression_ratio: float
n_layers: int = 0
n_tokens: int = 0
property savings_mb: float
uncompressed_bytes: int = 0
class turboquant.kv_cache.TurboQuantKVCache(n_layers: int, head_dim: int, bits: int = 3, rotation_seed: int = 42)[source]

Bases: object

KV cache that stores keys and values as TurboQuant-compressed blocks.

Each key/value vector is quantized via esml.quant.turboquant_mse() on append(), and decompressed on get_keys() / get_values().

Parameters:
  • n_layers (int) – Number of transformer layers.

  • head_dim (int) – Dimension per attention head (must be power of 2).

  • bits (int) – TurboQuant quantization bits (2, 3, or 4).

  • rotation_seed (int) – Shared rotation seed for reproducibility.

Examples

>>> cache = TurboQuantKVCache(n_layers=32, head_dim=128, bits=3)
>>> k = np.random.randn(128)
>>> v = np.random.randn(128)
>>> cache.append(layer=0, k_vec=k, v_vec=v)
>>> keys = cache.get_keys(0)   # (1, 128) decompressed
>>> values = cache.get_values(0)
>>> cache.stats.compression_ratio
5.1
append(layer: int, k_vec: ndarray[tuple[Any, ...], dtype[float64]], v_vec: ndarray[tuple[Any, ...], dtype[float64]]) None[source]

Compress and cache a new key/value pair for one token.

Parameters:
  • layer (int) – Transformer layer index.

  • k_vec (ndarray of shape (head_dim,)) – Key vector (will be quantized).

  • v_vec (ndarray of shape (head_dim,)) – Value vector (will be quantized).

clear() None[source]

Clear all cached blocks.

get_keys(layer: int) ndarray[tuple[Any, ...], dtype[float64]][source]

Decompress all cached keys for a layer.

Return type:

ndarray of shape (seq_len, head_dim)

get_values(layer: int) ndarray[tuple[Any, ...], dtype[float64]][source]

Decompress all cached values for a layer.

Return type:

ndarray of shape (seq_len, head_dim)

property seq_len: int

Number of tokens currently cached (from layer 0).

property stats: CacheStats

Compute memory statistics.

class turboquant.kv_cache.UncompressedKVCache(n_layers: int, head_dim: int)[source]

Bases: object

Baseline uncompressed KV cache for comparison benchmarks.

append(layer: int, k_vec: ndarray[tuple[Any, ...], dtype[float64]], v_vec: ndarray[tuple[Any, ...], dtype[float64]]) None[source]
clear() None[source]
get_keys(layer: int) ndarray[tuple[Any, ...], dtype[float64]][source]
get_values(layer: int) ndarray[tuple[Any, ...], dtype[float64]][source]
property memory_bytes: int
property seq_len: int