End-to-end: pretrain → compress → serve¶

A full small-scale LLM workflow using four llamaclaw repos. Trains on an M2 Mac in 30 minutes, compresses to 3-bit, and serves via Ollama.

Architecture¶

autoresearch (train) ──▶  .pt checkpoint
                         │
                         ▼
turboquant (PTQ) ──▶  compressed blocks (10× smaller)
                         │
                         ▼
pt2gguf ──▶  .gguf file
                         │
                         ▼
Ollama (serve) ──▶  perseus uses it via provider chain

1. Clone autoresearch (MPS variant)¶

git clone git@github.com:llamaclaw/autoresearch.git
cd autoresearch
uv sync

2. Pretrain (30 min on M2 8 GB)¶

./run_full_experiment.sh

This runs the 5-phase pipeline: train_optimal.py → quantize_eval.py → pt2gguf.py → tq_benchmark.py → verify. At the end you get model_best.pt (~200 MB for a 50 M param model) and a val_bpb around 1.65.

Progress shows live:

[1/5] training       ████████████████░░░░░░░  1450/1800s  val_bpb=1.672

3. Compress with TurboQuant¶

The quantize_eval.py step already runs 2/3/4-bit sweeps. Results land in results.tsv:

bits	compression	mean cos	val_bpb delta
4	7.6×	0.995	+21.7%
3	10.0×	0.983	+21.6%
2	14.6×	0.940	+22.6%
5	6.2×	0.9985	+0.02% ★

The 5-bit point is the sweet spot if you can afford 6×. Smaller models benefit more from aggressive quantization (3-bit is almost as good as 4-bit here).

4. Convert to GGUF¶

python pt2gguf.py --input model_best.pt --output model.gguf --bits 3

Under the hood this uses llamaclaw/turboquant to apply PolarQuant + QJL and pack into GGUF blocks.

5. Serve via Ollama¶

cat > Modelfile << 'EOF'
FROM ./model.gguf
PARAMETER temperature 0.1
SYSTEM "You are a scientific computing assistant."
EOF

ollama create mymodel -f Modelfile
ollama run mymodel "explain double machine learning"

Or publish to your Ollama namespace:

ollama push llamaclaw/mymodel:e3b

6. Point Perseus at it¶

llamaclaw/perseus auto-detects running Ollama:

from perseus import ask_percy

response = ask_percy("What is the ATE?")
print(response["output_text"])
# Perseus uses the first available provider: your mymodel, then
# FreeAPI, then Gemini, then OpenAI, then local keyword fallback.

Performance (M2 8 GB, fp32 baseline → TQ 3-bit GGUF)¶

metric	before	after
model size	200 MB	20 MB
inference tok/s	12	14
val_bpb	1.648	2.004
cosine sim vs fp32	1.000	0.983

The tok/s improvement is modest because MPS doesn’t natively accelerate int8/int3 ops; the win is the 10× memory reduction, which lets you run larger base models in the same RAM.

Why this matters¶

TurboQuant is unbiased (E[x_hat] = x), which is critical if you’re going to chain the quantized model into a causal-inference pipeline where biased estimators poison downstream estimates. That’s the whole reason esml depends on turboquant as a pip VCS dep: we need trustworthy compression for production inference.