End-to-end: pretrain → compress → serve¶
A full small-scale LLM workflow using four llamaclaw repos. Trains on an M2 Mac in 30 minutes, compresses to 3-bit, and serves via Ollama.
Architecture¶
autoresearch (train) ──▶ .pt checkpoint
│
▼
turboquant (PTQ) ──▶ compressed blocks (10× smaller)
│
▼
pt2gguf ──▶ .gguf file
│
▼
Ollama (serve) ──▶ perseus uses it via provider chain
1. Clone autoresearch (MPS variant)¶
git clone git@github.com:llamaclaw/autoresearch.git
cd autoresearch
uv sync
2. Pretrain (30 min on M2 8 GB)¶
./run_full_experiment.sh
This runs the 5-phase pipeline: train_optimal.py → quantize_eval.py
→ pt2gguf.py → tq_benchmark.py → verify. At the end you get
model_best.pt (~200 MB for a 50 M param model) and a val_bpb around
1.65.
Progress shows live:
[1/5] training ████████████████░░░░░░░ 1450/1800s val_bpb=1.672
3. Compress with TurboQuant¶
The quantize_eval.py step already runs 2/3/4-bit sweeps. Results land
in results.tsv:
bits |
compression |
mean cos |
val_bpb delta |
|---|---|---|---|
4 |
7.6× |
0.995 |
+21.7% |
3 |
10.0× |
0.983 |
+21.6% |
2 |
14.6× |
0.940 |
+22.6% |
5 |
6.2× |
0.9985 |
+0.02% ★ |
The 5-bit point is the sweet spot if you can afford 6×. Smaller models benefit more from aggressive quantization (3-bit is almost as good as 4-bit here).
4. Convert to GGUF¶
python pt2gguf.py --input model_best.pt --output model.gguf --bits 3
Under the hood this uses
llamaclaw/turboquant to
apply PolarQuant + QJL and pack into GGUF blocks.
5. Serve via Ollama¶
cat > Modelfile << 'EOF'
FROM ./model.gguf
PARAMETER temperature 0.1
SYSTEM "You are a scientific computing assistant."
EOF
ollama create mymodel -f Modelfile
ollama run mymodel "explain double machine learning"
Or publish to your Ollama namespace:
ollama push llamaclaw/mymodel:e3b
6. Point Perseus at it¶
llamaclaw/perseus auto-detects
running Ollama:
from perseus import ask_percy
response = ask_percy("What is the ATE?")
print(response["output_text"])
# Perseus uses the first available provider: your mymodel, then
# FreeAPI, then Gemini, then OpenAI, then local keyword fallback.
Performance (M2 8 GB, fp32 baseline → TQ 3-bit GGUF)¶
metric |
before |
after |
|---|---|---|
model size |
200 MB |
20 MB |
inference tok/s |
12 |
14 |
val_bpb |
1.648 |
2.004 |
cosine sim vs fp32 |
1.000 |
0.983 |
The tok/s improvement is modest because MPS doesn’t natively accelerate int8/int3 ops; the win is the 10× memory reduction, which lets you run larger base models in the same RAM.
Why this matters¶
TurboQuant is unbiased (E[x_hat] = x), which is critical if you’re
going to chain the quantized model into a causal-inference pipeline
where biased estimators poison downstream estimates. That’s the whole
reason esml depends on turboquant as a pip VCS dep: we need
trustworthy compression for production inference.
See also¶
Pi deployment guide — run the same compressed model on zeus.local with 16 GB RAM + NVMe.
Perseus custom model tutorial — fine-tune a base model for a specific domain before quantizing.