End-to-end: pretrain → compress → serve

A full small-scale LLM workflow using four llamaclaw repos. Trains on an M2 Mac in 30 minutes, compresses to 3-bit, and serves via Ollama.

Architecture

autoresearch (train) ──▶  .pt checkpoint
                         │
                         ▼
turboquant (PTQ) ──▶  compressed blocks (10× smaller)
                         │
                         ▼
pt2gguf ──▶  .gguf file
                         │
                         ▼
Ollama (serve) ──▶  perseus uses it via provider chain

1. Clone autoresearch (MPS variant)

git clone git@github.com:llamaclaw/autoresearch.git
cd autoresearch
uv sync

2. Pretrain (30 min on M2 8 GB)

./run_full_experiment.sh

This runs the 5-phase pipeline: train_optimal.pyquantize_eval.pypt2gguf.pytq_benchmark.py → verify. At the end you get model_best.pt (~200 MB for a 50 M param model) and a val_bpb around 1.65.

Progress shows live:

[1/5] training       ████████████████░░░░░░░  1450/1800s  val_bpb=1.672

3. Compress with TurboQuant

The quantize_eval.py step already runs 2/3/4-bit sweeps. Results land in results.tsv:

bits

compression

mean cos

val_bpb delta

4

7.6×

0.995

+21.7%

3

10.0×

0.983

+21.6%

2

14.6×

0.940

+22.6%

5

6.2×

0.9985

+0.02% ★

The 5-bit point is the sweet spot if you can afford 6×. Smaller models benefit more from aggressive quantization (3-bit is almost as good as 4-bit here).

4. Convert to GGUF

python pt2gguf.py --input model_best.pt --output model.gguf --bits 3

Under the hood this uses llamaclaw/turboquant to apply PolarQuant + QJL and pack into GGUF blocks.

5. Serve via Ollama

cat > Modelfile << 'EOF'
FROM ./model.gguf
PARAMETER temperature 0.1
SYSTEM "You are a scientific computing assistant."
EOF

ollama create mymodel -f Modelfile
ollama run mymodel "explain double machine learning"

Or publish to your Ollama namespace:

ollama push llamaclaw/mymodel:e3b

6. Point Perseus at it

llamaclaw/perseus auto-detects running Ollama:

from perseus import ask_percy

response = ask_percy("What is the ATE?")
print(response["output_text"])
# Perseus uses the first available provider: your mymodel, then
# FreeAPI, then Gemini, then OpenAI, then local keyword fallback.

Performance (M2 8 GB, fp32 baseline → TQ 3-bit GGUF)

metric

before

after

model size

200 MB

20 MB

inference tok/s

12

14

val_bpb

1.648

2.004

cosine sim vs fp32

1.000

0.983

The tok/s improvement is modest because MPS doesn’t natively accelerate int8/int3 ops; the win is the 10× memory reduction, which lets you run larger base models in the same RAM.

Why this matters

TurboQuant is unbiased (E[x_hat] = x), which is critical if you’re going to chain the quantized model into a causal-inference pipeline where biased estimators poison downstream estimates. That’s the whole reason esml depends on turboquant as a pip VCS dep: we need trustworthy compression for production inference.

See also