autoresearch¶

Autonomous LLM pretraining. Karpathy’s autoresearch with a macOS/MPS fork for Apple Silicon. autoresearch is a set of command-line scripts, not an importable Python package, so this page is a runbook rather than an autodoc dump.

Source: llamaclaw/autoresearch.

Entry points¶

Script	Purpose
`train.py`	Pre-training loop. Default depth=4, batch=2, TOTAL_BATCH=8192, TIME_BUDGET=1800 s (30 min).
`train_optimal.py`	Same loop with frozen hyperparameters — the “known-good” baseline.
`quantize_eval.py`	TurboQuant PTQ evaluation against the trained checkpoint.
`agent_loop.py`	LLM-driven autonomous training loop. Reads `train.py`, proposes edits, runs training, keeps or reverts based on `val_bpb`. Uses the ESML provider chain via `esml.llm`.
`tq_benchmark.py`	Rigorous multi-seed TurboQuant benchmark with RTN baseline, SQNR analysis, Pareto plot, JSON output.
`pt2gguf.py`	Convert an autoresearch `.pt` checkpoint to GGUF v3, optionally applying TurboQuant compression.
`run_full_experiment.sh`	5-phase orchestrator: train → quantize → benchmark → convert → verify. Pass `--benchmark` to add the rigorous Phase 6.
`run_agent_experiment.sh`	Wrapper that launches `agent_loop.py` with live progress.

Hardware¶

macOS / Apple Silicon: PyTorch MPS. Tested on M2 8 GB; batch=2 and PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 are required to avoid OOM.
Linux / CUDA: upstream Karpathy branch.

Status files¶

train_status.txt, quant_status.txt, agent_status.txt, bench_status.txt are plain key=value files that the shell scripts poll to render animated progress bars. They are the observability surface for running experiments.

Results persistence¶

results.tsv and results_history.tsv are append-only — written after each phase, never deferred to a “final documentation” phase. EXPERIMENT_LOG.md appends with --- separators between runs. Per-run snapshots land in logs/<RUN_TAG>/.

Reproducible experiment result (2026-04-07)¶

val_bpb ≈ 1.60
TurboQuant: 4-bit 7.6× at cosine 0.995; 3-bit 10× at cosine 0.983; 2-bit 14.6× at cosine ~0.95.
Config: depth=4, batch=2, TOTAL_BATCH=8192, TIME_BUDGET=1800, eval_tokens=10 × VAL_TOKENS.