autoresearch

Autonomous LLM pretraining. Karpathy’s autoresearch with a macOS/MPS fork for Apple Silicon. autoresearch is a set of command-line scripts, not an importable Python package, so this page is a runbook rather than an autodoc dump.

Source: llamaclaw/autoresearch.

Entry points

Script

Purpose

train.py

Pre-training loop. Default depth=4, batch=2, TOTAL_BATCH=8192, TIME_BUDGET=1800 s (30 min).

train_optimal.py

Same loop with frozen hyperparameters — the “known-good” baseline.

quantize_eval.py

TurboQuant PTQ evaluation against the trained checkpoint.

agent_loop.py

LLM-driven autonomous training loop. Reads train.py, proposes edits, runs training, keeps or reverts based on val_bpb. Uses the ESML provider chain via esml.llm.

tq_benchmark.py

Rigorous multi-seed TurboQuant benchmark with RTN baseline, SQNR analysis, Pareto plot, JSON output.

pt2gguf.py

Convert an autoresearch .pt checkpoint to GGUF v3, optionally applying TurboQuant compression.

run_full_experiment.sh

5-phase orchestrator: train → quantize → benchmark → convert → verify. Pass --benchmark to add the rigorous Phase 6.

run_agent_experiment.sh

Wrapper that launches agent_loop.py with live progress.

Hardware

  • macOS / Apple Silicon: PyTorch MPS. Tested on M2 8 GB; batch=2 and PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 are required to avoid OOM.

  • Linux / CUDA: upstream Karpathy branch.

Status files

train_status.txt, quant_status.txt, agent_status.txt, bench_status.txt are plain key=value files that the shell scripts poll to render animated progress bars. They are the observability surface for running experiments.

Results persistence

results.tsv and results_history.tsv are append-only — written after each phase, never deferred to a “final documentation” phase. EXPERIMENT_LOG.md appends with --- separators between runs. Per-run snapshots land in logs/<RUN_TAG>/.

Reproducible experiment result (2026-04-07)

  • val_bpb 1.60

  • TurboQuant: 4-bit 7.6× at cosine 0.995; 3-bit 10× at cosine 0.983; 2-bit 14.6× at cosine ~0.95.

  • Config: depth=4, batch=2, TOTAL_BATCH=8192, TIME_BUDGET=1800, eval_tokens=10 × VAL_TOKENS.