Nonstop CI Autopilot

The autopilot is a self-sustaining loop that continuously proposes, runs, and records coil-optimization cases on a self-hosted CI runner. Its purpose is to autonomously explore the design space across multiple plasma surfaces and constraint settings, building the leaderboard without manual intervention.

How it works

Three phases repeat indefinitely:

Propose — python -m tools.propose_batch generates a batch of cases (default 10; configurable via policy/proposer_policy.yaml).
Run — The CI runner executes each case via stellcoilbench run-ci-case. Cases run in parallel (up to 10 concurrent jobs) with per-case timeouts.
Record — Results are written to cases/done/<case_id>/summary.json (locally, for runner polling; not committed) and to submissions/<surface>/auto/<case_id>/. Failures are recorded in policy/autopilot_failures.json. The completed results feed the next proposal round.

A batch barrier ensures that the proposer never writes new cases while cases/pending/ still contains unfinished work. The CI workflow update-db-self-hosted.yml drives the loop: each push to main (or a 10-minute cron safety net) triggers run_autopilot_cases → propose_autopilot_batch. Autopilot commits (submissions/, policy/autopilot_failures.json, cases/pending/) are filtered via paths-ignore so they do not re-trigger benchmark jobs, keeping the loop self-sustaining.

Proposer modes

The proposer supports two modes. Both share the same guardrails, validation, and output format; only the case-generation strategy differs.

Deterministic (genetic-algorithm) policy — default

The default proposer is a lightweight, deterministic genetic-algorithm (GA) style optimizer. It requires no external API and is fully reproducible with a --seed flag.

Each batch is split into two halves (ratio set by exploit_fraction):

Exploit (mutation) — The proposer selects a parent from the top-\(k\) feasible results (ranked by composite score), clones its case_config, and applies random perturbations:
- Threshold jitter — Each constraint threshold is multiplied by a log-normal factor (\(\sigma\) from threshold_sigma), nudging the optimizer toward a different trade-off on the Pareto frontier.
- Structural mutation — With probability structural_mutation_prob, the number of base coils or Fourier order is changed to an adjacent value.
- A novelty check (config hash) rejects duplicates of recent runs.
Explore — New random cases with surface, coil count, and order drawn from the policy’s allowed sets. Thresholds are sampled from log-uniform ranges to create diversity across the design space.

# Default (deterministic) proposer
python -m tools.propose_batch --batch-size 10 --dry-run --seed 42

LLM-powered policy — opt-in

When the --llm flag is passed, the proposer calls the LLM directly with context from build_context (top parents, failure statistics, surface exploration counts, run cards, postmortems) and llm_context.md. The LLM returns a list of mutate / explore actions that are converted into CI case dicts by apply_llm_action().

Requires ANTHROPIC_API_KEY (or KB_LLM_* env vars). If the LLM returns an error or fewer than batch_size valid cases, the proposer falls back automatically to the deterministic GA policy.

# LLM proposer (direct mode)
python -m tools.propose_batch --batch-size 10 --llm --dry-run

The LLM mode is useful when you want the proposer to reason about which surfaces are under-explored, which constraint combinations have not yet been tried, or which failure modes to avoid — tasks that go beyond what a fixed mutation/exploration policy can do.

Guardrails and safe mode

The proposer checks a sliding window (default 30) of recent results before every batch:

Fail-rate guard — If the failure rate exceeds max_fail_rate (default 0.6), proposals halt and PAUSE_AUTORUN is written.
Repeated-failure guard — If the most common failure reason repeats more than max_common_failure_count times, proposals halt.
Critical-class guard — Tracks classes like vmec_nonconverged, nan_in_objective, timeout; halts if any exceeds max_critical_class_count.

Between the safe-mode threshold (0.35) and the hard halt (0.6), the proposer enters safe mode: mutation sigma is reduced, iteration caps are lowered, and exploration is restricted to preferred (simpler) surfaces.

Pausing and resuming

Create the file PAUSE_AUTORUN in the repo root to halt all proposals. The proposer checks for this file on every invocation and exits immediately if it exists. Guardrail triggers can also create this file automatically (controlled by cooldown.write_pause_file in the policy). To resume, simply delete the file:

# Pause
touch PAUSE_AUTORUN

# Resume
rm PAUSE_AUTORUN

Directory layout

cases/pending/           Proposer writes new case JSONs here
cases/done/<case_id>/    summary.json, case.yaml, coils.json
submissions/<surface>/auto/<case_id>/   Results for leaderboard
policy/proposer_policy.yaml             Proposer configuration
tools/propose_batch/                    Proposer package (python -m tools.propose_batch)
tools/build_context.py                  Context builder

Pending-case format

Each file in cases/pending/ is a JSON dict with:

case_id — unique timestamp + random suffix
case_config — full case YAML equivalent (surface, coils, optimizer, objective terms)
resource — max_total_iterations (capped at 10 000) and timeout_minutes
parent_ids — list of parent case IDs (empty for exploration)
tags — ["exploit"], ["explore"], ["exploit", "llm"], etc.
random_seed — for reproducibility

Commands

# Preview a batch (dry run)
python -m tools.propose_batch --batch-size 10 --dry-run --seed 42

# Build context payload from completed cases
python tools/build_context.py [--out context.json]

# Run a single CI case locally
stellcoilbench run-ci-case <case_file> [--output-dir cases/done] \
    [--policy policy/proposer_policy.yaml]

# Emergency stop
touch PAUSE_AUTORUN

# Resume after stop
rm PAUSE_AUTORUN