chore: add lean4-benchmarks skill for Claude Code

This PR moves benchmark documentation from CLAUDE.md into a dedicated project-level skill at .claude/skills/lean4-benchmarks/SKILL.md, covering the measurement pipeline, variable-size benchmarks, JSONL format, custom metrics, and combine.py behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 23:34:11 +00:00 · 2026-03-23 22:03:07 +00:00
2 changed files with 80 additions and 7 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@@ -33,14 +33,9 @@ tests/with_stage1_test_env.sh tests/elab_bench/run_bench.sh cbv_decide.lean
 tests/with_stage1_test_env.sh tests/elab/run_test.sh grind_indexmap.lean
 ```

-## Benchmark vs Test Problem Sizes
+## Benchmarks

-Benchmarks are also run as tests. Use the `TEST_BENCH` environment variable (unset in tests, set to `1` in benchmarks) to scale problem sizes:
-
- In `compile_bench` `.init.sh` files: check `$TEST_BENCH` and set `TEST_ARGS` accordingly
- In `elab_bench` Lean files: use `(← IO.getEnv "TEST_BENCH") == some "1"` to switch between small (test) and large (bench) inputs
-
-See `tests/README.md` for the full benchmark writing guide.
+See the `lean4-benchmarks` skill (`.claude/skills/lean4-benchmarks/SKILL.md`) and `tests/README.md`.

 ## Testing stage 2

--- a/.claude/skills/lean4-benchmarks/SKILL.md
+++ b/.claude/skills/lean4-benchmarks/SKILL.md
@@ -0,0 +1,78 @@
+---
+name: lean4-benchmarks
+description: Working with the Lean4 benchmark system. Use when creating, running, or analyzing benchmarks, or when working in tests/elab_bench/ or tests/compile_bench/.
+---
+
+# Lean4 Benchmarks
+
+## Two Benchmark Piles
+
+- **`tests/elab_bench/`** -- elaboration benchmarks. Each `.lean` file is elaborated by `lean` and measured.
+- **`tests/compile_bench/`** -- compile+run benchmarks. Each `.lean` is compiled to C, then native binary, then execution is measured.
+
+New `.lean` files are auto-registered via the glob in `tests/CMakeLists.txt`.
+
+## Running
+
+```bash
+# Full suite → tests/measurements.jsonl
+make -C build/release -j$(nproc) bench
+
+# Single benchmark manually (test sizes, fast):
+tests/with_stage1_test_env.sh tests/elab_bench/run_bench.sh foo.lean
+
+# Single benchmark at real sizes (sets TEST_BENCH=1):
+tests/with_stage1_bench_env.sh tests/elab_bench/run_bench.sh foo.lean
+```
+
+## Variable-Size Benchmarks
+
+`TEST_BENCH` env var: unset in tests, set to `1` in benchmarks. Two mechanisms:
+
+**compile_bench** -- create `foo.lean.init.sh` sidecar:
+```bash
+TEST_ARGS=( 10 )
+if [[ -n $TEST_BENCH ]]; then
+  TEST_ARGS=( 100 )
+fi
+```
+
+**elab_bench** -- check in Lean:
+```lean
+let bench := (← IO.getEnv "TEST_BENCH") == some "1"
+let sizes := if bench then [6, 10, 14, 18, 20] else [6]
+```
+
+## JSONL Measurement Format
+
+Metrics use the format `topic//category`:
+```json
+{"metric": "elab/foo//wall-clock", "value": 0.5, "unit": "s"}
+{"metric": "elab/foo//instructions", "value": 1200000000}
+```
+
+Default categories from `measure.py` (wraps `perf stat` + `rusage`):
+`wall-clock`, `task-clock`, `instructions`, `cycles`, `maxrss`.
+
+## Custom Metrics from Lean
+
+Print to stdout in this exact format:
+```
+measurement: compress 0.123 s
+measurement: parse 0.045 s
+```
+`extract_measurements` in `util.sh` converts these to JSONL automatically.
+
+## combine.py
+
+Merges JSONL files. **Sums duplicate metrics** -- if you re-run without deleting
+the `.measurements.jsonl` file, values double. (CMake targets handle this via `remove -f`.)
+
+```bash
+python3 tests/combine.py -o combined.jsonl file1.jsonl file2.jsonl
+```
+
+## Scaling Analysis
+
+No built-in tool. Write a custom script per `tests/bench/mergeSort/bench.py`:
+run at multiple sizes, fit curves with scipy, plot with matplotlib.