Compare commits

...

1 Commits

Author SHA1 Message Date
Kim Morrison
bb577aaf88 chore: add lean4-benchmarks skill for Claude Code
This PR moves benchmark documentation from CLAUDE.md into a dedicated
project-level skill at .claude/skills/lean4-benchmarks/SKILL.md, covering
the measurement pipeline, variable-size benchmarks, JSONL format, custom
metrics, and combine.py behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 22:03:07 +00:00
2 changed files with 80 additions and 7 deletions

View File

@@ -33,14 +33,9 @@ tests/with_stage1_test_env.sh tests/elab_bench/run_bench.sh cbv_decide.lean
tests/with_stage1_test_env.sh tests/elab/run_test.sh grind_indexmap.lean
```
## Benchmark vs Test Problem Sizes
## Benchmarks
Benchmarks are also run as tests. Use the `TEST_BENCH` environment variable (unset in tests, set to `1` in benchmarks) to scale problem sizes:
- In `compile_bench` `.init.sh` files: check `$TEST_BENCH` and set `TEST_ARGS` accordingly
- In `elab_bench` Lean files: use `(← IO.getEnv "TEST_BENCH") == some "1"` to switch between small (test) and large (bench) inputs
See `tests/README.md` for the full benchmark writing guide.
See the `lean4-benchmarks` skill (`.claude/skills/lean4-benchmarks/SKILL.md`) and `tests/README.md`.
## Testing stage 2

View File

@@ -0,0 +1,78 @@
---
name: lean4-benchmarks
description: Working with the Lean4 benchmark system. Use when creating, running, or analyzing benchmarks, or when working in tests/elab_bench/ or tests/compile_bench/.
---
# Lean4 Benchmarks
## Two Benchmark Piles
- **`tests/elab_bench/`** -- elaboration benchmarks. Each `.lean` file is elaborated by `lean` and measured.
- **`tests/compile_bench/`** -- compile+run benchmarks. Each `.lean` is compiled to C, then native binary, then execution is measured.
New `.lean` files are auto-registered via the glob in `tests/CMakeLists.txt`.
## Running
```bash
# Full suite → tests/measurements.jsonl
make -C build/release -j$(nproc) bench
# Single benchmark manually (test sizes, fast):
tests/with_stage1_test_env.sh tests/elab_bench/run_bench.sh foo.lean
# Single benchmark at real sizes (sets TEST_BENCH=1):
tests/with_stage1_bench_env.sh tests/elab_bench/run_bench.sh foo.lean
```
## Variable-Size Benchmarks
`TEST_BENCH` env var: unset in tests, set to `1` in benchmarks. Two mechanisms:
**compile_bench** -- create `foo.lean.init.sh` sidecar:
```bash
TEST_ARGS=( 10 )
if [[ -n $TEST_BENCH ]]; then
TEST_ARGS=( 100 )
fi
```
**elab_bench** -- check in Lean:
```lean
let bench := ( IO.getEnv "TEST_BENCH") == some "1"
let sizes := if bench then [6, 10, 14, 18, 20] else [6]
```
## JSONL Measurement Format
Metrics use the format `topic//category`:
```json
{"metric": "elab/foo//wall-clock", "value": 0.5, "unit": "s"}
{"metric": "elab/foo//instructions", "value": 1200000000}
```
Default categories from `measure.py` (wraps `perf stat` + `rusage`):
`wall-clock`, `task-clock`, `instructions`, `cycles`, `maxrss`.
## Custom Metrics from Lean
Print to stdout in this exact format:
```
measurement: compress 0.123 s
measurement: parse 0.045 s
```
`extract_measurements` in `util.sh` converts these to JSONL automatically.
## combine.py
Merges JSONL files. **Sums duplicate metrics** -- if you re-run without deleting
the `.measurements.jsonl` file, values double. (CMake targets handle this via `remove -f`.)
```bash
python3 tests/combine.py -o combined.jsonl file1.jsonl file2.jsonl
```
## Scaling Analysis
No built-in tool. Write a custom script per `tests/bench/mergeSort/bench.py`:
run at multiple sizes, fit curves with scipy, plot with matplotlib.