mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-13 20:44:09 +00:00

Files

Georgi Gerganov 07d5e1e0ea examples: add llama-server simulator for testing eval scripts

Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.

2026-02-15 21:08:22 +02:00

llama-eval-discussion.md

examples: add llama-server simulator for testing eval scripts

2026-02-15 21:08:22 +02:00

llama-eval.py

add checkpointing

2026-02-15 21:08:22 +02:00

llama-server-simulator-plan.md

examples: add llama-server simulator for testing eval scripts

2026-02-15 21:08:22 +02:00

llama-server-simulator.py

examples: add llama-server simulator for testing eval scripts

2026-02-15 21:08:22 +02:00

README.md

add checkpointing

2026-02-15 21:08:22 +02:00

simulator-summary.md

examples: add llama-server simulator for testing eval scripts

2026-02-15 21:08:22 +02:00

test-cache.sh

examples: add llama-server simulator for testing eval scripts

2026-02-15 21:08:22 +02:00

test-simulator.sh

examples: add llama-server simulator for testing eval scripts

2026-02-15 21:08:22 +02:00

README.md

llama.cpp/example/llama-eval

llama-eval.py is a single-script evaluation runner that sends prompt/response pairs to any OpenAI-compatible HTTP server (the default llama-server).

./llama-server -m model.gguf --port 8033
python examples/llama-eval/llama-eval.py --path_server http://localhost:8033 --n_prompts 100 --prompt_source arc

The supported tasks are:

GSM8K — grade-school math
AIME — competition math (integer answers)
MMLU — multi-domain multiple choice
HellaSwag — commonsense reasoning multiple choice
ARC — grade-school science multiple choice
WinoGrande — commonsense coreference multiple choice