Commit Graph

63 Commits

Author SHA1 Message Date
Georgi Gerganov
f49c636db0 llama-eval : protect dump() with lock for thread safety
Assisted-by: llama.cpp:local pi
2026-05-10 21:52:43 +03:00
Georgi Gerganov
d5165e8f2e llama-eval : require --grader-model or --model when using --grader-type llm
Assisted-by: llama.cpp:local pi
2026-05-10 21:49:58 +03:00
Georgi Gerganov
85c6aa006d llama-server-simulator : fix comment - Dice coefficient, not Levenshtein
Assisted-by: llama.cpp:local pi
2026-05-10 21:49:02 +03:00
Georgi Gerganov
e5ac6d1da6 llama-eval : track model name in eval state and verify on resume
- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming

Assisted-by: llama.cpp:local pi
2026-05-10 21:43:35 +03:00
Georgi Gerganov
094554dbcc llama-eval : update README with PR link and quick-start examples
Assisted-by: llama.cpp:local pi
2026-05-10 21:22:48 +03:00
Georgi Gerganov
f64d56bcd8 llama-server-simulator : replace Flask with stdlib http.server
- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports

Assisted-by: llama.cpp:local pi
2026-05-10 20:47:08 +03:00
ggerganov
43f14a0a46 llama-eval : support multiple evaluation endpoints with dynamic task distribution
- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before

Assisted-by: llama.cpp:local pi
2026-05-10 20:42:23 +03:00
Georgi Gerganov
d26b1ffcc9 llama-eval : rename display, escaped, and count variables to use prefix convention
- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)

Assisted-by: llama.cpp:local pi
2026-05-10 19:24:29 +03:00
Georgi Gerganov
9f10d8d195 llama-eval : add per-task generation time from server timings
Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.

Assisted-by: llama.cpp:local pi
2026-05-10 19:15:34 +03:00
Georgi Gerganov
4d5dedc569 llama-eval : add per-task generation speed from server timings
Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.

Assisted-by: llama.cpp:local pi
2026-05-10 19:05:20 +03:00
Georgi Gerganov
81a65cf035 eval : add Wilson score confidence interval to results
Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.
2026-05-10 18:46:36 +03:00
Georgi Gerganov
7d433f767b eval : unify "judge" terminology to "grader"
Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).

Assisted-by: llama.cpp:local pi
2026-05-10 18:23:28 +03:00
Georgi Gerganov
633a68d6c2 remove junk 2026-05-10 18:13:50 +03:00
Georgi Gerganov
e0a2cf48ca track total time 2026-05-10 18:13:50 +03:00
Georgi Gerganov
bad9565a1e refactor 2026-05-10 18:13:50 +03:00
Georgi Gerganov
752b703a5e resoning and error handling 2026-05-10 18:13:50 +03:00
Georgi Gerganov
fc571f3a1e add tokens 2026-05-10 18:13:50 +03:00
Georgi Gerganov
6797d80dff store full response 2026-05-10 18:13:50 +03:00
Georgi Gerganov
3649793811 add html 2026-05-10 18:13:50 +03:00
Georgi Gerganov
7e8c88c5e0 fix prompts 2026-05-10 18:13:49 +03:00
Georgi Gerganov
2e0b6766f3 simplify 2026-05-10 18:13:49 +03:00
Georgi Gerganov
f95f4dd1ca fix counts 2026-05-10 18:13:49 +03:00
Georgi Gerganov
095c8ab655 cleanup 2026-05-10 18:13:49 +03:00
Georgi Gerganov
d830acacc5 resume eval 2026-05-10 18:13:49 +03:00
Georgi Gerganov
f35b10f0a9 ignore errors 2026-05-10 18:13:49 +03:00
Georgi Gerganov
802d85e26e add AGENTS.md 2026-05-10 18:13:49 +03:00
Georgi Gerganov
91bd92c6b6 cleanup 2026-05-10 18:13:48 +03:00
Georgi Gerganov
f20b5a72cf datasets : fix aime2025 2026-05-10 18:13:48 +03:00
Georgi Gerganov
122dfe3eab grade : improve regex + logs 2026-05-10 18:13:48 +03:00
Georgi Gerganov
8b94ab4f4a grader : update prompt 2026-05-10 18:13:48 +03:00
Georgi Gerganov
f99d77f3bd datasets : add aime2025 2026-05-10 18:13:48 +03:00
Georgi Gerganov
55a7cf4a06 cont 2026-05-10 18:13:48 +03:00
Georgi Gerganov
6e7e1a5a63 grader : improve example answers 2026-05-10 18:13:48 +03:00
Georgi Gerganov
9f02fa6382 rename 2026-05-10 18:13:47 +03:00
Georgi Gerganov
e7b8646098 add gpqa + sampling + docs 2026-05-10 18:13:47 +03:00
Georgi Gerganov
55ce1b4e2f datasets : add gsm8k 2026-05-10 18:13:47 +03:00
Georgi Gerganov
abec77e068 remove old files 2026-05-10 18:13:47 +03:00
Georgi Gerganov
65e3c5a928 docs 2026-05-10 18:13:47 +03:00
Georgi Gerganov
4f176f6a4d improve grader 2026-05-10 18:13:47 +03:00
Georgi Gerganov
9578e83ac2 minor 2026-05-10 18:13:47 +03:00
Georgi Gerganov
530f38f9c3 eval : support multiple dataset runs 2026-05-10 18:13:46 +03:00
Georgi Gerganov
cda8cae01a sim : fix answer matching 2026-05-10 18:13:46 +03:00
Georgi Gerganov
64720e1e01 test : fix path 2026-05-10 18:13:46 +03:00
Georgi Gerganov
1a780f7c44 eval : add prompts 2026-05-10 18:13:46 +03:00
Georgi Gerganov
940364e4c9 eval : print progress 2026-05-10 18:13:46 +03:00
Georgi Gerganov
ee9b715eb6 examples: add task summary table to llama-eval-new.py 2026-05-10 18:13:46 +03:00
Georgi Gerganov
d639ee52ea docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
2026-05-10 18:13:46 +03:00
Georgi Gerganov
fb40d1a04a examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
2026-05-10 18:13:45 +03:00
Georgi Gerganov
2fe445cc60 docs: update llama-eval-discussion.md with session work summary 2026-05-10 18:13:45 +03:00
Georgi Gerganov
3732aea2df examples: use cached dataset path in simulator to avoid HF Hub requests 2026-05-10 18:13:45 +03:00