Commit Graph

9146 Commits

Author SHA1 Message Date
Georgi Gerganov
bad9565a1e refactor 2026-05-10 18:13:50 +03:00
Georgi Gerganov
752b703a5e resoning and error handling 2026-05-10 18:13:50 +03:00
Georgi Gerganov
fc571f3a1e add tokens 2026-05-10 18:13:50 +03:00
Georgi Gerganov
6797d80dff store full response 2026-05-10 18:13:50 +03:00
Georgi Gerganov
3649793811 add html 2026-05-10 18:13:50 +03:00
Georgi Gerganov
7e8c88c5e0 fix prompts 2026-05-10 18:13:49 +03:00
Georgi Gerganov
2e0b6766f3 simplify 2026-05-10 18:13:49 +03:00
Georgi Gerganov
f95f4dd1ca fix counts 2026-05-10 18:13:49 +03:00
Georgi Gerganov
095c8ab655 cleanup 2026-05-10 18:13:49 +03:00
Georgi Gerganov
d830acacc5 resume eval 2026-05-10 18:13:49 +03:00
Georgi Gerganov
f35b10f0a9 ignore errors 2026-05-10 18:13:49 +03:00
Georgi Gerganov
802d85e26e add AGENTS.md 2026-05-10 18:13:49 +03:00
Georgi Gerganov
91bd92c6b6 cleanup 2026-05-10 18:13:48 +03:00
Georgi Gerganov
f20b5a72cf datasets : fix aime2025 2026-05-10 18:13:48 +03:00
Georgi Gerganov
122dfe3eab grade : improve regex + logs 2026-05-10 18:13:48 +03:00
Georgi Gerganov
8b94ab4f4a grader : update prompt 2026-05-10 18:13:48 +03:00
Georgi Gerganov
f99d77f3bd datasets : add aime2025 2026-05-10 18:13:48 +03:00
Georgi Gerganov
55a7cf4a06 cont 2026-05-10 18:13:48 +03:00
Georgi Gerganov
6e7e1a5a63 grader : improve example answers 2026-05-10 18:13:48 +03:00
Georgi Gerganov
9f02fa6382 rename 2026-05-10 18:13:47 +03:00
Georgi Gerganov
e7b8646098 add gpqa + sampling + docs 2026-05-10 18:13:47 +03:00
Georgi Gerganov
55ce1b4e2f datasets : add gsm8k 2026-05-10 18:13:47 +03:00
Georgi Gerganov
abec77e068 remove old files 2026-05-10 18:13:47 +03:00
Georgi Gerganov
65e3c5a928 docs 2026-05-10 18:13:47 +03:00
Georgi Gerganov
4f176f6a4d improve grader 2026-05-10 18:13:47 +03:00
Georgi Gerganov
9578e83ac2 minor 2026-05-10 18:13:47 +03:00
Georgi Gerganov
530f38f9c3 eval : support multiple dataset runs 2026-05-10 18:13:46 +03:00
Georgi Gerganov
cda8cae01a sim : fix answer matching 2026-05-10 18:13:46 +03:00
Georgi Gerganov
64720e1e01 test : fix path 2026-05-10 18:13:46 +03:00
Georgi Gerganov
1a780f7c44 eval : add prompts 2026-05-10 18:13:46 +03:00
Georgi Gerganov
940364e4c9 eval : print progress 2026-05-10 18:13:46 +03:00
Georgi Gerganov
ee9b715eb6 examples: add task summary table to llama-eval-new.py 2026-05-10 18:13:46 +03:00
Georgi Gerganov
d639ee52ea docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
2026-05-10 18:13:46 +03:00
Georgi Gerganov
fb40d1a04a examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
2026-05-10 18:13:45 +03:00
Georgi Gerganov
2fe445cc60 docs: update llama-eval-discussion.md with session work summary 2026-05-10 18:13:45 +03:00
Georgi Gerganov
3732aea2df examples: use cached dataset path in simulator to avoid HF Hub requests 2026-05-10 18:13:45 +03:00
Georgi Gerganov
edc766c919 examples: use cached dataset path to avoid HF Hub requests 2026-05-10 18:13:45 +03:00
Georgi Gerganov
d7d2c22909 examples: remove HF_HUB_OFFLINE to allow dataset download 2026-05-10 18:13:45 +03:00
Georgi Gerganov
30ea5124de examples: use HF_HUB_OFFLINE to avoid HF Hub warnings 2026-05-10 18:13:45 +03:00
Georgi Gerganov
0ca458d892 examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
2026-05-10 18:13:45 +03:00
Georgi Gerganov
de8eda468b docs: remove README.md from llama-eval 2026-05-10 18:13:44 +03:00
Georgi Gerganov
a2b96e0444 examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
2026-05-10 18:13:44 +03:00
Georgi Gerganov
deed078654 docs: update llama-eval-discussion.md with session work summary
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
2026-05-10 18:13:44 +03:00
Georgi Gerganov
05b8425bd6 examples: refactor test-simulator.sh for better readability
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
2026-05-10 18:13:44 +03:00
Georgi Gerganov
58bd57ba99 examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.
2026-05-10 18:13:44 +03:00
gatbontonpc
5cbe95b6e5 add checkpointing 2026-05-10 18:13:44 +03:00
gatbontonpc
c7f3ce25f5 Add readme 2026-05-10 18:13:44 +03:00
gatbontonpc
4db4497ca7 multi source llama-eval 2026-05-10 18:13:43 +03:00
gatbontonpc
db8b09d6e8 working llama-eval mc and math suite 2026-05-10 18:13:42 +03:00
Georgi Gerganov
0b047287fe sync : ggml b9097 2026-05-10 17:00:11 +03:00