Commit Graph

1677 Commits

Author SHA1 Message Date
Georgi Gerganov
3649793811 add html 2026-05-10 18:13:50 +03:00
Georgi Gerganov
7e8c88c5e0 fix prompts 2026-05-10 18:13:49 +03:00
Georgi Gerganov
2e0b6766f3 simplify 2026-05-10 18:13:49 +03:00
Georgi Gerganov
f95f4dd1ca fix counts 2026-05-10 18:13:49 +03:00
Georgi Gerganov
095c8ab655 cleanup 2026-05-10 18:13:49 +03:00
Georgi Gerganov
d830acacc5 resume eval 2026-05-10 18:13:49 +03:00
Georgi Gerganov
f35b10f0a9 ignore errors 2026-05-10 18:13:49 +03:00
Georgi Gerganov
802d85e26e add AGENTS.md 2026-05-10 18:13:49 +03:00
Georgi Gerganov
91bd92c6b6 cleanup 2026-05-10 18:13:48 +03:00
Georgi Gerganov
f20b5a72cf datasets : fix aime2025 2026-05-10 18:13:48 +03:00
Georgi Gerganov
122dfe3eab grade : improve regex + logs 2026-05-10 18:13:48 +03:00
Georgi Gerganov
8b94ab4f4a grader : update prompt 2026-05-10 18:13:48 +03:00
Georgi Gerganov
f99d77f3bd datasets : add aime2025 2026-05-10 18:13:48 +03:00
Georgi Gerganov
55a7cf4a06 cont 2026-05-10 18:13:48 +03:00
Georgi Gerganov
6e7e1a5a63 grader : improve example answers 2026-05-10 18:13:48 +03:00
Georgi Gerganov
9f02fa6382 rename 2026-05-10 18:13:47 +03:00
Georgi Gerganov
e7b8646098 add gpqa + sampling + docs 2026-05-10 18:13:47 +03:00
Georgi Gerganov
55ce1b4e2f datasets : add gsm8k 2026-05-10 18:13:47 +03:00
Georgi Gerganov
abec77e068 remove old files 2026-05-10 18:13:47 +03:00
Georgi Gerganov
65e3c5a928 docs 2026-05-10 18:13:47 +03:00
Georgi Gerganov
4f176f6a4d improve grader 2026-05-10 18:13:47 +03:00
Georgi Gerganov
9578e83ac2 minor 2026-05-10 18:13:47 +03:00
Georgi Gerganov
530f38f9c3 eval : support multiple dataset runs 2026-05-10 18:13:46 +03:00
Georgi Gerganov
cda8cae01a sim : fix answer matching 2026-05-10 18:13:46 +03:00
Georgi Gerganov
64720e1e01 test : fix path 2026-05-10 18:13:46 +03:00
Georgi Gerganov
1a780f7c44 eval : add prompts 2026-05-10 18:13:46 +03:00
Georgi Gerganov
940364e4c9 eval : print progress 2026-05-10 18:13:46 +03:00
Georgi Gerganov
ee9b715eb6 examples: add task summary table to llama-eval-new.py 2026-05-10 18:13:46 +03:00
Georgi Gerganov
d639ee52ea docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
2026-05-10 18:13:46 +03:00
Georgi Gerganov
fb40d1a04a examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
2026-05-10 18:13:45 +03:00
Georgi Gerganov
2fe445cc60 docs: update llama-eval-discussion.md with session work summary 2026-05-10 18:13:45 +03:00
Georgi Gerganov
3732aea2df examples: use cached dataset path in simulator to avoid HF Hub requests 2026-05-10 18:13:45 +03:00
Georgi Gerganov
edc766c919 examples: use cached dataset path to avoid HF Hub requests 2026-05-10 18:13:45 +03:00
Georgi Gerganov
d7d2c22909 examples: remove HF_HUB_OFFLINE to allow dataset download 2026-05-10 18:13:45 +03:00
Georgi Gerganov
30ea5124de examples: use HF_HUB_OFFLINE to avoid HF Hub warnings 2026-05-10 18:13:45 +03:00
Georgi Gerganov
0ca458d892 examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
2026-05-10 18:13:45 +03:00
Georgi Gerganov
de8eda468b docs: remove README.md from llama-eval 2026-05-10 18:13:44 +03:00
Georgi Gerganov
a2b96e0444 examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
2026-05-10 18:13:44 +03:00
Georgi Gerganov
deed078654 docs: update llama-eval-discussion.md with session work summary
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
2026-05-10 18:13:44 +03:00
Georgi Gerganov
05b8425bd6 examples: refactor test-simulator.sh for better readability
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
2026-05-10 18:13:44 +03:00
Georgi Gerganov
58bd57ba99 examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.
2026-05-10 18:13:44 +03:00
gatbontonpc
5cbe95b6e5 add checkpointing 2026-05-10 18:13:44 +03:00
gatbontonpc
c7f3ce25f5 Add readme 2026-05-10 18:13:44 +03:00
gatbontonpc
4db4497ca7 multi source llama-eval 2026-05-10 18:13:43 +03:00
gatbontonpc
db8b09d6e8 working llama-eval mc and math suite 2026-05-10 18:13:42 +03:00
Neo Zhang
6a2a2513dc fix script error (#22795sycl : ) 2026-05-08 06:54:57 +03:00
Shane Tran Whitmire
cfff1fc300 sycl : fix test script (#22737)
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution

was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0

Fix is to get rid of a dollar sign.
2026-05-07 08:25:57 +03:00
Adrien Gallouët
bf76ac77be common : only load backends when required (#22290)
* common : only load backends when required

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* llama : call ggml_backend_load_all() directly from llama_backend_init()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add ggml_backend_load_all() where llama_backend_init() is not used

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 09:23:50 +02:00
Georgi Gerganov
d6e7b033a4 llama : add option to save memory in device buffers (#22679)
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Shakhnazar Sailaukan
d8794eecd5 examples: refactor diffusion generation (#22590)
* examples: refactor diffusion generation

* renamed enum values
2026-05-04 20:19:30 +08:00