Commit Graph

8102 Commits

Author SHA1 Message Date
Georgi Gerganov
c6d70b9bea add AGENTS.md 2026-02-16 13:13:35 +02:00
Georgi Gerganov
de956a6ca8 cleanup 2026-02-16 12:02:16 +02:00
Georgi Gerganov
350e7c1409 datasets : fix aime2025 2026-02-16 11:55:57 +02:00
Georgi Gerganov
db10dda1f3 grade : improve regex + logs 2026-02-16 11:51:36 +02:00
Georgi Gerganov
52759bf078 grader : update prompt 2026-02-16 11:17:53 +02:00
Georgi Gerganov
99e3c3d02c datasets : add aime2025 2026-02-16 11:07:54 +02:00
Georgi Gerganov
c6315655b7 cont 2026-02-16 10:56:58 +02:00
Georgi Gerganov
f762a71d56 grader : improve example answers 2026-02-16 10:51:41 +02:00
Georgi Gerganov
73e61d5b75 rename 2026-02-16 10:30:10 +02:00
Georgi Gerganov
cffd268bb3 add gpqa + sampling + docs 2026-02-16 00:52:33 +02:00
Georgi Gerganov
e8a807519a datasets : add gsm8k 2026-02-15 23:19:46 +02:00
Georgi Gerganov
1db8428f00 remove old files 2026-02-15 22:16:54 +02:00
Georgi Gerganov
7751ae2796 docs 2026-02-15 22:15:50 +02:00
Georgi Gerganov
d2b10302ce improve grader 2026-02-15 22:12:02 +02:00
Georgi Gerganov
68dde884d6 minor 2026-02-15 21:21:40 +02:00
Georgi Gerganov
fd90796da2 eval : support multiple dataset runs 2026-02-15 21:08:24 +02:00
Georgi Gerganov
8156d549f6 sim : fix answer matching 2026-02-15 21:08:24 +02:00
Georgi Gerganov
9695e6feb4 test : fix path 2026-02-15 21:08:24 +02:00
Georgi Gerganov
fb1481d60d eval : add prompts 2026-02-15 21:08:24 +02:00
Georgi Gerganov
812ae13ec1 eval : print progress 2026-02-15 21:08:24 +02:00
Georgi Gerganov
e79e8d02d5 examples: add task summary table to llama-eval-new.py 2026-02-15 21:08:23 +02:00
Georgi Gerganov
a939f4c47e docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
2026-02-15 21:08:23 +02:00
Georgi Gerganov
62b04cef54 examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
2026-02-15 21:08:23 +02:00
Georgi Gerganov
37b26cafee docs: update llama-eval-discussion.md with session work summary 2026-02-15 21:08:23 +02:00
Georgi Gerganov
04f6872116 examples: use cached dataset path in simulator to avoid HF Hub requests 2026-02-15 21:08:23 +02:00
Georgi Gerganov
c2619c18bf examples: use cached dataset path to avoid HF Hub requests 2026-02-15 21:08:23 +02:00
Georgi Gerganov
87f8930968 examples: remove HF_HUB_OFFLINE to allow dataset download 2026-02-15 21:08:23 +02:00
Georgi Gerganov
9453f9de12 examples: use HF_HUB_OFFLINE to avoid HF Hub warnings 2026-02-15 21:08:23 +02:00
Georgi Gerganov
5a1be6ce37 examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
2026-02-15 21:08:23 +02:00
Georgi Gerganov
a80814e97b docs: remove README.md from llama-eval 2026-02-15 21:08:23 +02:00
Georgi Gerganov
5cc2258e82 examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
2026-02-15 21:08:22 +02:00
Georgi Gerganov
c87af1d527 docs: update llama-eval-discussion.md with session work summary
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
2026-02-15 21:08:22 +02:00
Georgi Gerganov
23d4e21a81 examples: refactor test-simulator.sh for better readability
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
2026-02-15 21:08:22 +02:00
Georgi Gerganov
07d5e1e0ea examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:

- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting

Also includes test scripts and documentation for testing and understanding
the simulator functionality.
2026-02-15 21:08:22 +02:00
gatbontonpc
8839037528 add checkpointing 2026-02-15 21:08:22 +02:00
gatbontonpc
89cab3dbc5 Add readme 2026-02-15 21:08:22 +02:00
gatbontonpc
c2d83ca048 multi source llama-eval 2026-02-15 21:08:22 +02:00
gatbontonpc
c05df17ce3 working llama-eval mc and math suite 2026-02-15 21:08:19 +02:00
David Friehs
27b93cbd15 cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined
b8064
2026-02-15 22:38:42 +05:30
Aaron Teo
6e67fd2144 docs: update s390x build docs (#19643) 2026-02-16 00:33:34 +08:00
Adrien Gallouët
9e118b97c4 build : remove LLAMA_HTTPLIB option (#19623)
This option was introduced as a workaround because cpp-httplib could not
build on visionOS. Since it has been fixed and now compiles on all platforms,
we can remove it and simplify many things.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8062
2026-02-15 15:38:50 +01:00
Daniel Bevenius
57088276d4 cmake : check if KleidiAI API has been fetched (#19640)
This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).

I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.
b8061
2026-02-15 13:59:38 +01:00
Georgi Gerganov
341bc7d23c context : fix output reorder with backend sampling (#19638) b8060 2026-02-15 14:57:40 +02:00
Georgi Gerganov
08e6d914b8 ggml : avoid UB in gemm ukernel (#19642) b8059 2026-02-15 14:56:35 +02:00
Aaron Teo
184c694f45 ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399) b8058 2026-02-15 18:20:35 +08:00
Aman Gupta
684b36101c ggml-cpu: FA add GEMM microkernel (#19422)
* ggml-cpu: FA add GEMM microkernel

* add guard for sizeless vector types

* fix case where DV % GGML_F32_EPR !=0

* move memset out of the loop

* move another memset out of the loop

* use RM=4 for arm

* simd_gemm: convert everything to int

* convert everything to size_t to avoid warnings

* fixup

* add pragma for ignoring aggressive loop optimizations
b8057
2026-02-15 11:09:24 +05:30
SamareshSingh
3a00c98584 cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581)
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL

Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.

The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.

* addressed code review comments
b8056
2026-02-15 06:22:53 +01:00
Sigbjørn Skjæret
079feab9e3 convert : ensure all models handle new experts count (#19621)
* ensure all models handle new experts count

* revert removal for PhiMoeModel, does not inherit from base
b8055
2026-02-14 22:22:32 +01:00
Anav Prasad
01d8eaa28d mtmd : Add Nemotron Nano 12B v2 VL support (#19547)
* nemotron nano v2 vlm support added

* simplified code; addressed reviews

* pre-downsample position embeddings during GGUF conversion for fixed input size
b8054
2026-02-14 14:07:00 +01:00
Georgi Gerganov
1725e316c1 models : optimize qwen3next graph (#19375)
* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs
b8053
2026-02-14 12:57:36 +02:00