llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-01 22:54:05 +00:00

Author	SHA1	Message	Date
Aldehir Rojas	4aa962e2b0	vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488 )	2026-04-06 09:08:37 -05:00
Georgi Gerganov	400ac8e194	convert : set "add bos" == True for Gemma 4 (#21500 ) * convert : set "add bos" == True for Gemma 4 * cont : handle old GGUFs	2026-04-06 13:52:07 +03:00
anchortense	58190cc84d	llama : correct platform-independent loading of BOOL metadata (#21428 ) * model-loader : fix GGUF bool array conversion * model-loader : fix remaining GGUF bool pointer uses	2026-04-06 01:40:38 +02:00
Richard Davison	af76639f72	model : add HunyuanOCR support (#21395 ) * HunyuanOCR: add support for text and vision models - Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge - Add separate HUNYUAN_OCR chat template (content-before-role format) - Handle HunyuanOCR's invalid pad_token_id=-1 in converter - Fix EOS/EOT token IDs from generation_config.json - Support xdrope RoPE scaling type - Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.) - Register HunYuanVLForConditionalGeneration for both text and mmproj conversion * fix proper mapping * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * address comments * update * Fix typecheck * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-05 23:32:14 +02:00
Aldehir Rojas	b8635075ff	common : add gemma 4 specialized parser (#21418 ) * common : add gemma4 dedicated parser * cont : add '<\|tool_response>' as eog * cont : emit JSON from Gemma4 tool call AST * cont : more fixes * cont : refactor convert function * cont : refine rules and mapping * cont : add more tests * cont : clean up * cont : remove autoparser gemma4 implementation * cont : more cleanup * cont : rename gemma4.jinja to match the others * cont : add custom template to support interleaved thinking * cont : preserve reasoning in model turns * cont : fix initializer error * cont : fix unused vars * cont : fix accidental static * cont : fix specialized_template signature * fix extra semicolon * remove debug line and extra space [no ci]	2026-04-04 20:39:00 +02:00
SamareshSingh	650bf14eb9	llama-model: read final_logit_softcapping for Gemma 4 (#21390 )	2026-04-04 13:05:10 +02:00
Aman Gupta	b7ad48ebda	llama: add custom newline split for Gemma 4 (#21406 )	2026-04-04 15:06:34 +08:00
Piotr Wilkin (ilintar)	d3416a4aa9	fix: remove stale assert (#21369 )	2026-04-03 13:40:41 +02:00
Piotr Wilkin (ilintar)	b069b10ab4	vocab: fix Gemma4 tokenizer (#21343 ) * seems to work * fix case with new line Co-authored-by: sayap <sokann@gmail.com> * gemma 4: fix pre tok regex --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: sayap <sokann@gmail.com>	2026-04-03 10:33:03 +02:00
Georgi Gerganov	39b27f0da0	(revert) kv-cache : do not quantize SWA KV cache (#21332 ) This reverts commit `17193cce34`.	2026-04-03 09:07:01 +03:00
Bartowski	7992aa7c8e	tests : add unit test coverage for llama_tensor_get_type (#20112 ) * Add unit test coverage for llama_tensor_get_type * Fix merge conflicts, add more schemas * clang formatter changes * Trailing whitespace * Update name * Start rebase * Updating files with upstream changes prior to rebase * Changes needed from rebase * Update attn_qkv schema, change throw behaviour * Fix merge conflicts * White space * Update with latest changes to state counters * Revert accidental personal CLAUDE.md changes * Change quotation mark * Reuse metadata.name since we have it * Move test-only stuff out of llama-quant.cpp * Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns * cont : inital deslop guidelines * Cleanup based on review comments * Continue cleanup * Small cleanup * Manually set proper ordering of tensors, mostly applies to gemma * Formatting * Update tests/test-quant-type-selection.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix merge conflicts --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-02 22:53:58 +02:00
Xuan-Son Nguyen	63f8fe0ef4	model, mtmd: fix gguf conversion for audio/vision mmproj (#21309 ) * fix gguf conversion for audio/vision mmproj * fix test	2026-04-02 17:10:32 +02:00
Jesus Talavera	6137c325a1	chat : add Granite 4.0 chat template with correct tool_call role mapping (#20804 ) * chat : add Granite 4.0 chat template with correct tool_call role mapping Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite 3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`). The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the `assistant_tool_call` role to `<\|start_of_role\|>assistant<\|end_of_role\|><\|tool_call\|>`. Without a matching C++ handler, the fallback path emits the literal role `assistant_tool_call` which the model does not recognize, breaking tool calling when `--jinja` is not used. Changes: - Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X` (preserves existing 3.x behavior unchanged) - Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler - Detection: `<\|start_of_role\|>` + (`<tool_call>` or `<tools>`) → 4.0, otherwise → 3.x - Add production Granite 4.0 Jinja template - Add tests for both 3.x and 4.0 template paths (C++ and Jinja) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Code review: follow standard format and use common logic in test-chat-template.cpp * Rename custom_conversation variable for extra_conversation to give it a more meaningful name --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 11:28:56 +02:00
Georgi Gerganov	17193cce34	kv-cache : do not quantize SWA KV cache (#21277 )	2026-04-02 11:54:05 +03:00
Georgi Gerganov	744c0c7310	llama : rotate activations for better quantization (#21038 ) * llama : rotate activations for better quantization * cont : rotate V more + refactor * cont : rotate caches separately + support non-power-of-2 head sizes * cont : simplify * cont : add reference for V rotation * cont : refactor * cont : support context shift * cont : consolidate * cont : dedup + allow different types for the rotation matrix * cont : add env variable to disable rotation * cont : simplify attn rot kv cache logic + rename env * cont : pre-compute the Hadamard matrices	2026-04-01 16:58:01 +03:00
Ettore Di Giacinto	e1cb817483	memory: respect unified KV cache in hybrid memory for eval tasks (#21224 ) The hybrid memory paths (`llama-memory-hybrid.cpp` and `llama-memory-hybrid-iswa.cpp`) always used sequential equal split, ignoring the unified KV cache flag. This caused hellaswag, winogrande, and multiple-choice evaluations to fail on hybrid models (models with both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with: split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) PR #19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically enabling unified KV mode and setting n_parallel >= 4 for multi-choice eval tasks. However, the hybrid memory paths were not updated. This commit mirrors the iswa fix: use non-sequential split when KV cache is unified (n_stream == 1), which is automatically set by llama-perplexity for hellaswag/winogrande/multiple-choice since #19954. Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model): - HellaSwag: 83.0% (400 tasks) - Winogrande: 74.5% (400 tasks) - MMLU: 41.2% - ARC-Challenge: 56.2% - TruthfulQA: 37.7% All previously failed with llama_decode() error.	2026-04-01 12:50:17 +03:00
Ed Addario	4951250235	llama : refactor llama_model_quantize_params to expose a pure C interface (#20346 ) * Refactor llama_model_quantize_params to expose a pure C interface * Restore comment and cleanup struct def * Code review refactoring Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review refactoring --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-01 08:43:00 +03:00
lainon1	0b6ff47996	fix: correct misspellings in code comments (#21217 ) - emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp, gemma-embedding.cpp) - imlpemented → implemented (llama-adapter.cpp) - interere → interfere (llama-graph.cpp) - overridde → overridden (chat.cpp) - stastistics → statistics (ngram-map.h) - layed → laid (llama-kv-cache.h) - worster → worst (llama-context.cpp) - sequantial → sequential (llama-batch.h)	2026-03-31 13:50:51 +02:00
Aman Gupta	278521c33a	llama-model-loader: print warning when using overrides with mmap (#20978 ) * llama-model-loader: use pinned memory for tensor overrides * change to warning	2026-03-30 17:40:17 +08:00
Sigbjørn Skjæret	7c203670f8	add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150 )	2026-03-29 19:45:40 +02:00
Saba Fallah	1743d98057	mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr (#21027 ) * mtmd: fix "v.patch_embd" quant and unsupported im2col ops on Metal for deepseek-ocr * Update src/llama-quant.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-27 00:07:55 +01:00
Michael Wand	f8d4abae86	convert : support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505 ) * convert : fix Qwen3.5 NVFP4 conversion * Updated copilot concerns and rebased * move into _LinearAttentionVReorderBase and simplify * --flake * new_name not needed * Added input_scale to gguf * Fixed input_scale addition as tensor * Added input scale to loader and named _in_s * Update convert_hf_to_gguf.py Re-removed input_scale from aux cleanup Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-26 16:52:06 +01:00
Saba Fallah	a970515bdb	mtmd: Add DeepSeekOCR Support (#17400 ) * mtmd: llama.cpp DeepSeekOCR support init commit * loading sam tensors * mtmd: fix vision model processing * deepseek-ocr clip-vit model impl * mtmd: add DeepSeek-OCR LM support with standard attention * mtmd: successfully runs DeepSeek-OCR LM in llama-cli * mtmd: Fix RoPE type for DeepSeek-OCR LM. * loading LM testing Vision model loading * sam warmup working * sam erroneous return corrected * clip-vit: corrected cls_embd concat * clip-vit: model convert qkv_proj split * corrected combining of image encoders' results * fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model * concat image_newline and image_seperator tokens * visual_model warmup (technically) works * window partitioning using standard ggml ops * sam implementation without using CPU only ops * clip: fixed warnings * Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr * mtmd: fix get_rel_pos * mtmd: fixed the wrong scaler for get_rel_pos * image encoding technically works but the output can't be checked singe image decoding fails * mtmd: minor changed * mtmd: add native resolution support * - image encoding debugged - issues fixed mainly related wrong config like n_patches etc. - configs need to be corrected in the converter * mtmd: correct token order * - dynamic resizing - changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4 * mtmd: quick fix token order * mtmd: fix danling pointer * mtmd: SAM numerically works * mtmd: debug CLIP-L (vit_pre_ln) * mtmd: debug CLIP-L & first working DeepSeek-OCR model * mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work * mtmd: simplify SAM patch embedding * mtmd: adapt Pillow image resizing function * mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing * mtmd: remove --dsocr-mode argument * mtmd: refactor code & remove unused helper functions * mtmd: fix tensor names for image newlines and view separator * clean up * reverting automatically removed spaces * reverting automatically removed spaces * mtmd: fixed bad ocr check in Deepseek2 (LM) * mtmd: support combined QKV projection in buid_vit * using common build_attn in sam * corrected code-branch when flash-attn disabled enabling usage of --flash-attn option * mtmd: minor fix * minor formatting and style * fixed flake8 lint issues * minor editorconfig-check fixes * minor editorconfig-check fixes * mtmd: simplify get_rel_pos * mtmd: make sam hparams configurable * mtmd: add detailed comments for resize_bicubic_pillow * mtmd: fixed wrong input setting * mtmd: convert model in FP16 * mtmd: minor fix * mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template * fix: test-1.jpg ORC issue with small (640) resolution setting min-resolution base (1024) max large (1280) for dynamic-resolution * minor: editconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909 added new opt to tests.sh to disable flash-attn * minor: editconfig-check fix * testing deepseek-ocr quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR * quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909 * refactoring, one single builder function and static helpers * added deepseek-ocr test to tests.sh * minor formatting fixes * check with fixed expected resutls * minor formatting * editorconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042 * minor - added GLM-4.6V to big tests - added missing deps for python test * convert: minor fix * mtmd: format code * convert: quick fix * convert: quick fix * minor python formatting * fixed merge build issue * merge resolved - fixed issues in convert - tested several deepseek models * minor fix * minor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * - removed clip_is_deepseekocr - removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions * - cleaning commented out code * fixing instabilities issues reintroducing resize_bicubic_pillow * - use f16 model for deepseek-ocr test - ignore llama-arch test for deepseek-ocr * rename fc_w --> mm_fc_w * add links to OCR discussion * cleaner loading code * add missing .weight to some tensors * add default jinja template (to be used by server) * move test model to ggml-org * rolling back upscale change * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: bluebread <hotbread70127@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-03-25 19:57:40 +01:00
Saba Fallah	80322ebdaf	model: codefuse-ai/F2LLM-v2 support	2026-03-25 18:33:42 +01:00
Dowon	44c51e526b	model : allow causal_attn and pooling_type on all architectures (#20973 ) * models : allow causal_attn and pooling_type on all architectures * fix: move location	2026-03-25 18:12:38 +01:00
Johannes Gäßler	36dafba5c4	llama: fix llama-model-saver (#20503 ) * llama : add fd-based model loading via llama_model_load_from_fd * llama : address review feedback for fd-based model loading * llama : use FILE pointer instead of fd in public API * llama : use FILE pointer consistently, address review feedback * fixup * fix tensor names * fix llama-model-saver * roundtrip tests * fixup * refactor tests * fix prints * fix model saving * fix CI, disable Chameleon * print seed --------- Co-authored-by: Siddhesh2377 <siddheshsonar2377@gmail.com>	2026-03-25 12:53:16 +02:00
Georgi Gerganov	9f102a1407	models : move the token embedding norms to the first layer (#20943 ) * models : move the token embedding norms to the first layer * cont : fix LLM_TENSOR_CONV1D + fix il indexing	2026-03-24 17:00:30 +02:00
Aman Gupta	3fc6f1aed1	ggml-backend: re-enable graph reuse with pipeline parallelism (#20927 )	2026-03-24 20:47:00 +08:00
Aman Gupta	e852eb4901	llama-fit: fix regex pattern for gate_up tensors (#20910 ) * llama-fit: fix regex pattern for gate_up tensors * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-24 12:57:57 +08:00
Georgi Gerganov	f93c09e267	memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887 )	2026-03-23 14:08:46 +02:00
Andrea Arcangeli	990e4d9698	common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604 ) * grammar: add test case for nullable symbol loop Reproduce stack overflow (or OOM) with ( [x]* )* found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= ( [x]* )" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions grammar: prevent stack overflow with nullable symbol loop Fix a potential stack overflow in llama_grammar_advance_stack that could occur when processing grammars with nullable symbols that lead to infinite derivations of empty strings. The fix introduces cycle detection by tracking visited stacks to prevent infinite recursion. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A20 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) * grammar: convert recursive llama_grammar_advance_stack to iterative This change converts the function to an iterative approach using explicit stacks, which prevents deep recursion and eliminates the risk of stack overflow. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A30 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration convert from recursive to interactive""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) v2: Added a `std::set` to perform tree-based lookups with O(N log N) complexity. Testing with a parallel run of `test-grammar-integration` shows a double-digit percentage increase in runtime. An `unordered_set` with O(1) hashing was also evaluated, but the overhead of constructing hash keys from pointers made it significantly slower than the rbtree implementation that only requires an ordering operator. The performance regression in the test suite appears justified by the overall reduction in algorithmic complexity. Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> * grammar: add test case for hang in repetition grammar processing This commit adds a new test case to the grammar integration tests that specifically targets a hang scenario in the repetition grammar parser found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= (([^x]){0,99}){0,99}" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions grammar: add repetition threshold check The change introduces a maximum repetition threshold to avoid excessive rule expansion during grammar parsing. When parsing repetition patterns like {m,n}, the parser now calculates the potential number of rules that would be generated and throws an error if the product of previous rules and new rules exceeds the threshold. A test case was added to verify the threshold is properly enforced for deeply nested repetition patterns that would otherwise cause hangs.	2026-03-21 18:43:35 +01:00
Tom Hillbrunner	212f4521b0	context : use n_embd_out for pooled embedding extraction (#20840 ) The MEAN/CLS/LAST pooling paths in encode() and decode() used n_embd_inp() (16384 for qwen3vl with deepstack) to read from the pooled embedding tensor, which only has n_embd_out() (4096) floats per sequence. This caused a tensor read out of bounds assertion. Fixes embedding mode for Qwen3-VL-Embedding models.	2026-03-21 19:35:00 +02:00
Victor Villar	58c81f7e81	model : fix Granite Hybrid type check for 7B.A1B (#20795 ) * Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B * Use feed fwd dim instead of num of experts Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-20 15:16:09 +01:00
Ruikai Peng	dc6592431b	context: zero output buffer on allocation (#20781 ) * context: zero output buffer on allocation Address GHSA-wqq9-25mr-rw76. The logits output buffer allocated in output_reserve() uses posix_memalign(), which does not zero memory. The buffer is only written during decode when needs_raw_logits() returns true. When backend samplers cover all output sequences, needs_raw_logits() returns false and the buffer is never written, but llama_get_logits() still returns a pointer to it, exposing stale heap content. Zero the buffer after allocation to prevent information disclosure through the public logits API. Found-by: Pwno * Update src/llama-context.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-20 11:31:34 +02:00
Ruikai Peng	3adbef7776	model: assert nextn_predict_layers to prevent underflow (#20783 ) Address GHSA-645x-v54x-34w8. When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers can underflow (unsigned wrap), which corrupts n_layer_kv_from_start. Assert nextn_predict_layers immediately after parsing the GGUF key. Found-by: Pwno	2026-03-20 10:17:58 +01:00
Sigbjørn Skjæret	811397745e	vocab : assert array size of scores and toktypes (#20737 )	2026-03-19 08:34:04 +01:00
Michael Grau	6729d4920c	model : add control vector support where missing (#20653 ) * Add control vector functions to qwen3.5 and qwen-next models * Add missing cvec compatibility to the rest of the models * Adjust comments and formatting * cleanup * whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-18 23:25:12 +01:00
Pop Flamingo	312cf03328	llama : re-enable manual LoRA adapter free (#19983 ) * Re-enable manual LoRA adapter free * Remove stale "all adapters must be loaded before context creation" stale comments	2026-03-18 12:03:26 +02:00
Andreas Obersteiner	a69d54f990	context : fix graph not resetting when control vector changes (#20381 )	2026-03-18 08:10:13 +02:00
Xuan-Son Nguyen	d34ff7eb5b	model: mistral small 4 support (#20649 ) * model: mistral small 4 support * fix test * fix test (2) * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * change newline --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-17 00:31:14 +01:00
Aman Gupta	3c8521c4f5	llama-graph: replace cont with reshape for alpha in qwen35 (#20640 )	2026-03-16 22:07:13 +08:00
Sigbjørn Skjæret	de8f01c2d7	model : wire up Nemotron-H tensors for NVFP4 support (#20561 ) * wire up Nemotron-H tensors for NVFP4 support * add ssm tensors * alignment	2026-03-16 09:19:16 +01:00
sprayandwipe	6b10a82c00	kv-cache : fix reading llama_kv_cell_ext during state read (#20273 ) Co-authored-by: sid <sid@ragingfist.net>	2026-03-15 09:11:19 +02:00
Michael Wand	d23355afc3	model : wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support (#20506 )	2026-03-14 22:44:42 +01:00
Georgi Gerganov	e30f1fdf74	graph : remove redundant GDN state transposes (#20443 ) * ggml : transpose fused GDN state access for coalesced memory reads (#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[isS_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[iS_v+col] -> curr_state[colS_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on\|off\|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * llama : rever fgdn argument changes * graph : remove GDN state transposes * vulkan : adapt * cuda : remove obsolete smem code --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com>	2026-03-13 22:12:54 +02:00
ZeroV0LT	f17b3be63f	llama : fix pooling assertion crash in chunked GDN detection path (#20468 ) * llama : fix pooling assertion crash in chunked GDN detection path The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by #20340 (`d28961d`). Same class of bug as #12517, fixed by #12545. * server : add mean pooling tests to embedding test suite Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by #20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path. --------- Co-authored-by: Domenico Crupi <domenico@zerovolt.it>	2026-03-13 20:53:42 +02:00
Georgi Gerganov	57819b8d4b	llama : disable graph reuse with pipeline parallelism (#20463 )	2026-03-12 21:04:13 +02:00
Ruben Ortlam	128142fe7d	test-backend-ops: allow loading tests from file and parsing model operators into file (#19896 ) * tests: allow loading test-backend-ops tests from json * add error threshold based on op * add error when file cannot be read * add graph operator json extraction tool * add nb parameter for non-contiguous input tensors * fix view check * only use view if non-contiguous/permuted, use C++ random instead of rand() * replace internal API calls with public llama_graph_reserve call * reduce test description length * fix nb[0] not getting set for view * add name to tests * fix inplace error * use text file instead of json * move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/ * fix missing declaration * use pragma once * fix indent * fix Windows build	2026-03-12 13:26:00 +01:00
Asbjørn Olling	0a10c34dc1	grammar: Fix grammar root symbol check (#19761 ) * grammar: fix bad check for root symbol, correct error logging * add tests to demonstrate root symbol check failure	2026-03-12 12:04:56 +01:00
Richard Davison	1eea6a2968	graph : add optional scale parameter to build_lora_mm [no ci] (#20427 )	2026-03-12 00:22:49 +01:00

1 2 3 4 5 ...

899 Commits