8397 Commits

Author SHA1 Message Date
Aaron Teo
ae87863dc1 llama-bench: introduce -hf and -hff flags & use --mmap 1 by default (#20211) b8247 2026-03-09 09:05:44 +08:00
Piotr Wilkin (ilintar)
97c64fbdbd PEG parser for LFM2 (#20251)
* PEG parser for LFM2

* Simplify using python_value()
b8246
2026-03-09 01:11:22 +01:00
Georgi Gerganov
d417bc43dd server : do not create checkpoints right after mtmd chunks (#20232) b8245 2026-03-08 22:16:46 +02:00
Sigbjørn Skjæret
35bee031e1 graph : remove redundant scale_w parameter (#20235) b8244 2026-03-08 18:58:28 +01:00
Aldehir Rojas
451ef08432 common : gracefully handle incomplete output (#20191)
* common : handle incomplete UTF-8 at end of input in PEG parser

* cont : if reached end prematurely, emit needs_more_input to propagate partial output

* cont: refactor peg parse context to add lenient flag

* cont : remove partial flag, keep lenient flag
b8243
2026-03-08 17:17:02 +01:00
Piotr Wilkin (ilintar)
9b24886f78 Fix compile bug (#20203)
* Fix compile bug

* Update common/chat-auto-parser-helpers.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8242
2026-03-08 17:15:49 +01:00
Piotr Wilkin (ilintar)
62b8143ad2 Fix structured outputs (#20223)
* Fix structured outputs

* Update common/chat-auto-parser-generator.cpp

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>
b8241
2026-03-08 17:14:43 +01:00
GiantPrince
d088d5b74f ggml-vulkan: Add ELU op support (#20183)
* ggml-Vulkan: add ELU support

* ggml-Vulkan: remove extra spaces and variables

* ggml-Vulkan: fix format issue

* ggml-Vulkan: fix format issue

* fix whitespace issue

* Update Vulkan.csv and ops.md
b8240
2026-03-08 12:38:17 +01:00
Jeff Bolz
cd18a50ea5 vulkan: Fix data races in coopmat1 mul_mat(_id) (#20084)
* vulkan: Fix data races in coopmat1 mul_mat(_id)

Add barriers between coopmat store and regular loads. We sort of got away with
this because it was the same subgroup accessing the values, but it's still a
race and may not work.

* switch to subgroup control barriers
b8239
2026-03-08 12:33:48 +01:00
Johannes Gäßler
a976ff081b llama: end-to-end tests (#19802)
* tests: add end-to-end tests per model architecture

* fixup for rebase

* fix use-after-free in llama-model-loader.cpp

* fix CI

* fix WebGPU

* fix CI

* disable CI for macOS-latest-cmake-arm64

* use expert_weights_scale only if != 0.0f

* comments
b8238
2026-03-08 12:30:21 +01:00
Christopher Maher
a95047979a readme : update infra list (#20212) 2026-03-08 12:42:28 +02:00
Piotr Wilkin (ilintar)
b283f6d5b3 Revert to OAI-compatible args (#20213)
* Revert to OAI-compatible args

* Apply workaround::func_args_not_string
b8236
2026-03-08 11:33:03 +01:00
decahedron1
ff52ee964d server : correct index on finish in OAI completion streams (#20226) b8235 2026-03-08 10:08:57 +01:00
Neo Zhang
213c4a0b81 [SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190)
* support flash-attention for fp32/fp16/Q4/Q5/Q8

* rm warining

* update for JIT
b8234
2026-03-08 12:00:07 +08:00
Aman Gupta
c5a778891b ggml: add GATED_DELTA_NET op (#19504)
* ggml: add GATED_DELTA_NET op

* remove the transpose

* add KDA

* add qwen35 dense

* llama : check for fused gated delta net backend support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8233
2026-03-07 15:41:10 +08:00
lhez
6fce5c6a7d opencl: add l2_norm (#20160) b8232 2026-03-06 18:03:05 -08:00
Piotr Wilkin (ilintar)
c024d85908 Autoparser: True streaming (#20177)
* Relax atomicity constraint for nicer, more pleasent, True Streaming parsing

* Whitespace

* Remove redundant atomics
b8231
2026-03-07 01:55:33 +01:00
Piotr Wilkin (ilintar)
2f2923f895 Autoparser: add optional argument reshuffle capability (#20171)
* Allow reshuffled arguments in tagged argument parser format tool calls.

* Remove shuffle just keep the optional parsers in any order

* Remove unnecessary import
b8230
2026-03-06 22:34:15 +01:00
Bartowski
649f06481e quants : Add memsets and other fixes for IQ quants (#19861)
* Add memsets and other fixes for IQ quants

* Make memset unconditional, change Laux back to L

* Move another memset
b8229
2026-03-06 23:06:56 +02:00
Piotr Wilkin (ilintar)
7463687161 Add @pwilkin to CODEOWNERS for autoparser code (#20174) 2026-03-06 21:25:41 +01:00
Piotr Wilkin (ilintar)
566059a26b Autoparser - complete refactoring of parser architecture (#18675)
* Autoparser - full single commit squish

* Final pre-merge changes: minor fixes, Kimi 2.5 model parser
b8227
2026-03-06 21:01:00 +01:00
Todor Boinovski
34df42f7be hexagon: add f32 ssm_conv op (#20122)
* hexagon: add ssm_conv op

* hexagon: hvx kernel is functional

* hexagon: improvements to ssm-conv hvx kernel

* hexagon: added dma to ssm-conv hvx kernel

* hexagon: ssm-conv dynamically compute gather scratchpad

* hex-ssm-conv: add local context and fix various issues (spad indexing, etc)

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b8226
2026-03-06 09:59:26 -08:00
Tom Vaucourt
e68f2fb894 server : preserve anthropic thinking blocks in conversion (#20120)
* server : preserve anthropic thinking blocks in conversion (#20090)

* server : add tests for anthropic thinking block conversion

---------

Co-authored-by: root <root@llamacpp.home>
b8225
2026-03-06 17:41:12 +01:00
Max Krasnyansky
ba2fd11cdf cpu: skip redudant ROPE cache updates (#20149) b8224 2026-03-06 08:32:40 -08:00
Aman Gupta
d48e876467 ggml-cuda: add mem check for fusion (#19916)
* ggml-cuda: add mem check for fusion

* Replace NaNs with -FLT_MAX

* fix typo

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b8223
2026-03-07 00:05:43 +08:00
Aaron Teo
ba2ff79e43 ggml: update comments for backends which have no memory to report (#20157)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b8222
2026-03-06 23:24:38 +08:00
shalinib-ibm
c6980ff29d ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083) (#20130)
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.

Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
    `plxv 40, 2(14)`

This ensures zero performance regression while unblocking builds on
newer toolchains.

Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9  + GCC 15.1.1 (gcc-toolset-15)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
b8221
2026-03-06 23:22:39 +08:00
Aman Gupta
1e38a7a6fa CUDA: use shared mem for ssm_conv (#20128)
* CUDA: use shared mem for ssm_conv

* fuse silu + ssm_conv

* fuse unary + mul

* enable for fp16

* formatting

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b8220
2026-03-06 23:09:59 +08:00
Tim Neumann
388baabc06 context: ignore zero scale LoRAs when checking sameness (#20166) b8219 2026-03-06 15:05:52 +02:00
Piotr Wilkin (ilintar)
f5ddcd1696 Checkpoint every n tokens: squash (#20087) b8218 2026-03-06 11:39:26 +01:00
Aleksander Grygier
f6235a41ef webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts (#18655) 2026-03-06 10:00:39 +01:00
Johannes Gäßler
2850bc6a13 ggml-cpu: fix data race for debug asserts (#20148) b8216 2026-03-06 09:12:49 +01:00
Georgi Gerganov
17a4258946 kv-cache : fix M-RoPE checkpoints (#20132) b8215 2026-03-06 08:46:51 +02:00
Roj234
f7db3f3789 cli : Don't clear system prompt when using '/clear' (#20067)
* Enhance /clear command to include system prompt

Add system prompt to messages when clearing chat history.

* Use lambda
b8214
2026-03-06 06:41:11 +01:00
lhez
6c97bffd65 opencl: add neg, exp and diag (#20127)
* opencl: add `neg`

* opencl: add `exp`

* opencl: add `diag`
b8213
2026-03-05 21:16:39 -08:00
YardenTal44
2b10b62677 hexagon: add fp16 support for binary ops: add,sub,mul,div (#20139)
* hexagon: add fp16 support for binary ops: add,sub,mul,div

* hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79)

* hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad

* snapdragon: fix readme link

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b8212
2026-03-05 18:29:13 -08:00
ymcki
a0ed91a442 models : kda chunk size = 16 (#19827)
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments [no ci]

* add kimi linear to delta-net-base

* removed unnecessary ggml_cont from g_exp_t

* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp

* removed unnecessary diag mask

* cont : simplify

* cont : avoid graph splits

* scale q after mul instead of beginning

* scale q after mul instead of beginning

* identical ppl

* cont : fix scale and decay mask

* minor : remove TODO

* block implementation for kda

* remove space at the end of line 101

* concat+pad

* pad+binary row concat

* chunk size 16 for kda

* removed minor differences to master

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-05 17:01:23 +02:00
Andreas Kieslinger
2cd20b72ed CUDA: Improve performance via less synchronizations between token (#17795)
* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()

* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)

* Exchanges synchronous copy with async copy function.

* Adds macro guards to allow compilation in non-CUDA builds

* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts

* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues

* Minor cleanup

* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.

* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.

* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization

* Simplifies synchronizations to adhere to `saaasg` pattern.

* Apply suggestion from @ggerganov (src->buffer to buf_src)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestion from @ggerganov (src->buffer to buf_src) v2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8210
2026-03-05 13:53:21 +02:00
Eric Zhang
872646b30c model : update Qwen3.5 model type detection (#20126)
* model : fix Qwen3.5 model type detection

* Update src/llama-model.cpp

whoops, my bad

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8209
2026-03-05 12:47:14 +01:00
Sigbjørn Skjæret
b5ed0e058c cli : add command and file auto-completion (#19985) b8208 2026-03-05 10:47:28 +01:00
Sigbjørn Skjæret
cf232515c9 convert : register Qwen 3.5 ForCausalLM for text only (#20119) 2026-03-05 10:30:02 +01:00
Aleksander Grygier
5e335ba113 webui: Improvements for Models Selector UI (#20066) 2026-03-05 08:52:22 +01:00
Marcel Petrick
92f7da00b4 chore : correct typos [no ci] (#20041)
* fix(docs): correct typos found during code review

Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>

* Update docs/backend/CANN.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"

This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-05 08:50:21 +01:00
Max Krasnyansky
7a99dc85e2 hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates (#20118)
* ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance by expanding vector handling and optimizing accumulation

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_reduce_sum_f32x4 for improved performance and reduced complexity

* ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector processing in flash attention

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removing unused scale parameter and improving vector accumulation

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readability and return HVX_Vector for better integration

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for clarity

* ggml-hexagon: fix compiling error

* fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly using masking

* refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element counts as parameters for improved clarity and flexibility

* wip

* fa: instrumentation and dma reordering

* hex-fa: use block-size 64 to improve DMA pipelining

* hex-fa: optimize vec-dot for v79 and above

* hex-fa: use block size 64

* hex-fa: avoid scalar fp32->fp16 conversions

* hex-fa: simplify dot_f16 functions using optimized vec_mpyacc

* hex-fa: rewrite mad_f32_f16 using hvx_vec_mpyacc

* hex-mm: use mpyacc in matmul dot functions

---------

Co-authored-by: chraac <chraac@gmail.com>
b8204
2026-03-04 21:55:29 -08:00
lhez
69fd345335 opencl: add SET, support i32 for CPY, minor refactor for cpy (#20101) b8203 2026-03-04 21:32:26 -08:00
Todor Boinovski
1a29907d2e hexagon: add llama-completion runner script (#20095) b8202 2026-03-04 15:04:59 -08:00
Nikhil Jain
24d2ee0527 [WebGPU] Fix wait logic for inflight jobs (#20096)
* Enable tmate debugging for investigating thread safety issue

* Refactor wait and submit to operate on vector<wgpu::FutureWaitInfo>, and fix wait to delete only the future that is completed.

* Cleanup

* Remove clear change and run clang-format

* Cleanup
b8201
2026-03-04 11:54:55 -08:00
Masashi Yoshimura
541bf37622 Add concat op to webgpu. (#20068) b8200 2026-03-04 11:19:00 -08:00
Sigbjørn Skjæret
d969e933e1 tools : add missing clocale include in mtmd-cli [no ci] (#20107) 2026-03-04 14:18:04 +01:00
Johannes Gäßler
7f5ee54968 ggml: fix ggml_is_contiguous_n for ne == 1 (#20092) b8198 2026-03-04 12:04:31 +01:00