Commit Graph

8235 Commits

Author SHA1 Message Date
Ruben Ortlam
715ed28683 use scalar sums 2026-03-07 22:11:40 +01:00
Ruben Ortlam
a9435151db apply scales inline 2026-03-07 14:56:25 +01:00
Ruben Ortlam
d1f8bbd085 vulkan: add int8 coopmat quantized matmul shader 2026-03-07 14:43:21 +01:00
lhez
6fce5c6a7d opencl: add l2_norm (#20160) b8232 2026-03-06 18:03:05 -08:00
Piotr Wilkin (ilintar)
c024d85908 Autoparser: True streaming (#20177)
* Relax atomicity constraint for nicer, more pleasent, True Streaming parsing

* Whitespace

* Remove redundant atomics
b8231
2026-03-07 01:55:33 +01:00
Piotr Wilkin (ilintar)
2f2923f895 Autoparser: add optional argument reshuffle capability (#20171)
* Allow reshuffled arguments in tagged argument parser format tool calls.

* Remove shuffle just keep the optional parsers in any order

* Remove unnecessary import
b8230
2026-03-06 22:34:15 +01:00
Bartowski
649f06481e quants : Add memsets and other fixes for IQ quants (#19861)
* Add memsets and other fixes for IQ quants

* Make memset unconditional, change Laux back to L

* Move another memset
b8229
2026-03-06 23:06:56 +02:00
Piotr Wilkin (ilintar)
7463687161 Add @pwilkin to CODEOWNERS for autoparser code (#20174) 2026-03-06 21:25:41 +01:00
Piotr Wilkin (ilintar)
566059a26b Autoparser - complete refactoring of parser architecture (#18675)
* Autoparser - full single commit squish

* Final pre-merge changes: minor fixes, Kimi 2.5 model parser
b8227
2026-03-06 21:01:00 +01:00
Todor Boinovski
34df42f7be hexagon: add f32 ssm_conv op (#20122)
* hexagon: add ssm_conv op

* hexagon: hvx kernel is functional

* hexagon: improvements to ssm-conv hvx kernel

* hexagon: added dma to ssm-conv hvx kernel

* hexagon: ssm-conv dynamically compute gather scratchpad

* hex-ssm-conv: add local context and fix various issues (spad indexing, etc)

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b8226
2026-03-06 09:59:26 -08:00
Tom Vaucourt
e68f2fb894 server : preserve anthropic thinking blocks in conversion (#20120)
* server : preserve anthropic thinking blocks in conversion (#20090)

* server : add tests for anthropic thinking block conversion

---------

Co-authored-by: root <root@llamacpp.home>
b8225
2026-03-06 17:41:12 +01:00
Max Krasnyansky
ba2fd11cdf cpu: skip redudant ROPE cache updates (#20149) b8224 2026-03-06 08:32:40 -08:00
Aman Gupta
d48e876467 ggml-cuda: add mem check for fusion (#19916)
* ggml-cuda: add mem check for fusion

* Replace NaNs with -FLT_MAX

* fix typo

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b8223
2026-03-07 00:05:43 +08:00
Aaron Teo
ba2ff79e43 ggml: update comments for backends which have no memory to report (#20157)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b8222
2026-03-06 23:24:38 +08:00
shalinib-ibm
c6980ff29d ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083) (#20130)
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.

Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
    `plxv 40, 2(14)`

This ensures zero performance regression while unblocking builds on
newer toolchains.

Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9  + GCC 15.1.1 (gcc-toolset-15)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
b8221
2026-03-06 23:22:39 +08:00
Aman Gupta
1e38a7a6fa CUDA: use shared mem for ssm_conv (#20128)
* CUDA: use shared mem for ssm_conv

* fuse silu + ssm_conv

* fuse unary + mul

* enable for fp16

* formatting

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b8220
2026-03-06 23:09:59 +08:00
Tim Neumann
388baabc06 context: ignore zero scale LoRAs when checking sameness (#20166) b8219 2026-03-06 15:05:52 +02:00
Piotr Wilkin (ilintar)
f5ddcd1696 Checkpoint every n tokens: squash (#20087) b8218 2026-03-06 11:39:26 +01:00
Aleksander Grygier
f6235a41ef webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts (#18655) 2026-03-06 10:00:39 +01:00
Johannes Gäßler
2850bc6a13 ggml-cpu: fix data race for debug asserts (#20148) b8216 2026-03-06 09:12:49 +01:00
Georgi Gerganov
17a4258946 kv-cache : fix M-RoPE checkpoints (#20132) b8215 2026-03-06 08:46:51 +02:00
Roj234
f7db3f3789 cli : Don't clear system prompt when using '/clear' (#20067)
* Enhance /clear command to include system prompt

Add system prompt to messages when clearing chat history.

* Use lambda
b8214
2026-03-06 06:41:11 +01:00
lhez
6c97bffd65 opencl: add neg, exp and diag (#20127)
* opencl: add `neg`

* opencl: add `exp`

* opencl: add `diag`
b8213
2026-03-05 21:16:39 -08:00
YardenTal44
2b10b62677 hexagon: add fp16 support for binary ops: add,sub,mul,div (#20139)
* hexagon: add fp16 support for binary ops: add,sub,mul,div

* hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79)

* hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad

* snapdragon: fix readme link

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b8212
2026-03-05 18:29:13 -08:00
ymcki
a0ed91a442 models : kda chunk size = 16 (#19827)
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments [no ci]

* add kimi linear to delta-net-base

* removed unnecessary ggml_cont from g_exp_t

* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp

* removed unnecessary diag mask

* cont : simplify

* cont : avoid graph splits

* scale q after mul instead of beginning

* scale q after mul instead of beginning

* identical ppl

* cont : fix scale and decay mask

* minor : remove TODO

* block implementation for kda

* remove space at the end of line 101

* concat+pad

* pad+binary row concat

* chunk size 16 for kda

* removed minor differences to master

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-05 17:01:23 +02:00
Andreas Kieslinger
2cd20b72ed CUDA: Improve performance via less synchronizations between token (#17795)
* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()

* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)

* Exchanges synchronous copy with async copy function.

* Adds macro guards to allow compilation in non-CUDA builds

* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts

* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues

* Minor cleanup

* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.

* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.

* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization

* Simplifies synchronizations to adhere to `saaasg` pattern.

* Apply suggestion from @ggerganov (src->buffer to buf_src)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestion from @ggerganov (src->buffer to buf_src) v2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8210
2026-03-05 13:53:21 +02:00
Eric Zhang
872646b30c model : update Qwen3.5 model type detection (#20126)
* model : fix Qwen3.5 model type detection

* Update src/llama-model.cpp

whoops, my bad

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8209
2026-03-05 12:47:14 +01:00
Sigbjørn Skjæret
b5ed0e058c cli : add command and file auto-completion (#19985) b8208 2026-03-05 10:47:28 +01:00
Sigbjørn Skjæret
cf232515c9 convert : register Qwen 3.5 ForCausalLM for text only (#20119) 2026-03-05 10:30:02 +01:00
Aleksander Grygier
5e335ba113 webui: Improvements for Models Selector UI (#20066) 2026-03-05 08:52:22 +01:00
Marcel Petrick
92f7da00b4 chore : correct typos [no ci] (#20041)
* fix(docs): correct typos found during code review

Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>

* Update docs/backend/CANN.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"

This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-05 08:50:21 +01:00
Max Krasnyansky
7a99dc85e2 hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates (#20118)
* ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance by expanding vector handling and optimizing accumulation

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_reduce_sum_f32x4 for improved performance and reduced complexity

* ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector processing in flash attention

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removing unused scale parameter and improving vector accumulation

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readability and return HVX_Vector for better integration

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for clarity

* ggml-hexagon: fix compiling error

* fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly using masking

* refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element counts as parameters for improved clarity and flexibility

* wip

* fa: instrumentation and dma reordering

* hex-fa: use block-size 64 to improve DMA pipelining

* hex-fa: optimize vec-dot for v79 and above

* hex-fa: use block size 64

* hex-fa: avoid scalar fp32->fp16 conversions

* hex-fa: simplify dot_f16 functions using optimized vec_mpyacc

* hex-fa: rewrite mad_f32_f16 using hvx_vec_mpyacc

* hex-mm: use mpyacc in matmul dot functions

---------

Co-authored-by: chraac <chraac@gmail.com>
b8204
2026-03-04 21:55:29 -08:00
lhez
69fd345335 opencl: add SET, support i32 for CPY, minor refactor for cpy (#20101) b8203 2026-03-04 21:32:26 -08:00
Todor Boinovski
1a29907d2e hexagon: add llama-completion runner script (#20095) b8202 2026-03-04 15:04:59 -08:00
Nikhil Jain
24d2ee0527 [WebGPU] Fix wait logic for inflight jobs (#20096)
* Enable tmate debugging for investigating thread safety issue

* Refactor wait and submit to operate on vector<wgpu::FutureWaitInfo>, and fix wait to delete only the future that is completed.

* Cleanup

* Remove clear change and run clang-format

* Cleanup
b8201
2026-03-04 11:54:55 -08:00
Masashi Yoshimura
541bf37622 Add concat op to webgpu. (#20068) b8200 2026-03-04 11:19:00 -08:00
Sigbjørn Skjæret
d969e933e1 tools : add missing clocale include in mtmd-cli [no ci] (#20107) 2026-03-04 14:18:04 +01:00
Johannes Gäßler
7f5ee54968 ggml: fix ggml_is_contiguous_n for ne == 1 (#20092) b8198 2026-03-04 12:04:31 +01:00
Adrien Gallouët
66199c9f03 ggml : use a simple std::thread in AMX without OpenMP (#20074)
Disabling OpenMP generally provides better inference performance (at
least in my testing) but the loading becomes slightly slower.

Benchmark results for `convert_B_packed_format()`:

Before this commit:

         N      K |  No OpenMP     OpenMP |    Diff |  Speedup
    ------------------------------------------------------------
       512   2880 |    640.9us    263.5us |  -58.9% |    0.41x
      2880   4096 |     2.55ms    261.7us |  -89.8% |    0.10x
    201088   2880 |   256.44ms    21.61ms |  -91.6% |    0.08x
    ------------------------------------------------------------

    Total: 325.43ms vs 31.05ms

After:

         N      K |  No OpenMP     OpenMP |    Diff |  Speedup
    ------------------------------------------------------------
       512   2880 |     1.49ms    263.5us |  -82.3% |    0.18x
      2880   4096 |     1.55ms    261.7us |  -83.1% |    0.17x
    201088   2880 |    24.03ms    21.61ms |  -10.1% |    0.90x
    ------------------------------------------------------------

    Total: 78.97ms vs 31.05ms

Tested with unsloth/gpt-oss-20b-GGUF:Q4_K_M.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8197
2026-03-04 11:57:09 +01:00
ddh0
c99909dd0b impl : use 6 digits for tensor dims (#20094)
Many models have vocabulary sizes, and thus tensor shapes, with more
than 5 digits (ex: Gemma 3's vocab size is 262,208).

I already fixed this for `llama_format_tensor_shape` but missed it for
`llama_format_tensor_shape` until now. Oops.
b8196
2026-03-04 09:53:38 +01:00
SamareshSingh
cb8f4fa3f8 Fix locale-dependent float printing in GGUF metadata (#17331)
* Set C locale for consistent float formatting across all binaries.

* Add C locale setting to all tools binaries

Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/
directory to ensure consistent floating-point formatting.

* Apply suggestion from @JohannesGaessler

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b8195
2026-03-04 09:30:40 +01:00
standby24x7
54910bd4f3 completion : Fix a typo in warning message (#20082)
resuse -> reuse
b8194
2026-03-04 06:44:49 +01:00
Mickael Desgranges
ecd99d6a9a docs: Fix intel documentation link (#20040) b8193 2026-03-03 21:50:00 +08:00
Charles Xu
137435ff15 kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (#20043) b8192 2026-03-03 11:40:26 +02:00
shaofeiqi
24350fdf9b opencl: add optimized q4_1 mm kernel for adreno (#19840)
* Add Q4_1 OpenCL Kernels

* opencl: refactor transpose

* opencl: format

* opencl: refactor q4_1 unpack

* opencl: move `ggml_cl_mul_mat_q4_1_f32_adreno`

* opencl: refactor `ggml_cl_mul_mat_q4_1_f32_adreno` and kernels

* opencl: rename kernel files and kernes

* opencl: fix build for non adreno

* opencl: move code around and format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
b8191
2026-03-02 19:49:41 -08:00
Abhijit Ramesh
49a7564ac1 ggml webgpu: fix workgroup dispatch limit for large batch sizes (#19965)
* ggml-webgpu: fix workgroup dispatch limit for large batch sizes

WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT
operations with batch sizes exceedeing this limi would fail.

* add compute_2d_workgroups() helper to split total workgroup ID across
X/Y dimensions

* update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D
   dispatch

* update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID
  from 2D dispatch

* update mul_mat.wgsl to compute global index from 2D workgroup
  coordinates

* refactor all three mul_mat dispatch paths to use the shared helper

* ggml-webgpu: add bounds checking for over-dispatched workgroups

2D workgroup dispatch can over-dispatch when total workgroups don't
divide evenly into the 65535 per-dimension limit. Extra workgroups
would compute invalid batch indices, causing memory corruption.

* add batch_idx bound check to mul_mat_reg_tile.wgsl and
mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups
from accessing invalid memory

* fixes test failures with large batch sizes (eg., bs=[128, 1024])

* ggml-webgpu: add back TODO for spliting large sizes into batches

* Optimize 2d workgroup provisioning

* Set some parameters that increase speed

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
b8190
2026-03-02 19:35:11 -08:00
Nikhil Jain
4d828bd1ab ggml webgpu: Clean up per-thread parameter buffer pool and job submission logic (#19772)
* Allow webgpu_buf_pool to resize if needed, remove inflight_threads, and replace inflight_threads with num_kernels for submission

* Run clang-format

* Keep track of num batched kernels that have not been submitted yet

* Run clang-format

* Increase buf pool max size

* Increase param buf pool init size

* Remove webgpu buf pool resizing

* Merge with master

* Add buffer pool growth

* Move buffer pool growth outside of lock

* Reduce max pool size to 32

* Run clang-format

* Only resize param buf pool
b8189
2026-03-02 10:23:34 -08:00
Masashi Yoshimura
36a7a6589c ggml-webgpu: Support non-contiguous src0 and overlapping src0/src1 in binary ops (#19850)
* ggml-webgpu: Add binary op support for overlapping and non-contiguous.

* Add newline to binary.wgsl

* Append the test of binary op for src overlapping  to test_bin_bcast.

* Remove unnecessary newline.
b8188
2026-03-02 07:59:53 -08:00
Ruben Ortlam
feefb92836 vulkan: tune MMVQ for Intel Windows (#19988) b8187 2026-03-02 15:58:25 +01:00
Adrien Gallouët
ec88c3ceea scripts : improve get-wikitext-2.sh (#19952)
* scripts : improve get-wikitext-2.sh

Switch to sh, add curl fallback, and avoid redundant downloads

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>

* fix indent

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-02 15:40:49 +01:00