8397 Commits

Author SHA1 Message Date
Richard Davison
5eae9cb1d9 ggml : add NVFP4 quantization type support (#19769)
* WIP: add NVFP4 quantization support

* tests

* improve NVFP4 dot product implementation performance and fix bad super call

* typo

* Use nvfp4 kvalues

* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table

* vulcal and perf fixes

* wip

* Fix metal

* fix vulcan

* Rename threshold & fix wrong scale

* Fix MOE

* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)

Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.

Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
  quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
  ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c

Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.

* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms

After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.

* quantize: add NVFP4 as a quantization type option

* Fix ggml_fp32_to_ue4m3: handle subnormal values

Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.

Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.

Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).

* Restore ARM NEON NVFP4 dot product implementation

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.

tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq

- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
  ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators

tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)

* ARM NEON NVFP4: rearrange q8 to match nibble layout

Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.

Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.

* CPU only backend 64 super-block layout

* cleanup

* Remove unused LUT

* int

* exclude NVFP4 from unsupported ops in metal build

* remove quantization for now

* store scales as native UE4M3, preserve original model bits when possible

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* correct comment

* format

* reduce duplication and cleanup

* Address comments

* move detection to prepare_tensors

* Use math instead of const

* Move

* fix comment

* Shelf quantize tests

* Rebase and move check

* cleanup

* lint

* Update gguf-py/gguf/scripts/gguf_convert_endian.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use fallback quant config

* Simplify

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* organize

* Refactor

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* add quantize_nvfp4 (required for test_quants.py)

* add quantize_nvfp4 (required for test_quants.py)

* add quantize_nvfp4 (required for test_quants.py)

* fix return type

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8297
2026-03-11 21:02:54 +01:00
Georgi Gerganov
3ca19b0e9f benches : add nemotron super (#20420) 2026-03-11 21:39:40 +02:00
Daniel Bevenius
eaf1d7930c llama : add support for Nemotron 3 Super (#20411)
* llama : add support for Nemotron 3 Super

This commit adds support for the Nemotron 3 Super model (120B.A12B)
enabling this model to be converted to GGUF format and run in llama.cpp.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Matt Clayton <156335168+mattjcly@users.noreply.github.com>
b8295
2026-03-11 19:27:53 +01:00
Georgi Gerganov
76ea1c1c46 metal : fix capture_compute counter logic (#20410) 2026-03-11 18:38:22 +02:00
Aman Gupta
bd1ec818e9 compare-llama-bench: check remotes as well (#20406) 2026-03-12 00:14:42 +08:00
Georgi Gerganov
b541241104 metal : fix q5_k mul_mv register spill (#20399) b8292 2026-03-11 16:25:27 +02:00
Georgi Gerganov
c363256839 metal : add env var to trigger graph capture (#20398) b8291 2026-03-11 16:25:10 +02:00
Neo Zhang
ecac98ee53 [SYCL] Update SYCL.md for binary package for Windows (#20401)
* add download binary package

* update prefix
2026-03-11 22:21:22 +08:00
Ruben Ortlam
182acfe5c5 ci: disable coopmat on ubuntu-24-cmake-vulkan job (#20294) 2026-03-11 14:12:29 +01:00
Aldehir Rojas
b5fe4559ae common/parser: use nlohmann::ordered_json to preserve parameter order (#20385) 2026-03-11 10:26:51 +01:00
Piotr Wilkin (ilintar)
acb7c79069 common/parser: handle reasoning budget (#20297)
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
b8287
2026-03-11 10:26:12 +01:00
uvos
5f91b1d5d5 ggml-cuda: gdn use shared mem for HIP (#20366)
Suggested-by: Aman Gupta <amangupta052@gmail.com>
b8286
2026-03-11 13:06:19 +08:00
uvos
9ef7523ee9 cuda/hip: fix loop unrolling in ssm-conv (#20369) b8285 2026-03-11 13:04:32 +08:00
Pascal
00de615345 Fix agentic mcp image single model (#20339)
* webui: fix MCP image attachments dropped during the agentic loop in single-model mode

* chore: update webui build output
b8284
2026-03-11 05:31:33 +01:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
e1a399992b vendor : update cpp-httplib to 0.37.0 (#20207) 2026-03-11 11:03:53 +08:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
4f2f0a163d vendor : update miniaudio to 0.11.25 (#20209) 2026-03-11 11:01:56 +08:00
Neo Zhang
0cec84f999 fix op rope, add rope_back (#20293) b8281 2026-03-11 09:53:34 +08:00
Neo Zhang
b2e1427c9b fix for failed UT case: ACC, L2_NORM, UPSCALE, fused_glu, unary (#20283) b8280 2026-03-11 09:53:05 +08:00
Vinicios Lugli
4d99d45084 model : qwen3vl reranker text support (#20332)
* model : fix qwen3vl reranker support

* Remove CLS_OUT

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8279
2026-03-10 23:40:14 +01:00
ddh0
10e5b148b0 llama-quant : correct n_attention_wv usage (#20357)
* llama-quant : correct `n_attention_wv` usage

In #19770, I introduced a regression in the way the
`quantize_state_impl` counter values were initialized. I was
incrementing and using `n_attention_wv` in the same loop, when it should
have been fixed by the time we're deciding tensor types in
`llama_tensor_get_type_impl` (for `use_more_bits`).

I never observed a difference in any of [my
tests](https://github.com/ggml-org/llama.cpp/pull/19770#issuecomment-4000424712)
- it was only after @bartowski kindly pointed this out that I realized
it was incorrect. (Thanks!)

* simplify
b8278
2026-03-10 21:43:29 +02:00
Georgi Gerganov
90b2731894 ggml : bump RPC version (#20330) b8277 2026-03-10 21:36:57 +02:00
Reese Levine
aa2d278a11 ggml webgpu: faster normal quant and some k-quant matrix operations, better shader parameter handling (#20173)
* K quant speedup (#20)

* Basic JIT compilation for mul_mat, get_rows, and scale (#17)

* scale jit working

* preliminary working jit for getrows and mulmat, needs refining

* simplified mul_mat preprocessing switch statement

* get_rows fixes, mul_mat refinement

* formatted + last edits

* removed some extraneous prints

* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish

* small fix

* some changes, working

* get_rows and mul_mat jit fixed and working

* Update formatting

* formatting

* Add header

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on all-encompassing shader library

* refactor argmax, set_rows

* Refactor all but flashattention, mat mul

* no gibberish, all k quants added, merged

* vec memory fix

* q6_k matching metal on my machine, tests passing

* Set tile size for q6_k separately

* Separate out fast shaders

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

* Move towards writeBuffer for params

* Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups

* Remove extra file

* Formatting

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
b8276
2026-03-10 09:14:27 -07:00
Piotr Wilkin (ilintar)
6c770d16ca Reduce level of content parser warning message to avoid log spam on non-debug verbosity (#20347) 2026-03-10 15:21:51 +01:00
Ray Xu
8d880ac012 examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968)
* Fix logic for retrieving schema items in `json_schema_to_grammar.py`

If `schema['items']` is `{}` and `prefixItems not in schema', as `{}` is Falsy, the original code here will raise an error.

I think if `schema['items']` is `{}`, them items should just be `{}`

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add tests for arrays with empty items

Add two unit tests to `tests/test-json-schema-to-grammar.cpp` that validate handling of arrays when 'items' is an empty schema and when 'prefixItems' is present alongside an empty 'items'. Both tests expect the same generated grammar, ensuring the JSON Schema->grammar conversion treats an empty 'items' schema (and the presence of 'prefixItems') correctly and covering this edge case.

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-10 14:38:18 +01:00
a3894281
0f1e9d14cc docs: update CPU backend ops to mark POOL_1D as supported (#20304) 2026-03-10 21:31:24 +08:00
Georgi Gerganov
1274fbee9e models : fix assert in mamba2 (cont) (#20335)
* models : fix assert in mamba2 (cont)

* cont : add n_group mod

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8272
2026-03-10 15:00:08 +02:00
Georgi Gerganov
a7b3dee7a5 server : make 2 checkpoints near the end of the prompt (#20288)
* server : make 2 checkpoints near the end of the prompt

* cont : adjust checkpoints
b8271
2026-03-10 14:28:23 +02:00
Sigbjørn Skjæret
ec947d2b16 common : fix incorrect uses of stoul (#20313) b8270 2026-03-10 11:40:26 +01:00
Charles Xu
0cd4f4720b kleidiai : support for concurrent sme and neon kernel execution (#20070) b8269 2026-03-10 09:25:25 +02:00
Taimur Ahmad
af237f3026 ggml-cpu: add RVV repack GEMM and GEMV for quantization types (#19121)
* ggml-cpu: add rvv ggml_quantize_mat_4x8 for q8_0

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: add rvv repacking for iq4_nl

* ggml-cpu: add generic impl for iq4_nl gemm/gemv

* ggml-cpu: add rvv repacking for q8_0

* ggml-cpu: refactor; add rvv repacking for q4_0, q4_K

* ggml-cpu: refactor; add rvv repacking for q2_K

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: refactor rvv repack

---------

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
b8268
2026-03-10 08:49:52 +02:00
Julian Pscheid
1a5631beaa metal: handle command buffer failures gracefully in synchronize (#20306)
Replace GGML_ABORT("fatal error") in ggml_metal_synchronize() with
error flag + return. This aligns synchronize error handling with
graph_compute, which already returns GGML_STATUS_FAILED for the same
condition.

When a command buffer fails (e.g., iOS GPU access revocation during
backgrounding, macOS eGPU disconnect, OOM), the backend enters an
error state instead of killing the host process. Subsequent
graph_compute calls return GGML_STATUS_FAILED immediately. Recovery
requires recreating the backend.

Failed extra command buffers are properly released on the error path
to avoid Metal object leaks.
b8267
2026-03-10 08:32:24 +02:00
ddh0
1dab5f5a44 llama-quant : fail early on missing imatrix, refactor type selection, code cleanup (#19770)
* quantize : imatrix-fail early + code cleanup

* fix manual override printing

it's in the preliminary loop now, so needs to be on its own line

* revert header changes per ggerganov

* remove old #includes

* clarify naming

rename `tensor_quantization` to `tensor_typo_option` to descirbe its
functionality

* fix per barto
b8266
2026-03-10 08:16:05 +02:00
Aldehir Rojas
c96f608d98 common: consolidate PEG string parsers (#20263)
* common : consolidate PEG string parsers
* cont : fix json_string_content()
b8265
2026-03-10 00:29:21 +01:00
Xuan-Son Nguyen
0842b9b465 model: fix step3.5 n_rot (#20318) b8264 2026-03-09 23:42:24 +01:00
Xuan-Son Nguyen
59db9a357d llama: dynamic head_dim and n_rot for SWA (#20301)
* llama: dynamic head_dim and n_rot for SWA

* also add gguf_writer wrappers

* fix build

* build_rope_shift arg reorder
b8263
2026-03-09 22:22:39 +01:00
Evan Huus
23fbfcb1ad server: Parse port numbers from MCP server URLs in CORS proxy (#20208)
* Parse port numbers from MCP server URLs

* Pass scheme to http proxy for determining whether to use SSL

* Fix download on non-standard port and re-add port to logging

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b8262
2026-03-09 17:47:54 +01:00
Paul Flynn
e22cd0aa15 metal : extend mul_mv_ext to BF16, Q2_K, Q3_K (#20250)
Enable mul_mv_ext small-batch kernels (BS 2-8) for BF16, Q2_K,
and Q3_K quantization types. These types previously fell through
to the slower single-row mul_mv path.

BF16 uses the float4 dequantize path (like F16). Q2_K and Q3_K
use the float4x4 K-quant path (like Q4_K/Q5_K/Q6_K).

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
b8261
2026-03-09 16:48:12 +02:00
Georgi Gerganov
96cfc4992c server : fix checkpoints n_tokens calculation (#20287) b8260 2026-03-09 16:47:06 +02:00
Georgi Gerganov
ed0007aa32 metal : add upscale (#20284) b8259 2026-03-09 16:45:11 +02:00
Georgi Gerganov
344ee2a38a server : warn swa-full is not supported for non-SWA models (#20291) b8258 2026-03-09 16:44:25 +02:00
Georgi Gerganov
d6e1556499 server : fix off-by-1 in server_tokens::size_up_to_pos() (#20279)
* server : fix off-by-1 in server_tokens::size_up_to_pos()

* cont : fix typo [no ci]
2026-03-09 16:43:38 +02:00
Piotr Wilkin (ilintar)
f76565db92 common: map developer role to system (#20215)
* Map developer role to system
* Simplify
b8256
2026-03-09 14:25:11 +01:00
Georgi Gerganov
43e1cbd6c1 models : fix assert in mamba2 graph (#20270) b8255 2026-03-09 13:15:15 +02:00
Georgi Gerganov
107d599952 server : add kill switch when server is stuck (#20277) b8254 2026-03-09 10:33:12 +02:00
Aman Gupta
e8bbc736cb ggml-cuda: disable gdn for musa (#20278) b8253 2026-03-09 16:15:36 +08:00
ddh0
b518195101 llama-quant : left-align tensor names in output (#20117) b8252 2026-03-09 09:28:41 +02:00
Aman Gupta
e2763a6723 contributing: limit open PRs for new contributors to 1 (#20036) 2026-03-09 15:05:34 +08:00
Bertay Eren
0beb8db3a0 ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (#20219) b8250 2026-03-09 07:24:16 +01:00
Ruben Ortlam
b2f460bd3c vulkan: skip zero size tensors in backend copies (#20233) b8249 2026-03-09 07:23:45 +01:00
Michael Huang
5f4cdac385 cuda : display total and free VRAM capacity during device initialization (#20185) b8248 2026-03-09 12:45:43 +08:00