Commit Graph

8326 Commits

Author SHA1 Message Date
Piotr Wilkin (ilintar)
2948e6049a general: CONTRIBUTING.md - guidelines for quantization schemes (#19762)
* Guidelines for quantization schemes

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Change required precision from Q8 to FP16/BF16

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update CONTRIBUTING.md [no ci]

* Update CONTRIBUTING.md [no ci]

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-13 12:21:33 +01:00
Georgi Gerganov
73c9eb8ced metal : fix l2 norm scale (#20493) b8325 2026-03-13 11:43:20 +02:00
Daniel Bevenius
983df142a9 convert : fix/suppress pyright errors (#20442)
* convert : fix/suppress pyright errors

This commit fixes the pyright errors that are generated by pyright for
convert_hf_to_gguf.py.

The motivation for this is that running this locally generates errors
that CI does not, and it can be difficult to spot new errors. One use
case is when working on new models which cannot be run in CI due to
privacy. Having the ability to run pyright locally is would be helpful
in this cases.

In the linked issue there is the mention of switching to `ty` which I
don't know anything about but in the meantime I would appreciate if we
could suppress these errors for now, and later perhaps revert this
commit.

With this change there are no errors but there are 4 informations
messages if the `mistral_common` package is installed. The
`--level error` flag can be used to suppress them.

Resolves: https://github.com/ggml-org/llama.cpp/issues/20417
2026-03-13 06:00:52 +01:00
Georgi Gerganov
57819b8d4b llama : disable graph reuse with pipeline parallelism (#20463) b8323 2026-03-12 21:04:13 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
557fe2d913 vendor : update cpp-httplib to 0.37.1 (#20390) b8322 2026-03-12 13:57:06 +01:00
Piotr Wilkin (ilintar)
0e810413bb tests : use reasoning instead of reasoning_budget in server tests (#20432) 2026-03-12 13:41:01 +01:00
Ruben Ortlam
128142fe7d test-backend-ops: allow loading tests from file and parsing model operators into file (#19896)
* tests: allow loading test-backend-ops tests from json

* add error threshold based on op

* add error when file cannot be read

* add graph operator json extraction tool

* add nb parameter for non-contiguous input tensors

* fix view check

* only use view if non-contiguous/permuted, use C++ random instead of rand()

* replace internal API calls with public llama_graph_reserve call

* reduce test description length

* fix nb[0] not getting set for view

* add name to tests

* fix inplace error

* use text file instead of json

* move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/

* fix missing declaration

* use pragma once

* fix indent

* fix Windows build
b8320
2026-03-12 13:26:00 +01:00
Daniel Bevenius
6de1bc631d common : update completion executables list [no ci] (#19934)
This commit updates the bash completion executables list, adding missing
executables and removing some that non longer exist.
2026-03-12 12:12:01 +01:00
Asbjørn Olling
0a10c34dc1 grammar: Fix grammar root symbol check (#19761)
* grammar: fix bad check for root symbol, correct error logging

* add tests to demonstrate root symbol check failure
b8318
2026-03-12 12:04:56 +01:00
ProgenyAlpha
deee23863b vulkan: add GATED_DELTA_NET op support (#20334)
* vulkan: add GATED_DELTA_NET op support

Implements the fused gated delta net recurrence as a Vulkan compute
shader with full support for scalar gate, KDA vector gate, GQA
broadcast, multi-token sequences, and permuted (non-contiguous) q/k
inputs. Specialization constants select head size (32/64/128) and
KDA mode at pipeline creation time.

Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: optimize GATED_DELTA_NET shader (Phase 1)

- vec4 dot products on all inner loops (dp4 hardware intrinsic)
- Cache exp(g) in shared memory for KDA path, eliminating ~32K
  redundant global reads and ~16K redundant exp() calls per token
- vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops)
- Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops

KDA TG: +5.4% throughput. Non-KDA: no regressions.
13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: address review feedback for GATED_DELTA_NET

Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros,
scale in push constants, supports_op fix, dispatch restructuring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: add explicit FLOAT_TYPE casts for buffer loads

Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts
to ensure correct behavior across all Vulkan configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: fix Q/K broadcast for interleaved head layout

Adapt to the interleaved broadcast convention from #20340:
head_id / rq1 → head_id % neq1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
b8317
2026-03-12 11:32:04 +01:00
Sigbjørn Skjæret
c3e3f9e533 convert : better mtp check and fix return [no ci] (#20419) 2026-03-12 10:04:20 +01:00
ProgenyAlpha
40c550d4f6 vulkan: fix SSM_CONV PP scaling with large ubatch sizes (#20379)
* vulkan: optimize SSM_CONV workgroup dispatch for large ubatch

Tile tokens into 2D workgroups (32x16) to reduce workgroup launch
overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common
d_conv size). Fixes PP performance degradation with ubatch > 512.

Ref: ggml-org/llama.cpp#18725

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: remove unused shared memory declaration in SSM_CONV

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
b8315
2026-03-12 10:03:18 +01:00
Pascal
de190154c8 New conversations now auto-select the first loaded model (#20403)
* webui: auto-select first loaded model for new conversations in router mode

* chore: update webui build output
2026-03-12 09:07:05 +01:00
Masashi Yoshimura
05039967da ggml-virtgpu: Fix some build commands (#20341) 2026-03-12 15:47:45 +08:00
Georgi Gerganov
e4cff0956b metal : avoid divisions in bin kernel (#20426)
* metal : avoid modulus in bin kernel when not broadcasting

* metal : fix capture_started flag
b8312
2026-03-12 09:42:40 +02:00
Masato Nakasaka
4cc6eb158c ci: Setup self-hosted CI for Intel Linux Vulkan backend (#20154) 2026-03-12 06:43:22 +01:00
Jeff Bolz
246ffc4b05 vulkan: fix l2_norm epsilon handling (#20350) b8310 2026-03-12 06:39:41 +01:00
Jeff Bolz
aa429cf507 vulkan: fix OOB check in flash_attn_mask_opt (#20296) b8309 2026-03-12 06:35:49 +01:00
Masato Nakasaka
5866e3bbc8 vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (#20059)
* Changed to reuse command buffers to fix crashing on Intel GPU

* Removed unused parameter

* Fixed compile error and minor mistake

* Fix logging

* Changing to use usage flag per command buffer

* fixed style

* added buffer reset

* Removed cmd_buffer_idx for reuse consistency

* Fixed style
b8308
2026-03-12 06:30:16 +01:00
lhez
0516e04bf9 opencl: use larger workgroup size for get_rows (#20316) 2026-03-11 22:03:27 -07:00
shaofeiqi
3d9ab225e7 opencl: add cumsum op (#18981)
* OpenCL: add CUMSUM op support

* remove unused argument

* opencl: refactor cumsum

* opencl: refactor

* opencl: refactor tmp buffer

* opencl: adjust max number of subgroups

* opencl: fix whitespace

* opencl: fix global size when cumsum the tmp buffer

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-03-11 22:03:07 -07:00
uvos
d63aa398de hip: compile debug builds with -O2 on hip to avoid a compiler bug (#20392) b8305 2026-03-12 10:37:10 +08:00
Mishusha
a8304b4d27 common/parser: add GigaChatV3/3.1 models support (#19931)
Co-authored-by: Mishusha <pmv26021975@gmail.com>
b8304
2026-03-12 01:22:25 +01:00
DAN™
fdb17643d3 model : add support for Phi4ForCausalLMV (#20168)
* Add support for Phi4ForCausalLMV.

* Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd.

* Rename contants + fix tokenizer label

* Clean-ups.

* Fix GGUF export.

* Set tokenizer.ggml.pre explicitly.

* Default vocab name rather than forcing it.

* Clean-ups.

* Fix indent.

* Fix subscriptable error.

* remov overcomplicated code path

* Clean-ups.

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b8303
2026-03-12 00:25:54 +01:00
Richard Davison
1eea6a2968 graph : add optional scale parameter to build_lora_mm [no ci] (#20427) 2026-03-12 00:22:49 +01:00
ddh0
4a748b8f15 common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (#20416) b8301 2026-03-12 00:13:28 +01:00
Masashi Yoshimura
f2ab047f27 ggml-webgpu: Add supports for GGML_OP_REPEAT (#20230)
* Add GGML_OP_REPEAT to webgpu backend.

* Add i16 support for GGML_OP_REPEAT.
b8300
2026-03-11 14:40:36 -07:00
Georgi Gerganov
d28961d81e llama : enable chunked fused GDN path (#20340)
* llama : enable chunked fused GDN path

* models : avoid Q and K repeats when using fused GDA

* cont : fix comment

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix the fix

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix

* metal : add GDN kernel (#20361)

* metal : add Metal backend for GGML_OP_GATED_DELTA_NET

Add a fused Metal kernel for the gated delta net recurrence op
(#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.

Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.

Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
  tg128: 170 -> 213 t/s (+25%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : validate contiguity of all input tensors in supports_op

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : add algorithm equivalence comment for GDA decay path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* cont : unslop + optimize

* cont : clean-up

---------

Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* CUDA: AR gated delta net improvements (#20391)

* Add FastDiv to gated_delta_net_cuda

* Shard columns across warps

This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).

* Remove unneded include in gated_delta_net.cu

* Improve comments

* Apply code-formating

* Make sharding HIP-compatible

1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
2. Add test with partial warp to test sum reduction on CUDA

* Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t

* Rename variables

* Enable GDN also for prefill, move TODO for chunked_GDN

* Actually remove the TODO from 2068908975

* Get warp size at runtime

warp_size is not known at compile time in hip host code.

* Don't expose ggml_cuda_get_physical_warp_size on host

---------

Co-authored-by: uvos <devnull@uvos.xyz>

* llama : refactor llm_build_delta_net_base API

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: uvos <devnull@uvos.xyz>
b8299
2026-03-11 22:46:40 +02:00
Sigbjørn Skjæret
f90bd1dd84 llama : whitespace cleanup (#20422) b8298 2026-03-11 21:18:29 +01:00
Richard Davison
5eae9cb1d9 ggml : add NVFP4 quantization type support (#19769)
* WIP: add NVFP4 quantization support

* tests

* improve NVFP4 dot product implementation performance and fix bad super call

* typo

* Use nvfp4 kvalues

* vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table

* vulcal and perf fixes

* wip

* Fix metal

* fix vulcan

* Rename threshold & fix wrong scale

* Fix MOE

* Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)

Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.

Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
  quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
  ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c

Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.

* Fix arch-fallback.h: add NVFP4 generic fallback for all platforms

After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.

* quantize: add NVFP4 as a quantization type option

* Fix ggml_fp32_to_ue4m3: handle subnormal values

Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.

Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.

Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).

* Restore ARM NEON NVFP4 dot product implementation

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.

tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

* Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq

- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
  ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators

tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)

* ARM NEON NVFP4: rearrange q8 to match nibble layout

Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.

Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.

* CPU only backend 64 super-block layout

* cleanup

* Remove unused LUT

* int

* exclude NVFP4 from unsupported ops in metal build

* remove quantization for now

* store scales as native UE4M3, preserve original model bits when possible

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* correct comment

* format

* reduce duplication and cleanup

* Address comments

* move detection to prepare_tensors

* Use math instead of const

* Move

* fix comment

* Shelf quantize tests

* Rebase and move check

* cleanup

* lint

* Update gguf-py/gguf/scripts/gguf_convert_endian.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Use fallback quant config

* Simplify

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* organize

* Refactor

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* add quantize_nvfp4 (required for test_quants.py)

* add quantize_nvfp4 (required for test_quants.py)

* add quantize_nvfp4 (required for test_quants.py)

* fix return type

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8297
2026-03-11 21:02:54 +01:00
Georgi Gerganov
3ca19b0e9f benches : add nemotron super (#20420) 2026-03-11 21:39:40 +02:00
Daniel Bevenius
eaf1d7930c llama : add support for Nemotron 3 Super (#20411)
* llama : add support for Nemotron 3 Super

This commit adds support for the Nemotron 3 Super model (120B.A12B)
enabling this model to be converted to GGUF format and run in llama.cpp.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Matt Clayton <156335168+mattjcly@users.noreply.github.com>
b8295
2026-03-11 19:27:53 +01:00
Georgi Gerganov
76ea1c1c46 metal : fix capture_compute counter logic (#20410) 2026-03-11 18:38:22 +02:00
Aman Gupta
bd1ec818e9 compare-llama-bench: check remotes as well (#20406) 2026-03-12 00:14:42 +08:00
Georgi Gerganov
b541241104 metal : fix q5_k mul_mv register spill (#20399) b8292 2026-03-11 16:25:27 +02:00
Georgi Gerganov
c363256839 metal : add env var to trigger graph capture (#20398) b8291 2026-03-11 16:25:10 +02:00
Neo Zhang
ecac98ee53 [SYCL] Update SYCL.md for binary package for Windows (#20401)
* add download binary package

* update prefix
2026-03-11 22:21:22 +08:00
Ruben Ortlam
182acfe5c5 ci: disable coopmat on ubuntu-24-cmake-vulkan job (#20294) 2026-03-11 14:12:29 +01:00
Aldehir Rojas
b5fe4559ae common/parser: use nlohmann::ordered_json to preserve parameter order (#20385) 2026-03-11 10:26:51 +01:00
Piotr Wilkin (ilintar)
acb7c79069 common/parser: handle reasoning budget (#20297)
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
b8287
2026-03-11 10:26:12 +01:00
uvos
5f91b1d5d5 ggml-cuda: gdn use shared mem for HIP (#20366)
Suggested-by: Aman Gupta <amangupta052@gmail.com>
b8286
2026-03-11 13:06:19 +08:00
uvos
9ef7523ee9 cuda/hip: fix loop unrolling in ssm-conv (#20369) b8285 2026-03-11 13:04:32 +08:00
Pascal
00de615345 Fix agentic mcp image single model (#20339)
* webui: fix MCP image attachments dropped during the agentic loop in single-model mode

* chore: update webui build output
b8284
2026-03-11 05:31:33 +01:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
e1a399992b vendor : update cpp-httplib to 0.37.0 (#20207) 2026-03-11 11:03:53 +08:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
4f2f0a163d vendor : update miniaudio to 0.11.25 (#20209) 2026-03-11 11:01:56 +08:00
Neo Zhang
0cec84f999 fix op rope, add rope_back (#20293) b8281 2026-03-11 09:53:34 +08:00
Neo Zhang
b2e1427c9b fix for failed UT case: ACC, L2_NORM, UPSCALE, fused_glu, unary (#20283) b8280 2026-03-11 09:53:05 +08:00
Vinicios Lugli
4d99d45084 model : qwen3vl reranker text support (#20332)
* model : fix qwen3vl reranker support

* Remove CLS_OUT

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8279
2026-03-10 23:40:14 +01:00
ddh0
10e5b148b0 llama-quant : correct n_attention_wv usage (#20357)
* llama-quant : correct `n_attention_wv` usage

In #19770, I introduced a regression in the way the
`quantize_state_impl` counter values were initialized. I was
incrementing and using `n_attention_wv` in the same loop, when it should
have been fixed by the time we're deciding tensor types in
`llama_tensor_get_type_impl` (for `use_more_bits`).

I never observed a difference in any of [my
tests](https://github.com/ggml-org/llama.cpp/pull/19770#issuecomment-4000424712)
- it was only after @bartowski kindly pointed this out that I realized
it was incorrect. (Thanks!)

* simplify
b8278
2026-03-10 21:43:29 +02:00
Georgi Gerganov
90b2731894 ggml : bump RPC version (#20330) b8277 2026-03-10 21:36:57 +02:00