llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-03-17 16:44:07 +00:00

Author	SHA1	Message	Date
Piotr Wilkin (ilintar)	2948e6049a	general: CONTRIBUTING.md - guidelines for quantization schemes (#19762 ) * Guidelines for quantization schemes * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Change required precision from Q8 to FP16/BF16 * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md [no ci] * Update CONTRIBUTING.md [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-13 12:21:33 +01:00
Georgi Gerganov	73c9eb8ced	metal : fix l2 norm scale (#20493 ) b8325	2026-03-13 11:43:20 +02:00
Daniel Bevenius	983df142a9	convert : fix/suppress pyright errors (#20442 ) * convert : fix/suppress pyright errors This commit fixes the pyright errors that are generated by pyright for convert_hf_to_gguf.py. The motivation for this is that running this locally generates errors that CI does not, and it can be difficult to spot new errors. One use case is when working on new models which cannot be run in CI due to privacy. Having the ability to run pyright locally is would be helpful in this cases. In the linked issue there is the mention of switching to `ty` which I don't know anything about but in the meantime I would appreciate if we could suppress these errors for now, and later perhaps revert this commit. With this change there are no errors but there are 4 informations messages if the `mistral_common` package is installed. The `--level error` flag can be used to suppress them. Resolves: https://github.com/ggml-org/llama.cpp/issues/20417	2026-03-13 06:00:52 +01:00
Georgi Gerganov	57819b8d4b	llama : disable graph reuse with pipeline parallelism (#20463 ) b8323	2026-03-12 21:04:13 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	557fe2d913	vendor : update cpp-httplib to 0.37.1 (#20390 ) b8322	2026-03-12 13:57:06 +01:00
Piotr Wilkin (ilintar)	0e810413bb	tests : use `reasoning` instead of `reasoning_budget` in server tests (#20432 )	2026-03-12 13:41:01 +01:00
Ruben Ortlam	128142fe7d	test-backend-ops: allow loading tests from file and parsing model operators into file (#19896 ) * tests: allow loading test-backend-ops tests from json * add error threshold based on op * add error when file cannot be read * add graph operator json extraction tool * add nb parameter for non-contiguous input tensors * fix view check * only use view if non-contiguous/permuted, use C++ random instead of rand() * replace internal API calls with public llama_graph_reserve call * reduce test description length * fix nb[0] not getting set for view * add name to tests * fix inplace error * use text file instead of json * move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/ * fix missing declaration * use pragma once * fix indent * fix Windows build b8320	2026-03-12 13:26:00 +01:00
Daniel Bevenius	6de1bc631d	common : update completion executables list [no ci] (#19934 ) This commit updates the bash completion executables list, adding missing executables and removing some that non longer exist.	2026-03-12 12:12:01 +01:00
Asbjørn Olling	0a10c34dc1	grammar: Fix grammar root symbol check (#19761 ) * grammar: fix bad check for root symbol, correct error logging * add tests to demonstrate root symbol check failure b8318	2026-03-12 12:04:56 +01:00
ProgenyAlpha	deee23863b	vulkan: add GATED_DELTA_NET op support (#20334 ) * vulkan: add GATED_DELTA_NET op support Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: optimize GATED_DELTA_NET shader (Phase 1) - vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: address review feedback for GATED_DELTA_NET Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: add explicit FLOAT_TYPE casts for buffer loads Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts to ensure correct behavior across all Vulkan configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: fix Q/K broadcast for interleaved head layout Adapt to the interleaved broadcast convention from #20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> b8317	2026-03-12 11:32:04 +01:00
Sigbjørn Skjæret	c3e3f9e533	convert : better mtp check and fix return [no ci] (#20419 )	2026-03-12 10:04:20 +01:00
ProgenyAlpha	40c550d4f6	vulkan: fix SSM_CONV PP scaling with large ubatch sizes (#20379 ) * vulkan: optimize SSM_CONV workgroup dispatch for large ubatch Tile tokens into 2D workgroups (32x16) to reduce workgroup launch overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common d_conv size). Fixes PP performance degradation with ubatch > 512. Ref: ggml-org/llama.cpp#18725 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: remove unused shared memory declaration in SSM_CONV Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> b8315	2026-03-12 10:03:18 +01:00
Pascal	de190154c8	New conversations now auto-select the first loaded model (#20403 ) * webui: auto-select first loaded model for new conversations in router mode * chore: update webui build output	2026-03-12 09:07:05 +01:00
Masashi Yoshimura	05039967da	ggml-virtgpu: Fix some build commands (#20341 )	2026-03-12 15:47:45 +08:00
Georgi Gerganov	e4cff0956b	metal : avoid divisions in bin kernel (#20426 ) * metal : avoid modulus in bin kernel when not broadcasting * metal : fix capture_started flag b8312	2026-03-12 09:42:40 +02:00
Masato Nakasaka	4cc6eb158c	ci: Setup self-hosted CI for Intel Linux Vulkan backend (#20154 )	2026-03-12 06:43:22 +01:00
Jeff Bolz	246ffc4b05	vulkan: fix l2_norm epsilon handling (#20350 ) b8310	2026-03-12 06:39:41 +01:00
Jeff Bolz	aa429cf507	vulkan: fix OOB check in flash_attn_mask_opt (#20296 ) b8309	2026-03-12 06:35:49 +01:00
Masato Nakasaka	5866e3bbc8	vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (#20059 ) * Changed to reuse command buffers to fix crashing on Intel GPU * Removed unused parameter * Fixed compile error and minor mistake * Fix logging * Changing to use usage flag per command buffer * fixed style * added buffer reset * Removed cmd_buffer_idx for reuse consistency * Fixed style b8308	2026-03-12 06:30:16 +01:00
lhez	0516e04bf9	opencl: use larger workgroup size for get_rows (#20316 )	2026-03-11 22:03:27 -07:00
shaofeiqi	3d9ab225e7	opencl: add cumsum op (#18981 ) * OpenCL: add CUMSUM op support * remove unused argument * opencl: refactor cumsum * opencl: refactor * opencl: refactor tmp buffer * opencl: adjust max number of subgroups * opencl: fix whitespace * opencl: fix global size when cumsum the tmp buffer --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-03-11 22:03:07 -07:00
uvos	d63aa398de	hip: compile debug builds with -O2 on hip to avoid a compiler bug (#20392 ) b8305	2026-03-12 10:37:10 +08:00
Mishusha	a8304b4d27	common/parser: add GigaChatV3/3.1 models support (#19931 ) Co-authored-by: Mishusha <pmv26021975@gmail.com> b8304	2026-03-12 01:22:25 +01:00
DAN™	fdb17643d3	model : add support for Phi4ForCausalLMV (#20168 ) * Add support for Phi4ForCausalLMV. * Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd. * Rename contants + fix tokenizer label * Clean-ups. * Fix GGUF export. * Set tokenizer.ggml.pre explicitly. * Default vocab name rather than forcing it. * Clean-ups. * Fix indent. * Fix subscriptable error. * remov overcomplicated code path * Clean-ups. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b8303	2026-03-12 00:25:54 +01:00
Richard Davison	1eea6a2968	graph : add optional scale parameter to build_lora_mm [no ci] (#20427 )	2026-03-12 00:22:49 +01:00
ddh0	4a748b8f15	common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (#20416 ) b8301	2026-03-12 00:13:28 +01:00
Masashi Yoshimura	f2ab047f27	ggml-webgpu: Add supports for `GGML_OP_REPEAT` (#20230 ) * Add GGML_OP_REPEAT to webgpu backend. * Add i16 support for GGML_OP_REPEAT. b8300	2026-03-11 14:40:36 -07:00
Georgi Gerganov	d28961d81e	llama : enable chunked fused GDN path (#20340 ) * llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from `2068908975` * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz> b8299	2026-03-11 22:46:40 +02:00
Sigbjørn Skjæret	f90bd1dd84	llama : whitespace cleanup (#20422 ) b8298	2026-03-11 21:18:29 +01:00
Richard Davison	5eae9cb1d9	ggml : add NVFP4 quantization type support (#19769 ) * WIP: add NVFP4 quantization support * tests * improve NVFP4 dot product implementation performance and fix bad super call * typo * Use nvfp4 kvalues * vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table * vulcal and perf fixes * wip * Fix metal * fix vulcan * Rename threshold & fix wrong scale * Fix MOE * Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD) Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained. * Fix arch-fallback.h: add NVFP4 generic fallback for all platforms After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions. * quantize: add NVFP4 as a quantization type option * Fix ggml_fp32_to_ue4m3: handle subnormal values Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15). * Restore ARM NEON NVFP4 dot product implementation Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup * Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq - Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed) * ARM NEON NVFP4: rearrange q8 to match nibble layout Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions. * CPU only backend 64 super-block layout * cleanup * Remove unused LUT * int * exclude NVFP4 from unsupported ops in metal build * remove quantization for now * store scales as native UE4M3, preserve original model bits when possible * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * correct comment * format * reduce duplication and cleanup * Address comments * move detection to prepare_tensors * Use math instead of const * Move * fix comment * Shelf quantize tests * Rebase and move check * cleanup * lint * Update gguf-py/gguf/scripts/gguf_convert_endian.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use fallback quant config * Simplify Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * organize * Refactor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * fix return type --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8297	2026-03-11 21:02:54 +01:00
Georgi Gerganov	3ca19b0e9f	benches : add nemotron super (#20420 )	2026-03-11 21:39:40 +02:00
Daniel Bevenius	eaf1d7930c	llama : add support for Nemotron 3 Super (#20411 ) * llama : add support for Nemotron 3 Super This commit adds support for the Nemotron 3 Super model (120B.A12B) enabling this model to be converted to GGUF format and run in llama.cpp. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Matt Clayton <156335168+mattjcly@users.noreply.github.com> b8295	2026-03-11 19:27:53 +01:00
Georgi Gerganov	76ea1c1c46	metal : fix capture_compute counter logic (#20410 )	2026-03-11 18:38:22 +02:00
Aman Gupta	bd1ec818e9	compare-llama-bench: check remotes as well (#20406 )	2026-03-12 00:14:42 +08:00
Georgi Gerganov	b541241104	metal : fix q5_k mul_mv register spill (#20399 ) b8292	2026-03-11 16:25:27 +02:00
Georgi Gerganov	c363256839	metal : add env var to trigger graph capture (#20398 ) b8291	2026-03-11 16:25:10 +02:00
Neo Zhang	ecac98ee53	[SYCL] Update SYCL.md for binary package for Windows (#20401 ) * add download binary package * update prefix	2026-03-11 22:21:22 +08:00
Ruben Ortlam	182acfe5c5	ci: disable coopmat on ubuntu-24-cmake-vulkan job (#20294 )	2026-03-11 14:12:29 +01:00
Aldehir Rojas	b5fe4559ae	common/parser: use nlohmann::ordered_json to preserve parameter order (#20385 )	2026-03-11 10:26:51 +01:00
Piotr Wilkin (ilintar)	acb7c79069	common/parser: handle reasoning budget (#20297 ) * v1 * Finished! * Handlie cli * Reasoning sampler * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Less explosive terminology :) * Add utf-8 case and tests * common : migrate reasoning budget sampler to common * cont : clean up * cont : expose state and allow passing as initial state * cont : remove unused imports * cont : update state machine doc string --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Alde Rojas <hello@alde.dev> b8287	2026-03-11 10:26:12 +01:00
uvos	5f91b1d5d5	ggml-cuda: gdn use shared mem for HIP (#20366 ) Suggested-by: Aman Gupta <amangupta052@gmail.com> b8286	2026-03-11 13:06:19 +08:00
uvos	9ef7523ee9	cuda/hip: fix loop unrolling in ssm-conv (#20369 ) b8285	2026-03-11 13:04:32 +08:00
Pascal	00de615345	Fix agentic mcp image single model (#20339 ) * webui: fix MCP image attachments dropped during the agentic loop in single-model mode * chore: update webui build output b8284	2026-03-11 05:31:33 +01:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	e1a399992b	vendor : update cpp-httplib to 0.37.0 (#20207 )	2026-03-11 11:03:53 +08:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	4f2f0a163d	vendor : update miniaudio to 0.11.25 (#20209 )	2026-03-11 11:01:56 +08:00
Neo Zhang	0cec84f999	fix op rope, add rope_back (#20293 ) b8281	2026-03-11 09:53:34 +08:00
Neo Zhang	b2e1427c9b	fix for failed UT case: ACC, L2_NORM, UPSCALE, fused_glu, unary (#20283 ) b8280	2026-03-11 09:53:05 +08:00
Vinicios Lugli	4d99d45084	model : qwen3vl reranker text support (#20332 ) * model : fix qwen3vl reranker support * Remove CLS_OUT Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8279	2026-03-10 23:40:14 +01:00
ddh0	10e5b148b0	llama-quant : correct `n_attention_wv` usage (#20357 ) * llama-quant : correct `n_attention_wv` usage In #19770, I introduced a regression in the way the `quantize_state_impl` counter values were initialized. I was incrementing and using `n_attention_wv` in the same loop, when it should have been fixed by the time we're deciding tensor types in `llama_tensor_get_type_impl` (for `use_more_bits`). I never observed a difference in any of [my tests](https://github.com/ggml-org/llama.cpp/pull/19770#issuecomment-4000424712) - it was only after @bartowski kindly pointed this out that I realized it was incorrect. (Thanks!) * simplify b8278	2026-03-10 21:43:29 +02:00
Georgi Gerganov	90b2731894	ggml : bump RPC version (#20330 ) b8277	2026-03-10 21:36:57 +02:00

1 2 3 4 5 ...

8326 Commits