Commit Graph

8067 Commits

Author SHA1 Message Date
Georgi Gerganov
ff4affb4c1 sync : ggml b8067 2026-02-15 22:24:29 +02:00
Georgi Gerganov
55d58599c8 ggml : bump version to 0.9.7 (ggml/1425) 2026-02-15 22:24:29 +02:00
Georgi Gerganov
1a8c700bfd ggml : bump version to 0.9.6 (ggml/1423) 2026-02-15 22:24:29 +02:00
David Friehs
27b93cbd15 cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624)
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined
b8064
2026-02-15 22:38:42 +05:30
Aaron Teo
6e67fd2144 docs: update s390x build docs (#19643) 2026-02-16 00:33:34 +08:00
Adrien Gallouët
9e118b97c4 build : remove LLAMA_HTTPLIB option (#19623)
This option was introduced as a workaround because cpp-httplib could not
build on visionOS. Since it has been fixed and now compiles on all platforms,
we can remove it and simplify many things.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8062
2026-02-15 15:38:50 +01:00
Daniel Bevenius
57088276d4 cmake : check if KleidiAI API has been fetched (#19640)
This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).

I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.
b8061
2026-02-15 13:59:38 +01:00
Georgi Gerganov
341bc7d23c context : fix output reorder with backend sampling (#19638) b8060 2026-02-15 14:57:40 +02:00
Georgi Gerganov
08e6d914b8 ggml : avoid UB in gemm ukernel (#19642) b8059 2026-02-15 14:56:35 +02:00
Aaron Teo
184c694f45 ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399) b8058 2026-02-15 18:20:35 +08:00
Aman Gupta
684b36101c ggml-cpu: FA add GEMM microkernel (#19422)
* ggml-cpu: FA add GEMM microkernel

* add guard for sizeless vector types

* fix case where DV % GGML_F32_EPR !=0

* move memset out of the loop

* move another memset out of the loop

* use RM=4 for arm

* simd_gemm: convert everything to int

* convert everything to size_t to avoid warnings

* fixup

* add pragma for ignoring aggressive loop optimizations
b8057
2026-02-15 11:09:24 +05:30
SamareshSingh
3a00c98584 cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581)
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL

Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.

The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.

* addressed code review comments
b8056
2026-02-15 06:22:53 +01:00
Sigbjørn Skjæret
079feab9e3 convert : ensure all models handle new experts count (#19621)
* ensure all models handle new experts count

* revert removal for PhiMoeModel, does not inherit from base
b8055
2026-02-14 22:22:32 +01:00
Anav Prasad
01d8eaa28d mtmd : Add Nemotron Nano 12B v2 VL support (#19547)
* nemotron nano v2 vlm support added

* simplified code; addressed reviews

* pre-downsample position embeddings during GGUF conversion for fixed input size
b8054
2026-02-14 14:07:00 +01:00
Georgi Gerganov
1725e316c1 models : optimize qwen3next graph (#19375)
* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs
b8053
2026-02-14 12:57:36 +02:00
Adrien Gallouët
b7742cf321 ggml : fix GGML_DEBUG with OpenMP (#19599)
last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8052
2026-02-14 11:22:57 +01:00
iMil
badba89320 NetBSD build support (#19589) b8051 2026-02-14 09:47:01 +01:00
Aleksander Grygier
baa12f3831 webui: Architecture and UI improvements (#19596) 2026-02-14 09:06:41 +01:00
agent-enemy-2
2d8015e8a4 llama : update LoRA API. + fix excessive graph reserves (#19280)
* Refactoring to use new llama_put_adapter_loras

* cont : alternative lora API

---------

Co-authored-by: Jake Chavis <jakechavis6@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8049
2026-02-14 10:06:27 +02:00
George
eb145c0753 mmap: Fix Windows handle lifetime (#19598)
* ggml: added cleanups in ggml_quantize_free
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

* mmap: Fix Windows handle lifetime
Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping.
The file mapping handle must remain valid until UnmapViewOfFile is called.
Fixes cleanup order in destructor.

* Update llama-mmap.cpp

* Update llama-mmap.cpp

Remove trailing whitespace from line 567
b8048
2026-02-14 10:05:12 +02:00
Georgi Gerganov
6e473fb384 metal : fix ACC op (#19427) b8047 2026-02-14 09:54:03 +02:00
Adrien Gallouët
c7db95f106 scripts : use official split.py for cpp-httplib (#19588)
* scripts : use official split.py for cpp-httplib

Using the official script is safer and ensures the generated code aligns
with the library's standards.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Catch generic errors

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Allow print()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Ensure robust cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8046
2026-02-14 08:41:16 +01:00
Sigbjørn Skjæret
0d00ef65ed convert : store ffn_gate_inp_shexp as F32 (#19606) 2026-02-14 08:17:43 +01:00
Adrien Gallouët
91ea5d67f2 build : fix libtool call in build-xcframework.sh (#19605)
Run libtool via xcrun like strip and dsymutil, to have proper tool resolution.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-14 06:48:37 +01:00
Jeff Bolz
dbb023336b vulkan: support L2_NORM with contiguous rows (#19604) b8043 2026-02-14 06:42:04 +01:00
Jeff Bolz
53aef25a88 vulkan: support GGML_OP_SET (#19584) b8042 2026-02-14 06:36:38 +01:00
Sophon
2dec548094 vulkan: Add vendor id for Qualcomm drivers (#19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
b8041
2026-02-14 06:29:17 +01:00
Max Krasnyansky
0ccbfdef3e hexagon: further optimizations and refactoring for flash attention (#19583)
* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <chraac@gmail.com>
b8040
2026-02-13 16:27:30 -08:00
Mengsheng Wu
94a602db66 github : add missing backends to issue templates (#19603) 2026-02-14 00:56:53 +01:00
Jeff Bolz
05a6f0e894 vulkan: restore -inf check in FA shaders (#19582) b8038 2026-02-13 13:35:29 -06:00
Adrien Gallouët
b48e80f677 common : update download code (#19573)
* common : remove legacy .json to .etag migration code

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* common : simplify common_download_file_single_online

This commit also force a redownload if the file exists
but has no .etag file.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8037
2026-02-13 15:10:46 +01:00
Xuan-Son Nguyen
752584d5f5 model: support GLM MoE DSA arch (NOTE: indexer is not yet supported) (#19460)
* model: support GLM MoE DSA arch

* working version

* pyright

* keep indexer tensors

* add indexer gguf params

* loaded now

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* update

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* minor fix and cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8036
2026-02-13 14:56:53 +01:00
Alberto Cabrera Pérez
cc2aa81513 Fix wrong memcpy length for block_interleave == 4 (#19575) b8035 2026-02-13 20:32:14 +08:00
ymcki
0e21991472 fix vulkan ggml_acc only works in 3d but not 4d (#19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8034
2026-02-13 13:31:37 +01:00
Sigbjørn Skjæret
b2ecc0cdb4 support --verbose-prompt (#19576) b8033 2026-02-13 12:49:10 +01:00
Aman Gupta
5065da554e CUDA: loop over ne2*ne3 in case it overflows (#19538)
* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv
b8032
2026-02-13 17:01:40 +05:30
Aleksander Grygier
5174d7206f webui: UI and routing fixes (#19586)
* chore: update webui build output

* chore: update webui build output

* fix: Scroll issues in DropdownMenuSearchable

* webui: fix redirect to root ignoring base path

* fix: Word wrapping

* fix: remove obsolete modality UI tests causing CI failures

- Remove VisionModality/AudioModality test stories
- Remove mockServerProps usage and imports
- Simplify Default test (remove dropdown interaction checks)
- Simplify FileAttachments test (remove mocks)

* feat: Improve formatting performance time

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2026-02-13 12:31:00 +01:00
Oliver Simons
43919b7f4f CUDA: Do not mutate cgraph for fused ADDs (#19566)
* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
b8030
2026-02-13 15:07:55 +05:30
Pavan Shinde
423cf0b26f docs : fix broken link and typo (#19560) 2026-02-13 09:38:09 +01:00
ymcki
33a56f90a6 model : Kimi Linear fix conv state update (#19531)
* fix conv state update for llama-server parallel serving

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
b8028
2026-02-13 09:10:18 +01:00
Adrien Gallouët
25224c8021 llama : remove deprecated codecvt (#19565)
Using the same conversion function ensures a consistent matching between
the regex pattern and the text.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8027
2026-02-13 06:43:53 +01:00
Adrien Gallouët
2f5d8f8edc vendor : update BoringSSL to 0.20260211.0 (#19562)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8026
2026-02-13 06:43:26 +01:00
Georgi Gerganov
bb96bfd361 memory : fix kv cache size for hybrid models (#19559) b8025 2026-02-13 07:36:24 +02:00
Georgi Gerganov
0644baefde metal : improve concurrency (#19555) b8024 2026-02-13 07:35:57 +02:00
Georgi Gerganov
490eb96b88 metal : support GGML_OP_SET (#19548) b8023 2026-02-13 07:34:52 +02:00
Shupei Fan
3bb78133ab hexagon: fix typo in vtcm_needs_release (#19545) b8022 2026-02-12 15:07:49 -08:00
lhez
79cc0f2daf opencl: add basic support for q4_1 (#19534)
* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm
b8021
2026-02-12 14:52:37 -08:00
Georgi Gerganov
338085c69e args : add -kvu to llama-parallel (#19577) b8020 2026-02-12 21:52:41 +02:00
Aleksander Grygier
4c61875bf8 webui: Add switcher to Chat Message UI to show raw LLM output (#19571) 2026-02-12 19:55:51 +01:00
Adrien Gallouët
4b385bfcf8 vendor : update cpp-httplib (#19537)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8018
2026-02-12 16:11:22 +01:00