8864 Commits

Author SHA1 Message Date
xris99
ff6b1062af server : fix hardcoded proxy connection timeout in router mode (#18760) (#22003)
Fixes: https://github.com/ggml-org/llama.cpp/issues/18760

Co-authored-by: Christian <christian@example.com>
b8864
2026-04-21 06:41:14 +02:00
leonardHONG
97895129e5 ggml-cuda: flush legacy pool on OOM and retry (#22155)
* ggml-cuda: flush legacy pool on OOM and retry

Signed-off-by: 梁厚宏 <2695316095@qq.com>

* Address review comments: add explicit sync, update destructor, clean up MUSA macros

Signed-off-by: 梁厚宏 <2695316095@qq.com>

---------

Signed-off-by: 梁厚宏 <2695316095@qq.com>
b8863
2026-04-20 23:30:38 +02:00
Xuan-Son Nguyen
86f8daacfe mtmd: correct get_n_pos / get_decoder_pos (#22175) b8862 2026-04-20 23:29:19 +02:00
Georgi Gerganov
cf8b0dbda9 server : remove /api endpoints (#22165)
* server : remove /api endpoints

* cont : remove /api/tags
b8861
2026-04-20 20:41:19 +03:00
Gaurav Garg
fd6ae4ca1c Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (#22129)
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
b8860
2026-04-20 18:25:39 +02:00
Johannes Gäßler
fb19f94c71 TP: fix 0-sized tensor slices, AllReduce fallback (#21808)
* TP: fix 0-sized tensor slices, AllReduce fallback

* fix layer structure <-> GPU count aliasing

* add missing std::fill

* fix CUDA device set, max ggml ctx size
b8859
2026-04-20 18:09:39 +02:00
pl752
7f251fdbce ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (#21636)
* Implemented optimized q1_0 dot for x86 and generic

* Removed redundant helper definition

* Removed two redundant instructions from AVX q1_0 dot

* Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback

* Style cleanup around AVX q1_0 dot

* Replaced explicitly unrolled blocks with inner for loop for q1_0

* Replaced scalar ARM q1_0 impl with new generic one
b8858
2026-04-20 19:02:54 +03:00
neha-ha
a6cc43c286 ggml-webgpu: updated matrix-vector multiplication (#21738)
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
b8857
2026-04-20 07:37:17 -07:00
Xuan-Son Nguyen
a678916623 mtmd: refactor mtmd_decode_use_mrope (#22161) 2026-04-20 14:45:11 +02:00
SamareshSingh
81df3f7cfa fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102)
* llama: fix crash in print_info for GLM-DSA when vocab_only is set

* addressed code review comments

* cont : simplify

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8855
2026-04-20 10:32:46 +03:00
Georgi Gerganov
de71b5f81c server : refactor "use checkpoint" logic (#22114) b8854 2026-04-20 08:42:37 +03:00
Katostrofik
788fcbc5dd [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes (#22035)
* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes

The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K
asserted that block_num_y was a multiple of 16 subgroups. Models with
a vocab size not divisible by 16 (for example HY-MT at 120818) aborted
on model load when the output projection tripped the assert.

I replaced the assert with padding: block_num_y now rounds up to a
whole number of subgroup-sized workgroups. The kernel already has the
row bounds check (`if (row >= nrows) return;`) so the extra padded
threads early-exit cleanly. Row values are uniform across a subgroup
so the collective reduce stays safe.

For aligned vocab sizes the padded block_num_y equals the old value,
so the kernel launch is identical and there is no regression.

Thanks to @arthw for flagging the relationship to #21527.

Fixes #22020.

AI assisted coding, tested on Intel B70 hardware.

* sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches

Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec
launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel
target where WARP_SIZE is 16, but makes the relationship to subgroup
size explicit. Per review by @NeoZhangJianyu on #22035.

Assisted by Claude.
b8853
2026-04-20 08:39:45 +03:00
Yes You Can Have Your Own
9d49acb2a7 server: rename --clear-idle to --cache-idle-slots (#21741) b8852 2026-04-20 08:30:24 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
e365e658f0 vendor : update cpp-httplib to 0.42.0 (#21781) b8851 2026-04-20 06:41:43 +08:00
Johannes Gäßler
4eac5b4509 CUDA: refactor mma data loading for AMD (#22051)
* CUDA: refactor mma data loading for AMD

* fix CDNA MMQ occupancy

* fix CDNA3 mma

* fix RDNA3 compile
b8850
2026-04-19 18:26:59 +02:00
Aldehir Rojas
d5b780a676 common/autoparser : allow space after tool call (#22073) b8849 2026-04-19 13:28:35 +02:00
uvos
471540ae8a HIP: Remove unesscary NCCL_CHECK (#21914) b8848 2026-04-19 12:59:44 +02:00
Xuan-Son Nguyen
19124078be mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) (#22082)
* mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos

* fix build
b8847
2026-04-19 11:57:21 +02:00
Gaurav Garg
bcdcc1044f ggml : reduce CPU overhead in meta backend (#22041)
* cache subgraph splits when cgraph is unchanged

Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively.

Assign uid to every sub-graph so that CUDA's fast uid check path hits too.

* Address review comments

* Keep the scope as is

* Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code.

* Address review comments

* Update ggml/src/ggml-backend-meta.cpp

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-backend-meta.cpp

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b8846
2026-04-19 12:48:35 +03:00
Sigbjørn Skjæret
037bfe38d0 ci : install spirv-headers for vulkan-cross (#22109) 2026-04-19 10:32:08 +03:00
Dowon
8685e7b075 convert : support sentence-transformer 5.4 config files (#22087)
* convert : support sentence-transformer 5.4 config files

* fix: embeddinggemma

* fix: mapping

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix: pooling_mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-19 10:25:39 +03:00
texasich
09b4efa95f cmake: remove CMP0194 policy to restore MSVC builds (#21934)
#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds.

Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform.

Reported-by: oobabooga

Refs: #21630

Co-authored-by: texasich <texasich@users.noreply.github.com>
b8843
2026-04-19 10:25:05 +03:00
Sascha Rogmann
455d8e4be8 server : speculative checkpointing (#19493)
* server : speculative decoding using checkpoints

* server : fix draft check with checkpoints

* server : rename spec vars

* server : log levels

* server : refactored spec logic to speculative.cpp

* server : renamed spec checkpoints option

* server : fix spec checkpoints, logging

* speculative : checkpoints with draft model, logging

* server : n_tokens_cur and create_checkpoint in draft

* server : fix server_speculative_callback (slot.id)

* spec : fix ngram-map/begin idx_last_check

* spec : init ckpt (begin() wasn't called)

* chore: update webui build output

* server : restore sampler in spec checkpoint and clear mem

* cont : avoid --spec-use-checkpoints argument

* cont : remove server_prompt_checkpoint_with_size

* spec : rename (leave_draft_state)

* cont : clean-up

* cont : do not ignore partial drafts even if the are short

* cont : spec callback owned by session

* cont : simplify

* cont : avoid empty speculative session

* cont : simplify

* cont : simplify

* cont : enable mtmd speculative decoding

* cont : keep the spec sampler alive

* cont : simplify

* cont : fix nullptr deref + draft checkpoints

* cont : remove common_speculative_accept_response

* cont : remove callback

* cont : simplify

* cont : minor

* cont : simplify

* cont : fix accepted number

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8842
2026-04-19 10:24:06 +03:00
Radoslav Gerganov
91fef95362 rpc : refactor the RPC transport (#21998)
* rpc : refactor the RPC transport

Move all transport related code into a separate file and use the
socket_t interface to hide all transport implementation details.

* fix win32

* better socket_t construction
b8841
2026-04-19 10:21:53 +03:00
Cetarthoriphros
9e5647affa server: Expose media_tag on /props endpoint. (#22028) b8840 2026-04-19 00:27:17 +02:00
Sigbjørn Skjæret
4f02d47339 model : refactor bias tensor variable names (#22079)
* refactor bias tensor variable names

* use create_tensor_qkv for jina-bert-v2
b8839
2026-04-18 20:12:00 +02:00
Sigbjørn Skjæret
23b8cc4991 android : libcommon -> libllama-common (#22076) b8838 2026-04-18 11:19:40 +02:00
SamareshSingh
59accc8863 ggml-backend-meta: add multi-segment read support in get_tensor (#22063) b8837 2026-04-18 10:04:51 +02:00
Sigbjørn Skjæret
83d58e02fc ci : free disk space for rocm release (#22012) b8836 2026-04-18 09:37:30 +02:00
Sigbjørn Skjæret
89a5474f0e convert : fix (ignore for now) typings errors (#22002) 2026-04-18 09:36:41 +02:00
Johannes Gäßler
fd1c0ec3f0 llama: fit ctx size for CPU only (#21568) 2026-04-18 08:16:04 +02:00
Reese Levine
45cac7ca70 ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding (#21052)
* Update workflows to remove dependence on llvmpipe

* Try setting Dawn_DIR

* remove c++20 initializers

* Move to proper guid

* Try avoiding segfaults on vulkan backend process exit

* Remove compiler warnings on parameter casting

* Fix soft_max and update reg_tile accumulation to f32 for better precision

* Refactor flash_attn a bit

* remove c++20 initializers and format

* Increase div precision for NVIDIA

* revert div precision and comment out ggml-ci node for now

* Formatting

* Try debugging on a failing CI node

* Revert "Try debugging on a failing CI node"

This reverts commit 1971e33cba.
b8833
2026-04-17 09:17:11 -07:00
Aman Gupta
b94050e896 CUDA: use LRU based eviction for cuda graphs (#21611)
* CUDA: use a ring-buffer for cuda graphs

* bump limit to 128

* use LRU eviction

* better naming

* do periodic clean-up
b8832
2026-04-17 23:24:21 +08:00
Yuri Khrustalev
a279d0f0f4 ci : add android arm64 build and release (#21647)
* server: respect the ignore eos flag

* ci: add android arm64 build and release

* patch

* pin android-setup actions to v4

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* lf in the suggestion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8831
2026-04-17 11:32:24 +02:00
65a
268d61e178 mtmd: add missing struct tag (#22023) b8830 2026-04-17 10:48:33 +02:00
Georgi Gerganov
6990e2f1f7 libs : rename libcommon -> libllama-common (#21936)
* cmake : allow libcommon to be shared

* cmake : rename libcommon to libllama-common

* cont : set -fPIC for httplib

* cont : export all symbols

* cont : fix build_info exports

* libs : add libllama-common-base

* log : add common_log_get_verbosity_thold()
b8829
2026-04-17 11:11:46 +03:00
Eric Zhang
fcc7508759 model : Gemma4 model type detection (#22027)
* model : Gemma4 model type detection

* model : Gemma4 model type detection
b8828
2026-04-17 10:07:11 +02:00
lhez
5e6c0e18b6 opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for Adreno (#21938)
* opencl: refactor q8_0 gemm/gemv Adreno dispatch

* opencl: refactor q8_0 set_tensor

* opencl: fix whitespace
b8827
2026-04-16 22:28:33 -07:00
Sigbjørn Skjæret
30dce2cf29 cli : use get_media_marker (#22017) b8826 2026-04-17 00:12:31 +02:00
Xuan-Son Nguyen
089dd41fe3 cmake: use glob to collect src/models sources (#22005) b8825 2026-04-16 23:25:16 +02:00
nullname
85dde8dc4a hexagon: optimize HMX matmul operations (#21071)
* optimize hmx_mat_mul functions by calculating row and column tiles upfront

* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability

* wip

* set scale outside of loop

* wip

* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts

* wip

* wip

* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation

* wip

* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization

* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking

* refactor core_dot_chunk_fp16 to improve tile stride calculations for output

* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization

* fix compiling error

* wip

* optimize row and column tile indexing in core_mma_chunk_fp16 function

* wip

* Revert "wip"

This reverts commit cde679eff7.

* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions

* wip
b8824
2026-04-16 13:48:34 -07:00
Xuan-Son Nguyen
4fbdabdc61 model: using single llm_build per arch (#21970)
* model: using single llm_build per arch

* fix merge

* nits
b8823
2026-04-16 21:10:22 +02:00
shaofeiqi
e45dbdece8 opencl: add q5_K gemm and gemv kernels for Adreno (#21595) b8822 2026-04-16 12:08:33 -07:00
Pascal
4adac43f6f server: tests: fetch random media marker via /apply-template (#21962) (#21980)
* server: tests: fetch random media marker via /apply-template (#21962 fix)

* server: allow pinning media marker via LLAMA_MEDIA_MARKER env var

get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it
as-is if set, falling back to the random marker otherwise.

Tests no longer need to fetch the marker dynamically via /apply-template:
the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts
work as before.

Address review feedback from ngxson

* server: make get_media_marker() thread-safe via magic statics

Use a C++11 static local with a lambda initializer instead of a global
static with an empty-check. The runtime guarantees initialization exactly
once without explicit locking.

Address review feedback from ggerganov

* nits

* nits
b8821
2026-04-16 20:46:21 +03:00
PikaPikachu
9db77a020c model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245)
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers

* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
2026-04-16 17:41:34 +02:00
Sigbjørn Skjæret
f772f6e434 model : support NVFP4 tensors for Gemma4 (#21971)
* support nvfp4 tensors for Gemma4

* add wo_s to build_attn

* add wo_s to build_attn

* fix glm4
2026-04-16 16:51:47 +02:00
Ruben Ortlam
b572d1ecd6 codeowners: add team member comments (#21714) 2026-04-16 13:13:11 +03:00
Anav Prasad
03b3d07798 Convert: Fix NemotronH Config Parsing (#21664)
* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns

* fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code
2026-04-16 13:11:45 +03:00
Aman Gupta
3f7c29d318 ggml: add graph_reused (#21764)
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
b8816
2026-04-16 17:21:28 +08:00
Kusha Gharahi
ae2d34899e metal: Implement ROLL op (#21946)
* nix: support unified apple-sdk

* Impl roll op for Metal

* Revert "nix: support unified apple-sdk"

This reverts commit abfa473360.

* update ops.md

* update op docs
b8815
2026-04-16 11:54:37 +03:00