Commit Graph

9009 Commits

Author SHA1 Message Date
Ruben Ortlam
da1f16886f load directly from downloaded state 2026-05-01 15:39:00 +02:00
Ruben Ortlam
884901f04d handle models that need to be downloaded before estimation 2026-05-01 15:39:00 +02:00
Georgi Gerganov
01dd39342d cont : clean-up 2026-05-01 15:39:00 +02:00
Ruben Ortlam
972813c253 also strip models memory margin from child processes 2026-05-01 15:39:00 +02:00
Ruben Ortlam
b440ee05b8 improve variable naming, fix style 2026-05-01 15:39:00 +02:00
Ruben Ortlam
f24011f2cf improve memory_per_device map naming 2026-05-01 15:38:59 +02:00
Ruben Ortlam
f4a384b46c fix model count exceeded check 2026-05-01 15:38:59 +02:00
Ruben Ortlam
f750bae2d3 move llama_context_device_memory function to llama-ext.h 2026-05-01 15:38:59 +02:00
Ruben Ortlam
527c91ac87 add server memory debug logging 2026-05-01 15:38:59 +02:00
Ruben Ortlam
3c815b369e use memory margin instead of total size limit, apply to each device separately 2026-05-01 15:38:59 +02:00
Ruben Ortlam
18163c4143 only set model memory_mb if not previously calculated 2026-05-01 15:38:59 +02:00
Ruben Ortlam
af28cd24dc use no_alloc to get memory requirements for model load 2026-05-01 15:38:59 +02:00
Ruben Ortlam
e6468c1715 estimate with to-be-loaded model size included 2026-05-01 15:38:59 +02:00
Ruben Ortlam
0a019ed812 server: add --models-memory-max parameter to allow dynamically unloading models when they exceed a memory size threshold 2026-05-01 15:38:59 +02:00
Jeff Bolz
05e141a6b3 vulkan: Support asymmetric FA in coopmat2 path (#21753)
* vulkan: Support asymmetric FA in coopmat2 path

There has been some recent interest/experimentation with mixed quantization
types for FA. I had originally designed the cm2 FA shader with this in mind
(because I didn't realize it wasn't supported at the time!), this change
adds the missing pieces and enables it.

Also support Q1_0 since people have been trying that out (seems crazy, but
who knows).

We should be able to do similar things in the coopmat1/scalar path, but
there's another change open against the scalar path and I don't want to
conflict.

* reorder cases
b8995
2026-05-01 15:28:32 +02:00
Chen Yuan
aab68217b7 ggml-webgpu: add the upscale shader (#22419)
* shader(upscale): add the upscale shader with nearest, bilinear and bicubic implementations

* shader(upscale): use macro
b8994
2026-04-30 22:22:18 -07:00
Masashi Yoshimura
a95a11e5b8 ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (#22464)
* Add mat-vec fast path of MUL_MAT_ID.

* Add shared accumulation vec logic and the other types supports.

* Add i-quant mat-mat for MUL_MAT_ID and fix some parts

* Remove n_experts from shader_lib_context.
2026-04-30 14:19:10 -07:00
Reese Levine
5cbfb18075 Update llama-mmap to use ftello/fseeko (#22497)
* Update llama-mmap to work with 32-bit wasm and >2GB models

* Update to gguf.cpp style
b8992
2026-04-30 14:17:52 -07:00
Adrien Gallouët
beb42fffa4 common : check for null getpwuid in hf-cache (#22550)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8991
2026-04-30 21:32:41 +02:00
Ruben Ortlam
660b1b4bdc vulkan: add get/set tensor 2d functions (#22514)
* vulkan: add get/set_tensor_2d functions

* fix backend interface comments

* Update ggml/src/ggml-metal/ggml-metal.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8990
2026-04-30 17:37:13 +02:00
Ben Guidarelli
c20c44514a spec: fix argument typo (#22552) b8989 2026-04-30 17:32:32 +03:00
Sigbjørn Skjæret
6118c043b1 ci : bump ty to 0.0.33 (#22535)
* bump ty to 0.0.33

* update typings
2026-04-30 16:15:54 +03:00
Adrien Gallouët
5f0ab726f7 vendor : update cpp-httplib to 0.43.2 (#22548)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8987
2026-04-30 15:04:39 +02:00
Johannes Gäßler
e82aaf2587 CUDA: fix tile FA kernel on Pascal (#22541) b8986 2026-04-30 13:04:50 +02:00
Georgi Gerganov
27aef3dd91 scripts : add wc2wt.sh - create worktree from current HEAD (#22513)
* scripts : add wc2wt.sh - create worktree from current HEAD

Add a script to create a git worktree on a new branch from the current
HEAD. Similar to pr2wt.sh but for local development branches instead of
PRs.

Usage:
  ./scripts/wc2wt.sh gg/new-feature
  ./scripts/wc2wt.sh gg/new-feature "bash -l"

Assisted-by: llama.cpp:local pi

* cont : no need to try to delete the branch
2026-04-30 09:20:26 +03:00
Rithik Sharma
45155597aa add fast matmul iquants (#22504) b8984 2026-04-29 22:58:32 -07:00
Georgi Gerganov
80afa33aad spec : fix draft model checkpoints (#22521)
* spec : fix draft model checkpoints

* cont : clean-up

* cont : gate the ngram-mod reset warning behind verbose flag
b8983
2026-04-30 08:32:18 +03:00
Peter Sideris
b42c7fa5b8 spec : fix vocab compat checks in spec example (#22426)
* port #22358 PR to examples/speculative/speculative.cpp
* use vocab_[tgt,dft] instead of ctx_[tgt,dft] when logging on draft
  model / target model vocabulary mismatch

Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
b8982
2026-04-30 08:18:25 +03:00
Aldehir Rojas
d77599234e common : do not pass prompt tokens to reasoning budget sampler (#22488) b8981 2026-04-29 14:10:58 -05:00
Max Krasnyansky
41a63be28e hexagon: make vmem and buffer-size configurable (#22487)
* hexagon: allow host to set max vmem size

We use a sane default but it's helpful to allow for an override if needed.

* hexagon: add support for measuring vmem space and move pinned mmaping management to host

* hexagon: update vmem checks to use uint64

* hexagon: bump op buffers to 16 (matches max mmaps)

* hexagon: bump default vmem to 3.2GB

* hexagon: add support for autodetecting vmem space and some logging cleanup in that area

* hexagon: fix whitespace warnings

* Update scripts/snapdragon/adb/run-cli.sh

Co-authored-by: Pascal <admin@serveurperso.com>

* hex-adb: fix run-completion script

---------

Co-authored-by: Pascal <admin@serveurperso.com>
b8980
2026-04-29 11:51:21 -07:00
Anav Prasad
098705a29e CUDA: fuse SSM_CONV + ADD(bias) + SILU (#22478) b8979 2026-04-30 02:39:56 +08:00
Georgi Gerganov
683c5acb90 spec : disacard last drafted token with low prob (#22506) b8978 2026-04-29 17:00:00 +03:00
Georgi Gerganov
b1d5f5b449 sync : ggml b8977 2026-04-29 16:43:47 +03:00
Georgi Gerganov
4b221b7f1e ggml : bump version to 0.10.1 (ggml/1469) 2026-04-29 16:43:47 +03:00
Pascal
59237bfbbc webui: fix slow mic stop and WAV encode (#22480)
* webui: instant mic stop, race-free recorder restart

* webui: faster WAV PCM encode via hoisted channels and Int16Array

* chore: update webui build output

* webui: drop setTimeout(0) hack and harden cancelRecording

* chore: update webui build output
2026-04-29 12:58:35 +02:00
shalinib-ibm
1cbc846eba ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (#22293)
* ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault

vec_xst operations in the tiled path crash on AIX when writing
near 4KB page boundaries due to strict memory protection. Fall
back to mnpack implementation on AIX for stable execution.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

* Update ggml/src/ggml-cpu/llamafile/sgemm.cpp

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Update sgemm.cpp

* Update sgemm.cpp

---------

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
b8974
2026-04-29 13:32:40 +03:00
Aman Gupta
3142f1dbb9 ggml-cuda: refactor fusion code (#22468)
* ggml-cuda: refactor fusion code

* apply formatting + make env variable truthy
b8973
2026-04-29 16:19:33 +08:00
qiurui144
b5c4227dc6 ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (#22317)
* ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME

When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains
inline asm for the vmadot family which requires the xsmtvdotii custom
extension.(problem can see in some blogs and make sure in K3 platform)
The current CMakeLists does not include xsmtvdotii, so any toolchain
that honours the explicit -march (tested with SpacemiT GCC 15.2) fails
at the assembler stage:

  Error: unrecognized opcode `vmadot v16,v14,v0',
         extension `xsmtvdotii' required

Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is
enabled so the IME path can actually build with a capable toolchain.
No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off.

toolchain from https://www.spacemit.com/community/resources-download/Tools

* Update ggml/src/ggml-cpu/CMakeLists.txt

Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>

---------

Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
b8972
2026-04-29 10:59:21 +03:00
Reese Levine
d6a5094004 ggml-webgpu: Fix bug in FlashAttention support check (#22492)
* Fix flashattention support check for devices that don't support subgroups

* set path to none if kv_tile doesn't fit
b8971
2026-04-29 10:59:00 +03:00
Masato Nakasaka
7b95ea5d11 common: Intentionally leak logger instance to fix hanging on Windows (#22273)
* Changed to leak logger singleton to prevent hanging on Windows

* Fix comment

* Stopped using static vector

Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash

* Change so all logs are output before exit

* Added debug logging

* added more logging

* Added logging

* Explicitly free logger to avoid hanging on Win

* Reverted to leak logger instance again

* Removed debug log and fixed comment

* Fixed comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b8970
2026-04-29 10:58:43 +03:00
hrushitfujitsu
bdc9c743a5 ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (#21916)
* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
b8969
2026-04-29 10:57:37 +03:00
Johannes Gäßler
739393beeb TP: fix delayed AllReduce + zero-sized slices (#22489) b8968 2026-04-29 08:55:07 +02:00
Michael Wand
fc2b0053ff ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196) b8967 2026-04-29 06:47:42 +08:00
lnigam
7b8443ac78 ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286)
* ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32)

Adds MMA-f16 and tile kernel configs, dispatch logic, template instances,
and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to
ncols2=32 to support GQA ratio 32 only.

* Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32

* Apply suggestions from code review

Address review comments

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256

* Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b8966
2026-04-28 21:37:35 +02:00
Daniel Bevenius
5d56effdee convert : add support for Nemotron Nano 3 Omni (#22481)
This commit adds support for NVIDIA Nemotron Nano 3 Omni model enabling
this model to be converted to GGUF.
2026-04-28 19:17:57 +02:00
Jillis ter Hove
52e5f0a5c1 common : re-arm reasoning budget after DONE on new <think> (#22323)
DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response.

Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
b8964
2026-04-28 19:15:36 +02:00
Matt Corallo
f9f33654a6 vulkan: Coalesce Q4_K/Q5_K scale loads (#21751)
Some SPIR-V compilers (notably mesa) don't handle the current
vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well.
While reading three `u8`s from the 12-byte scale array should (at
least on some hardware) result in loading the full 12 bytes in a
single LOAD followed by whatever extraction is needed, at least
the ANV Intel driver really can't practically perform this
optimization.

`mesa`'s unsigned upper bound logic doesn't handle tracking bounds
through ternary, resulting in the `(is < 4) ? ... : is - 4` having
an infinite upper bound (as it cannot prove `is - 4` doesn't
underflow). While this could still be rectified if mesa looked at
the array bounds, it currently doesn't and `glslc` currently emits
SPIR-V that doesn't allow for this optimization anyway (though
maybe it will at some point, see
https://github.com/KhronosGroup/glslang/issues/4206).

In mul_mat_vecq we took a different approach to loading the same
fields. We read the first two bytes we needed from `scale` then
took a branch before deciding whether we needed to read a third
byte. In mesa this did, indeed, lead to a top-level branch with
conditional loads. As such these loads ended up not being
coalesced either (at least in the ANV driver) resulting in
additional instructions in our hot loop.

Instead, here, we go ahead and force loading the full 12 bytes and
extract the bits we need from the packed-u32s instead. In mul_mat
there's a few less ternaries and only one extra shift, so even on
drivers that did optimize the previous loads properly the only
material change should be pulling a few extra bytes into registers
(which on most hardware won't cost anything anyway, though
ironically on Intel it theoretically could). In mul_mat_vecq this
requires a bit of extra math and may read bytes from the u32 that
weren't needed, but it seems likely avoiding the branch is a win
on most platforms.

On Intel Xe2/mesa 26.0.4 with the optimizations from
https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162,

for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l:
 * Instruction Count: 2753 -> 2688
 * SEND Count: 269 -> 261
 * Cycle Count: 273976 -> 266138
 * Max live registers: 248 -> 246
 * Non SSA regs after NIR: 381 -> 382

for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l:
 * Instruction Count: 2767 -> 2702
 * SEND Count: 271 -> 263
 * Cycle Count: 274140 -> 268144
 * Max live registers: 248 -> 246
 * Non SSA regs after NIR: 381 -> 382

for shader mul_mat_vec_id_q4_k_q8_1_f32:
 * Instruction Count: 1930 -> 1646
 * SEND Count: 116 -> 71
 * Cycle Count: 1348306 -> 843350
 * Max live registers: 78 -> 84
 * Non SSA regs after NIR: 300 -> 135

for shader mul_mat_vec_id_q5_k_q8_1_f32:
 * Instruction Count: 2207 -> 1922
 * SEND Count: 131 -> 86
 * Cycle Count: 1392012 -> 1037836
 * Max live registers: 90 -> 90
 * Non SSA regs after NIR: 300 -> 135

for shader mul_mat_vec_q4_k_q8_1_f32:
 * Instruction Count: 2029 -> 1749
 * SEND Count: 111 -> 66
 * Cycle Count: 1347278 -> 840118
 * Max live registers: 74 -> 80
 * Non SSA regs after NIR: 299 -> 134

for shader mul_mat_vec_q5_k_q8_1_f32:
 * Instruction Count: 2307 -> 2022
 * SEND Count: 126 -> 81
 * Cycle Count: 1379820 -> 954042
 * Max live registers: 86 -> 86
 * Non SSA regs after NIR: 299 -> 134

On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL:
 * pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%)
 * pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%)
 * tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%)

On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S:
 * pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%)
 * pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%)
 * tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%)

On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with
-sm layer (note that -sm tensor improvements will naturally be
less):
 * pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%)
 * pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%)
 * tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)
b8963
2026-04-28 17:31:04 +02:00
Reese Levine
98bb57916a ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (#22456)
* Refactor buffer aliasing to be part of shader lib decisions

* cleanup

* formatting
b8962
2026-04-28 07:27:17 -07:00
Aleksander Grygier
f42e29fdf1 webui: Server tools (#21237)
* wip: server_tools

* feat: Integrate with `/tools` endpoint

* feat: Builtin + MCP + JSON Schema Tools WIP

* refactor

* displayName -> display_name

* snake_case everywhere

* rm redundant field

* feat: Improvements

* chore: update webui build output

* refactor: Updates after server updates

* chore: update webui build output

* change arg to --tools all

* feat: UI improvements

* chore: update webui build output

* add readme mention

* llama-gen-docs

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output

* feat: Reorganize settings sections

* feat: Separate dialogs for MCP Servers Settings and Import/Export

* feat: WIP

* feat: WIP

* feat: WIP

* feat: WIP

* feat: WIP

* feat: WIP

* WIP on allozaur/20677-webui-server-tools

* feat: UI improvements

* chore: Update package lock

* chore: Run `npm audit fix`

* feat: UI WIP

* feat: UI

* refactor: Desktop Icon Strip DRY

* feat: Cleaner rendering and transition for ChatScreen

* feat: UI improvements

* feat: UI improvement

* feat: Remove MCP Server "enable" switch from Tools submenu

* chore: Run `npm audit fix`

* feat: WIP

* feat: Logic improvements

* refactor: Cleanup

* refactor: DRY

* test: Fix Chat Sidebar UI Tests

* chore: Update package lock

* refactor: Cleanup

* feat: Chat Message Action Card with Continue and Permission flow implementations

* feat: Add agentic steering messages, draft messages and improve chat UX

* fix: Search results UI

* test: Fix unit test

* feat: UI/UX improvements

* refactor: Simplify `useToolsPanel` access in components

* feat: Implement Processing Info Context API

* feat: Implement 'Go back to chat' functionality for settings

* feat: Enhance MCP Server management in Chat Form Attachments

* style: Minor UI and branding adjustments

* chore: Update webui static build output

* chore: Formatting, linting & type checks

* feat: Draft messages logic

* feat: UI improvements

* feat: Steering Messages improvements

* refactor: Cleanup

* refactor: Cleanup

* feat: Improve UI

* refactor: Settings navigation hook

* refactor: DRY code

* refactor: DRY ChatMessageUser UI components

* refactor: Desktop Icon Strip DRY

* refactor: Tools & permissions

* fix: Navigation condition

* refactor: Cleanup

* refactor: Cleanup

* refactor: Cleanup

* fix: preserve reasoning_content in agentic flow

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-04-28 14:35:49 +03:00
Jeff Bolz
19821178be vulkan: add barrier after writetimestamp (#21865) b8960 2026-04-28 12:28:12 +02:00