8390 Commits

Author SHA1 Message Date
Neo Zhang
b6c83aad55 [SYCL] ehance UPSCALE to support all UT cases (#20637)
* [SYCL] ehance UPSCALE to support more cases

* rm test case result of SYCL1
b8390
2026-03-17 10:01:52 +08:00
Piotr Wilkin (ilintar)
2e4a6edd4a tools/server: support refusal content for Responses API (#20285)
* Support refusal content for Responses API

* Update tools/server/server-common.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tools/server/server-common.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8389
2026-03-17 01:42:04 +01:00
Xuan-Son Nguyen
d34ff7eb5b model: mistral small 4 support (#20649)
* model: mistral small 4 support

* fix test

* fix test (2)

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* change newline

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8388
2026-03-17 00:31:14 +01:00
Georgi Gerganov
45172df4d6 ci : disable AMX jobs (#20654)
[no ci]
2026-03-16 22:38:59 +02:00
Georgi Gerganov
9b342d0a9f benches : add Nemotron 3 Nano on DGX Spark (#20652)
[no ci]
2026-03-16 21:50:43 +02:00
Sigbjørn Skjæret
55e87026f7 tests : write to binary buffer to avoid newline translation in jinja -py [no ci] (#20365) 2026-03-16 20:40:22 +01:00
Martin Klacer
cf21cdf36c kleidiai: add data type check to get_tensor_traits (#20639)
* kleidiai: add data type check to get_tensor_traits

 * Added check for F16 data type into get_tensor_traits path with input data
   not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7

* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp

updated kleidiai.cpp file as per suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-16 21:25:54 +02:00
Sigbjørn Skjæret
0ed992973b ci : update labeler (#20629) 2026-03-16 20:24:20 +01:00
Aldehir Rojas
1bbec6a75d jinja : add capability check for object args (#20612) 2026-03-16 17:43:14 +01:00
Georgi Gerganov
f47a246a08 sync : ggml 2026-03-16 17:22:06 +02:00
Georgi Gerganov
c0ccbd1f86 ggml : try fix arm build (whisper/0) 2026-03-16 17:22:06 +02:00
David366AI
f6da02c3f2 ggml : extend im2col f16 (ggml/1434)
* examples/yolo: fix load_model memory leak

* fix/issue-1433 ggml_compute_forward_im2col_f16 assert error

* fix/issue-1433
2026-03-16 17:22:06 +02:00
Pascal
dddca026bf webui: add model information dialog to router mode (#20600)
* webui: add model information dialog to router mode

* webui: add "Available models" section header in model list

* webui: remove nested scrollbar from chat template in model info dialog

* chore: update webui build output

* feat: UI improvements

* refactor: Cleaner rendering + UI docs

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-16 15:38:11 +01:00
Aman Gupta
3c8521c4f5 llama-graph: replace cont with reshape for alpha in qwen35 (#20640) b8377 2026-03-16 22:07:13 +08:00
Aleksander Grygier
67a2209fab webui: Add MCP CORS Proxy detection logic & UI (#20167)
* refactor: MCP store cleanup

* feat: Add MCP proxy availability detection

* fix: Sidebar icon

* chore: update webui build output

* chore: Formatting

* chore: update webui build output

* chore: Update package lock

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output
2026-03-16 13:05:36 +01:00
Pascal
d65c4f2dc9 Fix model selector locked to first loaded model with multiple models (#20580)
* webui: fix model selector being locked to first loaded model

When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.

* chore: update webui build output
2026-03-16 12:04:06 +01:00
Woof Dog
d8c331c0af webui: use date in more human readable exported filename (#19939)
* webui: use date in exported filename

Move conversation naming and export to utils

update index.html.gz

* webui: move literals to message export constants file

* webui: move export naming and download back to the conversation store

* chore: update webui build output

* webui: add comments to some constants

* chore: update webui build output
2026-03-16 11:18:13 +01:00
Ruben Ortlam
46dba9fce8 vulkan: fix flash attention dot product precision (#20589) b8373 2026-03-16 10:45:49 +01:00
Sigbjørn Skjæret
de8f01c2d7 model : wire up Nemotron-H tensors for NVFP4 support (#20561)
* wire up Nemotron-H tensors for NVFP4 support

* add ssm tensors

* alignment
b8372
2026-03-16 09:19:16 +01:00
Richard Davison
079e5a45f0 convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization (#20539)
* support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization

* cleanup

* fallback

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-16 09:18:47 +01:00
Masato Nakasaka
d3936498a3 common : fix iterator::end() dereference (#20445) b8370 2026-03-16 08:50:38 +02:00
Aman Gupta
34818ea6c0 CUDA: GDN hide memory latency (#20537) b8369 2026-03-16 11:41:45 +08:00
Piotr Wilkin (ilintar)
9e2e2198b0 tools/cli: fix disable reasoning (#20606) b8368 2026-03-15 22:40:53 +01:00
Georgi Gerganov
88915cb55c server : fix wait in test_cancel_requests() test (#20601)
* server : fix wait in test_cancel_requests() test

* codeowners : add team for server tests
2026-03-15 20:54:37 +02:00
Sigbjørn Skjæret
ebbf544ed1 sycl : fix for untransposed GDA recurrent state (#20583) b8366 2026-03-15 19:10:15 +01:00
Sigbjørn Skjæret
b91d7dfe5b ci : only save openvino caches on github-hosted master (#20593)
* only save openvino ccache on master

* disable toolkit cache if self-hosted

* only cache on github-hosted runners

* remove toolkit cache [no ci]
2026-03-15 18:58:13 +01:00
Johannes Gäßler
ae40cd27c8 CUDA: limit number of FA stream-k CUDA blocks (#20586) b8364 2026-03-15 18:30:47 +01:00
Pascal
ceef6b5233 ggml: avoid creating CUDA context during device init (#20595) b8363 2026-03-16 00:42:56 +08:00
Adrien Gallouët
07c6a59b4f vendor : update cpp-httplib to 0.38.0 (#20578)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b8362
2026-03-15 17:30:06 +01:00
MoonShadow
8b7d340b6f ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain (#20536)
* ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain

On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain
returns hipErrorInvalidValue because the hint is not applicable to UMA systems.
The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on
APU systems such as AMD Strix Halo (gfx1151).

Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it
without error checking and clear any resulting error with hipGetLastError().

Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory
issues on APU systems, and store totalGlobalMem in device info.

Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits
hipMallocManaged to ~64GB regardless of available system RAM. A fix has been
submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix

---------

Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
b8361
2026-03-15 17:23:58 +01:00
Eric Hsieh
559646472d fix: prevent nullptr dereference (#20552) b8360 2026-03-15 16:51:49 +01:00
Sigbjørn Skjæret
cf45437d35 codeowners : use teams (#20526)
* use teams

* update

* update

* update

* update

* update
2026-03-15 14:26:10 +01:00
Georgi Gerganov
9cd4ebcfb1 ci : split build.yml + server.yml (#20546)
* ci : split build.yml

* cont : split server.yml

* cont : reduce paths

* cont : split build-android.yml + update paths

* ci : make msys workflows manual (#20588)

* ci : make cross-build workflows manual (#20585)

* cont : fix release paths

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b8358
2026-03-15 15:11:17 +02:00
Sigbjørn Skjæret
89d0aec042 convert : support contiguous method on lora tensors (#20489) 2026-03-15 12:15:12 +01:00
Bartowski
b9da4444df ggml : guard against sumq2 being 0 in IQ4_NL (#20460) b8356 2026-03-15 10:47:28 +02:00
PikaPikachu
617db241aa cuda : add RDNA4-specific MMVQ parameter table for bs=1 decode (#19478)
* mmvq: add RDNA3/RDNA4-specific parameter table (nwarps=8, rows=1)

* mmvq: add dedicated RDNA3 parameter table

* mmvq: exclude RDNA3.5 (gfx1150/1151) from RDNA3 table
b8355
2026-03-15 08:33:39 +01:00
Ruben Ortlam
1a3d8edbba vulkan: use graphics queue on AMD (#20551)
* vulkan: use graphics queue on AMD for slightly better performance

* disable async transfer queue on AMD
b8354
2026-03-15 08:18:54 +01:00
sprayandwipe
6b10a82c00 kv-cache : fix reading llama_kv_cell_ext during state read (#20273)
Co-authored-by: sid <sid@ragingfist.net>
b8353
2026-03-15 09:11:19 +02:00
Michael Wand
d23355afc3 model : wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support (#20506) b8352 2026-03-14 22:44:42 +01:00
Georgi Gerganov
b30a5fdf37 metal : add FA specialization for HSK = 320, HSV = 256 (#20549) b8351 2026-03-14 23:15:47 +02:00
Georgi Gerganov
b4768955c4 ci : move self-hosted workflows to separate files (#20540) b8350 2026-03-14 23:15:35 +02:00
Gerard Guillemas Martos
fc350fdf96 docker : force Python 3.13 in Vulkan container (#20530)
* ci: force Python 3.13 in Vulkan container

* remove unnecessary `update-alternatives` line
2026-03-14 21:37:09 +01:00
Eve
3a6f059909 ci : try to optimize some jobs (#20521)
* force arm version to test

* run on either x86 or arm if we can help it, this only works for runs without ccache

* readd other jobs

* remove ccache
b8348
2026-03-14 20:27:52 +01:00
Max Krasnyansky
609ea50026 hexagon: Q4_0 and MXFP4 repack fixes (#20527)
* hexagon: fix tail corruption with rows sizes not multiple of 256

* hexagon: use different stride for repacking partial blocks

* hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks

Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing
instead of the original (0:128,1:129,...) packing in order to fix tail corruption.
Since the mm kernels already deal with partial tails we can use even:odd
packing only for the last block.
This avoid performance penalty of having to shuffle to zip the elements
in the common case.

* hex-mm: update rmpy x8 for better optimizations

* hex-mm: tighten supported MUL_MAT checks to avoid spurios failures

* hex-mm: use vzero to init accumulators

* hex-mm: properly call partial rmpy_x8
b8347
2026-03-14 11:09:08 -07:00
Georgi Gerganov
9f774e45ee ci : reduce webgpu tests timeout to 900s (#20538)
[no ci]
2026-03-14 17:08:26 +02:00
Xuan-Son Nguyen
94d0262277 mtmd: add llama-mtmd-debug binary (#20508)
* mtmd: add llama-mtmd-debug binary

* adapt

* fixes

* fix compile error

* fix windows compile error

* rm legacy clip_debug_encode()

* add MTMD_API to fix build
2026-03-14 15:52:29 +01:00
Neo Zhang
a93c0ef0fa add op gated_delta_net (#20455) 2026-03-14 22:01:57 +08:00
Chedrian07
710878a7dd webui: restore code preview iframe origin isolation (#20477) 2026-03-14 11:28:28 +01:00
Adrien Gallouët
0685848bc6 scripts : remove get-wikitext-103.sh (#20543)
It doesn't work and no one seems to use it.

    $ wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
    HTTP request sent, awaiting response... 301 Moved Permanently
    Location: unspecified
    ERROR: Redirection (301) without location.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-14 11:22:04 +01:00
Adrien Gallouët
0024a69b70 scripts : update get-hellaswag.sh and get-winogrande.sh (#20542)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-14 11:21:50 +01:00