9075 Commits

Author SHA1 Message Date
Pascal
58e68df0f9 cuda: fuse snake activation (mul, sin, sqr, mul, add) (#22667)
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
b9075
2026-05-08 17:44:09 +08:00
Aleksander Grygier
9b2925e1e0 webui: Add Import/Export of Settings configuration + improve architecture (#22803)
* refactor: Settings keys as constant object keys

* chore: Run `npm audit fix`

* refactor: Settings Sections UI

* feat: Refactor Settings structure and implement import/export logic

* feat: Introduce ROUTES constant and RouterService

* refactor: Consolidate settings definitions into registry

* refactor: Update settings page routing structure

* chore: Migrate hardcoded URLs to use ROUTES and RouterService

* feat: Enhance model selection logic for settings and chat

* chore: Update webui static build

* refactor: Address PR review comments

* fix: Remove unneeded setting

* fix: Re-add missing settings

* fix: Add missing `/slots` proxy for webui dev mode

* chore: Dev-mode logs

* fix: Data binding

* fix: Steering for non-agentic flow
2026-05-08 11:26:04 +02:00
Johannes Gäßler
a8fd165fec CUDA: lower-case PCI bus id, standardize for ggml (#22820) b9073 2026-05-08 10:09:38 +02:00
miyan
6d57a49a70 vulkan: fix spv shadowing (#22760) b9072 2026-05-08 09:35:22 +02:00
Max Krasnyansky
3e941b813b ggml: update SCHED_DEBUG output to use ggml_op_desc() (#22825) b9071 2026-05-07 22:43:04 -07:00
Shawn Gu
f3e8d149ce opencl: add q4_0 MoE GEMM for Adreno (#22731)
* Q4_0 MoE CLC pass sanity check

* release program

* opencl: fix whitespace

* opencl: remove unused cl_program

* opencl: break #if block to make it more clear

* opencl: adjust format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
b9070
2026-05-07 21:17:07 -07:00
Michał Piszczek
1d72d87349 convert : fix RuntimeError when stripping FP8 KV-cache scales (#22818)
* convert : fix RuntimeError when stripping FP8 KV-cache scales

In ModelBase._generate_nvfp4_tensors the final cleanup loop iterates
self.model_tensors.keys() and calls del on the same dict, which raises
RuntimeError: dictionary changed size during iteration when a ModelOpt
NVFP4 model also has FP8 KV-cache scales (e.g. mmangkad/Qwen3.6-35B-A3B-NVFP4
and any modelopt config with kv_cache_quant_algo: FP8).

Wrap the keys view in list() so the deletions happen on a snapshot.

* re-add another accidentally removed list

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-08 06:55:48 +03:00
Neo Zhang
6a2a2513dc fix script error (#22795sycl : ) 2026-05-08 06:54:57 +03:00
samuraieng
44dbe8c521 model: Support sarashina2.2-vision-3b model (#22103) 2026-05-07 23:10:29 +02:00
leonardHONG
05ff59cb57 CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (#22651)
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched

* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched

* CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends
b9066
2026-05-07 21:59:29 +02:00
smugman-dot
aaf4a4d5e0 webui: add option for LLM title generation (#22265)
* webui: add LLM title generation option

* webui: use chat_template_kwargs for title gen + fix conversation check

* webui: capture firstUserMessage before async streamChatCompletion to fix race condition

* webui: extract LLM title generation into separate method

* webui: use constants and ChatService for LLM generated titles

* webui: rebuild static output

* webui: add LLM title generation setting to new settings location

* webui: use sendMessage in generateTitle

* webui: rebuild static output

* webui: fix formatting

* webui: configurable title prompt, remove think tag regexes, fix TS error

* webui: group title constants into TITLE object, use TruncatedText for CSS truncation and fix race condition

* webui: rebuild static output
2026-05-07 21:14:03 +02:00
Georgi Gerganov
e43431b381 llama : fix device state save/load (#22805) b9064 2026-05-07 21:43:40 +03:00
shaofeiqi
ceb7e14b96 opencl: add opfilter regex for debugging (#22782) b9063 2026-05-07 11:00:20 -07:00
Aldehir Rojas
093be624cc common/chat : preserve media markers for typed-content templates (#22634) b9062 2026-05-07 12:50:56 -05:00
HaoJun ZHANG
deab41ec68 tests: add long-sequence cases and fix inputs for gated_delta_net (#22794)
* tests : add long-seq + tail cases for gated_delta_net

* tests : realistic input ranges for gated_delta_net
b9061
2026-05-08 00:23:36 +08:00
Intel AI Get-to Market Customer Success and Solutions
ad09224658 sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (#22149)
* sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Fix abort during test-backend-ops

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Regenerate ops.md

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Add scope_dbg_print to newly added SYCL ops.

Also add scope_dbg_print to existing ssm_conv op.

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>
b9060
2026-05-07 18:51:33 +03:00
Gaurav Garg
b9afc19cb4 Write a readme on Multi-GPU usage in llama.cpp (#22729)
* Write a readme on Multi-GPU usage in llama.cpp

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-07 17:48:40 +02:00
Georgi Gerganov
803627f121 llama : remove unnecessary seq_id check during state restore (#22797) b9058 2026-05-07 16:37:26 +03:00
pl752
68380ae11b ggml-cpu: Optimized risc-v cpu q1_0 dot b9057 2026-05-07 21:09:25 +08:00
Pascal
cc97e45a14 mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT (#22770) b9056 2026-05-07 14:01:01 +02:00
AesSedai
8e52631d55 model: Add Mimo v2.5 model support (#22493)
* add mimo-v2.5 support

* mimo-v2.5: fix modify_tensors row split

* mimi-v2.5: forgot `add_attn_value_scale` plumbing

* mimi-v2.5: fix tp dequant to detect tp rows

* mimo-v2.5: fix TP iteration to be descending

* mimo-v2.5: fix comment

* mimo-v2.5: retain fused qkv

* mimo-v2.5: missed the attn_value scale during merge

* mimo-v2.5: fused QKV needs contiguous for scaling attention value

* mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors

* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* mimo-v2.5: include MTP weights in gguf

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9055
2026-05-07 13:21:58 +02:00
Pascal
f4b5a2ee91 webui: fix ?model= URL param race in router mode (#22771)
* webui: fix ?model= URL param race in router mode

* chore: update webui build output
2026-05-07 13:09:32 +02:00
Vishal Singh
97f06e9eed codeowners : add ZenDNN backend codeowner (#22772)
* codeowners : add ZenDNN backend codeowner

* codeowners : fix zendnn owners to use individual github handles
2026-05-07 14:46:51 +08:00
viggy
e358d75adb webui: fix flicker issue on dismiss animation on overlay primitives (#22773)
* add fill-mode-forwards

* generated diffs
2026-05-07 08:11:31 +02:00
Shane Tran Whitmire
cfff1fc300 sycl : fix test script (#22737)
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution

was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0

Fix is to get rid of a dollar sign.
2026-05-07 08:25:57 +03:00
Adrien Gallouët
3980e04d5a llama : add missing call to ggml_backend_load_all() (#22752)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9050
2026-05-07 08:24:47 +03:00
tc-mb
2496f9c149 mtmd : support MiniCPM-V 4.6 (#22529)
* Support MiniCPM-V 4.6 in new branch

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix code bug

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix pre-commit

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix convert

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* rename clip_graph_minicpmv4_6

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use new TYPE_MINICPMV4_6

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use build_attn to allow flash attention support

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* no use legacy code, restored here.

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use the existing tensors name

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* unused ctx->model.hparams.minicpmv_version

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use n_merge for slice alignment

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* borrow wa_layer_indexes for vit_merger insertion point

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix code style

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* use filter_tensors and add model.vision_tower

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix chkhsh

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix type check

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

---------

Signed-off-by: tc-mb <tianchi_cai@icloud.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9049
2026-05-06 21:54:09 +02:00
Gilad S.
5207d120ea model : don't crash on unsupported architecture (#22742)
* model: don't crash on unsupported architecture

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9048
2026-05-06 18:51:21 +02:00
fl0rianr
a0101225bc common: do not fit to unknown device memory (#22614)
* common: do not fit to unknown device memory

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* common: preserve host fallback for non-GPU fit devices

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* common: keep unknown GPU fit memory at zero

Signed-off-by: Florian Reinle <f.reinle@otec.de>

---------

Signed-off-by: Florian Reinle <f.reinle@otec.de>
b9047
2026-05-06 17:03:45 +02:00
Georgi Gerganov
a290ce6266 gguf-py : bump version to 0.19.0 (#22664)
* gguf-py : bump version to 0.19.0

* bump poetry

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
gguf-v0.19.0
2026-05-06 14:46:14 +02:00
Yakine Tahtah
a00e47e422 mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101)
* mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)

Conformer encoder with Shaw relative position encoding,
QFormer projector, log-mel spectrogram with frame stacking.

Encoder uses GLU gating, folded batch norm, and SSM depthwise
conv. QFormer compresses encoder output via windowed
cross-attention (window=15, queries=3) into the LLM embedding
space.

Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank,
dynamic range compression, 2x frame stacking (80->160 mel).

GGUF converter handles batch norm folding at export time,
fused K/V split, and Conv1d weight reshaping.

Tested against HF transformers reference: token-for-token match
on 30s/60s audio clips with greedy decoding.

* mtmd: rename gs_ prefixed tensors to generic/architecture names

* mtmd: use tensor_mapping.py for all granite_speech tensors

* convert: fold GraniteSpeechTextModel into GraniteModel

* mtmd: replace n_layer hack with explicit has_standard_layers flag

* mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech

* mtmd: align KEY_A_ define spacing

* convert: register GraniteModel for GraniteSpeechForConditionalGeneration

* convert: fix ty type-check for GraniteSpeechMmprojModel registration

* mtmd: align TN_ define spacing

* mtmd: use generic layer loop for granite speech tensor loading

* mtmd: merge qformer_proj_layer into clip_layer

* mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs

* mtmd: granite_speech add comment explaining why build_attn is not used

* mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata

* gguf: add spacing between granite_speech tensor mapping blocks

* mtmd: make generic audio layer_norm_eps read optional

* mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps

* mtmd: align defines and struct fields in clip-impl.h and clip-model.h

* mtmd: fix alignment and ordering issues across granite speech files

* convert: granite_speech use filter_tensors instead of modify_tensors for skipping
b9045
2026-05-06 14:40:59 +02:00
David Huggins-Daines
750141969c feat: migrate to PEP 621 and add uv support (#21907)
* feat: migrate to PEP 621 and add uv support

* fix: remove upper bound on protobuf

* remove poetry.lock and uv.lock

* fix/add torch dependency version and markers

* fix dev-dependency deprecation warning

* gguf-py : update python version requirement to 3.10

---------

Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2026-05-06 14:04:10 +02:00
Daniel Bevenius
a736e6c0ac convert : ignore non-language tensors for Gemma4Model (#22753)
* convert : ignore non-language tensors for Gemma4Model

This commit adds a check to make sure only text language tensors are
handled in filter_tensors.

The motivation is that currently when trying to convert a Gemma4 model
the following error occurs:
```console
(venv) $ ./convert-gemma.sh
INFO:hf-to-gguf:Loading model: gemma-4-E2B-it
INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,                 torch.float32 --> F32, shape = {256}
Traceback (most recent call last):
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module>
    main()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main
    model_instance.write()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write
    self.prepare_tensors()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors
    yield from super().modify_tensors(data_torch, name, bid)
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors
    yield from super().modify_tensors(data_torch, name, bid)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors
    new_name = self.map_tensor_name(name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight'
```

* add forgotten embed_vision and embed_audio

* improve

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-06 13:50:44 +02:00
Aleksander Grygier
e3e3f8e46a webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719)
* refactor: Remove Google favicon utility

* fix: MCP Server favicon

* refactor: Cleanup

* refactor: MCP Server Information

* fix: Fix MCP Settings UI

* refactor: Cleanup
2026-05-06 11:12:27 +02:00
zzzzwc
f08f20a0e3 ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423) b9041 2026-05-06 15:41:14 +08:00
viggy
07eaf919ed add tabindex and aria-hidden (#22699) 2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret
74d6248f71 convert : add filter_tensors method to pre-filter tensors (#22597)
* add filter_tensors classmethod

* remove language_model

* fix parts validation
2026-05-06 08:06:05 +02:00
fl0rianr
2ca1161bd7 ggml : use CL_DEVICE_GLOBAL_MEM_SIZE as memory estimate for OpenCL --fit (#22688)
* ggml : report estimated OpenCL memory for --fit

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* ggml : estimated OpenCL memory backend integrated

Signed-off-by: Florian Reinle <f.reinle@otec.de>

---------

Signed-off-by: Florian Reinle <f.reinle@otec.de>
b9038
2026-05-05 22:12:48 -07:00
Trivikram Reddy
bbeb89d76c Hexagon: Process M-tail rows on HMX instead of HVX (#22724)
* hex-mm: process m-tail rows on HMX instead of HVX

* hmx-mm: unroll and optimize padded activation loop

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b9037
2026-05-05 09:43:03 -07:00
lhez
ff806a110d opencl: refactor Adreno q4_0 (#22335)
* opencl: refactor adreno q4_0 gemm/gemv dispatch

* opencl: refactor q4_0 gemm/gemv loading, use consistent names

* opencl: use consistent name for adreno q8_0 gemm/gemv

* opencl: use consistent names for adreno q4_0 gemm/gemv

* opencl: simplify adreno q4_0 set_tensor

* opencl: refactor q4_0 get_tensor
2026-05-05 09:38:57 -07:00
Radoslav Gerganov
d5003b6e4d rpc : use graph uid instead of graph cache (#22701)
Store the last graph uid and compare against it to determine if the same
graph is being computed.
2026-05-05 13:47:13 +03:00
Adrien Gallouët
2635ac76e8 common : fix missing-noreturn warnings when compiling with clang 21 (#22702)
common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3719 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3726 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3733 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3740 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3747 |         [](common_params & /*params*/, int /*value*/) {
          |         ^

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 13:16:25 +03:00
Georgi Gerganov
70a8309114 sync : ggml b9033 2026-05-05 13:15:59 +03:00
Georgi Gerganov
c91faf997f ggml : bump version to 0.11.0 (ggml/1478) 2026-05-05 13:15:59 +03:00
Adrien Gallouët
bf76ac77be common : only load backends when required (#22290)
* common : only load backends when required

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* llama : call ggml_backend_load_all() directly from llama_backend_init()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add ggml_backend_load_all() where llama_backend_init() is not used

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9031
2026-05-05 09:23:50 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
a09a00e502 vendor : update cpp-httplib to 0.43.3 (#22686) b9030 2026-05-05 09:04:57 +02:00
Georgi Gerganov
2bacb1eb77 server : validate --tools CLI argument against known tool names (#22538)
Previously, unknown tool names passed via --tools were silently ignored.
Now the server validates each tool name at startup and exits with an
error if an unrecognized tool is specified, listing the available tools.

Assisted-by: llama.cpp:local pi
b9029
2026-05-05 06:35:27 +03:00
Georgi Gerganov
d6e7b033a4 llama : add option to save memory in device buffers (#22679)
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
b9028
2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret
fa595462ca graph : handle non-contiguous Q/K/V in mul_mat_aux (#22630)
* qkv may not always be contiguous

* cont : make the cont conditional

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-05 06:34:44 +03:00
Ismail
a817a22bc6 ggml : implement fast walsh-hadamard transform for kv rotation (#21352) (#22631) b9026 2026-05-05 10:05:05 +08:00