Commit Graph

9086 Commits

Author SHA1 Message Date
Georgi Gerganov
1dbc054da5 server : fix slot ctx_drft ptr 2026-05-08 11:55:05 +03:00
Georgi Gerganov
161eae0adf spec : fix n_past type 2026-05-08 11:54:32 +03:00
Georgi Gerganov
e5b1401318 speculative-simple : update 2026-05-08 11:09:34 +03:00
Georgi Gerganov
3b1a8df8fd server : clean-up + dry 2026-05-08 10:20:01 +03:00
Georgi Gerganov
233d1aee69 server : add comment
[no ci]
2026-05-08 08:50:23 +03:00
Georgi Gerganov
12c7cfbe83 server : fix URL for draft model 2026-05-08 08:03:49 +03:00
Georgi Gerganov
6a4b05a030 server : fix mtmd draft processing 2026-05-08 08:02:11 +03:00
Georgi Gerganov
8be14e40de spec : handle draft running out of context 2026-05-08 07:11:51 +03:00
Georgi Gerganov
7e118cdce0 cont : process images throught the draft context 2026-05-07 21:44:09 +03:00
Georgi Gerganov
ae6703fa89 cont : pass correct n_past for drafting 2026-05-07 21:44:08 +03:00
Georgi Gerganov
0239f4c611 cont : handle non-ckpt models 2026-05-07 21:44:08 +03:00
Georgi Gerganov
c7facb0fe1 cont : async drft eval when possible 2026-05-07 21:44:08 +03:00
Georgi Gerganov
08c8012bde cont : sync main and drft contexts 2026-05-07 21:44:08 +03:00
Georgi Gerganov
de35b1255c server, spec : transition to unified spec context 2026-05-07 21:44:08 +03:00
Georgi Gerganov
1afee5b262 server : improve ctx names
[no ci]
2026-05-07 21:44:08 +03:00
Georgi Gerganov
11fd5e7272 server : draft prompt cache and checkpoints
[no ci]
2026-05-07 21:44:08 +03:00
Georgi Gerganov
c97dc3605e server : sketch the ctx_dft decode loop
[no ci]
2026-05-07 21:44:08 +03:00
Georgi Gerganov
8a50f6f0b9 cont : dedup ctx_seq_rm_type
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
77269ad8a7 cont : pass seq_id
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
4550f0f08b spec : update common_speculative_init()
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
befc7ef635 spec : drop support for incompatible vocabs
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
2c9a40849f spec : refactor
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
e43431b381 llama : fix device state save/load (#22805) b9064 2026-05-07 21:43:40 +03:00
shaofeiqi
ceb7e14b96 opencl: add opfilter regex for debugging (#22782) b9063 2026-05-07 11:00:20 -07:00
Aldehir Rojas
093be624cc common/chat : preserve media markers for typed-content templates (#22634) b9062 2026-05-07 12:50:56 -05:00
HaoJun ZHANG
deab41ec68 tests: add long-sequence cases and fix inputs for gated_delta_net (#22794)
* tests : add long-seq + tail cases for gated_delta_net

* tests : realistic input ranges for gated_delta_net
b9061
2026-05-08 00:23:36 +08:00
Intel AI Get-to Market Customer Success and Solutions
ad09224658 sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (#22149)
* sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Fix abort during test-backend-ops

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Regenerate ops.md

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Add scope_dbg_print to newly added SYCL ops.

Also add scope_dbg_print to existing ssm_conv op.

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>
b9060
2026-05-07 18:51:33 +03:00
Gaurav Garg
b9afc19cb4 Write a readme on Multi-GPU usage in llama.cpp (#22729)
* Write a readme on Multi-GPU usage in llama.cpp

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-07 17:48:40 +02:00
Georgi Gerganov
803627f121 llama : remove unnecessary seq_id check during state restore (#22797) b9058 2026-05-07 16:37:26 +03:00
pl752
68380ae11b ggml-cpu: Optimized risc-v cpu q1_0 dot b9057 2026-05-07 21:09:25 +08:00
Pascal
cc97e45a14 mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT (#22770) b9056 2026-05-07 14:01:01 +02:00
AesSedai
8e52631d55 model: Add Mimo v2.5 model support (#22493)
* add mimo-v2.5 support

* mimo-v2.5: fix modify_tensors row split

* mimi-v2.5: forgot `add_attn_value_scale` plumbing

* mimi-v2.5: fix tp dequant to detect tp rows

* mimo-v2.5: fix TP iteration to be descending

* mimo-v2.5: fix comment

* mimo-v2.5: retain fused qkv

* mimo-v2.5: missed the attn_value scale during merge

* mimo-v2.5: fused QKV needs contiguous for scaling attention value

* mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors

* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* mimo-v2.5: include MTP weights in gguf

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9055
2026-05-07 13:21:58 +02:00
Pascal
f4b5a2ee91 webui: fix ?model= URL param race in router mode (#22771)
* webui: fix ?model= URL param race in router mode

* chore: update webui build output
2026-05-07 13:09:32 +02:00
Vishal Singh
97f06e9eed codeowners : add ZenDNN backend codeowner (#22772)
* codeowners : add ZenDNN backend codeowner

* codeowners : fix zendnn owners to use individual github handles
2026-05-07 14:46:51 +08:00
viggy
e358d75adb webui: fix flicker issue on dismiss animation on overlay primitives (#22773)
* add fill-mode-forwards

* generated diffs
2026-05-07 08:11:31 +02:00
Shane Tran Whitmire
cfff1fc300 sycl : fix test script (#22737)
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution

was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0

Fix is to get rid of a dollar sign.
2026-05-07 08:25:57 +03:00
Adrien Gallouët
3980e04d5a llama : add missing call to ggml_backend_load_all() (#22752)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b9050
2026-05-07 08:24:47 +03:00
tc-mb
2496f9c149 mtmd : support MiniCPM-V 4.6 (#22529)
* Support MiniCPM-V 4.6 in new branch

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix code bug

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix pre-commit

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix convert

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* rename clip_graph_minicpmv4_6

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use new TYPE_MINICPMV4_6

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use build_attn to allow flash attention support

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* no use legacy code, restored here.

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use the existing tensors name

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* unused ctx->model.hparams.minicpmv_version

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use n_merge for slice alignment

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* borrow wa_layer_indexes for vit_merger insertion point

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix code style

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* use filter_tensors and add model.vision_tower

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix chkhsh

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix type check

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

---------

Signed-off-by: tc-mb <tianchi_cai@icloud.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9049
2026-05-06 21:54:09 +02:00
Gilad S.
5207d120ea model : don't crash on unsupported architecture (#22742)
* model: don't crash on unsupported architecture

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b9048
2026-05-06 18:51:21 +02:00
fl0rianr
a0101225bc common: do not fit to unknown device memory (#22614)
* common: do not fit to unknown device memory

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* common: preserve host fallback for non-GPU fit devices

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* common: keep unknown GPU fit memory at zero

Signed-off-by: Florian Reinle <f.reinle@otec.de>

---------

Signed-off-by: Florian Reinle <f.reinle@otec.de>
b9047
2026-05-06 17:03:45 +02:00
Georgi Gerganov
a290ce6266 gguf-py : bump version to 0.19.0 (#22664)
* gguf-py : bump version to 0.19.0

* bump poetry

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
gguf-v0.19.0
2026-05-06 14:46:14 +02:00
Yakine Tahtah
a00e47e422 mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101)
* mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)

Conformer encoder with Shaw relative position encoding,
QFormer projector, log-mel spectrogram with frame stacking.

Encoder uses GLU gating, folded batch norm, and SSM depthwise
conv. QFormer compresses encoder output via windowed
cross-attention (window=15, queries=3) into the LLM embedding
space.

Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank,
dynamic range compression, 2x frame stacking (80->160 mel).

GGUF converter handles batch norm folding at export time,
fused K/V split, and Conv1d weight reshaping.

Tested against HF transformers reference: token-for-token match
on 30s/60s audio clips with greedy decoding.

* mtmd: rename gs_ prefixed tensors to generic/architecture names

* mtmd: use tensor_mapping.py for all granite_speech tensors

* convert: fold GraniteSpeechTextModel into GraniteModel

* mtmd: replace n_layer hack with explicit has_standard_layers flag

* mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech

* mtmd: align KEY_A_ define spacing

* convert: register GraniteModel for GraniteSpeechForConditionalGeneration

* convert: fix ty type-check for GraniteSpeechMmprojModel registration

* mtmd: align TN_ define spacing

* mtmd: use generic layer loop for granite speech tensor loading

* mtmd: merge qformer_proj_layer into clip_layer

* mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs

* mtmd: granite_speech add comment explaining why build_attn is not used

* mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata

* gguf: add spacing between granite_speech tensor mapping blocks

* mtmd: make generic audio layer_norm_eps read optional

* mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps

* mtmd: align defines and struct fields in clip-impl.h and clip-model.h

* mtmd: fix alignment and ordering issues across granite speech files

* convert: granite_speech use filter_tensors instead of modify_tensors for skipping
b9045
2026-05-06 14:40:59 +02:00
David Huggins-Daines
750141969c feat: migrate to PEP 621 and add uv support (#21907)
* feat: migrate to PEP 621 and add uv support

* fix: remove upper bound on protobuf

* remove poetry.lock and uv.lock

* fix/add torch dependency version and markers

* fix dev-dependency deprecation warning

* gguf-py : update python version requirement to 3.10

---------

Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2026-05-06 14:04:10 +02:00
Daniel Bevenius
a736e6c0ac convert : ignore non-language tensors for Gemma4Model (#22753)
* convert : ignore non-language tensors for Gemma4Model

This commit adds a check to make sure only text language tensors are
handled in filter_tensors.

The motivation is that currently when trying to convert a Gemma4 model
the following error occurs:
```console
(venv) $ ./convert-gemma.sh
INFO:hf-to-gguf:Loading model: gemma-4-E2B-it
INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,                 torch.float32 --> F32, shape = {256}
Traceback (most recent call last):
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module>
    main()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main
    model_instance.write()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write
    self.prepare_tensors()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors
    yield from super().modify_tensors(data_torch, name, bid)
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors
    yield from super().modify_tensors(data_torch, name, bid)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors
    new_name = self.map_tensor_name(name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight'
```

* add forgotten embed_vision and embed_audio

* improve

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-06 13:50:44 +02:00
Aleksander Grygier
e3e3f8e46a webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719)
* refactor: Remove Google favicon utility

* fix: MCP Server favicon

* refactor: Cleanup

* refactor: MCP Server Information

* fix: Fix MCP Settings UI

* refactor: Cleanup
2026-05-06 11:12:27 +02:00
zzzzwc
f08f20a0e3 ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423) b9041 2026-05-06 15:41:14 +08:00
viggy
07eaf919ed add tabindex and aria-hidden (#22699) 2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret
74d6248f71 convert : add filter_tensors method to pre-filter tensors (#22597)
* add filter_tensors classmethod

* remove language_model

* fix parts validation
2026-05-06 08:06:05 +02:00
fl0rianr
2ca1161bd7 ggml : use CL_DEVICE_GLOBAL_MEM_SIZE as memory estimate for OpenCL --fit (#22688)
* ggml : report estimated OpenCL memory for --fit

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* ggml : estimated OpenCL memory backend integrated

Signed-off-by: Florian Reinle <f.reinle@otec.de>

---------

Signed-off-by: Florian Reinle <f.reinle@otec.de>
b9038
2026-05-05 22:12:48 -07:00
Trivikram Reddy
bbeb89d76c Hexagon: Process M-tail rows on HMX instead of HVX (#22724)
* hex-mm: process m-tail rows on HMX instead of HVX

* hmx-mm: unroll and optimize padded activation loop

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b9037
2026-05-05 09:43:03 -07:00