Georgi Gerganov
1dbc054da5
server : fix slot ctx_drft ptr
2026-05-08 11:55:05 +03:00
Georgi Gerganov
161eae0adf
spec : fix n_past type
2026-05-08 11:54:32 +03:00
Georgi Gerganov
e5b1401318
speculative-simple : update
2026-05-08 11:09:34 +03:00
Georgi Gerganov
3b1a8df8fd
server : clean-up + dry
2026-05-08 10:20:01 +03:00
Georgi Gerganov
233d1aee69
server : add comment
...
[no ci]
2026-05-08 08:50:23 +03:00
Georgi Gerganov
12c7cfbe83
server : fix URL for draft model
2026-05-08 08:03:49 +03:00
Georgi Gerganov
6a4b05a030
server : fix mtmd draft processing
2026-05-08 08:02:11 +03:00
Georgi Gerganov
8be14e40de
spec : handle draft running out of context
2026-05-08 07:11:51 +03:00
Georgi Gerganov
7e118cdce0
cont : process images throught the draft context
2026-05-07 21:44:09 +03:00
Georgi Gerganov
ae6703fa89
cont : pass correct n_past for drafting
2026-05-07 21:44:08 +03:00
Georgi Gerganov
0239f4c611
cont : handle non-ckpt models
2026-05-07 21:44:08 +03:00
Georgi Gerganov
c7facb0fe1
cont : async drft eval when possible
2026-05-07 21:44:08 +03:00
Georgi Gerganov
08c8012bde
cont : sync main and drft contexts
2026-05-07 21:44:08 +03:00
Georgi Gerganov
de35b1255c
server, spec : transition to unified spec context
2026-05-07 21:44:08 +03:00
Georgi Gerganov
1afee5b262
server : improve ctx names
...
[no ci]
2026-05-07 21:44:08 +03:00
Georgi Gerganov
11fd5e7272
server : draft prompt cache and checkpoints
...
[no ci]
2026-05-07 21:44:08 +03:00
Georgi Gerganov
c97dc3605e
server : sketch the ctx_dft decode loop
...
[no ci]
2026-05-07 21:44:08 +03:00
Georgi Gerganov
8a50f6f0b9
cont : dedup ctx_seq_rm_type
...
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
77269ad8a7
cont : pass seq_id
...
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
4550f0f08b
spec : update common_speculative_init()
...
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
befc7ef635
spec : drop support for incompatible vocabs
...
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
2c9a40849f
spec : refactor
...
[no ci]
2026-05-07 21:44:07 +03:00
Georgi Gerganov
e43431b381
llama : fix device state save/load ( #22805 )
b9064
2026-05-07 21:43:40 +03:00
shaofeiqi
ceb7e14b96
opencl: add opfilter regex for debugging ( #22782 )
b9063
2026-05-07 11:00:20 -07:00
Aldehir Rojas
093be624cc
common/chat : preserve media markers for typed-content templates ( #22634 )
b9062
2026-05-07 12:50:56 -05:00
HaoJun ZHANG
deab41ec68
tests: add long-sequence cases and fix inputs for gated_delta_net ( #22794 )
...
* tests : add long-seq + tail cases for gated_delta_net
* tests : realistic input ranges for gated_delta_net
b9061
2026-05-08 00:23:36 +08:00
Intel AI Get-to Market Customer Success and Solutions
ad09224658
sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET ( #22149 )
...
* sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET
Signed-off-by: Chun Tao <chun.tao@intel.com >
* Fix abort during test-backend-ops
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com >
* Regenerate ops.md
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com >
* Add scope_dbg_print to newly added SYCL ops.
Also add scope_dbg_print to existing ssm_conv op.
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com >
---------
Signed-off-by: Chun Tao <chun.tao@intel.com >
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com >
Co-authored-by: Chun Tao <chun.tao@intel.com >
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com >
b9060
2026-05-07 18:51:33 +03:00
Gaurav Garg
b9afc19cb4
Write a readme on Multi-GPU usage in llama.cpp ( #22729 )
...
* Write a readme on Multi-GPU usage in llama.cpp
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* Address review comments
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2026-05-07 17:48:40 +02:00
Georgi Gerganov
803627f121
llama : remove unnecessary seq_id check during state restore ( #22797 )
b9058
2026-05-07 16:37:26 +03:00
pl752
68380ae11b
ggml-cpu: Optimized risc-v cpu q1_0 dot
b9057
2026-05-07 21:09:25 +08:00
Pascal
cc97e45a14
mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT ( #22770 )
b9056
2026-05-07 14:01:01 +02:00
AesSedai
8e52631d55
model: Add Mimo v2.5 model support ( #22493 )
...
* add mimo-v2.5 support
* mimo-v2.5: fix modify_tensors row split
* mimi-v2.5: forgot `add_attn_value_scale` plumbing
* mimi-v2.5: fix tp dequant to detect tp rows
* mimo-v2.5: fix TP iteration to be descending
* mimo-v2.5: fix comment
* mimo-v2.5: retain fused qkv
* mimo-v2.5: missed the attn_value scale during merge
* mimo-v2.5: fused QKV needs contiguous for scaling attention value
* mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors
* Update src/llama-hparams.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update src/models/mimo2.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* mimo-v2.5: include MTP weights in gguf
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
b9055
2026-05-07 13:21:58 +02:00
Pascal
f4b5a2ee91
webui: fix ?model= URL param race in router mode ( #22771 )
...
* webui: fix ?model= URL param race in router mode
* chore: update webui build output
2026-05-07 13:09:32 +02:00
Vishal Singh
97f06e9eed
codeowners : add ZenDNN backend codeowner ( #22772 )
...
* codeowners : add ZenDNN backend codeowner
* codeowners : fix zendnn owners to use individual github handles
2026-05-07 14:46:51 +08:00
viggy
e358d75adb
webui: fix flicker issue on dismiss animation on overlay primitives ( #22773 )
...
* add fill-mode-forwards
* generated diffs
2026-05-07 08:11:31 +02:00
Shane Tran Whitmire
cfff1fc300
sycl : fix test script ( #22737 )
...
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution
was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0
Fix is to get rid of a dollar sign.
2026-05-07 08:25:57 +03:00
Adrien Gallouët
3980e04d5a
llama : add missing call to ggml_backend_load_all() ( #22752 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
b9050
2026-05-07 08:24:47 +03:00
tc-mb
2496f9c149
mtmd : support MiniCPM-V 4.6 ( #22529 )
...
* Support MiniCPM-V 4.6 in new branch
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix code bug
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix pre-commit
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix convert
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* rename clip_graph_minicpmv4_6
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* use new TYPE_MINICPMV4_6
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* use build_attn to allow flash attention support
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* no use legacy code, restored here.
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* use the existing tensors name
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* unused ctx->model.hparams.minicpmv_version
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* use n_merge for slice alignment
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* borrow wa_layer_indexes for vit_merger insertion point
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix code style
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* use filter_tensors and add model.vision_tower
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix chkhsh
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix type check
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
---------
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
b9049
2026-05-06 21:54:09 +02:00
Gilad S.
5207d120ea
model : don't crash on unsupported architecture ( #22742 )
...
* model: don't crash on unsupported architecture
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
b9048
2026-05-06 18:51:21 +02:00
fl0rianr
a0101225bc
common: do not fit to unknown device memory ( #22614 )
...
* common: do not fit to unknown device memory
Signed-off-by: Florian Reinle <f.reinle@otec.de >
* common: preserve host fallback for non-GPU fit devices
Signed-off-by: Florian Reinle <f.reinle@otec.de >
* common: keep unknown GPU fit memory at zero
Signed-off-by: Florian Reinle <f.reinle@otec.de >
---------
Signed-off-by: Florian Reinle <f.reinle@otec.de >
b9047
2026-05-06 17:03:45 +02:00
Georgi Gerganov
a290ce6266
gguf-py : bump version to 0.19.0 ( #22664 )
...
* gguf-py : bump version to 0.19.0
* bump poetry
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
gguf-v0.19.0
2026-05-06 14:46:14 +02:00
Yakine Tahtah
a00e47e422
mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) ( #22101 )
...
* mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)
Conformer encoder with Shaw relative position encoding,
QFormer projector, log-mel spectrogram with frame stacking.
Encoder uses GLU gating, folded batch norm, and SSM depthwise
conv. QFormer compresses encoder output via windowed
cross-attention (window=15, queries=3) into the LLM embedding
space.
Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank,
dynamic range compression, 2x frame stacking (80->160 mel).
GGUF converter handles batch norm folding at export time,
fused K/V split, and Conv1d weight reshaping.
Tested against HF transformers reference: token-for-token match
on 30s/60s audio clips with greedy decoding.
* mtmd: rename gs_ prefixed tensors to generic/architecture names
* mtmd: use tensor_mapping.py for all granite_speech tensors
* convert: fold GraniteSpeechTextModel into GraniteModel
* mtmd: replace n_layer hack with explicit has_standard_layers flag
* mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech
* mtmd: align KEY_A_ define spacing
* convert: register GraniteModel for GraniteSpeechForConditionalGeneration
* convert: fix ty type-check for GraniteSpeechMmprojModel registration
* mtmd: align TN_ define spacing
* mtmd: use generic layer loop for granite speech tensor loading
* mtmd: merge qformer_proj_layer into clip_layer
* mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs
* mtmd: granite_speech add comment explaining why build_attn is not used
* mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata
* gguf: add spacing between granite_speech tensor mapping blocks
* mtmd: make generic audio layer_norm_eps read optional
* mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps
* mtmd: align defines and struct fields in clip-impl.h and clip-model.h
* mtmd: fix alignment and ordering issues across granite speech files
* convert: granite_speech use filter_tensors instead of modify_tensors for skipping
b9045
2026-05-06 14:40:59 +02:00
David Huggins-Daines
750141969c
feat: migrate to PEP 621 and add uv support ( #21907 )
...
* feat: migrate to PEP 621 and add uv support
* fix: remove upper bound on protobuf
* remove poetry.lock and uv.lock
* fix/add torch dependency version and markers
* fix dev-dependency deprecation warning
* gguf-py : update python version requirement to 3.10
---------
Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca >
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com >
2026-05-06 14:04:10 +02:00
Daniel Bevenius
a736e6c0ac
convert : ignore non-language tensors for Gemma4Model ( #22753 )
...
* convert : ignore non-language tensors for Gemma4Model
This commit adds a check to make sure only text language tensors are
handled in filter_tensors.
The motivation is that currently when trying to convert a Gemma4 model
the following error occurs:
```console
(venv) $ ./convert-gemma.sh
INFO:hf-to-gguf:Loading model: gemma-4-E2B-it
INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight, torch.float32 --> F32, shape = {256}
Traceback (most recent call last):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module>
main()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main
model_instance.write()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write
self.prepare_tensors()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors
for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors
new_name = self.map_tensor_name(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name
raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight'
```
* add forgotten embed_vision and embed_audio
* improve
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2026-05-06 13:50:44 +02:00
Aleksander Grygier
e3e3f8e46a
webui: Remove Google Favicons & Improve MCP Information logic & UI ( #22719 )
...
* refactor: Remove Google favicon utility
* fix: MCP Server favicon
* refactor: Cleanup
* refactor: MCP Server Information
* fix: Fix MCP Settings UI
* refactor: Cleanup
2026-05-06 11:12:27 +02:00
zzzzwc
f08f20a0e3
ggml-cpu: fuse RMS_NORM + MUL on CPU backend ( #22423 )
b9041
2026-05-06 15:41:14 +08:00
viggy
07eaf919ed
add tabindex and aria-hidden ( #22699 )
2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret
74d6248f71
convert : add filter_tensors method to pre-filter tensors ( #22597 )
...
* add filter_tensors classmethod
* remove language_model
* fix parts validation
2026-05-06 08:06:05 +02:00
fl0rianr
2ca1161bd7
ggml : use CL_DEVICE_GLOBAL_MEM_SIZE as memory estimate for OpenCL --fit ( #22688 )
...
* ggml : report estimated OpenCL memory for --fit
Signed-off-by: Florian Reinle <f.reinle@otec.de >
* ggml : estimated OpenCL memory backend integrated
Signed-off-by: Florian Reinle <f.reinle@otec.de >
---------
Signed-off-by: Florian Reinle <f.reinle@otec.de >
b9038
2026-05-05 22:12:48 -07:00
Trivikram Reddy
bbeb89d76c
Hexagon: Process M-tail rows on HMX instead of HVX ( #22724 )
...
* hex-mm: process m-tail rows on HMX instead of HVX
* hmx-mm: unroll and optimize padded activation loop
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com >
b9037
2026-05-05 09:43:03 -07:00