llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-12 20:14:09 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	1dbc054da5	server : fix slot ctx_drft ptr	2026-05-08 11:55:05 +03:00
Georgi Gerganov	161eae0adf	spec : fix n_past type	2026-05-08 11:54:32 +03:00
Georgi Gerganov	e5b1401318	speculative-simple : update	2026-05-08 11:09:34 +03:00
Georgi Gerganov	3b1a8df8fd	server : clean-up + dry	2026-05-08 10:20:01 +03:00
Georgi Gerganov	233d1aee69	server : add comment [no ci]	2026-05-08 08:50:23 +03:00
Georgi Gerganov	12c7cfbe83	server : fix URL for draft model	2026-05-08 08:03:49 +03:00
Georgi Gerganov	6a4b05a030	server : fix mtmd draft processing	2026-05-08 08:02:11 +03:00
Georgi Gerganov	8be14e40de	spec : handle draft running out of context	2026-05-08 07:11:51 +03:00
Georgi Gerganov	7e118cdce0	cont : process images throught the draft context	2026-05-07 21:44:09 +03:00
Georgi Gerganov	ae6703fa89	cont : pass correct n_past for drafting	2026-05-07 21:44:08 +03:00
Georgi Gerganov	0239f4c611	cont : handle non-ckpt models	2026-05-07 21:44:08 +03:00
Georgi Gerganov	c7facb0fe1	cont : async drft eval when possible	2026-05-07 21:44:08 +03:00
Georgi Gerganov	08c8012bde	cont : sync main and drft contexts	2026-05-07 21:44:08 +03:00
Georgi Gerganov	de35b1255c	server, spec : transition to unified spec context	2026-05-07 21:44:08 +03:00
Georgi Gerganov	1afee5b262	server : improve ctx names [no ci]	2026-05-07 21:44:08 +03:00
Georgi Gerganov	11fd5e7272	server : draft prompt cache and checkpoints [no ci]	2026-05-07 21:44:08 +03:00
Georgi Gerganov	c97dc3605e	server : sketch the ctx_dft decode loop [no ci]	2026-05-07 21:44:08 +03:00
Georgi Gerganov	8a50f6f0b9	cont : dedup ctx_seq_rm_type [no ci]	2026-05-07 21:44:07 +03:00
Georgi Gerganov	77269ad8a7	cont : pass seq_id [no ci]	2026-05-07 21:44:07 +03:00
Georgi Gerganov	4550f0f08b	spec : update common_speculative_init() [no ci]	2026-05-07 21:44:07 +03:00
Georgi Gerganov	befc7ef635	spec : drop support for incompatible vocabs [no ci]	2026-05-07 21:44:07 +03:00
Georgi Gerganov	2c9a40849f	spec : refactor [no ci]	2026-05-07 21:44:07 +03:00
Georgi Gerganov	e43431b381	llama : fix device state save/load (#22805 ) b9064	2026-05-07 21:43:40 +03:00
shaofeiqi	ceb7e14b96	opencl: add opfilter regex for debugging (#22782 ) b9063	2026-05-07 11:00:20 -07:00
Aldehir Rojas	093be624cc	common/chat : preserve media markers for typed-content templates (#22634 ) b9062	2026-05-07 12:50:56 -05:00
HaoJun ZHANG	deab41ec68	tests: add long-sequence cases and fix inputs for gated_delta_net (#22794 ) * tests : add long-seq + tail cases for gated_delta_net * tests : realistic input ranges for gated_delta_net b9061	2026-05-08 00:23:36 +08:00
Intel AI Get-to Market Customer Success and Solutions	ad09224658	sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (#22149 ) * sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET Signed-off-by: Chun Tao <chun.tao@intel.com> * Fix abort during test-backend-ops Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Regenerate ops.md Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> * Add scope_dbg_print to newly added SYCL ops. Also add scope_dbg_print to existing ssm_conv op. Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Signed-off-by: Todd Malsbary <todd.malsbary@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com> b9060	2026-05-07 18:51:33 +03:00
Gaurav Garg	b9afc19cb4	Write a readme on Multi-GPU usage in llama.cpp (#22729 ) * Write a readme on Multi-GPU usage in llama.cpp * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-07 17:48:40 +02:00
Georgi Gerganov	803627f121	llama : remove unnecessary seq_id check during state restore (#22797 ) b9058	2026-05-07 16:37:26 +03:00
pl752	68380ae11b	ggml-cpu: Optimized risc-v cpu q1_0 dot b9057	2026-05-07 21:09:25 +08:00
Pascal	cc97e45a14	mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT (#22770 ) b9056	2026-05-07 14:01:01 +02:00
AesSedai	8e52631d55	model: Add Mimo v2.5 model support (#22493 ) * add mimo-v2.5 support * mimo-v2.5: fix modify_tensors row split * mimi-v2.5: forgot `add_attn_value_scale` plumbing * mimi-v2.5: fix tp dequant to detect tp rows * mimo-v2.5: fix TP iteration to be descending * mimo-v2.5: fix comment * mimo-v2.5: retain fused qkv * mimo-v2.5: missed the attn_value scale during merge * mimo-v2.5: fused QKV needs contiguous for scaling attention value * mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mimo-v2.5: include MTP weights in gguf --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9055	2026-05-07 13:21:58 +02:00
Pascal	f4b5a2ee91	webui: fix ?model= URL param race in router mode (#22771 ) * webui: fix ?model= URL param race in router mode * chore: update webui build output	2026-05-07 13:09:32 +02:00
Vishal Singh	97f06e9eed	codeowners : add ZenDNN backend codeowner (#22772 ) * codeowners : add ZenDNN backend codeowner * codeowners : fix zendnn owners to use individual github handles	2026-05-07 14:46:51 +08:00
viggy	e358d75adb	webui: fix flicker issue on dismiss animation on overlay primitives (#22773 ) * add fill-mode-forwards * generated diffs	2026-05-07 08:11:31 +02:00
Shane Tran Whitmire	cfff1fc300	sycl : fix test script (#22737 ) The error: ./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad substitution was thrown whenever the user used this command: ./examples/sycl/test.sh -mg 0 Fix is to get rid of a dollar sign.	2026-05-07 08:25:57 +03:00
Adrien Gallouët	3980e04d5a	llama : add missing call to ggml_backend_load_all() (#22752 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9050	2026-05-07 08:24:47 +03:00
tc-mb	2496f9c149	mtmd : support MiniCPM-V 4.6 (#22529 ) * Support MiniCPM-V 4.6 in new branch Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code bug Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix pre-commit Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix convert Signed-off-by: tc-mb <tianchi_cai@icloud.com> * rename clip_graph_minicpmv4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use new TYPE_MINICPMV4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use build_attn to allow flash attention support Signed-off-by: tc-mb <tianchi_cai@icloud.com> * no use legacy code, restored here. Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use the existing tensors name Signed-off-by: tc-mb <tianchi_cai@icloud.com> * unused ctx->model.hparams.minicpmv_version Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use n_merge for slice alignment Signed-off-by: tc-mb <tianchi_cai@icloud.com> * borrow wa_layer_indexes for vit_merger insertion point Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code style Signed-off-by: tc-mb <tianchi_cai@icloud.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use filter_tensors and add model.vision_tower Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix chkhsh Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix type check Signed-off-by: tc-mb <tianchi_cai@icloud.com> --------- Signed-off-by: tc-mb <tianchi_cai@icloud.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9049	2026-05-06 21:54:09 +02:00
Gilad S.	5207d120ea	model : don't crash on unsupported architecture (#22742 ) * model: don't crash on unsupported architecture * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9048	2026-05-06 18:51:21 +02:00
fl0rianr	a0101225bc	common: do not fit to unknown device memory (#22614 ) * common: do not fit to unknown device memory Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: preserve host fallback for non-GPU fit devices Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: keep unknown GPU fit memory at zero Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de> b9047	2026-05-06 17:03:45 +02:00
Georgi Gerganov	a290ce6266	gguf-py : bump version to 0.19.0 (#22664 ) * gguf-py : bump version to 0.19.0 * bump poetry --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> gguf-v0.19.0	2026-05-06 14:46:14 +02:00
Yakine Tahtah	a00e47e422	mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101 ) * mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking. Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel). GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping. Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding. * mtmd: rename gs_ prefixed tensors to generic/architecture names * mtmd: use tensor_mapping.py for all granite_speech tensors * convert: fold GraniteSpeechTextModel into GraniteModel * mtmd: replace n_layer hack with explicit has_standard_layers flag * mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech * mtmd: align KEY_A_ define spacing * convert: register GraniteModel for GraniteSpeechForConditionalGeneration * convert: fix ty type-check for GraniteSpeechMmprojModel registration * mtmd: align TN_ define spacing * mtmd: use generic layer loop for granite speech tensor loading * mtmd: merge qformer_proj_layer into clip_layer * mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs * mtmd: granite_speech add comment explaining why build_attn is not used * mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata * gguf: add spacing between granite_speech tensor mapping blocks * mtmd: make generic audio layer_norm_eps read optional * mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps * mtmd: align defines and struct fields in clip-impl.h and clip-model.h * mtmd: fix alignment and ordering issues across granite speech files * convert: granite_speech use filter_tensors instead of modify_tensors for skipping b9045	2026-05-06 14:40:59 +02:00
David Huggins-Daines	750141969c	feat: migrate to PEP 621 and add uv support (#21907 ) * feat: migrate to PEP 621 and add uv support * fix: remove upper bound on protobuf * remove poetry.lock and uv.lock * fix/add torch dependency version and markers * fix dev-dependency deprecation warning * gguf-py : update python version requirement to 3.10 --------- Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2026-05-06 14:04:10 +02:00
Daniel Bevenius	a736e6c0ac	convert : ignore non-language tensors for Gemma4Model (#22753 ) * convert : ignore non-language tensors for Gemma4Model This commit adds a check to make sure only text language tensors are handled in filter_tensors. The motivation is that currently when trying to convert a Gemma4 model the following error occurs: ```console (venv) $ ./convert-gemma.sh INFO:hf-to-gguf:Loading model: gemma-4-E2B-it INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors' INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:hf-to-gguf:Exporting model... INFO:hf-to-gguf:rope_freqs.weight, torch.float32 --> F32, shape = {256} Traceback (most recent call last): File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module> main() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main model_instance.write() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write self.prepare_tensors() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)): File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors yield from super().modify_tensors(data_torch, name, bid) File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors yield from super().modify_tensors(data_torch, name, bid) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors new_name = self.map_tensor_name(name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name raise ValueError(f"Can not map tensor {name!r}") ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight' ``` * add forgotten embed_vision and embed_audio * improve --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-06 13:50:44 +02:00
Aleksander Grygier	e3e3f8e46a	webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719 ) * refactor: Remove Google favicon utility * fix: MCP Server favicon * refactor: Cleanup * refactor: MCP Server Information * fix: Fix MCP Settings UI * refactor: Cleanup	2026-05-06 11:12:27 +02:00
zzzzwc	f08f20a0e3	ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423 ) b9041	2026-05-06 15:41:14 +08:00
viggy	07eaf919ed	add tabindex and aria-hidden (#22699 )	2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret	74d6248f71	convert : add filter_tensors method to pre-filter tensors (#22597 ) * add filter_tensors classmethod * remove language_model * fix parts validation	2026-05-06 08:06:05 +02:00
fl0rianr	2ca1161bd7	ggml : use `CL_DEVICE_GLOBAL_MEM_SIZE` as memory estimate for OpenCL --fit (#22688 ) * ggml : report estimated OpenCL memory for --fit Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml : estimated OpenCL memory backend integrated Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de> b9038	2026-05-05 22:12:48 -07:00
Trivikram Reddy	bbeb89d76c	Hexagon: Process M-tail rows on HMX instead of HVX (#22724 ) * hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> b9037	2026-05-05 09:43:03 -07:00

1 2 3 4 5 ...

9086 Commits