llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-10 19:14:07 +00:00

Author	SHA1	Message	Date
Gaurav Garg	b9afc19cb4	Write a readme on Multi-GPU usage in llama.cpp (#22729 ) * Write a readme on Multi-GPU usage in llama.cpp * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-07 17:48:40 +02:00
Georgi Gerganov	803627f121	llama : remove unnecessary seq_id check during state restore (#22797 ) b9058	2026-05-07 16:37:26 +03:00
pl752	68380ae11b	ggml-cpu: Optimized risc-v cpu q1_0 dot b9057	2026-05-07 21:09:25 +08:00
Pascal	cc97e45a14	mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT (#22770 ) b9056	2026-05-07 14:01:01 +02:00
AesSedai	8e52631d55	model: Add Mimo v2.5 model support (#22493 ) * add mimo-v2.5 support * mimo-v2.5: fix modify_tensors row split * mimi-v2.5: forgot `add_attn_value_scale` plumbing * mimi-v2.5: fix tp dequant to detect tp rows * mimo-v2.5: fix TP iteration to be descending * mimo-v2.5: fix comment * mimo-v2.5: retain fused qkv * mimo-v2.5: missed the attn_value scale during merge * mimo-v2.5: fused QKV needs contiguous for scaling attention value * mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/mimo2.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mimo-v2.5: include MTP weights in gguf --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9055	2026-05-07 13:21:58 +02:00
Pascal	f4b5a2ee91	webui: fix ?model= URL param race in router mode (#22771 ) * webui: fix ?model= URL param race in router mode * chore: update webui build output	2026-05-07 13:09:32 +02:00
Vishal Singh	97f06e9eed	codeowners : add ZenDNN backend codeowner (#22772 ) * codeowners : add ZenDNN backend codeowner * codeowners : fix zendnn owners to use individual github handles	2026-05-07 14:46:51 +08:00
viggy	e358d75adb	webui: fix flicker issue on dismiss animation on overlay primitives (#22773 ) * add fill-mode-forwards * generated diffs	2026-05-07 08:11:31 +02:00
Shane Tran Whitmire	cfff1fc300	sycl : fix test script (#22737 ) The error: ./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad substitution was thrown whenever the user used this command: ./examples/sycl/test.sh -mg 0 Fix is to get rid of a dollar sign.	2026-05-07 08:25:57 +03:00
Adrien Gallouët	3980e04d5a	llama : add missing call to ggml_backend_load_all() (#22752 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9050	2026-05-07 08:24:47 +03:00
tc-mb	2496f9c149	mtmd : support MiniCPM-V 4.6 (#22529 ) * Support MiniCPM-V 4.6 in new branch Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code bug Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix pre-commit Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix convert Signed-off-by: tc-mb <tianchi_cai@icloud.com> * rename clip_graph_minicpmv4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use new TYPE_MINICPMV4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use build_attn to allow flash attention support Signed-off-by: tc-mb <tianchi_cai@icloud.com> * no use legacy code, restored here. Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use the existing tensors name Signed-off-by: tc-mb <tianchi_cai@icloud.com> * unused ctx->model.hparams.minicpmv_version Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use n_merge for slice alignment Signed-off-by: tc-mb <tianchi_cai@icloud.com> * borrow wa_layer_indexes for vit_merger insertion point Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code style Signed-off-by: tc-mb <tianchi_cai@icloud.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use filter_tensors and add model.vision_tower Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix chkhsh Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix type check Signed-off-by: tc-mb <tianchi_cai@icloud.com> --------- Signed-off-by: tc-mb <tianchi_cai@icloud.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9049	2026-05-06 21:54:09 +02:00
Gilad S.	5207d120ea	model : don't crash on unsupported architecture (#22742 ) * model: don't crash on unsupported architecture * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9048	2026-05-06 18:51:21 +02:00
fl0rianr	a0101225bc	common: do not fit to unknown device memory (#22614 ) * common: do not fit to unknown device memory Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: preserve host fallback for non-GPU fit devices Signed-off-by: Florian Reinle <f.reinle@otec.de> * common: keep unknown GPU fit memory at zero Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de> b9047	2026-05-06 17:03:45 +02:00
Georgi Gerganov	a290ce6266	gguf-py : bump version to 0.19.0 (#22664 ) * gguf-py : bump version to 0.19.0 * bump poetry --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> gguf-v0.19.0	2026-05-06 14:46:14 +02:00
Yakine Tahtah	a00e47e422	mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101 ) * mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking. Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel). GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping. Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding. * mtmd: rename gs_ prefixed tensors to generic/architecture names * mtmd: use tensor_mapping.py for all granite_speech tensors * convert: fold GraniteSpeechTextModel into GraniteModel * mtmd: replace n_layer hack with explicit has_standard_layers flag * mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech * mtmd: align KEY_A_ define spacing * convert: register GraniteModel for GraniteSpeechForConditionalGeneration * convert: fix ty type-check for GraniteSpeechMmprojModel registration * mtmd: align TN_ define spacing * mtmd: use generic layer loop for granite speech tensor loading * mtmd: merge qformer_proj_layer into clip_layer * mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs * mtmd: granite_speech add comment explaining why build_attn is not used * mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata * gguf: add spacing between granite_speech tensor mapping blocks * mtmd: make generic audio layer_norm_eps read optional * mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps * mtmd: align defines and struct fields in clip-impl.h and clip-model.h * mtmd: fix alignment and ordering issues across granite speech files * convert: granite_speech use filter_tensors instead of modify_tensors for skipping b9045	2026-05-06 14:40:59 +02:00
David Huggins-Daines	750141969c	feat: migrate to PEP 621 and add uv support (#21907 ) * feat: migrate to PEP 621 and add uv support * fix: remove upper bound on protobuf * remove poetry.lock and uv.lock * fix/add torch dependency version and markers * fix dev-dependency deprecation warning * gguf-py : update python version requirement to 3.10 --------- Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2026-05-06 14:04:10 +02:00
Daniel Bevenius	a736e6c0ac	convert : ignore non-language tensors for Gemma4Model (#22753 ) * convert : ignore non-language tensors for Gemma4Model This commit adds a check to make sure only text language tensors are handled in filter_tensors. The motivation is that currently when trying to convert a Gemma4 model the following error occurs: ```console (venv) $ ./convert-gemma.sh INFO:hf-to-gguf:Loading model: gemma-4-E2B-it INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors' INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only INFO:hf-to-gguf:Exporting model... INFO:hf-to-gguf:rope_freqs.weight, torch.float32 --> F32, shape = {256} Traceback (most recent call last): File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module> main() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main model_instance.write() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write self.prepare_tensors() File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)): File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors yield from super().modify_tensors(data_torch, name, bid) File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors yield from super().modify_tensors(data_torch, name, bid) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors new_name = self.map_tensor_name(name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name raise ValueError(f"Can not map tensor {name!r}") ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight' ``` * add forgotten embed_vision and embed_audio * improve --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-06 13:50:44 +02:00
Aleksander Grygier	e3e3f8e46a	webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719 ) * refactor: Remove Google favicon utility * fix: MCP Server favicon * refactor: Cleanup * refactor: MCP Server Information * fix: Fix MCP Settings UI * refactor: Cleanup	2026-05-06 11:12:27 +02:00
zzzzwc	f08f20a0e3	ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423 ) b9041	2026-05-06 15:41:14 +08:00
viggy	07eaf919ed	add tabindex and aria-hidden (#22699 )	2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret	74d6248f71	convert : add filter_tensors method to pre-filter tensors (#22597 ) * add filter_tensors classmethod * remove language_model * fix parts validation	2026-05-06 08:06:05 +02:00
fl0rianr	2ca1161bd7	ggml : use `CL_DEVICE_GLOBAL_MEM_SIZE` as memory estimate for OpenCL --fit (#22688 ) * ggml : report estimated OpenCL memory for --fit Signed-off-by: Florian Reinle <f.reinle@otec.de> * ggml : estimated OpenCL memory backend integrated Signed-off-by: Florian Reinle <f.reinle@otec.de> --------- Signed-off-by: Florian Reinle <f.reinle@otec.de> b9038	2026-05-05 22:12:48 -07:00
Trivikram Reddy	bbeb89d76c	Hexagon: Process M-tail rows on HMX instead of HVX (#22724 ) * hex-mm: process m-tail rows on HMX instead of HVX * hmx-mm: unroll and optimize padded activation loop --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> b9037	2026-05-05 09:43:03 -07:00
lhez	ff806a110d	opencl: refactor Adreno q4_0 (#22335 ) * opencl: refactor adreno q4_0 gemm/gemv dispatch * opencl: refactor q4_0 gemm/gemv loading, use consistent names * opencl: use consistent name for adreno q8_0 gemm/gemv * opencl: use consistent names for adreno q4_0 gemm/gemv * opencl: simplify adreno q4_0 set_tensor * opencl: refactor q4_0 get_tensor	2026-05-05 09:38:57 -07:00
Radoslav Gerganov	d5003b6e4d	rpc : use graph uid instead of graph cache (#22701 ) Store the last graph uid and compare against it to determine if the same graph is being computed.	2026-05-05 13:47:13 +03:00
Adrien Gallouët	2635ac76e8	common : fix missing-noreturn warnings when compiling with clang 21 (#22702 ) common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3719 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3726 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3733 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3740 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3747 \| [](common_params & /params/, int /value/) { \| ^ Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-05 13:16:25 +03:00
Georgi Gerganov	70a8309114	sync : ggml b9033	2026-05-05 13:15:59 +03:00
Georgi Gerganov	c91faf997f	ggml : bump version to 0.11.0 (ggml/1478)	2026-05-05 13:15:59 +03:00
Adrien Gallouët	bf76ac77be	common : only load backends when required (#22290 ) * common : only load backends when required Signed-off-by: Adrien Gallouët <angt@huggingface.co> * llama : call ggml_backend_load_all() directly from llama_backend_init() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add ggml_backend_load_all() where llama_backend_init() is not used Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> b9031	2026-05-05 09:23:50 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	a09a00e502	vendor : update cpp-httplib to 0.43.3 (#22686 ) b9030	2026-05-05 09:04:57 +02:00
Georgi Gerganov	2bacb1eb77	server : validate --tools CLI argument against known tool names (#22538 ) Previously, unknown tool names passed via --tools were silently ignored. Now the server validates each tool name at startup and exits with an error if an unrecognized tool is specified, listing the available tools. Assisted-by: llama.cpp:local pi b9029	2026-05-05 06:35:27 +03:00
Georgi Gerganov	d6e7b033a4	llama : add option to save memory in device buffers (#22679 ) * llama : add option to save memory in device buffers * tests : extend llama-save-load-state b9028	2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret	fa595462ca	graph : handle non-contiguous Q/K/V in mul_mat_aux (#22630 ) * qkv may not always be contiguous * cont : make the cont conditional --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-05 06:34:44 +03:00
Ismail	a817a22bc6	ggml : implement fast walsh-hadamard transform for kv rotation (#21352 ) (#22631 ) b9026	2026-05-05 10:05:05 +08:00
Charles Xu	eff06702b2	kleidiai : update to v1.24.0 and use release archive (#22549 ) b9025	2026-05-04 22:13:31 +03:00
leonardHONG	e77056f9b2	CUDA: use fastdiv for batch index split in get_rows (#22650 )	2026-05-04 16:24:05 +02:00
Xuan-Son Nguyen	935a340292	server: implement /models?reload=1 (#21848 ) b9023	2026-05-04 16:23:26 +02:00
Shakhnazar Sailaukan	d8794eecd5	examples: refactor diffusion generation (#22590 ) * examples: refactor diffusion generation * renamed enum values b9022	2026-05-04 20:19:30 +08:00
JusteLeo	36a694c965	webui : fix circular dependency between chat.service.ts and models.svelte.ts (#22625 )	2026-05-04 13:38:10 +02:00
Piotr Wilkin (ilintar)	a4701c98f7	common/autoparser: fixes for newline handling / forced tool calls (#22654 ) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply instead b9020	2026-05-04 13:18:11 +02:00
Xuan-Son Nguyen	994118a183	model: move `load_hparams` and `load_tensors` to per-model definition (#22004 ) * git-friendly migration * add build_graph * nits * exclude old code from build * wip * add llm_arch_model_i * prepare downstream functions * nits * nits * wip * wip * add back create_tensor_qkv * fix files missing include * enforce one llm_build per arch * cmake: use glob * missing model params * nits * wip * wip (2) * wip (3) * test-llama-archs is happy * improve switch case * move more stuff into llm_arch_model_i * fix downstream code * nits * nits (2) * fix order * llama_model_base * LLAMA_LOAD_LOCALS * small fix * fix build errors * auto * rm migration script and ifdef b9019	2026-05-04 12:36:59 +02:00
Evan Huus	c84e6d6db5	server: Add a simple get_datetime server tool (#22649 ) b9018	2026-05-04 12:19:41 +02:00
Nick Towle	fa8feaed34	webui: restore missing settings (#22666 )	2026-05-04 09:04:07 +02:00
Georgi Gerganov	846262d787	docs : update speculative decoding parameters after refactor (#22397 ) (#22539 ) * docs : update speculative decoding parameters after refactor (#22397) Update docs/speculative.md to reflect the new parameter naming scheme introduced in PR #22397: - Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min - Replace --spec-ngram-size-n/m with per-implementation variants - Add documentation for all new --spec-ngram-- parameters - Update all example commands Assisted-by: llama.cpp:local pi pi : add rule to use gh CLI for GitHub resources Assisted-by: llama.cpp:local pi * docs : run llama-gen-docs * arg : fix typo b9016	2026-05-04 08:52:07 +03:00
Atomic-Germ	6dcd824fce	vulkan: delete dead GGML_VK_MAX_NODES def (#22621 ) b9015	2026-05-04 07:49:29 +02:00
Chen Yuan	d4b0c22f9e	ggml-webgpu: add layer norm ops (#22406 ) * shader(norm): add layer norm ops * shader(norm): stablize floating point computation with Kahan summation and handle mixed types * shader(norm): remove the non-contiguous strides * shader(norm): use the original implementation rather than the kahan summation b9014	2026-05-03 20:52:53 -07:00
Aldehir Rojas	e48034dfc9	common : determine generation prompt using longest common prefix (#22657 )	2026-05-04 00:18:23 +02:00
Julien Denize	048a490f76	convert : Mistral format yarn apply_scale support (#22612 ) * [BUGFIX] Mistral format apply_scale support. * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix misunderstood boolean parameters --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b9012	2026-05-03 21:51:21 +02:00
JM Robles	db44417b02	convert : apply Q/K RoPE permutation in NVFP4 repack path (#22611 ) Llama-architecture q_proj/k_proj weights need an axis-0 row permutation to match GGML's RoPE convention. The BF16 path applies this in LlamaModel.modify_tensors via LlamaModel.permute, but the NVFP4 path bypasses modify_tensors and writes weights directly through ModelBase._repack_nvfp4. Without the permutation, attention heads end up scrambled at inference and the model produces gibberish. This change overrides _repack_nvfp4 on LlamaModel and applies the same permutation to both the nibble-packed weight and the per-block scale before delegating to ModelBase._repack_nvfp4 via super(). Reuses the existing LlamaModel.permute static helper and respects the existing undo_permute flag, so subclasses (Mistral, Granite, Llama4, etc.) inherit the fix automatically. Verified on TinyLlama-1.1B reproducer: perplexity drops from 4419 (gibberish) to 43.9, matching the BF16-dequantized baseline (44.0). Also verified end-to-end on ALIA-40b-instruct-2601 (BSC, Llama architecture) with multilingual generation in Spanish/Catalan/Basque/ Galician all coherent with the fix applied. Co-authored-by: Chema <chema@montevive.ai>	2026-05-03 18:22:00 +03:00
lucy	d05fe1d7da	fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus entirely) (#22533 ) * fix: CUDA device PCI bus ID detection for multi-GPU de-dupe * HIP, MUSA macros --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b9010	2026-05-02 22:19:25 +02:00

1 2 3 4 5 ...

9059 Commits