llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-11 11:34:10 +00:00

Author	SHA1	Message	Date
Adrien Gallouët	2635ac76e8	common : fix missing-noreturn warnings when compiling with clang 21 (#22702 ) common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3719 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3726 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3733 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3740 \| [](common_params & /params/, int /value/) { \| ^ common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn] 3747 \| [](common_params & /params/, int /value/) { \| ^ Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-05 13:16:25 +03:00
Adrien Gallouët	bf76ac77be	common : only load backends when required (#22290 ) * common : only load backends when required Signed-off-by: Adrien Gallouët <angt@huggingface.co> * llama : call ggml_backend_load_all() directly from llama_backend_init() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add ggml_backend_load_all() where llama_backend_init() is not used Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-05-05 09:23:50 +02:00
Georgi Gerganov	d6e7b033a4	llama : add option to save memory in device buffers (#22679 ) * llama : add option to save memory in device buffers * tests : extend llama-save-load-state	2026-05-05 06:35:07 +03:00
Shakhnazar Sailaukan	d8794eecd5	examples: refactor diffusion generation (#22590 ) * examples: refactor diffusion generation * renamed enum values	2026-05-04 20:19:30 +08:00
Piotr Wilkin (ilintar)	a4701c98f7	common/autoparser: fixes for newline handling / forced tool calls (#22654 ) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply instead	2026-05-04 13:18:11 +02:00
Evan Huus	c84e6d6db5	server: Add a simple get_datetime server tool (#22649 )	2026-05-04 12:19:41 +02:00
Georgi Gerganov	846262d787	docs : update speculative decoding parameters after refactor (#22397 ) (#22539 ) * docs : update speculative decoding parameters after refactor (#22397) Update docs/speculative.md to reflect the new parameter naming scheme introduced in PR #22397: - Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min - Replace --spec-ngram-size-n/m with per-implementation variants - Add documentation for all new --spec-ngram-- parameters - Update all example commands Assisted-by: llama.cpp:local pi pi : add rule to use gh CLI for GitHub resources Assisted-by: llama.cpp:local pi * docs : run llama-gen-docs * arg : fix typo	2026-05-04 08:52:07 +03:00
Aldehir Rojas	e48034dfc9	common : determine generation prompt using longest common prefix (#22657 )	2026-05-04 00:18:23 +02:00
Adrien Gallouët	beb42fffa4	common : check for null getpwuid in hf-cache (#22550 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-30 21:32:41 +02:00
Ben Guidarelli	c20c44514a	spec: fix argument typo (#22552 )	2026-04-30 17:32:32 +03:00
Georgi Gerganov	80afa33aad	spec : fix draft model checkpoints (#22521 ) * spec : fix draft model checkpoints * cont : clean-up * cont : gate the ngram-mod reset warning behind verbose flag	2026-04-30 08:32:18 +03:00
Aldehir Rojas	d77599234e	common : do not pass prompt tokens to reasoning budget sampler (#22488 )	2026-04-29 14:10:58 -05:00
Georgi Gerganov	683c5acb90	spec : disacard last drafted token with low prob (#22506 )	2026-04-29 17:00:00 +03:00
Masato Nakasaka	7b95ea5d11	common: Intentionally leak logger instance to fix hanging on Windows (#22273 ) * Changed to leak logger singleton to prevent hanging on Windows * Fix comment * Stopped using static vector Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash * Change so all logs are output before exit * Added debug logging * added more logging * Added logging * Explicitly free logger to avoid hanging on Win * Reverted to leak logger instance again * Removed debug log and fixed comment * Fixed comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-29 10:58:43 +03:00
Jillis ter Hove	52e5f0a5c1	common : re-arm reasoning budget after DONE on new <think> (#22323 ) DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response. Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).	2026-04-28 19:15:36 +02:00
Georgi Gerganov	14e733e36f	spec : refactor params (#22397 ) * spec : refactor params * cont : fix * cont : rename "sparam" to "sampling" * cont : add spec params category * cont : add info about removed arguments * cont : skip param length check for spec params * cont : adapt server tests	2026-04-28 09:07:33 +03:00
rankaiyx	42401c72b8	Fix type casting for unaccounted memory calculation (#22424 )	2026-04-27 14:31:13 +02:00
Georgi Gerganov	e940b3d468	download : prefer q8_0 when q4_k not available (#22428 )	2026-04-27 14:30:29 +02:00
Max Krasnyansky	5594d13224	common: fix missing exports in llama-common (#22340 ) * common: refactor common/debug to move abort_on_nan into base_callback_data Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO. It should just be a member of the base_callback_data instead. * cont : cleanup * common : use pimpl in debug.h to reduce header dependencies Move common_debug_cb_user_data's data members (std::regex, std::vector<uint8_t>) into a private impl struct in debug.cpp. This removes the includes of common.h and <regex> from debug.h, reducing transitive dependencies for any translation unit that includes the header. Assisted-by: llama.cpp:local pi --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-27 08:06:39 +03:00
Piotr Wilkin (ilintar)	dcad77cc3b	chat: fix handling of space in reasoning markers (#22353 ) * chat: fix handling of space in reasoning markers * fix tests * whitespace	2026-04-25 21:24:13 +02:00
Georgi Gerganov	98dc1418ea	spec : fix vocab compat checks (#22358 )	2026-04-25 20:11:35 +03:00
Adrien Gallouët	dc80c5252a	common : fix jinja warnings with clang 21 (#22313 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-24 12:36:02 +02:00
Georgi Gerganov	017f090442	jinja : remove unused header (#22310 )	2026-04-24 11:01:46 +03:00
Matthias Straka	0dd7f915fd	cli : cleanup auto-completion code (#21745 )	2026-04-23 15:03:28 +02:00
Tarek Dakhran	550d684bd1	server: Enable transcriptions API for LFM2-Audio (#22000 )	2026-04-23 10:47:26 +02:00
Ethan Turner	750579ff14	common: Refactoring sampler parameters (#20429 ) (#22233 ) This change refactors the reasoning_budget_message parameter from the common params into the sampling parameters specifically. It also removes the reasoning_budget common parameter and standardizes on the existing reasoning_budget_tokens parameter in the sampling configuration. Issue: https://github.com/ggml-org/llama.cpp/issues/20429 Original PR: https://github.com/ggml-org/llama.cpp/pull/20297	2026-04-22 10:40:19 +02:00
Piotr Wilkin (ilintar)	134d6e54d4	common/chat, server: refactor, move all conversion functions to common, add tests (#20690 ) * Refactor conversion functions	2026-04-22 10:28:45 +02:00
Paul Dubs	72d693e4fb	spec : reset i_last when low acceptance streak occurs (#22168 ) By resetting i_last to zero, we will include the current context when rebuilding the speculative map.	2026-04-21 21:29:07 +03:00
Georgi Gerganov	84652b80cf	arg : add --spec-default (#22223 )	2026-04-21 19:52:02 +03:00
Georgi Gerganov	cfe9838d26	fit-params : refactor + add option to output estimated memory per device (#22171 ) * fit-params : add option to output estimated memory per device * cont : minor * cont : refactor * cont : move fit params implementation to libcommon * cont : header * cont : headers * cont : codeowners	2026-04-21 09:54:36 +03:00
Georgi Gerganov	de71b5f81c	server : refactor "use checkpoint" logic (#22114 )	2026-04-20 08:42:37 +03:00
Yes You Can Have Your Own	9d49acb2a7	server: rename --clear-idle to --cache-idle-slots (#21741 )	2026-04-20 08:30:24 +03:00
Aldehir Rojas	d5b780a676	common/autoparser : allow space after tool call (#22073 )	2026-04-19 13:28:35 +02:00
Sascha Rogmann	455d8e4be8	server : speculative checkpointing (#19493 ) * server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-19 10:24:06 +03:00
Georgi Gerganov	6990e2f1f7	libs : rename libcommon -> libllama-common (#21936 ) * cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold()	2026-04-17 11:11:46 +03:00
Piotr Wilkin (ilintar)	e1a9a6dcbe	autoparser: support case of JSON_NATIVE with per-call markers (test case: Reka-Edge) (#21892 )	2026-04-15 10:51:50 +02:00
Berk Idem	56666fa607	common: skip reasoning budget sampler when no budget is requested (#21870 ) * common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.	2026-04-14 12:43:06 +02:00
Aldehir Rojas	e21cdc11a0	common/gemma4 : handle parsing edge cases (#21760 )	2026-04-13 18:18:18 -05:00
Piotr Wilkin (ilintar)	1c0d9081fd	chat: dedicated DeepSeek v3.2 parser + "official" template (#21785 )	2026-04-13 22:23:53 +02:00
Adrien Gallouët	aa00911d12	common : add download cancellation and temp file cleanup (#21813 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-13 11:18:23 +02:00
Galunid	b136b62cf9	fix: Fix broken structured output when using $refs in json_schema (#21699 )	2026-04-10 18:26:36 -05:00
Aldehir Rojas	3fc65063d9	common : better align to the updated official gemma4 template (#21704 )	2026-04-10 16:12:53 -05:00
Adrien Gallouët	05b3caaa48	common : add callback interface for download progress (#21735 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-10 22:17:00 +02:00
Adrien Gallouët	fb38d6f278	common : fix when loading a cached HF models with unavailable API (#21670 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-10 16:37:46 +02:00
Johannes Gäßler	0893f50f2d	common: mark --split-mode tensor as experimental (#21684 )	2026-04-10 12:27:27 +02:00
Berk Idem	d7ff074c87	common : enable reasoning budget sampler for gemma4 (#21697 ) * fix: enable reasoning budget sampler for gemma4 Add thinking_start_tag and thinking_end_tag to common_chat_params_init_gemma4(). Without these, the reasoning budget sampler never activates for gemma4. Make the newline after "thought" optional in the PEG parser to handle budget=0 (sampler forces end tag before the newline). Add test case for empty thinking block. Fixes #21487 * use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser	2026-04-10 11:49:14 +02:00
Adrien Gallouët	e095a482a0	common : add fluidity to the progress bar (#21671 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-10 08:24:53 +02:00
Johannes Gäßler	d6f3030047	ggml: backend-agnostic tensor parallelism (experimental) (#19378 ) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-09 16:42:19 +02:00
Aldehir Rojas	ddf03c6d9a	common : fix ambiguous grammar rule in gemma4 (#21661 ) * common : fix ambiguous grammar rule in gemma4 * cont : fix missing comma...	2026-04-09 12:25:07 +02:00
Aldehir Rojas	26229755c5	common : simplify autoparser tagged parser rules (#21216 ) * common : simplify autoparser tagged parser rules * cont : remove upper limit on optional args * cont : revert changes to parsing at the end * cont : undo arbitrary ordering of optional args * cont : fix uninitialized required parameters * revert to simplify merge * re-apply patches * restore flexible optional arg ordering tests	2026-04-09 12:24:20 +02:00

1 2 3 4 5 ...

887 Commits