llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-13 12:34:05 +00:00

Author	SHA1	Message	Date
Daniel Bevenius	0da7e7dccc	sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection.	2025-11-19 06:59:03 +01:00
Oliver Simons	26be108be8	CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ```	2025-11-18 18:17:44 +01:00
Daniel Bevenius	311c1a347f	sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence.	2025-11-18 16:06:23 +01:00
Daniel Bevenius	82957a90f2	sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful.	2025-11-18 15:11:59 +01:00
Georgi Gerganov	4b52e59903	graph : do not include llama-model.h	2025-11-18 13:53:25 +02:00
Daniel Bevenius	71574f9273	sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations.	2025-11-18 07:31:54 +01:00
Daniel Bevenius	67d3b8e84d	ggml : add initial cumsum implementation for CUDA	2025-11-17 16:16:05 +01:00
Daniel Bevenius	a3eb847d24	webui : add backend sampling options	2025-11-17 16:16:05 +01:00
Daniel Bevenius	f1f3e68511	server : add backend sampling options/configuration	2025-11-17 16:16:05 +01:00
Daniel Bevenius	9fe9a00a8a	llama-cli : add backend sampler configuration	2025-11-17 16:16:05 +01:00
Daniel Bevenius	7884b0e0ac	sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.	2025-11-17 16:15:58 +01:00
Adrien Gallouët	cb623de3fc	ggml : add missing AVX512 feature checks (#17270 ) _mm512_cvtepu8_epi16 requires __AVX512BW__ _mm512_srli_epi16 requires __AVX512BW__ __builtin_ia32_inserti32x8 requires __AVX512DQ__ Signed-off-by: Adrien Gallouët <angt@huggingface.co> b7087	2025-11-17 12:12:00 +01:00
Georgi Gerganov	7aaeedc098	metal : support I32 -> I32 copy (#17317 ) b7086	2025-11-17 11:52:00 +02:00
Georgi Gerganov	3347e6d904	metal : faster argsort (#17315 ) * metal : faster argsort * cont : keep data in registers b7085	2025-11-17 11:51:48 +02:00
Georgi Gerganov	1a139644a8	metal : add cumsum (#17305 ) b7084	2025-11-17 11:51:13 +02:00
hipudding	2376b7758c	CANN: Use smart pointers to manage ACL objects (#17238 ) * CANN: Use smart pointers to manage ACL objects Previously, ACL objects were managed via manual destruction, which led to multiple memory-leak issues during runtime. This patch replaces manual memory management with smart pointers so that ACL objects are properly released and ownership is clearly defined. Note that the ownership of an ACL object belongs to the function that creates it. Other internal functions should operate on these ACL objects using raw pointers to avoid unintended ownership transfers. Additionally, since aclTensorList automatically frees its contained aclTensor objects, any aclTensor added to a tensor list must release ownership to avoid double free operations. This PR also removes the asynchronous task submission mechanism. Due to changes in recent CANN versions, tiling time has significantly decreased. Even with a dual-thread submission model, the dispatch overhead still falls on the critical path, making async submission less beneficial. Moreover, aclGraph support provides a much better path to reducing operator dispatch latency. * CANN: resolve review comments b7083	2025-11-17 08:43:59 +08:00
Pavels Zaicenkovs	dbed61294a	vulkan: add LOG operation support for F32 and F16 (#17183 ) * vulkan: add LOG operation support for F32 and F16 Part of #14909. * vulkan: Fix LOG operation types * docs: Update operation support documentation for Vulkan LOG operation * vulkan: fix log_f16 shader * docs: restore missing LOG test cases and regenerate ops.md b7082	2025-11-16 22:50:09 +01:00
Ruben Ortlam	80deff3648	vulkan: fix MMQ quantize_y condition (#17301 ) b7081	2025-11-16 19:38:17 +01:00
Eve	8b1c339bd2	ci : revert #16249 (#17303 ) * Delete .github/workflows/build-amd.yml * Update build.yml	2025-11-16 19:09:17 +01:00
Georgi Gerganov	416e7c7f47	metal : remove obosolete asserts (#17295 ) b7079	2025-11-16 09:50:26 +02:00
Georgi Gerganov	5b2093becc	server : handle context overflow during decode (#17267 ) * server : handle context overflow during decode * server : minor refactor b7078	2025-11-16 09:23:37 +02:00
lhez	52e5d421f1	opencl: fix rms_norm_mul (#17250 ) * opencl: use subgrroup reduce for reduction in rms_norm_mul * opencl: add comment about workgroup size b7077	2025-11-15 17:40:14 -08:00
shaofeiqi	4db5641210	opencl: add kernel to handle mat mul in attention to improve encoding speed (#17181 ) * Add mul_mm_f16_f32_kq_kqv kernel * Add ggml_cl_mul_mat_kq_kqv_adreno func * fix whitespace * remove unused variable * remove redundant * refactor and clean up * remove trailing whitespace b7076	2025-11-15 17:33:10 -08:00
shani-f	72bd7321a7	sycl : unify unary kernels with a generic implementation and enable wide operator support (#17213 ) * SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access * SYCL: update documentation and sycl.csv to reflect new unary op support * update ops.md after syncing SYCL.csv changes * Fix SYCL.csv merge conflict * Update ops.md after fixing SYCL.csv conflicts * Fix SYCL.csv tail after merge conflict and regenerate ops.md * Fix line endings and final newline in SYCL.csv * Remove TOPK_MOE entries from SYCL.csv as requested * Update ops.md after removing TOPK_MOE from SYCL.csv * Regenerated SYCL.csv and synced ops.md with upstream * Update ops.md using create_ops_docs.py b7075	2025-11-16 00:52:42 +01:00
Aleksander Grygier	22e1ce2f81	webui: Fix clickability around chat processing statistics UI (#17278 ) * fix: Better pointer events handling in chat processing info elements * chore: update webui build output	2025-11-15 22:41:41 +01:00
Pascal	1411d9275a	webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI (#16618 ) * webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI - Purely visual and diagnostic change, no effect on model context, prompt construction, or inference behavior - Captured assistant tool call payloads during streaming and non-streaming completions, and persisted them in chat state and storage for downstream use - Exposed parsed tool call labels beneath the assistant's model info line with graceful fallback when parsing fails - Added tool call badges beneath assistant responses that expose JSON tooltips and copy their payloads when clicked, matching the existing model badge styling - Added a user-facing setting to toggle tool call visibility to the Developer settings section directly under the model selector option * webui: remove scroll listener causing unnecessary layout updates (model selector) * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: npm run format & update webui build output * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-11-15 21:09:32 +01:00
Sigbjørn Skjæret	662192e1dc	convert : remove unnecessary chat template patching (#17289 )	2025-11-15 20:58:59 +01:00
Jeff Bolz	24dc769f1b	vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (#17287 ) These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit. b7071	2025-11-15 19:54:23 +01:00
Ruben Ortlam	4dca015b7e	vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (#17285 ) b7070	2025-11-15 15:18:58 +01:00
Sigbjørn Skjæret	9a8860cf5d	convert : use all parts in safetensors index (#17286 )	2025-11-15 14:12:39 +01:00
Sigbjørn Skjæret	9d3ef4809f	convert : set expert gating func in base class (#17279 )	2025-11-15 14:06:24 +01:00
Ankur Verma	c7b7db0445	mtmd-cli: Avoid logging to stdout for model loading messages in mtmd-cli (#17277 ) b7067	2025-11-15 12:41:16 +01:00
Giuseppe Scrivano	1568d13c2c	vulkan: implement ABS and NEG (#17245 ) * docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> b7066	2025-11-15 12:00:29 +01:00
Jeff Bolz	439342ea0b	vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (#17244 ) * vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign b7065	2025-11-15 11:56:15 +01:00
Jeff Bolz	234ae7d7bd	vulkan: skip all-negative-inf blocks in FA (#17186 ) b7064	2025-11-15 10:37:25 +01:00
Jeff Bolz	38eaf32af1	vulkan: change graph_compute to be async and enable get_tensor_async (#17158 ) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst b7063	2025-11-15 09:06:41 +01:00
Xuan-Son Nguyen	9b17d74ab7	mtmd: add mtmd_log_set (#17268 ) b7062	2025-11-14 15:56:19 +01:00
Bartowski	e1fcf8b09b	model : add AfmoeForCausalLM support (#16477 ) * Add AFMOE model support * Update to vocab * Add model sizing * Undo Rope change for ARCEE model * Address review comments * Update modeling code is_sliding -> use_rope, replace hard-coded logic * Fix AFMOE tokenizer * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update AFMoE tokenizer class identification to be more unique --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b7061	2025-11-14 13:54:10 +01:00
Marek Hradil jr.	6cd0cf72ce	fix : Dangling pointer for non-empty trigger words in lazy grammar construction (#17048 ) * fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047) * Replace 'static' workaround, with keeping variable in scope for longer * Create std::array directly and pass into llama_grammar_init_impl * Add back the trigger pattern * Missed array include b7060	2025-11-14 14:35:26 +02:00
Georgi Gerganov	d396b43748	server : fix "can batch with" bug (#17263 ) b7059	2025-11-14 14:03:45 +02:00
Georgi Gerganov	45c6ef7307	metal : support argsort for ne00 > 1024 (#17247 ) * metal : refactor argsort * cont : sort chunks * cont : merge sorted buckets * cont : cleanup b7058	2025-11-14 09:36:06 +02:00
Georgi Gerganov	2606b0adab	metal : make the FA extra sizes consistent (#17143 ) b7057	2025-11-14 09:13:34 +02:00
ixgbe	307772fcda	readme : add RVV,ZVFH,ZFH,ZICBOP support for RISC-V (#17259 ) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-11-14 09:12:56 +02:00
Aleksander Grygier	f1bad23f88	Better UX for handling multiple attachments in WebUI (#17246 ) b7055	2025-11-14 01:19:08 +01:00
Alberto Cabrera Pérez	becc4816dd	ggml-cpu: handle 3d tensors in repack mat_mul (#17241 ) * ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries * Address performance regression in Qwen and llama.cpp due to chunking b7054	2025-11-13 12:53:00 -08:00
Xuan-Son Nguyen	c4abcb2457	server: fixing naming conflict res_error (#17243 ) b7053	2025-11-13 20:53:47 +01:00
Piotr Wilkin (ilintar)	389ac78b26	ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (#17063 ) * Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review * Whitespace * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * This is actually sigmoid, duh. * Add CONST, remove TRI_KEEP, other changes from review * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Remove extra script * Update ggml/src/ggml.c Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * moving changes from laptop [no ci] * pre-rebase * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Refactor tests * ggml : cleanup * cont : fix ggml_fill srcs * tests : add note * ggml : add ggml_fill_inplace * ggml : add asserts * ggml : fix ggml_fill constant cast * cont : ggml_tri minor * Use TENSOR_LOCALS * Fix regression from #14596, regenerate * Don't make commits at night... --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b7052	2025-11-13 20:54:47 +02:00
Ruben Ortlam	a19bd6f7ce	vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (#17219 ) * vulkan: remove shell call from vulkan-shaders-gen tool * use string vector for command execution * Fix condition * use string, remove const_cast * Fix dependency file quotation on Windows --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com> b7051	2025-11-13 14:51:21 +01:00
Diego Devesa	dd091e52f8	sched : fix reserve ignoring user tensor assignments (#17232 ) b7050	2025-11-13 13:14:02 +01:00
ixgbe	1215dde7b0	ggml-cpu : add RISC-V vector intrinsic support for silu and cvar operations (#17227 ) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> b7049	2025-11-13 13:13:32 +01:00

1 2 3 4 5 ...

7098 Commits