llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-03-17 16:44:07 +00:00

Author	SHA1	Message	Date
Ruben Ortlam	715ed28683	use scalar sums	2026-03-07 22:11:40 +01:00
Ruben Ortlam	a9435151db	apply scales inline	2026-03-07 14:56:25 +01:00
Ruben Ortlam	d1f8bbd085	vulkan: add int8 coopmat quantized matmul shader	2026-03-07 14:43:21 +01:00
lhez	6fce5c6a7d	opencl: add l2_norm (#20160 ) b8232	2026-03-06 18:03:05 -08:00
Piotr Wilkin (ilintar)	c024d85908	Autoparser: True streaming (#20177 ) * Relax atomicity constraint for nicer, more pleasent, True Streaming parsing * Whitespace * Remove redundant atomics b8231	2026-03-07 01:55:33 +01:00
Piotr Wilkin (ilintar)	2f2923f895	Autoparser: add optional argument reshuffle capability (#20171 ) * Allow reshuffled arguments in tagged argument parser format tool calls. * Remove shuffle just keep the optional parsers in any order * Remove unnecessary import b8230	2026-03-06 22:34:15 +01:00
Bartowski	649f06481e	quants : Add memsets and other fixes for IQ quants (#19861 ) * Add memsets and other fixes for IQ quants * Make memset unconditional, change Laux back to L * Move another memset b8229	2026-03-06 23:06:56 +02:00
Piotr Wilkin (ilintar)	7463687161	Add @pwilkin to CODEOWNERS for autoparser code (#20174 )	2026-03-06 21:25:41 +01:00
Piotr Wilkin (ilintar)	566059a26b	Autoparser - complete refactoring of parser architecture (#18675 ) * Autoparser - full single commit squish * Final pre-merge changes: minor fixes, Kimi 2.5 model parser b8227	2026-03-06 21:01:00 +01:00
Todor Boinovski	34df42f7be	hexagon: add f32 ssm_conv op (#20122 ) * hexagon: add ssm_conv op * hexagon: hvx kernel is functional * hexagon: improvements to ssm-conv hvx kernel * hexagon: added dma to ssm-conv hvx kernel * hexagon: ssm-conv dynamically compute gather scratchpad * hex-ssm-conv: add local context and fix various issues (spad indexing, etc) --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> b8226	2026-03-06 09:59:26 -08:00
Tom Vaucourt	e68f2fb894	server : preserve anthropic thinking blocks in conversion (#20120 ) * server : preserve anthropic thinking blocks in conversion (#20090) * server : add tests for anthropic thinking block conversion --------- Co-authored-by: root <root@llamacpp.home> b8225	2026-03-06 17:41:12 +01:00
Max Krasnyansky	ba2fd11cdf	cpu: skip redudant ROPE cache updates (#20149 ) b8224	2026-03-06 08:32:40 -08:00
Aman Gupta	d48e876467	ggml-cuda: add mem check for fusion (#19916 ) * ggml-cuda: add mem check for fusion * Replace NaNs with -FLT_MAX * fix typo Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b8223	2026-03-07 00:05:43 +08:00
Aaron Teo	ba2ff79e43	ggml: update comments for backends which have no memory to report (#20157 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> b8222	2026-03-06 23:24:38 +08:00
shalinib-ibm	c6980ff29d	ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083 ) (#20130 ) This patch addresses an Internal Compiler Error (Segmentation fault) observed with gcc 15 by replacing the intrinsic + cast by doing a cat on the data first and then calling the intrinsic. This bypasses the buggy compiler path while maintaining identical instruction selection. Performance Verification: Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original code and this fix generate the identical Power10 prefixed load instruction: `plxv 40, 2(14)` This ensures zero performance regression while unblocking builds on newer toolchains. Reproduced on: - Alpine Linux + GCC 15.2.0-r2 - RHEL 9 + GCC 15.1.1 (gcc-toolset-15) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> b8221	2026-03-06 23:22:39 +08:00
Aman Gupta	1e38a7a6fa	CUDA: use shared mem for ssm_conv (#20128 ) * CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b8220	2026-03-06 23:09:59 +08:00
Tim Neumann	388baabc06	context: ignore zero scale LoRAs when checking sameness (#20166 ) b8219	2026-03-06 15:05:52 +02:00
Piotr Wilkin (ilintar)	f5ddcd1696	Checkpoint every n tokens: squash (#20087 ) b8218	2026-03-06 11:39:26 +01:00
Aleksander Grygier	f6235a41ef	webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts (#18655 )	2026-03-06 10:00:39 +01:00
Johannes Gäßler	2850bc6a13	ggml-cpu: fix data race for debug asserts (#20148 ) b8216	2026-03-06 09:12:49 +01:00
Georgi Gerganov	17a4258946	kv-cache : fix M-RoPE checkpoints (#20132 ) b8215	2026-03-06 08:46:51 +02:00
Roj234	f7db3f3789	cli : Don't clear system prompt when using '/clear' (#20067 ) * Enhance /clear command to include system prompt Add system prompt to messages when clearing chat history. * Use lambda b8214	2026-03-06 06:41:11 +01:00
lhez	6c97bffd65	opencl: add neg, exp and diag (#20127 ) * opencl: add `neg` * opencl: add `exp` * opencl: add `diag` b8213	2026-03-05 21:16:39 -08:00
YardenTal44	2b10b62677	hexagon: add fp16 support for binary ops: add,sub,mul,div (#20139 ) * hexagon: add fp16 support for binary ops: add,sub,mul,div * hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79) * hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad * snapdragon: fix readme link --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> b8212	2026-03-05 18:29:13 -08:00
ymcki	a0ed91a442	models : kda chunk size = 16 (#19827 ) * models : add llm_build_delta_net_base * cont : keep qwen35 and qwen35moe graphs intact * cont : add comments [no ci] * add kimi linear to delta-net-base * removed unnecessary ggml_cont from g_exp_t * removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp * removed unnecessary diag mask * cont : simplify * cont : avoid graph splits * scale q after mul instead of beginning * scale q after mul instead of beginning * identical ppl * cont : fix scale and decay mask * minor : remove TODO * block implementation for kda * remove space at the end of line 101 * concat+pad * pad+binary row concat * chunk size 16 for kda * removed minor differences to master --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-05 17:01:23 +02:00
Andreas Kieslinger	2cd20b72ed	CUDA: Improve performance via less synchronizations between token (#17795 ) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b8210	2026-03-05 13:53:21 +02:00
Eric Zhang	872646b30c	model : update Qwen3.5 model type detection (#20126 ) * model : fix Qwen3.5 model type detection * Update src/llama-model.cpp whoops, my bad Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8209	2026-03-05 12:47:14 +01:00
Sigbjørn Skjæret	b5ed0e058c	cli : add command and file auto-completion (#19985 ) b8208	2026-03-05 10:47:28 +01:00
Sigbjørn Skjæret	cf232515c9	convert : register Qwen 3.5 ForCausalLM for text only (#20119 )	2026-03-05 10:30:02 +01:00
Aleksander Grygier	5e335ba113	webui: Improvements for Models Selector UI (#20066 )	2026-03-05 08:52:22 +01:00
Marcel Petrick	92f7da00b4	chore : correct typos [no ci] (#20041 ) * fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-05 08:50:21 +01:00
Max Krasnyansky	7a99dc85e2	hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and MatMul updates (#20118 ) * ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance by expanding vector handling and optimizing accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_reduce_sum_f32x4 for improved performance and reduced complexity * ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector processing in flash attention # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removing unused scale parameter and improving vector accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readability and return HVX_Vector for better integration # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for clarity * ggml-hexagon: fix compiling error * fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly using masking * refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element counts as parameters for improved clarity and flexibility * wip * fa: instrumentation and dma reordering * hex-fa: use block-size 64 to improve DMA pipelining * hex-fa: optimize vec-dot for v79 and above * hex-fa: use block size 64 * hex-fa: avoid scalar fp32->fp16 conversions * hex-fa: simplify dot_f16 functions using optimized vec_mpyacc * hex-fa: rewrite mad_f32_f16 using hvx_vec_mpyacc * hex-mm: use mpyacc in matmul dot functions --------- Co-authored-by: chraac <chraac@gmail.com> b8204	2026-03-04 21:55:29 -08:00
lhez	69fd345335	opencl: add `SET`, support i32 for `CPY`, minor refactor for cpy (#20101 ) b8203	2026-03-04 21:32:26 -08:00
Todor Boinovski	1a29907d2e	hexagon: add llama-completion runner script (#20095 ) b8202	2026-03-04 15:04:59 -08:00
Nikhil Jain	24d2ee0527	[WebGPU] Fix wait logic for inflight jobs (#20096 ) * Enable tmate debugging for investigating thread safety issue * Refactor wait and submit to operate on vector<wgpu::FutureWaitInfo>, and fix wait to delete only the future that is completed. * Cleanup * Remove clear change and run clang-format * Cleanup b8201	2026-03-04 11:54:55 -08:00
Masashi Yoshimura	541bf37622	Add concat op to webgpu. (#20068 ) b8200	2026-03-04 11:19:00 -08:00
Sigbjørn Skjæret	d969e933e1	tools : add missing clocale include in mtmd-cli [no ci] (#20107 )	2026-03-04 14:18:04 +01:00
Johannes Gäßler	7f5ee54968	ggml: fix ggml_is_contiguous_n for ne == 1 (#20092 ) b8198	2026-03-04 12:04:31 +01:00
Adrien Gallouët	66199c9f03	ggml : use a simple std::thread in AMX without OpenMP (#20074 ) Disabling OpenMP generally provides better inference performance (at least in my testing) but the loading becomes slightly slower. Benchmark results for `convert_B_packed_format()`: Before this commit: N K \| No OpenMP OpenMP \| Diff \| Speedup ------------------------------------------------------------ 512 2880 \| 640.9us 263.5us \| -58.9% \| 0.41x 2880 4096 \| 2.55ms 261.7us \| -89.8% \| 0.10x 201088 2880 \| 256.44ms 21.61ms \| -91.6% \| 0.08x ------------------------------------------------------------ Total: 325.43ms vs 31.05ms After: N K \| No OpenMP OpenMP \| Diff \| Speedup ------------------------------------------------------------ 512 2880 \| 1.49ms 263.5us \| -82.3% \| 0.18x 2880 4096 \| 1.55ms 261.7us \| -83.1% \| 0.17x 201088 2880 \| 24.03ms 21.61ms \| -10.1% \| 0.90x ------------------------------------------------------------ Total: 78.97ms vs 31.05ms Tested with unsloth/gpt-oss-20b-GGUF:Q4_K_M. Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8197	2026-03-04 11:57:09 +01:00
ddh0	c99909dd0b	impl : use 6 digits for tensor dims (#20094 ) Many models have vocabulary sizes, and thus tensor shapes, with more than 5 digits (ex: Gemma 3's vocab size is 262,208). I already fixed this for `llama_format_tensor_shape` but missed it for `llama_format_tensor_shape` until now. Oops. b8196	2026-03-04 09:53:38 +01:00
SamareshSingh	cb8f4fa3f8	Fix locale-dependent float printing in GGUF metadata (#17331 ) * Set C locale for consistent float formatting across all binaries. * Add C locale setting to all tools binaries Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/ directory to ensure consistent floating-point formatting. * Apply suggestion from @JohannesGaessler --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b8195	2026-03-04 09:30:40 +01:00
standby24x7	54910bd4f3	completion : Fix a typo in warning message (#20082 ) resuse -> reuse b8194	2026-03-04 06:44:49 +01:00
Mickael Desgranges	ecd99d6a9a	docs: Fix intel documentation link (#20040 ) b8193	2026-03-03 21:50:00 +08:00
Charles Xu	137435ff15	kleidiai : add sme fp16 compute path for q4_0 gemm on aarch64 (#20043 ) b8192	2026-03-03 11:40:26 +02:00
shaofeiqi	24350fdf9b	opencl: add optimized q4_1 mm kernel for adreno (#19840 ) * Add Q4_1 OpenCL Kernels * opencl: refactor transpose * opencl: format * opencl: refactor q4_1 unpack * opencl: move `ggml_cl_mul_mat_q4_1_f32_adreno` * opencl: refactor `ggml_cl_mul_mat_q4_1_f32_adreno` and kernels * opencl: rename kernel files and kernes * opencl: fix build for non adreno * opencl: move code around and format --------- Co-authored-by: Li He <lih@qti.qualcomm.com> b8191	2026-03-02 19:49:41 -08:00
Abhijit Ramesh	49a7564ac1	ggml webgpu: fix workgroup dispatch limit for large batch sizes (#19965 ) * ggml-webgpu: fix workgroup dispatch limit for large batch sizes WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT operations with batch sizes exceedeing this limi would fail. * add compute_2d_workgroups() helper to split total workgroup ID across X/Y dimensions * update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D dispatch * update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID from 2D dispatch * update mul_mat.wgsl to compute global index from 2D workgroup coordinates * refactor all three mul_mat dispatch paths to use the shared helper * ggml-webgpu: add bounds checking for over-dispatched workgroups 2D workgroup dispatch can over-dispatch when total workgroups don't divide evenly into the 65535 per-dimension limit. Extra workgroups would compute invalid batch indices, causing memory corruption. * add batch_idx bound check to mul_mat_reg_tile.wgsl and mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups from accessing invalid memory * fixes test failures with large batch sizes (eg., bs=[128, 1024]) * ggml-webgpu: add back TODO for spliting large sizes into batches * Optimize 2d workgroup provisioning * Set some parameters that increase speed --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com> b8190	2026-03-02 19:35:11 -08:00
Nikhil Jain	4d828bd1ab	ggml webgpu: Clean up per-thread parameter buffer pool and job submission logic (#19772 ) * Allow webgpu_buf_pool to resize if needed, remove inflight_threads, and replace inflight_threads with num_kernels for submission * Run clang-format * Keep track of num batched kernels that have not been submitted yet * Run clang-format * Increase buf pool max size * Increase param buf pool init size * Remove webgpu buf pool resizing * Merge with master * Add buffer pool growth * Move buffer pool growth outside of lock * Reduce max pool size to 32 * Run clang-format * Only resize param buf pool b8189	2026-03-02 10:23:34 -08:00
Masashi Yoshimura	36a7a6589c	ggml-webgpu: Support non-contiguous `src0` and overlapping `src0/src1` in binary ops (#19850 ) * ggml-webgpu: Add binary op support for overlapping and non-contiguous. * Add newline to binary.wgsl * Append the test of binary op for src overlapping to test_bin_bcast. * Remove unnecessary newline. b8188	2026-03-02 07:59:53 -08:00
Ruben Ortlam	feefb92836	vulkan: tune MMVQ for Intel Windows (#19988 ) b8187	2026-03-02 15:58:25 +01:00
Adrien Gallouët	ec88c3ceea	scripts : improve get-wikitext-2.sh (#19952 ) * scripts : improve get-wikitext-2.sh Switch to sh, add curl fallback, and avoid redundant downloads Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> * fix indent Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-02 15:40:49 +01:00

1 2 3 4 5 ...

8235 Commits