llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-01 22:54:05 +00:00

Author	SHA1	Message	Date
Daniel Bevenius	20ca2e12c4	model-conversion : remove -c 0 from model card template [no ci] (#18807 ) This commit removes the `-c, --ctx-size N` from the llama-server command in the model card template for causal models. The motivation for this is that -c 0 is the default and specifying it is redundant.	2026-01-13 14:13:10 +01:00
yulo	ea4a321f2a	HIP: add fattn-mma-f16 for RDNA4 (#18481 ) * finish VQ mma * flash_attn_ext_f16_iter * KQ_rowsum * correct exp * fix scale error * fix softmax scale * fix softmax scale * enable fattn on cpu side * fix random error * disable fattn-mma-f16 on rdna3 * fix wrong col for rdna * use identity mat to transpose * resolve conflicts * basic tuning for DeepSeek-R1-Distill-Qwen-1.5B * fix volta compile error * align rdna4 policy for fattn * adjust fattn policy * adjust kernel selection logic * update as the review comments * keep fattn-wmma logic * adjust kernel selection logic --------- Co-authored-by: zhang hui <you@example.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b7723	2026-01-13 13:52:16 +01:00
Johannes Gäßler	c1e79e610f	doc: ban AI-generated PR descriptions [no ci] (#18765 )	2026-01-13 13:43:12 +01:00
Xuan-Son Nguyen	e047f9ee9d	mtmd: fix use_non_causal being reported incorrectly (#18793 ) * mtmd: fix use_non_causal being reported incorrectly * move clip_is_mrope to mtmd_decode_use_mrope * fix sloppy code ggml_cpy b7721	2026-01-13 12:19:38 +01:00
Georgi Gerganov	0a57271ab6	CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (#18800 ) b7720	2026-01-13 12:25:53 +02:00
Gabe Goodhart	076b0faf7d	graph : clean up t5 input builders (#18795 ) * fix: Remove unnecessary `h` loops where `h` was only ever 0 Branch: CleanUpT5InputBuilders Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary padding loop that is never hit anymore The upper bound used to use GGML_PAD(n_tokens, GGML_KQ_MASK_PAD), but was removed in https://github.com/ggml-org/llama.cpp/pull/17910 leaving the loop dead. Branch: CleanUpT5InputBuilders Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> b7719	2026-01-13 09:43:51 +01:00
Ruben Ortlam	db79dc06b1	llama-bench: add direct_io parameter (#18778 ) b7718	2026-01-13 08:49:10 +01:00
Adrien Gallouët	537d4240d4	ci : remove libcurl in releases (#18775 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b7717	2026-01-12 21:43:02 +01:00
Radoslav Gerganov	bcf7546160	server : add arg for disabling prompt caching (#18776 ) * server : add arg for disabling prompt caching Disabling prompt caching is useful for clients who are restricted to sending only OpenAI-compat requests and want deterministic responses. * address review comments * address review comments b7716	2026-01-12 19:21:34 +02:00
Adrien Gallouët	36c5913c45	ci : use openssl for openEuler-latest-cmake-cann (#18779 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-12 17:29:00 +01:00
Adrien Gallouët	8e649571cd	vendor : update cpp-httplib to 0.30.1 (#18771 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b7714	2026-01-12 15:58:52 +01:00
Daniel Bevenius	4150da9a95	examples : add --kv-unified to batched example (#18774 ) This commit adds the --kv-unified flag to the batched example. This flag is currently specified in the README.md as required, but is currently not available as a command line option for the batched example. The motivation for this is that specifying this flag as the README instructs, will lead to an error about the flag not being recognized, and without this option the example fail with the following error: ```console split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) decode: failed to find a memory slot for batch of size 4 main: llama_decode() failed ``` b7713	2026-01-12 13:47:58 +01:00
Jeff Bolz	8e2da778da	vulkan: change memory_logger to be controlled by an env var (#18769 ) b7712	2026-01-12 13:32:55 +01:00
Xuan-Son Nguyen	ce3bf9b1a4	server: update docs for sleeping [no ci] (#18777 )	2026-01-12 13:01:24 +01:00
Jeff Bolz	2bbe4c2cf8	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678 ) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally. b7710	2026-01-12 12:32:13 +01:00
Ruben Ortlam	1051ecd289	vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763 ) * vulkan: Disable large coopmat matmul configuration on proprietary AMD driver * Also disable the large tile size b7709	2026-01-12 07:29:35 +01:00
Xuan-Son Nguyen	0c3b7a9efe	model: fix qwen3next broken due to #18683 (#18762 ) b7708	2026-01-11 21:00:10 +01:00
Ruben Ortlam	0e76501e1d	Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749 ) * vulkan: Enable and optimize large matmul parameter combination for AMD * limit tuning to AMD GPUs with coopmat support * use tx_m values instead of _l b7707	2026-01-11 17:33:33 +01:00
Xuan-Son Nguyen	4b060bf240	security: make it clear about subtopics in server (#18754 ) * security: make it clear about subtopics in server * exclude DoS	2026-01-11 16:51:03 +01:00
Daniel Bevenius	9789e28459	debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check (#18692 ) * debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check This commit updates the pooling check in the debug example to also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just LLAMA_POOLING_TYPE_NONE. * debug : normalize both pooled and token embeddings This commit updates debug.cpp to normalize embeddings for both pooled and non-pooled outputs. For pooled embeddings, normalization is applied to the single vector, and for non-pooled embeddings, normalization is applied to each token embedding vector individually. The motivation for this is to enable non-pooled embeddings to be normalized which was not possible previously. b7705	2026-01-11 16:34:41 +01:00
Georgi Gerganov	84ae04f163	tests : refactor test-backend-sampler (#18753 ) * tests : use "auto", use std::string * tests : refactor test-backend-sampler.cpp * cmake : remove redundant declarations * ci : use smaller model * tests : add struct test_params * tests : reduce logit bias 100.0f -> 10.0f b7704	2026-01-11 17:31:03 +02:00
Xuan-Son Nguyen	506bb6e010	model: try to improve Qwen3 Next (#18683 ) * qwen3next: simplify qkvz projection * use ggml_swiglu_split * revert swiglu_split, but remove redundant repeat() * fix missing reshape * rm 2 redundant transposes * move mul_mat(k,q) to outside of chunking * rm redundant cont * improve g_cs_chunk * add comments about no cont * use std::pair instead of ggml_concat * vectorize key_gdiff calculation * rm unused tensor * avoid ggml_concat inside loop * bring back ggml_concat as it may not work on other backend * nits b7703	2026-01-11 12:53:33 +01:00
thom-dev-fr	79456a690a	readme : update UIs (#18751 )	2026-01-11 13:46:50 +02:00
Xuan-Son Nguyen	28068af789	security: narrow down the scope of what we consider a vulnerability (#18752 ) * security: narrow down the scope of what we consider a vulnerability * fix typo	2026-01-11 12:23:36 +01:00
shaofeiqi	707cbafcaa	opencl: add SOFTPLUS op support (#18726 ) b7700	2026-01-10 21:57:44 -08:00
Aman Gupta	b137718878	test-backend-ops: fix mxfp4 tests on blackwell (#18736 ) b7699	2026-01-11 01:12:57 +08:00
Johannes Gäßler	d2ff4e23ac	HIP: adjust RDNA3.5 MMQ kernel selction logic (#18666 ) b7698	2026-01-10 17:19:01 +01:00
Perry Naseck	657a2e644b	cmake : update blas logic (#18205 ) b7697	2026-01-10 18:00:54 +02:00
Georgi Gerganov	f307926482	server : adjust unified KV cache tests (#18716 )	2026-01-10 17:51:56 +02:00
Sigbjørn Skjæret	7fdc8c893d	scripts : follow api redirects in pr2wt.sh (#18739 )	2026-01-10 16:04:05 +01:00
Xuan-Son Nguyen	23f82f2420	preset: allow named remote preset (#18728 ) * preset: allow named remote preset * nits: fix docs * cont docs b7694	2026-01-10 15:12:29 +01:00
Aaron Teo	2656c0d265	docs(ggml): update backend ops (#18734 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-01-10 18:48:17 +08:00
Michael Wand	600a366478	Corrected: changed s13 = src1->nb[3] instead of nb[2] (#18724 ) b7692	2026-01-10 10:16:07 +01:00
Adrien Gallouët	ea23c15990	common : add --license to display embedded licenses (#18696 ) This commit introduces a mechanism to embed all licenses directly into the compiled binaries. This eliminates the need to distribute separate LICENSE files alongside the executable, making the binaries self-contained and simplifying deployment. b7691	2026-01-10 09:46:24 +01:00
Xuan-Son Nguyen	9ac2693a30	server: fix n_cmpl not skipping processing prompt (#18663 ) * server: fix n_cmpl not skipping processing * fix infinite loop on empty batch * cont : init child samplers + modify child logic * cont : cleanup * cont : improve n_cmpl logic - launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache * cont : remove redundant function * cont : reduce parent checks * fix : nullptr task dereference --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b7690	2026-01-10 00:00:41 +01:00
Simranjeet Singh	a61c8bc3bf	mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256 ) * Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py. * Add mobilenetv5 impl * Fix comments, remove unused vars * Fix permute and remove transpose of projection weights * Fix comments, remove debugging prints from hf_to_gguf * 1. Hard-code image_mean = 0 and image_std = 1 2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder * 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size * Remove obsolete comments * - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest - convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf - mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_* - clip.cpp: Remove unused embedding and hard_emb_norm tensor loading * - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology * Fix stem conv bias name * Remove explicit handling of bias term for stem conv * - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer - Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable * clean up conversion script * fix code style * also preserve audio tensors * trailing space * split arch A and V * rm unused gemma3 func * fix alignment --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> b7689	2026-01-09 23:42:38 +01:00
shaofeiqi	593da7fa49	opencl: add EXPM1 op (#18704 ) b7688	2026-01-09 10:13:13 -08:00
Reese Levine	9e41884dce	Updates to webgpu get_memory (#18707 ) b7687	2026-01-09 08:17:18 -08:00
Pascal	ec8fd7876b	Webui/file upload (#18694 ) * webui: fix restrictive file type validation * webui: simplify file processing logic * chore: update webui build output * webui: remove file picker extension whitelist (1/2) * webui: remove file picker extension whitelist (2/2) * chore: update webui build output * refactor: Cleanup * chore: update webui build output * fix: update ChatForm storybook test after removing accept attribute * chore: update webui build output * refactor: more cleanup * chore: update webui build output	2026-01-09 16:45:32 +01:00
Asbjørn Olling	a180ba78c7	cmake: only build cli when server is enabled (#18670 ) b7685	2026-01-09 16:43:26 +01:00
Georgi Gerganov	53eb9435da	server : fix timing of prompt/generation (#18713 ) b7684	2026-01-09 12:59:50 +02:00
Georgi Gerganov	d3435efc8a	scripts : pr2wt.sh reset to remote head (#18695 ) * scripts : pr2wt.sh reset to remote head * cont : cleaner * cont : restore --set-upstream-to	2026-01-09 12:16:40 +02:00
Georgi Gerganov	f5f8812f7c	server : use different seeds for child completions (#18700 ) * server : use different seeds for child completions * cont : handle default seed * cont : note b7682	2026-01-09 09:33:50 +02:00
Xuan-Son Nguyen	8ece3836b4	common: support remote preset (#18520 ) * arg: support remote preset * proof reading * allow one HF repo to point to multiple HF repos * docs: mention about multiple GGUF use case * correct clean_file_name * download: also return HTTP status code * fix case with cache file used * fix --offline option b7681	2026-01-08 22:35:40 +01:00
Aaron Teo	046d5fd44e	llama: use host memory if device reports 0 memory (#18587 ) b7680	2026-01-09 05:34:56 +08:00
Masashi Yoshimura	480160d472	ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (#18628 ) * Fix GGML_MEM_ALIGN to 8 for emscripten. * Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten b7679	2026-01-08 08:36:42 -08:00
Reese Levine	15bff84bf5	ggml webgpu: initial flashattention implementation (#18610 ) * FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness b7678	2026-01-08 08:23:39 -08:00
Jeff Bolz	2524c26164	vulkan: fix push constant size for quantize_q8_1 (#18687 ) I added an assert to catch further mismatches, and it found several. Fix those, too. b7677	2026-01-08 15:40:58 +01:00
Jeff Bolz	cb14b06995	vulkan: optimize ssm_scan (#18630 ) * vulkan: optimize ssm_scan * fix warp vs subgroup naming b7676	2026-01-08 15:16:54 +01:00
Adrien Gallouët	55abc39355	vendor : update cpp-httplib to 0.30.0 (#18660 ) * vendor : update cpp-httplib to 0.30.0 * common : allow custom headers when downloading b7675	2026-01-08 13:53:54 +01:00

1 2 3 4 5 ...

7724 Commits