llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-01 22:54:05 +00:00

Author	SHA1	Message	Date
Xuan-Son Nguyen	eeef3cfced	model: support GLM-OCR (#19677 ) * model: support GLM-OCR * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8093	2026-02-18 17:51:40 +01:00
Maciej Lisowski	e99f1083a0	docs: Fix broken links for preparing models in Backends (#19684 )	2026-02-18 23:50:23 +08:00
Reese Levine	238856ec8f	ggml webgpu: shader library organization (#19530 ) * Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * flashattention and matrix multiplication moved to new format * clean up preprocessing * Formatting * remove duplicate constants * Split large shaders into multiple static strings --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> b8091	2026-02-18 07:51:02 -07:00
Aleksander Grygier	ea003229d3	Pre-MCP UI and architecture cleanup (#19689 )	2026-02-18 12:02:02 +01:00
Jeff Bolz	d0061be838	vulkan: split mul_mat into multiple dispatches to avoid overflow (#19509 ) * vulkan: split mul_mat into multiple dispatches to avoid overflow The batch dimensions can be greater than the max workgroup count limit, in which case we need to split into multiple dispatches and pass the base index through a push constant. Fall back for the less common p021 and nc variants. * address feedback b8089	2026-02-18 10:47:10 +01:00
Adrien Gallouët	a569bda445	common : make small string helpers as inline functions (#19693 ) Also use string_view when it make sense and fix some corner cases. Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8088	2026-02-18 08:03:01 +01:00
shaofeiqi	e2f19b320f	opencl: refactor expm1 and softplus (#19404 ) * opencl: refactor expm1 * opencl: refactor softplus * opencl: use h for half literals --------- Co-authored-by: Li He <lih@qti.qualcomm.com> b8087	2026-02-17 14:47:18 -08:00
shaofeiqi	983559d24b	opencl: optimize mean and sum_row kernels (#19614 ) * opencl: optimize mean and sum_row kernels * opencl: add comment for max subgroups * opencl: format --------- Co-authored-by: Li He <lih@qti.qualcomm.com> b8086	2026-02-17 13:56:09 -08:00
Daniel Bevenius	2b089c7758	model-conversion : add option to print tensor values (#19692 ) This commit updates the tensor-info.py script to support the option to print the first N values of a tensor when displaying its information. The motivation for this is that it can be useful to inspect some actual values in addition to the shapes of the tensors.	2026-02-17 20:43:22 +01:00
Aleksander Grygier	afa6bfe4f7	Pre-MCP UI and architecture cleanup (#19685 ) * webui: extract non-MCP changes from mcp-mvp review split * webui: extract additional pre-MCP UI and architecture cleanup * chore: update webui build output	2026-02-17 13:47:45 +01:00
Talha Can Havadar	ae2d3f28a8	ggml: ggml-cpu: force-no-lto-for-cpu-feats (#19609 ) When LTO enabled in build environments it forces all builds to have LTO in place. But feature detection logic is fragile, and causing Illegal instruction errors with lto. This disables LTO for the feature detection code to prevent cross-module optimization from inlining architecture-specific instructions into the score function. Without this, LTO can cause SIGILL when loading backends on older CPUs (e.g., loading power10 backend on power9 crashes before feature check runs). b8083	2026-02-17 13:22:46 +02:00
Georgi Gerganov	ad8207af77	cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (#19645 ) * cuda : enable CUDA graphs for MMID BS <= 4 * cont : add stream capture check Co-authored-by: Oliver Simons <osimons@nvidia.com> * cont : add MMVQ_MMID_MAX_BATCH_SIZE --------- Co-authored-by: Oliver Simons <osimons@nvidia.com> b8082	2026-02-17 12:31:49 +02:00
Daniel Bevenius	667b694278	model-conversion : make printing of config values optional (#19681 ) * model-conversion : make printing of config values optional This commit updates run-org-model.py to make the printing of model configuration values optional. The motivation for this change is that not all models have these configuration values defined and those that do not will error when running this script. With these changes we only print the values if they exist or a default value. We could optionally just remove them but it can be useful to see these values when running the original model.	2026-02-17 10:46:53 +01:00
Sigbjørn Skjæret	e48349a49d	ci : bump komac version (#19682 )	2026-02-17 09:30:31 +01:00
Adrien Gallouët	ae46a61e41	build : link ws2_32 as PUBLIC on Windows (#19666 ) Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> b8079	2026-02-17 08:37:07 +01:00
Adrien Gallouët	65cede7c70	build : cleanup library linking logic (#19665 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8078	2026-02-17 08:36:45 +01:00
DAN™	05fa625eac	convert : add JoyAI-LLM-Flash (#19651 ) * convert_hf_to_gguf: add JoyAI-LLM-Flash tokenizer hash mapping to deepseek-v3 * llama-vocab: create a new pre-tokenizer name for joyai-llm. * add missing vocab type section * Update convert_hf_to_gguf_update.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8077	2026-02-16 22:49:57 +01:00
AesSedai	d612901116	perplexity: add proper batching (#19661 ) b8076	2026-02-16 18:44:44 +02:00
Ivan Chikish	cceb1b4e33	common : inline functions (#18639 ) b8075	2026-02-16 17:52:24 +02:00
Judd	d23a55997d	ggml : make `ggml_is_view` as API (#19539 ) * make `ggml_is_view` as API * introduce `ggml_aux_is_view` as inline version for internal use. * change `ggml_aux_is_view` to `ggml_impl_is_view` b8074	2026-02-16 17:43:34 +02:00
Saurabh Dash	5f28c53d11	model: Add support for Tiny Aya Models (#19611 ) * changes for tiny aya * changes to hash * changes to vocab * fix some tokenizer regex edge cases * update comment * add some comments for regex * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> b8073	2026-02-16 16:28:46 +01:00
Adrien Gallouët	4408494144	build : rework llama_option_depr to handle LLAMA_CURL (#19658 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8072	2026-02-16 16:06:48 +01:00
Mario Limonciello	2ba9adc093	Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (#19591 ) Avoids issues with ROCm 6.4.4. Closes: https://github.com/ggml-org/llama.cpp/issues/19580 Fixes: `6845f7f87` ("Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)") Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> b8071	2026-02-16 14:46:08 +01:00
Georgi Gerganov	cc45f2ada6	models : deduplicate delta-net graphs for Qwen family (#19597 ) * models : add llm_build_delta_net_base * cont : keep qwen35 and qwen35moe graphs intact * cont : add comments b8070	2026-02-16 14:35:04 +02:00
Georgi Gerganov	d5dfc33027	graph : fix KQ mask, lora, cvec reuse checks (#19644 ) * graph : fix KQ mask reuse condition * cont : dedup KQ mask build and can_reuse * cont : fix build * graph : fix adapter check for reuse b8069	2026-02-16 09:21:11 +02:00
abhijain1204fujitsu	267ba5a1d9	ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (#19132 ) * Updated repack.cpp * Updated repack.cpp * Updated repack.cpp * Added if condition to support only vector length 256. * Changed the format removed comments and duplicate variable * If SVE 256 not present then was using generic function to compute, hence slowing the performance. So added code if SVE 256 is not present then use NEON code. * Code format change suggestion --------- Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com> b8068	2026-02-16 14:38:43 +08:00
Georgi Gerganov	ff4affb4c1	sync : ggml b8067	2026-02-15 22:24:29 +02:00
Georgi Gerganov	55d58599c8	ggml : bump version to 0.9.7 (ggml/1425)	2026-02-15 22:24:29 +02:00
Georgi Gerganov	1a8c700bfd	ggml : bump version to 0.9.6 (ggml/1423)	2026-02-15 22:24:29 +02:00
David Friehs	27b93cbd15	cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (#19624 ) * cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization - load all 8 int8 for a grid position in one load - calculate signs via popcnt instead of fetching from ksigns table - broadcast signs to drop individual shift/mask * cuda: iq2xxs: simplify sum scaling express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8` express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 \| 1)` saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight AFAICT no overflow can occur here as iq2xxs values are far too small * uint -> uint32_t error: identifier "uint" is undefined b8064	2026-02-15 22:38:42 +05:30
Aaron Teo	6e67fd2144	docs: update s390x build docs (#19643 )	2026-02-16 00:33:34 +08:00
Adrien Gallouët	9e118b97c4	build : remove LLAMA_HTTPLIB option (#19623 ) This option was introduced as a workaround because cpp-httplib could not build on visionOS. Since it has been fixed and now compiles on all platforms, we can remove it and simplify many things. Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8062	2026-02-15 15:38:50 +01:00
Daniel Bevenius	57088276d4	cmake : check if KleidiAI API has been fetched (#19640 ) This commit addresses a build issue with the KleidiAI backend when building multiple cpu backends. Commmit `3a00c98584` ("cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL") introduced a change where FetchContent_Populate is called instead of FetchContent_MakeAvailable, where the latter does handle this case (it is idempotent but FetchContent_Populate is not). I missed this during my review and I should not have commited without verifying the CI failure, sorry about that. b8061	2026-02-15 13:59:38 +01:00
Georgi Gerganov	341bc7d23c	context : fix output reorder with backend sampling (#19638 ) b8060	2026-02-15 14:57:40 +02:00
Georgi Gerganov	08e6d914b8	ggml : avoid UB in gemm ukernel (#19642 ) b8059	2026-02-15 14:56:35 +02:00
Aaron Teo	184c694f45	ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (#19399 ) b8058	2026-02-15 18:20:35 +08:00
Aman Gupta	684b36101c	ggml-cpu: FA add GEMM microkernel (#19422 ) * ggml-cpu: FA add GEMM microkernel * add guard for sizeless vector types * fix case where DV % GGML_F32_EPR !=0 * move memset out of the loop * move another memset out of the loop * use RM=4 for arm * simd_gemm: convert everything to int * convert everything to size_t to avoid warnings * fixup * add pragma for ignoring aggressive loop optimizations b8057	2026-02-15 11:09:24 +05:30
SamareshSingh	3a00c98584	cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (#19581 ) * cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used. The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality. * addressed code review comments b8056	2026-02-15 06:22:53 +01:00
Sigbjørn Skjæret	079feab9e3	convert : ensure all models handle new experts count (#19621 ) * ensure all models handle new experts count * revert removal for PhiMoeModel, does not inherit from base b8055	2026-02-14 22:22:32 +01:00
Anav Prasad	01d8eaa28d	mtmd : Add Nemotron Nano 12B v2 VL support (#19547 ) * nemotron nano v2 vlm support added * simplified code; addressed reviews * pre-downsample position embeddings during GGUF conversion for fixed input size b8054	2026-02-14 14:07:00 +01:00
Georgi Gerganov	1725e316c1	models : optimize qwen3next graph (#19375 ) * models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs b8053	2026-02-14 12:57:36 +02:00
Adrien Gallouët	b7742cf321	ggml : fix GGML_DEBUG with OpenMP (#19599 ) last_graph is only available without OpenMP, but ggml_graph_compute_thread() is called in both cases. Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8052	2026-02-14 11:22:57 +01:00
iMil	badba89320	NetBSD build support (#19589 ) b8051	2026-02-14 09:47:01 +01:00
Aleksander Grygier	baa12f3831	webui: Architecture and UI improvements (#19596 )	2026-02-14 09:06:41 +01:00
agent-enemy-2	2d8015e8a4	llama : update LoRA API. + fix excessive graph reserves (#19280 ) * Refactoring to use new llama_put_adapter_loras * cont : alternative lora API --------- Co-authored-by: Jake Chavis <jakechavis6@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b8049	2026-02-14 10:06:27 +02:00
George	eb145c0753	mmap: Fix Windows handle lifetime (#19598 ) * ggml: added cleanups in ggml_quantize_free Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup. * mmap: Fix Windows handle lifetime Move hMapping from local variable to member variable so it stays alive for the entire lifetime of the mapping. The file mapping handle must remain valid until UnmapViewOfFile is called. Fixes cleanup order in destructor. * Update llama-mmap.cpp * Update llama-mmap.cpp Remove trailing whitespace from line 567 b8048	2026-02-14 10:05:12 +02:00
Georgi Gerganov	6e473fb384	metal : fix ACC op (#19427 ) b8047	2026-02-14 09:54:03 +02:00
Adrien Gallouët	c7db95f106	scripts : use official split.py for cpp-httplib (#19588 ) * scripts : use official split.py for cpp-httplib Using the official script is safer and ensures the generated code aligns with the library's standards. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Catch generic errors Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Allow print() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Ensure robust cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8046	2026-02-14 08:41:16 +01:00
Sigbjørn Skjæret	0d00ef65ed	convert : store ffn_gate_inp_shexp as F32 (#19606 )	2026-02-14 08:17:43 +01:00
Adrien Gallouët	91ea5d67f2	build : fix libtool call in build-xcframework.sh (#19605 ) Run libtool via xcrun like strip and dsymutil, to have proper tool resolution. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-14 06:48:37 +01:00

1 2 3 4 5 ...

8093 Commits