llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-01 22:54:05 +00:00

Author	SHA1	Message	Date
copilot-swe-agent[bot]	4943e3a396	gen-libllama-abi: compile sort-key regex once outside the lambda Agent-Logs-Url: https://github.com/ggml-org/llama.cpp/sessions/cd21903e-afd2-477a-8285-0a2d46e1398c Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>	2026-04-15 12:04:44 +00:00
copilot-swe-agent[bot]	51b679a5d6	semver: revert llama_export.h, fix ABI baseline to track full signatures - Revert include/llama.h to use the original manual LLAMA_API visibility macro block (LLAMA_SHARED / LLAMA_BUILD) - Revert src/CMakeLists.txt: remove GenerateExportHeader, restore LLAMA_BUILD/LLAMA_SHARED compile definitions and original target_include_directories - Revert CMakeLists.txt: remove llama_export.h from LLAMA_PUBLIC_HEADERS - Add scripts/gen-libllama-abi.py: Python parser that reads include/llama.h and extracts normalized full LLAMA_API function signatures (return type + name + parameter list), handling both plain and DEPRECATED() patterns - Regenerate scripts/libllama.abi with full signatures (233 entries) - Update .github/workflows/libllama-abi-check.yml to use the header parser script instead of building the library and running nm; the check now runs in seconds with no compiler dependency Agent-Logs-Url: https://github.com/ggml-org/llama.cpp/sessions/cd21903e-afd2-477a-8285-0a2d46e1398c Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>	2026-04-15 12:02:36 +00:00
copilot-swe-agent[bot]	c00ac13fee	libllama-abi-check: add explicit read-only permissions to workflow job Agent-Logs-Url: https://github.com/ggml-org/llama.cpp/sessions/e9059c50-ffff-4233-a16d-13a7214f7b98 Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>	2026-04-15 11:45:14 +00:00
copilot-swe-agent[bot]	3f3d62ffec	semver: add proper semantic versioning and ABI check workflow for libllama - Add LLAMA_VERSION_MAJOR/MINOR variables to CMakeLists.txt (both default 0) replacing the hard-coded 0.0.{build_number} scheme - Use GenerateExportHeader in src/CMakeLists.txt to generate llama_export.h and replace the manual LLAMA_API visibility macro dance in include/llama.h - Set SOVERSION to LLAMA_VERSION_MAJOR so the .so symlink tracks the major ABI version (libllama.so.0 -> libllama.so.0.MINOR.PATCH) - Install the generated llama_export.h alongside llama.h as a public header - Add scripts/libllama.abi: committed baseline of exported llama_* symbols (233 symbols extracted from the current build) - Add .github/workflows/libllama-abi-check.yml: CI workflow that builds libllama, extracts symbols with nm, and compares against the baseline to determine whether a MAJOR (symbols removed) or MINOR (symbols added) version bump is required Agent-Logs-Url: https://github.com/ggml-org/llama.cpp/sessions/e9059c50-ffff-4233-a16d-13a7214f7b98 Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>	2026-04-15 11:44:00 +00:00
Ruben Ortlam	8dc530b86d	ci: disable test-backend-ops on Vulkan llvmpipe run and resture default timeout (#21901 )	2026-04-15 10:55:21 +02:00
Piotr Wilkin (ilintar)	e1a9a6dcbe	autoparser: support case of JSON_NATIVE with per-call markers (test case: Reka-Edge) (#21892 ) b8799	2026-04-15 10:51:50 +02:00
Matt	e39eba26f3	read n_ctx back after making llama_context (#21939 ) b8798	2026-04-15 15:24:57 +08:00
Yiwei Shao	5d14e5d19b	hexagon: optimization for HMX mat_mul (#21554 ) * hexagon: add async HMX worker Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX matmul with HVX dequant/DMA stages in the pipeline path, replacing the previous synchronous HMX calls that blocked the main thread. * hexagon: cost-based VTCM chunk search for out-stationary matmul * hexagon: fix futex race in hmx_worker_drain Store the boolean to local variable avoid atomic load twice * hex-mm: hmx optimize scatter/transpose and use HMX intrinsics * hex-vmem: drop vmem limit a touch under 3GB on v73 * hexagon: add fwd declaration of htp_context * hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface Simplifies the overall implemantion, reduces thread wakeup roundtrips. * hex-mm: add debug log to hmx work func called from hmx-queue * Update hmx-queue.h Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com> b8797	2026-04-14 14:09:03 -07:00
Xuan-Son Nguyen	fae3a28070	ggml : remove ggml-ext.h (#21869 ) * ggml: correct placement of ggml-ext.h * ggml : remove ggml-ext.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b8796	2026-04-14 17:32:58 +03:00
Georgi Gerganov	c0de6eda72	metal : fix FA support logic (#21898 ) b8795	2026-04-14 17:32:29 +03:00
Xuan-Son Nguyen	707c0b7a6e	mtmd: add mtmd_image_tokens_get_decoder_pos() API (#21851 ) * mtmd: add mtmd_image_tokens_get_decoder_pos() API * consistent naming * fix build b8794	2026-04-14 16:07:41 +02:00
Jeff Bolz	1f30ac0cea	vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it (#21572 ) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md b8793	2026-04-14 15:17:45 +02:00
Georgi Gerganov	f4b5bf2f32	ci : re-enable mac workflows (#21894 ) * ci : re-enable mac workflows * vulkan : fix compile warning b8792	2026-04-14 15:58:09 +03:00
Seyoung Jeong	aa0f1897b7	metal : add XIELU unary op (#20802 ) b8791	2026-04-14 15:43:59 +03:00
Adrien Gallouët	be76dd0bb2	vendor : update BoringSSL to 0.20260413.0 (#21881 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8790	2026-04-14 14:25:09 +03:00
Richard Davison	2e05f06ffb	ggml : fix ARM NEON nvfp4 dot product on non-dotprod targets (#21559 ) b8789	2026-04-14 14:23:45 +03:00
texasich	acc37a42ea	cmake: fix CMP0194 warning on Windows with MSVC (#21630 ) * cmake: fix CMP0194 warning on Windows with MSVC Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1. The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler. This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147). Closes ggml-org/llama.cpp#20311 * cmake: apply cisc's formatting suggestion --------- Co-authored-by: texasich <texasich@users.noreply.github.com> b8788	2026-04-14 13:47:56 +03:00
Reese Levine	5a23695d5a	ggml-webgpu: Update register tiling matmul to use f32 accumulation (#21644 ) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function b8787	2026-04-14 13:46:41 +03:00
Berk Idem	56666fa607	common: skip reasoning budget sampler when no budget is requested (#21870 ) * common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy. b8786	2026-04-14 12:43:06 +02:00
Jeff Bolz	6a6780a232	vulkan: Support GGML_TYPE_NVFP4 (#21455 ) This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32. b8785	2026-04-14 11:34:23 +02:00
Xuan-Son Nguyen	e489a5ca0e	server: support OAI /v1/audio/transcriptions API (#21863 ) * server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value b8784	2026-04-14 11:09:52 +02:00
Aldehir Rojas	e21cdc11a0	common/gemma4 : handle parsing edge cases (#21760 ) b8783	2026-04-13 18:18:18 -05:00
Xuan-Son Nguyen	e974923698	docs: listing qwen3-asr and qwen3-omni as supported (#21857 ) * docs: listing qwen3-asr and qwen3-omni as supported * nits	2026-04-13 22:28:17 +02:00
Piotr Wilkin (ilintar)	1c0d9081fd	chat: dedicated DeepSeek v3.2 parser + "official" template (#21785 ) b8781	2026-04-13 22:23:53 +02:00
Christian Kastner	a8bad3842e	ci: Also exempt 'security' tag from auto-close (#21844 )	2026-04-14 01:18:44 +08:00
Ruben Ortlam	75f3bc94e6	vulkan: Flash Attention DP4A shader for quantized KV cache (#20797 ) * use integer dot product for quantized KV flash attention * small improvements * fix SHMEM_STAGING indexing * add missing KV type quants * fixes * add supported quants to FA tests * readd fast paths for <8bit quants * fix mmq gate and shmem checks b8779	2026-04-13 14:21:31 +02:00
Adrien Gallouët	aa00911d12	common : add download cancellation and temp file cleanup (#21813 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8778	2026-04-13 11:18:23 +02:00
Gaspard Petit	ce8fd4b1a6	server: Expose build_info in router mode (#21835 ) b8777	2026-04-13 11:14:42 +02:00
Oliver Simons	9f5e1edb10	CUDA: Limit DeviceSegmentedSort to immediate mode (#21718 ) * CUDA: Limit DeviceSegmentedSort to immediate mode DeviceSegmentedSort is currently not capturable in a cuda graph. Hence, we have to go for the slower DeviceSegmentedRadixSort in that case. Perf numbers on RTX Pro 6000 Blackwell Max-Q: DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs) ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s DeviceSegmentedSort in immediate mode ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s * Add test case for dispatch to DeviceSegmentedRadixSort We currently lack a way to force graph mode in CUDA, patch callback to invoke ggml_backend_compare_graph_backend twice to enforce each test to run in graph mode b8776	2026-04-13 11:14:06 +02:00
Xuan-Son Nguyen	920b3e78cb	mtmd: use causal attn for gemma 4 audio (#21824 ) b8775	2026-04-13 09:47:55 +02:00
Rohan Jain	974c8c94cc	webui: add setting for first-line chat titles (#21797 ) * webui: add setting for first-line chat titles Add an opt-in setting (`titleGenerationUseFirstLine`) to use the first non-empty line of a prompt as the generated conversation title. Previously, the complete multi-line prompt was being used, which created long titles for complex queries. Coupled with "Ask for confirmation before changing conversation title", the dialog would overflow. * Update tools/server/webui/src/lib/utils/text.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/utils/text.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: Run build to update the bundle As requested in: https://github.com/ggml-org/llama.cpp/pull/21797#pullrequestreview-4094935065 * webui: Fix missing import for NEWLINE_SEPARATOR --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-04-13 09:30:46 +02:00
Aleksander Grygier	227ed28e12	webui: MCP Diagnostics improvements (#21803 ) * Add MCP Connection diagnostics and CORS hint to web-ui * tidy up test * webui: Refactor and improve MCP diagnostic logging --------- Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>	2026-04-13 07:58:38 +02:00
Masashi Yoshimura	bafae27654	Remove extra conditional check on debug mode. (#21798 ) b8772	2026-04-12 20:13:04 -07:00
Akarshan Biswas	873c825611	sycl: disable Q1_0 in backend and cleanup unused variables (#21807 ) b8771	2026-04-13 09:44:58 +08:00
Sergiu	82764d8f40	mtmd: fix crash when sending image under 2x2 pixels (#21711 ) b8770	2026-04-12 23:59:21 +02:00
Xuan-Son Nguyen	21a4933042	mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (#19441 ) * add qwen3a * wip * vision ok * no more deepstack for audio * convert ASR model ok * qwen3 asr working * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix bad merge * fix multi inheritance --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8769	2026-04-12 23:57:25 +02:00
Sigbjørn Skjæret	1e9d771e2c	convert : force f16 or f32 on step3-vl conv weights (#21646 )	2026-04-12 19:22:29 +02:00
Xuan-Son Nguyen	aa4695c5e5	mtmd: add gemma 4 test (vision + audio) [no ci] (#21806 ) * mtmd: add gemma 4 test (vision + audio) * add to docs	2026-04-12 16:29:03 +02:00
Stephen Cox	547765a93e	mtmd: add Gemma 4 audio conformer encoder support (#21421 ) * mtmd: add Gemma 4 audio conformer encoder support Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: #21325 b8766	2026-04-12 14:15:26 +02:00
Aleksander Grygier	9e209c5aee	fix: Proper messages rendering for "Show raw output" (#21672 )	2026-04-12 13:08:11 +02:00
Xuan-Son Nguyen	6313acbef0	docs: add guide on how to add multimodal support (#21778 ) * docs: add guide on how to add multimodal support * nits	2026-04-12 13:02:38 +02:00
Johannes Gäßler	ff5ef82786	CUDA: skip compilation of superfluous FA kernels (#21768 ) b8763	2026-04-11 18:52:11 +02:00
Sirui He	073bb2c20b	mtmd : add MERaLiON-2 multimodal audio support (#21756 ) * mtmd : add MERaLiON-2 multimodal audio support Adds support for ASTAR's MERaLiON-2 audio-language model (3B and 10B) to the multimodal framework. Architecture: - Whisper large-v2 encoder for audio feature extraction - Gated MLP adaptor: ln_speech -> frame stack (x15) -> Linear+SiLU -> GLU -> out_proj - Gemma2 3B / 27B decoder The mmproj GGUF is generated via convert_hf_to_gguf.py --mmproj on the full MERaLiON-2 model directory (architecture: MERaLiON2ForConditionalGeneration). The decoder is converted separately as a standard Gemma2 model after stripping the text_decoder. weight prefix. New projector type: PROJECTOR_TYPE_MERALION Supports tasks: speech transcription (EN/ZH/MS/TA), translation, spoken QA. Model: https://huggingface.co/MERaLiON/MERaLiON-2-3B https://huggingface.co/MERaLiON/MERaLiON-2-10B simplify comments in meralion adaptor * meralion: use format_tensor_name, ascii arrows in comments b8762	2026-04-11 14:15:48 +02:00
shaofeiqi	af1127d3c4	opencl: add basic support for q5_k (#21593 ) * opencl: add general q5_k mv * opencl: add flattened Q5_K mv and general Q5_K mm * opencl: fix Q5_K unit tests b8761	2026-04-11 01:46:19 -07:00
Johannes Gäßler	865ff06b2f	TP: fix Qwen 3 Next data split (#21732 ) b8760	2026-04-11 09:23:42 +02:00
Sigbjørn Skjæret	2b2cd57de6	ggml : fix a few instances of missing GGML_TYPE_Q1_0 cases (#21716 ) b8759	2026-04-11 09:45:00 +03:00
Bartowski	660386f6f8	py : Bump typer to latest to fix huggingface_hub issue (#21701 )	2026-04-11 09:44:15 +03:00
Aman Gupta	a29e4c0b7b	CUDA: also store node->src ne/nb for graph equality (#21736 ) b8757	2026-04-11 10:30:30 +08:00
Galunid	b136b62cf9	fix: Fix broken structured output when using $refs in json_schema (#21699 ) b8756	2026-04-10 18:26:36 -05:00
Todor Boinovski	81069a808a	hexagon: add support for linux on snapdragon (#21707 ) * hexagon: add support for debian on ex2 * hexagon: add -fvectotize to c/c++ cmake flags * hexagon: remove trailing white space * update onboarding steps * hexagon: update linux setup documentation * hexagon: update intallation scripts * Hexagon: update docs * hexagon: update onboarding scripts --------- Co-authored-by: Zack Li <zackli@qti.qualcomm.com> b8755	2026-04-10 15:57:23 -07:00

1 2 3 4 5 ...

8804 Commits