llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-01 22:54:05 +00:00

Author	SHA1	Message	Date
xris99	ff6b1062af	server : fix hardcoded proxy connection timeout in router mode (#18760 ) (#22003 ) Fixes: https://github.com/ggml-org/llama.cpp/issues/18760 Co-authored-by: Christian <christian@example.com> b8864	2026-04-21 06:41:14 +02:00
leonardHONG	97895129e5	ggml-cuda: flush legacy pool on OOM and retry (#22155 ) * ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com> b8863	2026-04-20 23:30:38 +02:00
Xuan-Son Nguyen	86f8daacfe	mtmd: correct get_n_pos / get_decoder_pos (#22175 ) b8862	2026-04-20 23:29:19 +02:00
Georgi Gerganov	cf8b0dbda9	server : remove /api endpoints (#22165 ) * server : remove /api endpoints * cont : remove /api/tags b8861	2026-04-20 20:41:19 +03:00
Gaurav Garg	fd6ae4ca1c	Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (#22129 ) * Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments b8860	2026-04-20 18:25:39 +02:00
Johannes Gäßler	fb19f94c71	TP: fix 0-sized tensor slices, AllReduce fallback (#21808 ) * TP: fix 0-sized tensor slices, AllReduce fallback * fix layer structure <-> GPU count aliasing * add missing std::fill * fix CUDA device set, max ggml ctx size b8859	2026-04-20 18:09:39 +02:00
pl752	7f251fdbce	ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (#21636 ) * Implemented optimized q1_0 dot for x86 and generic * Removed redundant helper definition * Removed two redundant instructions from AVX q1_0 dot * Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback * Style cleanup around AVX q1_0 dot * Replaced explicitly unrolled blocks with inner for loop for q1_0 * Replaced scalar ARM q1_0 impl with new generic one b8858	2026-04-20 19:02:54 +03:00
neha-ha	a6cc43c286	ggml-webgpu: updated matrix-vector multiplication (#21738 ) * merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> b8857	2026-04-20 07:37:17 -07:00
Xuan-Son Nguyen	a678916623	mtmd: refactor mtmd_decode_use_mrope (#22161 )	2026-04-20 14:45:11 +02:00
SamareshSingh	81df3f7cfa	fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102 ) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b8855	2026-04-20 10:32:46 +03:00
Georgi Gerganov	de71b5f81c	server : refactor "use checkpoint" logic (#22114 ) b8854	2026-04-20 08:42:37 +03:00
Katostrofik	788fcbc5dd	[SYCL] Fix reorder MMVQ assert on unaligned vocab sizes (#22035 ) * [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K asserted that block_num_y was a multiple of 16 subgroups. Models with a vocab size not divisible by 16 (for example HY-MT at 120818) aborted on model load when the output projection tripped the assert. I replaced the assert with padding: block_num_y now rounds up to a whole number of subgroup-sized workgroups. The kernel already has the row bounds check (`if (row >= nrows) return;`) so the extra padded threads early-exit cleanly. Row values are uniform across a subgroup so the collective reduce stays safe. For aligned vocab sizes the padded block_num_y equals the old value, so the kernel launch is identical and there is no regression. Thanks to @arthw for flagging the relationship to #21527. Fixes #22020. AI assisted coding, tested on Intel B70 hardware. * sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel target where WARP_SIZE is 16, but makes the relationship to subgroup size explicit. Per review by @NeoZhangJianyu on #22035. Assisted by Claude. b8853	2026-04-20 08:39:45 +03:00
Yes You Can Have Your Own	9d49acb2a7	server: rename --clear-idle to --cache-idle-slots (#21741 ) b8852	2026-04-20 08:30:24 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	e365e658f0	vendor : update cpp-httplib to 0.42.0 (#21781 ) b8851	2026-04-20 06:41:43 +08:00
Johannes Gäßler	4eac5b4509	CUDA: refactor mma data loading for AMD (#22051 ) * CUDA: refactor mma data loading for AMD * fix CDNA MMQ occupancy * fix CDNA3 mma * fix RDNA3 compile b8850	2026-04-19 18:26:59 +02:00
Aldehir Rojas	d5b780a676	common/autoparser : allow space after tool call (#22073 ) b8849	2026-04-19 13:28:35 +02:00
uvos	471540ae8a	HIP: Remove unesscary NCCL_CHECK (#21914 ) b8848	2026-04-19 12:59:44 +02:00
Xuan-Son Nguyen	19124078be	mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) (#22082 ) * mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos * fix build b8847	2026-04-19 11:57:21 +02:00
Gaurav Garg	bcdcc1044f	ggml : reduce CPU overhead in meta backend (#22041 ) * cache subgraph splits when cgraph is unchanged Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively. Assign uid to every sub-graph so that CUDA's fast uid check path hits too. * Address review comments * Keep the scope as is * Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code. * Address review comments * Update ggml/src/ggml-backend-meta.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-backend-meta.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> b8846	2026-04-19 12:48:35 +03:00
Sigbjørn Skjæret	037bfe38d0	ci : install spirv-headers for vulkan-cross (#22109 )	2026-04-19 10:32:08 +03:00
Dowon	8685e7b075	convert : support sentence-transformer 5.4 config files (#22087 ) * convert : support sentence-transformer 5.4 config files * fix: embeddinggemma * fix: mapping Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: pooling_mode Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-19 10:25:39 +03:00
texasich	09b4efa95f	cmake: remove CMP0194 policy to restore MSVC builds (#21934 ) #21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds. Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform. Reported-by: oobabooga Refs: #21630 Co-authored-by: texasich <texasich@users.noreply.github.com> b8843	2026-04-19 10:25:05 +03:00
Sascha Rogmann	455d8e4be8	server : speculative checkpointing (#19493 ) * server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> b8842	2026-04-19 10:24:06 +03:00
Radoslav Gerganov	91fef95362	rpc : refactor the RPC transport (#21998 ) * rpc : refactor the RPC transport Move all transport related code into a separate file and use the socket_t interface to hide all transport implementation details. * fix win32 * better socket_t construction b8841	2026-04-19 10:21:53 +03:00
Cetarthoriphros	9e5647affa	server: Expose `media_tag` on /props endpoint. (#22028 ) b8840	2026-04-19 00:27:17 +02:00
Sigbjørn Skjæret	4f02d47339	model : refactor bias tensor variable names (#22079 ) * refactor bias tensor variable names * use create_tensor_qkv for jina-bert-v2 b8839	2026-04-18 20:12:00 +02:00
Sigbjørn Skjæret	23b8cc4991	android : libcommon -> libllama-common (#22076 ) b8838	2026-04-18 11:19:40 +02:00
SamareshSingh	59accc8863	ggml-backend-meta: add multi-segment read support in get_tensor (#22063 ) b8837	2026-04-18 10:04:51 +02:00
Sigbjørn Skjæret	83d58e02fc	ci : free disk space for rocm release (#22012 ) b8836	2026-04-18 09:37:30 +02:00
Sigbjørn Skjæret	89a5474f0e	convert : fix (ignore for now) typings errors (#22002 )	2026-04-18 09:36:41 +02:00
Johannes Gäßler	fd1c0ec3f0	llama: fit ctx size for CPU only (#21568 )	2026-04-18 08:16:04 +02:00
Reese Levine	45cac7ca70	ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding (#21052 ) * Update workflows to remove dependence on llvmpipe * Try setting Dawn_DIR * remove c++20 initializers * Move to proper guid * Try avoiding segfaults on vulkan backend process exit * Remove compiler warnings on parameter casting * Fix soft_max and update reg_tile accumulation to f32 for better precision * Refactor flash_attn a bit * remove c++20 initializers and format * Increase div precision for NVIDIA * revert div precision and comment out ggml-ci node for now * Formatting * Try debugging on a failing CI node * Revert "Try debugging on a failing CI node" This reverts commit `1971e33cba`. b8833	2026-04-17 09:17:11 -07:00
Aman Gupta	b94050e896	CUDA: use LRU based eviction for cuda graphs (#21611 ) * CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up b8832	2026-04-17 23:24:21 +08:00
Yuri Khrustalev	a279d0f0f4	ci : add android arm64 build and release (#21647 ) * server: respect the ignore eos flag * ci: add android arm64 build and release * patch * pin android-setup actions to v4 * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * lf in the suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8831	2026-04-17 11:32:24 +02:00
65a	268d61e178	mtmd: add missing struct tag (#22023 ) b8830	2026-04-17 10:48:33 +02:00
Georgi Gerganov	6990e2f1f7	libs : rename libcommon -> libllama-common (#21936 ) * cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold() b8829	2026-04-17 11:11:46 +03:00
Eric Zhang	fcc7508759	model : Gemma4 model type detection (#22027 ) * model : Gemma4 model type detection * model : Gemma4 model type detection b8828	2026-04-17 10:07:11 +02:00
lhez	5e6c0e18b6	opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for Adreno (#21938 ) * opencl: refactor q8_0 gemm/gemv Adreno dispatch * opencl: refactor q8_0 set_tensor * opencl: fix whitespace b8827	2026-04-16 22:28:33 -07:00
Sigbjørn Skjæret	30dce2cf29	cli : use get_media_marker (#22017 ) b8826	2026-04-17 00:12:31 +02:00
Xuan-Son Nguyen	089dd41fe3	cmake: use glob to collect src/models sources (#22005 ) b8825	2026-04-16 23:25:16 +02:00
nullname	85dde8dc4a	hexagon: optimize HMX matmul operations (#21071 ) * optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit `cde679eff7`. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip b8824	2026-04-16 13:48:34 -07:00
Xuan-Son Nguyen	4fbdabdc61	model: using single llm_build per arch (#21970 ) * model: using single llm_build per arch * fix merge * nits b8823	2026-04-16 21:10:22 +02:00
shaofeiqi	e45dbdece8	opencl: add q5_K gemm and gemv kernels for Adreno (#21595 ) b8822	2026-04-16 12:08:33 -07:00
Pascal	4adac43f6f	server: tests: fetch random media marker via /apply-template (#21962 ) (#21980 ) * server: tests: fetch random media marker via /apply-template (#21962 fix) * server: allow pinning media marker via LLAMA_MEDIA_MARKER env var get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it as-is if set, falling back to the random marker otherwise. Tests no longer need to fetch the marker dynamically via /apply-template: the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts work as before. Address review feedback from ngxson * server: make get_media_marker() thread-safe via magic statics Use a C++11 static local with a lambda initializer instead of a global static with an empty-check. The runtime guarantees initialization exactly once without explicit locking. Address review feedback from ggerganov * nits * nits b8821	2026-04-16 20:46:21 +03:00
PikaPikachu	9db77a020c	model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245 ) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s	2026-04-16 17:41:34 +02:00
Sigbjørn Skjæret	f772f6e434	model : support NVFP4 tensors for Gemma4 (#21971 ) * support nvfp4 tensors for Gemma4 * add wo_s to build_attn * add wo_s to build_attn * fix glm4	2026-04-16 16:51:47 +02:00
Ruben Ortlam	b572d1ecd6	codeowners: add team member comments (#21714 )	2026-04-16 13:13:11 +03:00
Anav Prasad	03b3d07798	Convert: Fix NemotronH Config Parsing (#21664 ) * fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns * fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code	2026-04-16 13:11:45 +03:00
Aman Gupta	3f7c29d318	ggml: add graph_reused (#21764 ) * ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid b8816	2026-04-16 17:21:28 +08:00
Kusha Gharahi	ae2d34899e	metal: Implement ROLL op (#21946 ) * nix: support unified apple-sdk * Impl roll op for Metal * Revert "nix: support unified apple-sdk" This reverts commit `abfa473360`. * update ops.md * update op docs b8815	2026-04-16 11:54:37 +03:00

1 2 3 4 5 ...

8864 Commits