llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-03-17 16:44:07 +00:00

Author	SHA1	Message	Date
Piotr Wilkin (ilintar)	2e4a6edd4a	tools/server: support refusal content for Responses API (#20285 ) * Support refusal content for Responses API * Update tools/server/server-common.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tools/server/server-common.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8389	2026-03-17 01:42:04 +01:00
Xuan-Son Nguyen	d34ff7eb5b	model: mistral small 4 support (#20649 ) * model: mistral small 4 support * fix test * fix test (2) * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * change newline --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8388	2026-03-17 00:31:14 +01:00
Georgi Gerganov	45172df4d6	ci : disable AMX jobs (#20654 ) [no ci]	2026-03-16 22:38:59 +02:00
Georgi Gerganov	9b342d0a9f	benches : add Nemotron 3 Nano on DGX Spark (#20652 ) [no ci]	2026-03-16 21:50:43 +02:00
Sigbjørn Skjæret	55e87026f7	tests : write to binary buffer to avoid newline translation in jinja -py [no ci] (#20365 )	2026-03-16 20:40:22 +01:00
Martin Klacer	cf21cdf36c	kleidiai: add data type check to get_tensor_traits (#20639 ) * kleidiai: add data type check to get_tensor_traits * Added check for F16 data type into get_tensor_traits path with input data not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8) Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7 * updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp updated kleidiai.cpp file as per suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-16 21:25:54 +02:00
Sigbjørn Skjæret	0ed992973b	ci : update labeler (#20629 )	2026-03-16 20:24:20 +01:00
Aldehir Rojas	1bbec6a75d	jinja : add capability check for object args (#20612 )	2026-03-16 17:43:14 +01:00
Georgi Gerganov	f47a246a08	sync : ggml	2026-03-16 17:22:06 +02:00
Georgi Gerganov	c0ccbd1f86	ggml : try fix arm build (whisper/0)	2026-03-16 17:22:06 +02:00
David366AI	f6da02c3f2	ggml : extend im2col f16 (ggml/1434) * examples/yolo: fix load_model memory leak * fix/issue-1433 ggml_compute_forward_im2col_f16 assert error * fix/issue-1433	2026-03-16 17:22:06 +02:00
Pascal	dddca026bf	webui: add model information dialog to router mode (#20600 ) * webui: add model information dialog to router mode * webui: add "Available models" section header in model list * webui: remove nested scrollbar from chat template in model info dialog * chore: update webui build output * feat: UI improvements * refactor: Cleaner rendering + UI docs * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-16 15:38:11 +01:00
Aman Gupta	3c8521c4f5	llama-graph: replace cont with reshape for alpha in qwen35 (#20640 ) b8377	2026-03-16 22:07:13 +08:00
Aleksander Grygier	67a2209fab	webui: Add MCP CORS Proxy detection logic & UI (#20167 ) * refactor: MCP store cleanup * feat: Add MCP proxy availability detection * fix: Sidebar icon * chore: update webui build output * chore: Formatting * chore: update webui build output * chore: Update package lock * chore: update webui build output * chore: update webui build output * chore: update webui build output	2026-03-16 13:05:36 +01:00
Pascal	d65c4f2dc9	Fix model selector locked to first loaded model with multiple models (#20580 ) * webui: fix model selector being locked to first loaded model When multiple models are loaded, the auto-select effect would re-fire on every loadedModelIds change, overriding the user's manual model selection. Guard with selectedModelId so auto-select only kicks in when no model is chosen yet. * chore: update webui build output	2026-03-16 12:04:06 +01:00
Woof Dog	d8c331c0af	webui: use date in more human readable exported filename (#19939 ) * webui: use date in exported filename Move conversation naming and export to utils update index.html.gz * webui: move literals to message export constants file * webui: move export naming and download back to the conversation store * chore: update webui build output * webui: add comments to some constants * chore: update webui build output	2026-03-16 11:18:13 +01:00
Ruben Ortlam	46dba9fce8	vulkan: fix flash attention dot product precision (#20589 ) b8373	2026-03-16 10:45:49 +01:00
Sigbjørn Skjæret	de8f01c2d7	model : wire up Nemotron-H tensors for NVFP4 support (#20561 ) * wire up Nemotron-H tensors for NVFP4 support * add ssm tensors * alignment b8372	2026-03-16 09:19:16 +01:00
Richard Davison	079e5a45f0	convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization (#20539 ) * support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization * cleanup * fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-16 09:18:47 +01:00
Masato Nakasaka	d3936498a3	common : fix iterator::end() dereference (#20445 ) b8370	2026-03-16 08:50:38 +02:00
Aman Gupta	34818ea6c0	CUDA: GDN hide memory latency (#20537 ) b8369	2026-03-16 11:41:45 +08:00
Piotr Wilkin (ilintar)	9e2e2198b0	tools/cli: fix disable reasoning (#20606 ) b8368	2026-03-15 22:40:53 +01:00
Georgi Gerganov	88915cb55c	server : fix wait in test_cancel_requests() test (#20601 ) * server : fix wait in test_cancel_requests() test * codeowners : add team for server tests	2026-03-15 20:54:37 +02:00
Sigbjørn Skjæret	ebbf544ed1	sycl : fix for untransposed GDA recurrent state (#20583 ) b8366	2026-03-15 19:10:15 +01:00
Sigbjørn Skjæret	b91d7dfe5b	ci : only save openvino caches on github-hosted master (#20593 ) * only save openvino ccache on master * disable toolkit cache if self-hosted * only cache on github-hosted runners * remove toolkit cache [no ci]	2026-03-15 18:58:13 +01:00
Johannes Gäßler	ae40cd27c8	CUDA: limit number of FA stream-k CUDA blocks (#20586 ) b8364	2026-03-15 18:30:47 +01:00
Pascal	ceef6b5233	ggml: avoid creating CUDA context during device init (#20595 ) b8363	2026-03-16 00:42:56 +08:00
Adrien Gallouët	07c6a59b4f	vendor : update cpp-httplib to 0.38.0 (#20578 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8362	2026-03-15 17:30:06 +01:00
MoonShadow	8b7d340b6f	ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain (#20536 ) * ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain returns hipErrorInvalidValue because the hint is not applicable to UMA systems. The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on APU systems such as AMD Strix Halo (gfx1151). Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it without error checking and clear any resulting error with hipGetLastError(). Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory issues on APU systems, and store totalGlobalMem in device info. Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits hipMallocManaged to ~64GB regardless of available system RAM. A fix has been submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix --------- Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> b8361	2026-03-15 17:23:58 +01:00
Eric Hsieh	559646472d	fix: prevent nullptr dereference (#20552 ) b8360	2026-03-15 16:51:49 +01:00
Sigbjørn Skjæret	cf45437d35	codeowners : use teams (#20526 ) * use teams * update * update * update * update * update	2026-03-15 14:26:10 +01:00
Georgi Gerganov	9cd4ebcfb1	ci : split build.yml + server.yml (#20546 ) * ci : split build.yml * cont : split server.yml * cont : reduce paths * cont : split build-android.yml + update paths * ci : make msys workflows manual (#20588) * ci : make cross-build workflows manual (#20585) * cont : fix release paths Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> b8358	2026-03-15 15:11:17 +02:00
Sigbjørn Skjæret	89d0aec042	convert : support contiguous method on lora tensors (#20489 )	2026-03-15 12:15:12 +01:00
Bartowski	b9da4444df	ggml : guard against sumq2 being 0 in IQ4_NL (#20460 ) b8356	2026-03-15 10:47:28 +02:00
PikaPikachu	617db241aa	cuda : add RDNA4-specific MMVQ parameter table for bs=1 decode (#19478 ) * mmvq: add RDNA3/RDNA4-specific parameter table (nwarps=8, rows=1) * mmvq: add dedicated RDNA3 parameter table * mmvq: exclude RDNA3.5 (gfx1150/1151) from RDNA3 table b8355	2026-03-15 08:33:39 +01:00
Ruben Ortlam	1a3d8edbba	vulkan: use graphics queue on AMD (#20551 ) * vulkan: use graphics queue on AMD for slightly better performance * disable async transfer queue on AMD b8354	2026-03-15 08:18:54 +01:00
sprayandwipe	6b10a82c00	kv-cache : fix reading llama_kv_cell_ext during state read (#20273 ) Co-authored-by: sid <sid@ragingfist.net> b8353	2026-03-15 09:11:19 +02:00
Michael Wand	d23355afc3	model : wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support (#20506 ) b8352	2026-03-14 22:44:42 +01:00
Georgi Gerganov	b30a5fdf37	metal : add FA specialization for HSK = 320, HSV = 256 (#20549 ) b8351	2026-03-14 23:15:47 +02:00
Georgi Gerganov	b4768955c4	ci : move self-hosted workflows to separate files (#20540 ) b8350	2026-03-14 23:15:35 +02:00
Gerard Guillemas Martos	fc350fdf96	docker : force Python 3.13 in Vulkan container (#20530 ) * ci: force Python 3.13 in Vulkan container * remove unnecessary `update-alternatives` line	2026-03-14 21:37:09 +01:00
Eve	3a6f059909	ci : try to optimize some jobs (#20521 ) * force arm version to test * run on either x86 or arm if we can help it, this only works for runs without ccache * readd other jobs * remove ccache b8348	2026-03-14 20:27:52 +01:00
Max Krasnyansky	609ea50026	hexagon: Q4_0 and MXFP4 repack fixes (#20527 ) * hexagon: fix tail corruption with rows sizes not multiple of 256 * hexagon: use different stride for repacking partial blocks * hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing instead of the original (0:128,1:129,...) packing in order to fix tail corruption. Since the mm kernels already deal with partial tails we can use even:odd packing only for the last block. This avoid performance penalty of having to shuffle to zip the elements in the common case. * hex-mm: update rmpy x8 for better optimizations * hex-mm: tighten supported MUL_MAT checks to avoid spurios failures * hex-mm: use vzero to init accumulators * hex-mm: properly call partial rmpy_x8 b8347	2026-03-14 11:09:08 -07:00
Georgi Gerganov	9f774e45ee	ci : reduce webgpu tests timeout to 900s (#20538 ) [no ci]	2026-03-14 17:08:26 +02:00
Xuan-Son Nguyen	94d0262277	mtmd: add llama-mtmd-debug binary (#20508 ) * mtmd: add llama-mtmd-debug binary * adapt * fixes * fix compile error * fix windows compile error * rm legacy clip_debug_encode() * add MTMD_API to fix build	2026-03-14 15:52:29 +01:00
Neo Zhang	a93c0ef0fa	add op gated_delta_net (#20455 )	2026-03-14 22:01:57 +08:00
Chedrian07	710878a7dd	webui: restore code preview iframe origin isolation (#20477 )	2026-03-14 11:28:28 +01:00
Adrien Gallouët	0685848bc6	scripts : remove get-wikitext-103.sh (#20543 ) It doesn't work and no one seems to use it. $ wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip HTTP request sent, awaiting response... 301 Moved Permanently Location: unspecified ERROR: Redirection (301) without location. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-14 11:22:04 +01:00
Adrien Gallouët	0024a69b70	scripts : update get-hellaswag.sh and get-winogrande.sh (#20542 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-14 11:21:50 +01:00
Adrien Gallouët	d0b79aaa2f	ggml : add native AVX512-FP16 support for F16 operations (#20529 ) The overall benchmark speed remains almost the same because the CPU is now calculating faster than the RAM can deliver the data. (See perf stat results below showing 2.7 billion fewer instructions). Also note that this path will be only enabled for native build or with custom flags. now: ``` Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128': 189,073.52 msec task-clock # 14.658 CPUs utilized 404 context-switches # 2.137 /sec 19 cpu-migrations # 0.100 /sec 372,390 page-faults # 1.970 K/sec 310,877,195,595 instructions # 0.54 insn per cycle 581,071,530,602 cycles # 3.073 GHz 19,352,107,994 branches # 102.352 M/sec 48,304,438 branch-misses # 0.25% of all branches 84,998,431,152 L1-dcache-loads # 449.552 M/sec 12,186,410,279 L1-dcache-load-misses # 14.34% of all L1-dcache accesses 12.899358742 seconds time elapsed 187.823044000 seconds user 1.253416000 seconds sys ``` before: ``` Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128': 190,594.56 msec task-clock # 14.652 CPUs utilized 436 context-switches # 2.288 /sec 22 cpu-migrations # 0.115 /sec 372,782 page-faults # 1.956 K/sec 313,574,921,966 instructions # 0.54 insn per cycle 586,064,970,425 cycles # 3.075 GHz 19,585,778,563 branches # 102.761 M/sec 48,437,488 branch-misses # 0.25% of all branches 86,219,336,628 L1-dcache-loads # 452.370 M/sec 12,232,085,771 L1-dcache-load-misses # 14.19% of all L1-dcache accesses 13.007923164 seconds time elapsed 189.395316000 seconds user 1.202612000 seconds sys ``` Signed-off-by: Adrien Gallouët <angt@huggingface.co> b8340	2026-03-14 10:06:14 +01:00

1 2 3 4 5 ...

8389 Commits