* Q4_0 MoE CLC pass sanity check
* release program
* opencl: fix whitespace
* opencl: remove unused cl_program
* opencl: break #if block to make it more clear
* opencl: adjust format
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* convert : fix RuntimeError when stripping FP8 KV-cache scales
In ModelBase._generate_nvfp4_tensors the final cleanup loop iterates
self.model_tensors.keys() and calls del on the same dict, which raises
RuntimeError: dictionary changed size during iteration when a ModelOpt
NVFP4 model also has FP8 KV-cache scales (e.g. mmangkad/Qwen3.6-35B-A3B-NVFP4
and any modelopt config with kv_cache_quant_algo: FP8).
Wrap the keys view in list() so the deletions happen on a snapshot.
* re-add another accidentally removed list
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends
* webui: add LLM title generation option
* webui: use chat_template_kwargs for title gen + fix conversation check
* webui: capture firstUserMessage before async streamChatCompletion to fix race condition
* webui: extract LLM title generation into separate method
* webui: use constants and ChatService for LLM generated titles
* webui: rebuild static output
* webui: add LLM title generation setting to new settings location
* webui: use sendMessage in generateTitle
* webui: rebuild static output
* webui: fix formatting
* webui: configurable title prompt, remove think tag regexes, fix TS error
* webui: group title constants into TITLE object, use TruncatedText for CSS truncation and fix race condition
* webui: rebuild static output
* Write a readme on Multi-GPU usage in llama.cpp
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution
was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0
Fix is to get rid of a dollar sign.
* common: do not fit to unknown device memory
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* common: preserve host fallback for non-GPU fit devices
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* common: keep unknown GPU fit memory at zero
Signed-off-by: Florian Reinle <f.reinle@otec.de>
---------
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* feat: migrate to PEP 621 and add uv support
* fix: remove upper bound on protobuf
* remove poetry.lock and uv.lock
* fix/add torch dependency version and markers
* fix dev-dependency deprecation warning
* gguf-py : update python version requirement to 3.10
---------
Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* convert : ignore non-language tensors for Gemma4Model
This commit adds a check to make sure only text language tensors are
handled in filter_tensors.
The motivation is that currently when trying to convert a Gemma4 model
the following error occurs:
```console
(venv) $ ./convert-gemma.sh
INFO:hf-to-gguf:Loading model: gemma-4-E2B-it
INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight, torch.float32 --> F32, shape = {256}
Traceback (most recent call last):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module>
main()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main
model_instance.write()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write
self.prepare_tensors()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors
for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors
new_name = self.map_tensor_name(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name
raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight'
```
* add forgotten embed_vision and embed_audio
* improve
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* hex-mm: process m-tail rows on HMX instead of HVX
* hmx-mm: unroll and optimize padded activation loop
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3719 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3726 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3733 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3740 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3747 | [](common_params & /*params*/, int /*value*/) {
| ^
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Previously, unknown tool names passed via --tools were silently ignored.
Now the server validates each tool name at startup and exits with an
error if an unrecognized tool is specified, listing the available tools.
Assisted-by: llama.cpp:local pi