7616 Commits

Author SHA1 Message Date
Jeff Bolz
18ddaea2ae vulkan: Optimize GGML_OP_CUMSUM (#18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
b7616
2026-01-02 15:32:30 -06:00
Jeff Bolz
706e3f93a6 vulkan: Implement mmvq for iq1_s/iq1_m (#18450) b7615 2026-01-02 20:19:04 +01:00
Prabod
5755e52d15 model : Maincoder-1B support (#18534)
* Add Maincoder model support

* Removed SPM model vocabulary setting and MOE related GGUF parameters
Removed trailing spaces from maincoder.cpp

* removed set_vocab

* added new line

* Fix formatting

* Add a new line for PEP8
b7614
2026-01-02 20:11:59 +01:00
Georgi Gerganov
f38de16341 metal : adjust extra size for FA buffer to avoid reallocations (#18545) b7613 2026-01-02 19:02:18 +02:00
Georgi Gerganov
af1e8e1a6c graph : reduce topology branching (#18548) b7612 2026-01-02 19:01:56 +02:00
Georgi Gerganov
d84a6a98be vocab : reduce debug logs about non-EOG control tokens (#18541)
* vocab : reduce debug logs about non-EOG control tokens

* cont : add comment
b7611
2026-01-02 16:17:33 +02:00
Chris Rohlf
c6f0e832da rpc : use unordered_map::reserve and emplace (#18513) b7610 2026-01-02 12:09:36 +02:00
MeeMin
e86f3c2221 cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (#18433)
* ggml-cuda: fixed assertion in ggml_cuda_cpy (#18140)

* ggml-cuda: changes in data types to int64_t

* ggml-cuda: added asserts for CUDA block numbers

* ggml-cuda: changed the condition for y and z dimension
b7609
2026-01-02 00:24:20 +01:00
Sigbjørn Skjæret
169ee68ffb model : remove modern-bert iswa template (#18529)
* remove modern-bert iswa template

* forgotten
b7608
2026-01-02 00:06:42 +01:00
tt
ced765be44 model: support youtu-vl model (#18479)
* Support Youtu-VL Model

* merge code

* fix bug

* revert qwen2 code & support rsplit in minja.hpp

* update warm info

* fix annotation

* u

* revert minja.hpp

* fix

* Do not write routed_scaling_factor to gguf when routed_scaling_factor is None

* fix expert_weights_scale

* LGTM after whitespace fixes

* fix

* fix

* fix

* layers to layer_index

* enum fix

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7607
2026-01-01 19:25:54 +01:00
Piotr Wilkin (ilintar)
3ccccc83f7 Add conversion support for IQuestCoderForCausalLM (#18524) 2026-01-01 18:45:55 +01:00
o7si
d0a6a31470 model : add support for JinaBertModel with non-gated ffn (#18475)
* WIP: Initial commit for fixing JinaBert original FF type support

* convert: add jina-v2-de tokenizer variant for German_Semantic_V3

* convert: fix token collision in BERT phantom vocab conversion

* convert: add feed_forward_type metadata

* model: add feed_forward_type metadata for jina-bert-v2

* model: jina-bert-v2 support standard GELU FFN variant

* model: remove ffn_type, detect FFN variant from tensor dimensions

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* revert collision fix to be handled in separate PR

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7605
2026-01-01 18:38:51 +01:00
o7si
2b2afade9f convert : fix encoding of WPM vocab for BERT models (#18500)
* convert: avoid token collision when stripping ## prefix

* convert: use token types for BERT special tokens check

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-01 18:27:07 +01:00
HelloKS
f4f5019254 model: add Solar Open model (#18511)
* model: add Solar-Open model

* vocab: add solar-open to end eog blacklist

* model: add proper llm type

* chat: basic template for solar open

* typo: fix comment about vocab

* convert: sugested changes

* convert: suggested changes

* chat: change reasoning end tag for solar-open

* llama-chat: add solar-open template
b7603
2026-01-01 18:01:43 +01:00
Anri Lombard
d5574c919c webui: fix code copy stripping XML/HTML tags (#18518)
* webui: fix code copy stripping XML/HTML tags

* webui: update static build
2026-01-01 13:44:11 +01:00
Aman Gupta
26831bded9 ggml-cuda: remove unneccesary prints on ggml_cuda_init (#18502) b7601 2026-01-01 19:18:43 +08:00
Jeff Bolz
be47fb9285 vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (#18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
b7600
2026-01-01 08:58:27 +01:00
triplenom
9e10bd2eaf llama: handle short reads in direct I/O path (#18504) b7599 2026-01-01 10:24:43 +08:00
Anri Lombard
4cd162a123 chat: make tool description and parameters optional per OpenAI spec (#18478)
* chat: make tool description and parameters optional per OpenAI spec

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

* refactor: use value() for cleaner optional field access
b7598
2025-12-31 17:21:37 -06:00
Georgi Gerganov
13814eb370 sync : ggml 2025-12-31 18:54:43 +02:00
Georgi Gerganov
54f67b9b66 ggml : bump version to 0.9.5 (ggml/1410) 2025-12-31 18:54:43 +02:00
Anri Lombard
33ded988ba quantize: prevent input/output file collision (#18451)
Check if input and output files are the same before quantizing to prevent
file corruption when mmap reads from a file being written to.

Fixes #12753
b7595
2025-12-31 23:29:03 +08:00
Sigbjørn Skjæret
0db8109849 convert : lint fix (#18507) 2025-12-31 14:28:21 +01:00
Henry147147
9b8329de7a mtmd : Adding support for Nvidia Music Flamingo Model (#18470)
* Inital commit, debugging q5_k_s quant

* Made hf_to_gguf extend whisper to reduce code duplication

* addressed convert_hf_to_gguf pull request issue

---------

Co-authored-by: Henry D <henrydorsey147@gmail.com>
b7593
2025-12-31 12:13:23 +01:00
gatbontonpc
9a6369bb60 metal : add count_equal op (#18314)
* add count equal for metal

* remove trailing whitespace

* updated doc ops table

* changed shmem to i32

* added multi tg and templating

* removed BLAS support from Metal docs

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add memset to set dst to 0

* metal : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7592
2025-12-31 10:39:48 +02:00
Johannes Gäßler
ecc343de63 CUDA: fix KQ max calculation (#18487) b7591 2025-12-31 09:37:00 +01:00
Georgi Gerganov
01ade96e71 metal : remove BF16 x F16 kernels (#18456) b7590 2025-12-31 09:53:48 +02:00
Aman Gupta
7bcaf815c2 sycl: add newline at the end of CMakeLists.txt (#18503) b7589 2025-12-31 14:23:44 +08:00
Rahul Sathe
c8a3798041 Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (#18345)
* cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x

* [AI] sycl: auto-detect and skip incompatible IntelSYCL package

Automatically detect compiler versions with incompatible IntelSYCL
CMake configuration files and fall back to manual SYCL flags instead
of requiring users to set options manually.

Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
has SYCL_FEATURE_TEST_EXTRACT invocation errors.

* refactor: improve SYCL provider handling and error messages in CMake configuration

* refactor: enhance SYCL provider validation and error handling in CMake configuration

* ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes
b7588
2025-12-31 09:08:44 +08:00
Sigbjørn Skjæret
4849661d98 docker : add CUDA 13.1 image build (#18441)
* add updated cuda-new.Dockerfile for Ubuntu 24.04 compatibilty

* add cuda13 build
2025-12-30 22:28:53 +01:00
Bart Louwers
6e0c8cbc40 docs : document that JSON Schema is not available to model when using response_format (#18492)
* Document unsupported JSON Schema annotations

Add note about unsupported JSON Schema annotations.

* Update README.md

* Update README.md

* Update README.md
2025-12-30 15:13:49 -06:00
Aldehir Rojas
0f89d2ecf1 common : default content to an empty string (#18485)
* common : default content to an empty string

* common : fix tests that break when content != null
b7585
2025-12-30 12:00:57 -06:00
Daniel Bevenius
ac1d0eb7bf llama : fix typo in comment in llama-kv-cache.h [no ci] (#18489) 2025-12-30 17:20:14 +01:00
Xuan-Son Nguyen
cd78e57c3a lora: count lora nodes in graph_max_nodes (#18469)
* lora: count lora nodes in graph_max_nodes

* 3 nodes per weight

* 4 nodes

* keep track n_lora_nodes from llama_model

* fix assert

* rm redundant header

* common: load adapters before context creation

* use 6 nodes
b7583
2025-12-30 15:53:12 +01:00
Jay Zenith
c32fa21db8 sampling: reuse token data buffer in llama_sampler_sample (#18365)
* sampling: reuse token data buffer in llama_sampler_sample

* move cur buffer before timing section, after samplers

* minor : fix build

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7582
2025-12-30 16:27:49 +02:00
Jeff Bolz
f14f4e421b server: fix files built redundantly (#18474) b7581 2025-12-30 13:11:13 +01:00
Charles Xu
2d6c00a9b8 kleidiai: add and integrate SVE 256-bit vector-length kernel (#18458)
* kleidiai: add and integrate SVE 256-bit vector-length kernel

* updated for review comments
b7580
2025-12-30 14:04:53 +02:00
Aman Gupta
d77d7c5c06 CUDA: add log line when mxfp4 acceleration is used (#18483)
* CUDA: add log line when mxfp4 acceleration is used

* add in backend_get_features
b7579
2025-12-30 17:40:46 +08:00
Daniel Bevenius
a864fb1c14 model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461)
This commit updates the causal model verification script to use the
CONVERTED_MODEL environment variable instead of using the MODEL_PATH
(the original model path) as the basis for the converted model file
name.

The motivation for this that currently if the converted model file name
differs from the original model directory/name the verification script
will look for the wrong .bin file that was generating when running
the converted model.

This similar to the change made for the embeddings models script in
Commit db81d5ec4b ("model-conversion :
use CONVERTED_EMBEDDING_MODEL for embedding_verify_logits (#18079)"),
but we also verify the embeddings of for causal models as well.
2025-12-30 10:13:12 +01:00
Xuan-Son Nguyen
51a48720b8 webui: fix prompt progress ETA calculation (#18468)
* webui: fix prompt progress ETA calculation

* handle case done === 0
b7577
2025-12-29 21:42:11 +01:00
Pascal
c9a3b40d65 Webui/prompt processing progress (#18300)
* webui: display prompt preprocessing progress

* webui: add percentage/ETA and exclude cached tokens from progress

Address review feedback from ngxson

* webui: add minutes and first chunk (0%) case

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: address review feedback from allozaur

* chore: update webui build output

* webui: address review feedback from allozaur

* nit

* chore: update webui build output

* feat: Enhance chat processing state

* feat: Improve chat processing statistics UI

* chore: update webui build output

* feat: Add live generation statistics to processing state hook

* feat: Persist prompt processing stats in hook for better UX

* refactor: Enhance ChatMessageStatistics for live stream display

* feat: Implement enhanced live chat statistics into assistant message

* chore: update webui build output

* fix: Proper tab for each stage of prompt processing/generation

* chore: update webui build output

* fix: Improved ETA calculation & display logic

* chore: update webui build output

* feat: Simplify logic & remove ETA from prompt progress

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-29 19:32:21 +01:00
Johannes Gäßler
0bd1212a43 CUDA: fix replacment of bad archs in CMake (#18457) 2025-12-29 17:58:20 +01:00
wbtek
5b1248c9af server : Cmdline arg -to changes http read timeout from current 600sec default (#18279)
* Prevent crash if TTFT >300sec, boosted to 90 days

* server : allow configurable HTTP timeouts for child models

* server : pass needed timeouts from params only

---------

Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>
b7574
2025-12-29 17:12:48 +01:00
Xuan-Son Nguyen
3595ae5963 contributing: tighten AI usage policy (#18388)
* contributing: tighten AI usage policy

* refactor AGENTS.md

* proofreading

* update contributing

* add claude.md

* add trailing newline

* add note about dishonest practices

* rm point about dishonest

* rm requirement watermarking

* add .gemini/settings.json

* allow initially AI-generated content

* revise

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* improve

* trailing space

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* update

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-29 16:01:32 +01:00
Naco Siren
c1366056f6 android: routine maintenance - Dec 2025 (#18338)
* Fix `msg` typo

* Fix thread safety in destroy() to support generation abortion in lifecycle callbacks.

* UI polish: stack new message change from below; fix GGUF margin not in view port

* Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation.

* Bump dependencies' versions; Deprecated outdated dsl usage.
b7572
2025-12-29 15:51:13 +02:00
Georgi Gerganov
2a85f720b8 server : handle closed connection for tasks (#18459) b7571 2025-12-29 15:34:41 +02:00
Daniel Bevenius
7cbec34a63 model-conversion : add device option to embd run orig model (#18386)
This commit refactors the original model embedding script to include a
device selection option. Users can now specify the device (cpu, cuda,
mps, auto) via command-line arguments. It also refactors the code to be
more structured.
2025-12-29 13:37:02 +01:00
Héctor Estrada Moreno
0c8986403b retrieval : use at most n_seq_max chunks (#18400) b7569 2025-12-29 13:21:13 +02:00
o7si
daa242dfc8 common: fix return value check for setpriority (#18412)
* common: fix return value check for setpriority

* tools: add logging for process priority setting
b7568
2025-12-29 11:07:49 +02:00
Johannes Gäßler
e70e640db3 CUDA: Blackwell features for non-native builds (#18436) b7567 2025-12-29 09:35:42 +01:00