395 Commits

Author SHA1 Message Date
Piotr Wilkin (ilintar)
d2ecd2d1cf common/parser: add --skip-chat-parsing to force a pure content parser. (#20289)
* Add `--force-pure-content` to force a pure content parser.

* Update common/arg.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Change parameter name [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Georgi Gerganov
8cc2d81264 server : fix ctx checkpoint invalidation (#20671) 2026-03-17 15:21:14 +02:00
Piotr Wilkin (ilintar)
2e4a6edd4a tools/server: support refusal content for Responses API (#20285)
* Support refusal content for Responses API

* Update tools/server/server-common.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tools/server/server-common.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 01:42:04 +01:00
Pascal
dddca026bf webui: add model information dialog to router mode (#20600)
* webui: add model information dialog to router mode

* webui: add "Available models" section header in model list

* webui: remove nested scrollbar from chat template in model info dialog

* chore: update webui build output

* feat: UI improvements

* refactor: Cleaner rendering + UI docs

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-16 15:38:11 +01:00
Aleksander Grygier
67a2209fab webui: Add MCP CORS Proxy detection logic & UI (#20167)
* refactor: MCP store cleanup

* feat: Add MCP proxy availability detection

* fix: Sidebar icon

* chore: update webui build output

* chore: Formatting

* chore: update webui build output

* chore: Update package lock

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output
2026-03-16 13:05:36 +01:00
Pascal
d65c4f2dc9 Fix model selector locked to first loaded model with multiple models (#20580)
* webui: fix model selector being locked to first loaded model

When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.

* chore: update webui build output
2026-03-16 12:04:06 +01:00
Woof Dog
d8c331c0af webui: use date in more human readable exported filename (#19939)
* webui: use date in exported filename

Move conversation naming and export to utils

update index.html.gz

* webui: move literals to message export constants file

* webui: move export naming and download back to the conversation store

* chore: update webui build output

* webui: add comments to some constants

* chore: update webui build output
2026-03-16 11:18:13 +01:00
Georgi Gerganov
88915cb55c server : fix wait in test_cancel_requests() test (#20601)
* server : fix wait in test_cancel_requests() test

* codeowners : add team for server tests
2026-03-15 20:54:37 +02:00
Chedrian07
710878a7dd webui: restore code preview iframe origin isolation (#20477) 2026-03-14 11:28:28 +01:00
ZeroV0LT
f17b3be63f llama : fix pooling assertion crash in chunked GDN detection path (#20468)
* llama : fix pooling assertion crash in chunked GDN detection path

The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).

Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.

Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.

* server : add mean pooling tests to embedding test suite

Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.

These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.

---------

Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
2026-03-13 20:53:42 +02:00
SoftwareRenderer
d7ba99c485 server: reset counter related to kill-switch on client error (#20513)
* server: reset kill-switch on client error

This avoids triggering a server kill switch.

If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.

However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.

* moved counter reset as per recommendation

* cont : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-13 19:58:09 +02:00
Piotr Wilkin (ilintar)
0e810413bb tests : use reasoning instead of reasoning_budget in server tests (#20432) 2026-03-12 13:41:01 +01:00
Pascal
de190154c8 New conversations now auto-select the first loaded model (#20403)
* webui: auto-select first loaded model for new conversations in router mode

* chore: update webui build output
2026-03-12 09:07:05 +01:00
Piotr Wilkin (ilintar)
acb7c79069 common/parser: handle reasoning budget (#20297)
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-03-11 10:26:12 +01:00
Pascal
00de615345 Fix agentic mcp image single model (#20339)
* webui: fix MCP image attachments dropped during the agentic loop in single-model mode

* chore: update webui build output
2026-03-11 05:31:33 +01:00
Georgi Gerganov
a7b3dee7a5 server : make 2 checkpoints near the end of the prompt (#20288)
* server : make 2 checkpoints near the end of the prompt

* cont : adjust checkpoints
2026-03-10 14:28:23 +02:00
Evan Huus
23fbfcb1ad server: Parse port numbers from MCP server URLs in CORS proxy (#20208)
* Parse port numbers from MCP server URLs

* Pass scheme to http proxy for determining whether to use SSL

* Fix download on non-standard port and re-add port to logging

* add test

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-09 17:47:54 +01:00
Georgi Gerganov
96cfc4992c server : fix checkpoints n_tokens calculation (#20287) 2026-03-09 16:47:06 +02:00
Georgi Gerganov
344ee2a38a server : warn swa-full is not supported for non-SWA models (#20291) 2026-03-09 16:44:25 +02:00
Georgi Gerganov
d6e1556499 server : fix off-by-1 in server_tokens::size_up_to_pos() (#20279)
* server : fix off-by-1 in server_tokens::size_up_to_pos()

* cont : fix typo [no ci]
2026-03-09 16:43:38 +02:00
Georgi Gerganov
107d599952 server : add kill switch when server is stuck (#20277) 2026-03-09 10:33:12 +02:00
Georgi Gerganov
d417bc43dd server : do not create checkpoints right after mtmd chunks (#20232) 2026-03-08 22:16:46 +02:00
decahedron1
ff52ee964d server : correct index on finish in OAI completion streams (#20226) 2026-03-08 10:08:57 +01:00
Piotr Wilkin (ilintar)
566059a26b Autoparser - complete refactoring of parser architecture (#18675)
* Autoparser - full single commit squish

* Final pre-merge changes: minor fixes, Kimi 2.5 model parser
2026-03-06 21:01:00 +01:00
Tom Vaucourt
e68f2fb894 server : preserve anthropic thinking blocks in conversion (#20120)
* server : preserve anthropic thinking blocks in conversion (#20090)

* server : add tests for anthropic thinking block conversion

---------

Co-authored-by: root <root@llamacpp.home>
2026-03-06 17:41:12 +01:00
Piotr Wilkin (ilintar)
f5ddcd1696 Checkpoint every n tokens: squash (#20087) 2026-03-06 11:39:26 +01:00
Aleksander Grygier
f6235a41ef webui: Agentic Loop + MCP Client with support for Tools, Resources and Prompts (#18655) 2026-03-06 10:00:39 +01:00
Aleksander Grygier
5e335ba113 webui: Improvements for Models Selector UI (#20066) 2026-03-05 08:52:22 +01:00
Marcel Petrick
92f7da00b4 chore : correct typos [no ci] (#20041)
* fix(docs): correct typos found during code review

Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>

* Update docs/backend/CANN.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"

This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-05 08:50:21 +01:00
SamareshSingh
cb8f4fa3f8 Fix locale-dependent float printing in GGUF metadata (#17331)
* Set C locale for consistent float formatting across all binaries.

* Add C locale setting to all tools binaries

Add std::setlocale(LC_NUMERIC, "C") to all 16 binaries in the tools/
directory to ensure consistent floating-point formatting.

* Apply suggestion from @JohannesGaessler

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-04 09:30:40 +01:00
Roj234
3e6ab244ad server: Add pragma once to server-context.h (#19944) 2026-02-27 18:28:36 +01:00
Sami Kama
5596a35791 server: Mirroring /v1/responses to /responses to match /v1/chat/completions pattern (#19873) 2026-02-28 00:44:42 +08:00
Pascal
2e7e638523 server : support multiple model aliases via comma-separated --alias (#19926)
* server : support multiple model aliases via comma-separated --alias

* server : update --alias description and regenerate docs

* server : multiple model aliases and tags

- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields

* regenerate docs

* nits

* server : use first alias as model_name for backward compat

address review feedback from ngxson

* server : add single-model test for aliases and tags
2026-02-27 07:05:23 +01:00
Georgi Gerganov
01cd448b8c server : fix ctx checkpoint restore logic (#19924) 2026-02-26 18:20:16 +02:00
drrros
efba35a860 server: fix load-on-startup not respected in ini file (#19897)
Co-authored-by: Roman Marchenko <r.marchenko@ideco.ru>
2026-02-26 12:32:31 +01:00
yggdrasil75
bd72300591 server : fix typo in server README.md (#19900)
fix typo
2026-02-26 11:26:16 +01:00
Georgi Gerganov
f20469d919 server : enable multi-modal prompt caching (#19877) 2026-02-25 15:15:42 +02:00
Georgi Gerganov
d7d826b3c1 server : support multi-modal context checkpoints (#19849)
* Modify llama-memory-hybrid-iswa.cpp

* Modify llama-memory-recurrent.cpp

* Modify server-common.cpp

* Modify server-common.h

* Modify server-context.cpp

* Modify server-task.h

* Added comment to llama-memory-hybrid-iswa.cpp

* Remove comment from server-context.cpp

* Stylistic fix server-context.cpp

* Fix an issue when seqrm isn't called in server-context.cpp

* cont : alternative impl

* cont : cleanup

* cont : n_tokens -> int64_t

---------

Co-authored-by: timkhronos <timkhronos@gmail.com>
2026-02-25 15:14:27 +02:00
Pascal
47eb12b953 server: fix query params lost when proxying requests in multi-model router mode (#19854)
* server: fix query params lost when proxying requests in multi-model router mode

* server: re-encode query params using httplib::encode_query_component in proxy
2026-02-24 21:46:06 +01:00
Radoslav Gerganov
c830f99cfa server : support max_completion_tokens request property (#19831)
"max_tokens" is deprectated in favor of "max_completion_tokens" which
sets the upper bound for reasoning+output token.

Closes: #13700
2026-02-24 10:30:00 +02:00
Aleksander Grygier
5eb0ea32f0 feat: Add code blocks full height setting to parameter sync service (#19835) 2026-02-23 22:30:13 +01:00
Aleksander Grygier
9051663d5d webui: Add setting to have full height Code Blocks in Chat Messages (#19829) 2026-02-23 14:16:50 +01:00
Sigbjørn Skjæret
e8e261699a cli : provide model with text filename (#19783) 2026-02-22 22:33:49 +01:00
Kilian Krampf
cacc371f99 Fix wrong cli-argument in documentation (#19804) 2026-02-22 16:26:33 +01:00
Aldehir Rojas
34ec1c3f18 server : merge contiguous Responses input items into a single assistant message (#19773)
* server : merge contiguous input items into a single assistant message

* cont : simplify tool call msg

* cont : reduce and combine content

* cont : fix merging content items
2026-02-22 14:11:31 +01:00
crsawyer
07968d53e4 fix: UI single model selection in router mode (#19767) 2026-02-21 09:28:39 +01:00
crsawyer
10b26ee23a WebUI hide models in router mode (#19374) 2026-02-19 22:53:42 +01:00
Tarek Dakhran
c5897995a7 mtmd : chat : Fix extra \n between text and media marker (#19595)
* mtmd : chat : Fix extra \n between text and media marker

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

* Refactor to use explicite per type ifs

* Update common/chat.cpp

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* Update common_chat_templates_apply_legacy

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-02-19 12:18:57 +01:00
Aleksander Grygier
03fd9d3bb4 webui: Fix Attachments not being included in completion request (#19731)
* fix: Add missing argument

* chore: update webui build output
2026-02-19 10:27:38 +01:00
matteo
b55dcdef5d server: save generated text for the /slots endpoint (for LLAMA_SERVER_SLOTS_DEBUG=1) (#19622)
* save generated text for the /slots endpoint

* update debug_generated_text only when LLAMA_SERVER_SLOTS_DEBUG > 0

* Apply suggestions from code review

---------

Co-authored-by: Matteo <matteo@matteo>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-02-18 18:53:37 +01:00