Commit Graph

94 Commits

Author SHA1 Message Date
Georgi Gerganov
78faa2b79f server, spec : transition to unified spec context 2026-05-07 17:57:59 +03:00
Georgi Gerganov
d719d8aafc server : improve ctx names
[no ci]
2026-05-07 16:48:39 +03:00
Georgi Gerganov
4c957c4749 server : draft prompt cache and checkpoints
[no ci]
2026-05-07 16:48:39 +03:00
Georgi Gerganov
e22a090f12 server : sketch the ctx_dft decode loop
[no ci]
2026-05-07 16:48:39 +03:00
Georgi Gerganov
ae1f10b110 cont : dedup ctx_seq_rm_type
[no ci]
2026-05-07 16:48:39 +03:00
Georgi Gerganov
0791b0d95b cont : pass seq_id
[no ci]
2026-05-07 16:48:38 +03:00
Georgi Gerganov
4eec5542ce spec : update common_speculative_init()
[no ci]
2026-05-07 16:48:38 +03:00
Georgi Gerganov
2466149c25 spec : drop support for incompatible vocabs
[no ci]
2026-05-07 16:48:38 +03:00
Georgi Gerganov
34f1515783 spec : refactor
[no ci]
2026-05-07 16:48:38 +03:00
Georgi Gerganov
d6e7b033a4 llama : add option to save memory in device buffers (#22679)
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Georgi Gerganov
0754b7b6fe server : avoid checkpoint data host copies (#22558)
* server : avoid checkpoint data host copies

* llama : refactor llama_io_read_i
2026-05-02 18:03:25 +03:00
Georgi Gerganov
80afa33aad spec : fix draft model checkpoints (#22521)
* spec : fix draft model checkpoints

* cont : clean-up

* cont : gate the ngram-mod reset warning behind verbose flag
2026-04-30 08:32:18 +03:00
Georgi Gerganov
683c5acb90 spec : disacard last drafted token with low prob (#22506) 2026-04-29 17:00:00 +03:00
Georgi Gerganov
14e733e36f spec : refactor params (#22397)
* spec : refactor params

* cont : fix

* cont : rename "sparam" to "sampling"

* cont : add spec params category

* cont : add info about removed arguments

* cont : skip param length check for spec params

* cont : adapt server tests
2026-04-28 09:07:33 +03:00
Aman Gupta
516e8d7a8a server: use pos_next instead of n_tokens for m-rope (#22439) 2026-04-28 08:41:00 +03:00
Georgi Gerganov
ffdd983fb8 server : fix swa-full logic (#22288) 2026-04-24 10:17:37 +03:00
Yes You Can Have Your Own
793d0a7931 server: rename debug tags to match --cache-idle-slots naming (#22292) 2026-04-24 09:28:44 +03:00
Tarek Dakhran
550d684bd1 server: Enable transcriptions API for LFM2-Audio (#22000) 2026-04-23 10:47:26 +02:00
Georgi Gerganov
bcb5eeb645 speculative-simple : add checkpoint support (#22227)
* speculative-simple : add checkpoint support

* cont : fix build
2026-04-22 15:44:45 +03:00
Ethan Turner
750579ff14 common: Refactoring sampler parameters (#20429) (#22233)
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.

Issue: https://github.com/ggml-org/llama.cpp/issues/20429
Original PR: https://github.com/ggml-org/llama.cpp/pull/20297
2026-04-22 10:40:19 +02:00
Piotr Wilkin (ilintar)
134d6e54d4 common/chat, server: refactor, move all conversion functions to common, add tests (#20690)
* Refactor conversion functions
2026-04-22 10:28:45 +02:00
Georgi Gerganov
cf8b0dbda9 server : remove /api endpoints (#22165)
* server : remove /api endpoints

* cont : remove /api/tags
2026-04-20 20:41:19 +03:00
Georgi Gerganov
de71b5f81c server : refactor "use checkpoint" logic (#22114) 2026-04-20 08:42:37 +03:00
Yes You Can Have Your Own
9d49acb2a7 server: rename --clear-idle to --cache-idle-slots (#21741) 2026-04-20 08:30:24 +03:00
Sascha Rogmann
455d8e4be8 server : speculative checkpointing (#19493)
* server : speculative decoding using checkpoints

* server : fix draft check with checkpoints

* server : rename spec vars

* server : log levels

* server : refactored spec logic to speculative.cpp

* server : renamed spec checkpoints option

* server : fix spec checkpoints, logging

* speculative : checkpoints with draft model, logging

* server : n_tokens_cur and create_checkpoint in draft

* server : fix server_speculative_callback (slot.id)

* spec : fix ngram-map/begin idx_last_check

* spec : init ckpt (begin() wasn't called)

* chore: update webui build output

* server : restore sampler in spec checkpoint and clear mem

* cont : avoid --spec-use-checkpoints argument

* cont : remove server_prompt_checkpoint_with_size

* spec : rename (leave_draft_state)

* cont : clean-up

* cont : do not ignore partial drafts even if the are short

* cont : spec callback owned by session

* cont : simplify

* cont : avoid empty speculative session

* cont : simplify

* cont : simplify

* cont : enable mtmd speculative decoding

* cont : keep the spec sampler alive

* cont : simplify

* cont : fix nullptr deref + draft checkpoints

* cont : remove common_speculative_accept_response

* cont : remove callback

* cont : simplify

* cont : minor

* cont : simplify

* cont : fix accepted number

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-19 10:24:06 +03:00
Cetarthoriphros
9e5647affa server: Expose media_tag on /props endpoint. (#22028) 2026-04-19 00:27:17 +02:00
Georgi Gerganov
6990e2f1f7 libs : rename libcommon -> libllama-common (#21936)
* cmake : allow libcommon to be shared

* cmake : rename libcommon to libllama-common

* cont : set -fPIC for httplib

* cont : export all symbols

* cont : fix build_info exports

* libs : add libllama-common-base

* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Xuan-Son Nguyen
408225bb1a server: use random media marker (#21962)
* server: use random media marker

* nits

* remove legacy <__image__> token

* revert special char in random
2026-04-15 23:52:22 +02:00
Xuan-Son Nguyen
e489a5ca0e server: support OAI /v1/audio/transcriptions API (#21863)
* server: support OAI /v1/audio/transcriptions API

* address autoreview comments

* correct default response_format value
2026-04-14 11:09:52 +02:00
Yuri Khrustalev
660600081f server: respect the ignore eos flag (#21203) 2026-04-08 17:12:15 +02:00
Aaron Teo
69c28f1547 llama-server: fix model params not propagated (#21509)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-04-07 21:39:41 +08:00
Georgi Gerganov
e8f5082697 server : fix restore for checkpoints with pos_min == 0 (#21510) 2026-04-07 15:29:17 +03:00
Dan Hoffman
9c699074c9 server: Fix undefined timing measurement errors in server context (#21201)
Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
2026-04-04 22:11:19 +08:00
Yes You Can Have Your Own
50e0ad08fb server: save and clear idle slots on new task (--clear-idle) (#20993)
* server: clear idle slots KV from VRAM (LLAMA_KV_KEEP_ONLY_ACTIVE)

* server: move idle slot KV clearing to slot release

The save "cost" is now paid by the finishing request.

* server: add --kv-clear-idle flag, enable by default

* server: skip clearing last idle slot, clear on launch

* server: test --no-kv-clear-idle flag

* server: simplify on-release clearing loop

* server: remove on-release KV clearing, keep launch-only

* cont : clean-up

* tests: update log strings after --clear-idle rename

* tests: use debug tags instead of log message matching

* test: fix Windows CI by dropping temp log file unlink

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-03 19:02:27 +02:00
Georgi Gerganov
edfb440a2f server : fix processing of multiple back-to-back mtmd chunks (#21107) 2026-03-28 16:27:36 +02:00
Xuan-Son Nguyen
49bfddeca1 server: allow router to report child instances sleep status (#20849)
* server: allow router to report child instances sleep status

* refactor

* move sleeping to state

* nits
2026-03-22 18:33:52 +01:00
Georgi Gerganov
ab9d4c3678 server : improve mtmd ctx checkpoints (#20726)
* server : improve mtmd ctx checkpoints

* server : fix off-by-one in pos_min_thold
2026-03-20 11:13:12 +02:00
Ryan Goulden
26c9ce1288 server: Add cached_tokens info to oaicompat responses (#19361)
* tests : fix fetch_server_test_models.py

* server: to_json_oaicompat cached_tokens

Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
2026-03-19 19:09:33 +01:00
Piotr Wilkin (ilintar)
5e54d51b19 common/parser: add proper reasoning tag prefill reading (#20424)
* Implement proper prefill extraction

* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp

* Update tools/server/server-task.cpp

* refactor: move grammars to variant, remove grammar_external, handle exception internally

* Make code less C++y

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-19 16:58:21 +01:00
Piotr Wilkin (ilintar)
d2ecd2d1cf common/parser: add --skip-chat-parsing to force a pure content parser. (#20289)
* Add `--force-pure-content` to force a pure content parser.

* Update common/arg.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Change parameter name [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Georgi Gerganov
8cc2d81264 server : fix ctx checkpoint invalidation (#20671) 2026-03-17 15:21:14 +02:00
SoftwareRenderer
d7ba99c485 server: reset counter related to kill-switch on client error (#20513)
* server: reset kill-switch on client error

This avoids triggering a server kill switch.

If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.

However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.

* moved counter reset as per recommendation

* cont : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-13 19:58:09 +02:00
Piotr Wilkin (ilintar)
acb7c79069 common/parser: handle reasoning budget (#20297)
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-03-11 10:26:12 +01:00
Georgi Gerganov
a7b3dee7a5 server : make 2 checkpoints near the end of the prompt (#20288)
* server : make 2 checkpoints near the end of the prompt

* cont : adjust checkpoints
2026-03-10 14:28:23 +02:00
Georgi Gerganov
96cfc4992c server : fix checkpoints n_tokens calculation (#20287) 2026-03-09 16:47:06 +02:00
Georgi Gerganov
344ee2a38a server : warn swa-full is not supported for non-SWA models (#20291) 2026-03-09 16:44:25 +02:00
Georgi Gerganov
d6e1556499 server : fix off-by-1 in server_tokens::size_up_to_pos() (#20279)
* server : fix off-by-1 in server_tokens::size_up_to_pos()

* cont : fix typo [no ci]
2026-03-09 16:43:38 +02:00
Georgi Gerganov
107d599952 server : add kill switch when server is stuck (#20277) 2026-03-09 10:33:12 +02:00
Georgi Gerganov
d417bc43dd server : do not create checkpoints right after mtmd chunks (#20232) 2026-03-08 22:16:46 +02:00
Piotr Wilkin (ilintar)
f5ddcd1696 Checkpoint every n tokens: squash (#20087) 2026-03-06 11:39:26 +01:00