Commit Graph

142 Commits

Author SHA1 Message Date
Daniel Bevenius
82c2600585 Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-28 07:34:17 +01:00
Johannes Gäßler
026d2ad472 llama: fix magic number of 999 for GPU layers (#18266)
* llama: fix magic number of 999 for GPU layers

* use strings for -ngl, -ngld

* enacapsulate n_gpu_layers, split_mode
2025-12-27 20:18:35 +01:00
Georgi Gerganov
0ce03597e8 Merge branch 'master' into HEAD 2025-12-24 10:33:21 +02:00
Johannes Gäßler
147a521636 tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
Georgi Gerganov
3b3f5fed31 common : disable backend sampling when grammar is involved 2025-12-18 10:52:21 +02:00
Daniel Bevenius
ad1b60abc4 Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-16 09:45:08 +01:00
Johannes Gäßler
b1f3a6e5db llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-15 09:24:59 +01:00
Georgi Gerganov
22c7f85b9c Merge branch 'master' into HEAD 2025-12-14 10:19:58 +02:00
Georgi Gerganov
609a2d0268 models : fix YaRN regression + consolidate logic (#18006)
* models : fix YaRN regression + consolidate logic

* cont : fix the fix

* cont : remove header

* cont : add header
2025-12-14 08:34:56 +02:00
Jeff Bolz
5266379bca llama_context: synchronize before reallocating output buffer (#17974) 2025-12-13 09:19:51 -06:00
Georgi Gerganov
4d10b78e23 Merge branch 'master' into HEAD 2025-12-11 14:42:56 +02:00
Georgi Gerganov
4dff236a52 ggml : remove GGML_KQ_MASK_PAD constant (#17910)
* ggml : remove GGML_KQ_MASK_PAD constant

* cont : remove comment
2025-12-10 20:53:16 +02:00
Georgi Gerganov
804e7e3795 graph : respect sampler order for graph reuse 2025-12-10 20:40:15 +02:00
Georgi Gerganov
c02654eb7d graph : make the compute graph constant with respect to active samplers 2025-12-10 16:19:18 +02:00
Georgi Gerganov
34b407b41c sampling : use host buffer type for inputs 2025-12-09 17:53:17 +02:00
Georgi Gerganov
92ff767918 llama : require backend samplers to be of type llama_sampler_chain 2025-12-09 15:38:37 +02:00
Georgi Gerganov
6d38db5dfe Merge branch 'master' into HEAD 2025-12-08 17:55:24 +02:00
Piotr Wilkin (ilintar)
e4e9c4329c Make graph_max_nodes vary by ubatch size (#17794)
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph

* Update src/llama-context.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add missing const

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:32:41 +01:00
Georgi Gerganov
30742a6ff5 sampling : expand support (wip) 2025-12-06 16:51:56 +02:00
Georgi Gerganov
6958d41366 sampling : check backend support during init 2025-12-04 17:29:08 +02:00
Daniel Bevenius
cf0e1475c5 sampling : lower log level for output buffer reallocations [no ci]
This commit changes the logging level for output buffer reallocations
in the llama_context::output_reserve function from INFO to DEBUG.

The motivation for this is that it currently logs to info and when
enabling verbose logging for llama-cli this will get mixed with the
output, for example:

```console
What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB
 1. Stockholm
2\. Helsinki
Based are the options
1. Stockholm
Explanation: Stockholm is the capital of
...
```
2025-12-01 09:13:47 +01:00
Georgi Gerganov
80742cbaeb cont : naming 2025-11-30 11:24:30 +02:00
Georgi Gerganov
c187003d81 llama : naming 2025-11-30 00:05:47 +02:00
Georgi Gerganov
1760bd69b3 llama : reserve graphs with samplers 2025-11-29 23:57:25 +02:00
Georgi Gerganov
ff7b0bf632 llama : call backend_init once 2025-11-29 23:09:53 +02:00
Georgi Gerganov
d8d98bb4bb Merge branch 'master' into HEAD 2025-11-29 22:38:44 +02:00
Diego Devesa
e072b2052e ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276)
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.

* llama : update worst-case graph for unified cache

* ci : disable op offload in some tests

* fix spelling

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-28 17:33:23 +02:00
Georgi Gerganov
117e2079a9 refactor : simplify and improve memory management 2025-11-28 16:09:42 +02:00
Daniel Bevenius
459b7ae7b9 squash! sampling : support intermixed backend/cpu samplers
Fix llama-save-load-state which currently fails by handling the case
when batch.logits is nullptr (like when loading state) by allocating
space for all outputs as CPU logits.
2025-11-28 13:50:47 +01:00
Piotr Wilkin (ilintar)
ff55414c42 model : Qwen3 Next (#16095)
* Qwen3 Next - cleaned up version

* Whitespaces and stuff

* Correct minor errors

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Misc. fixes.

* Clean up code, add missing hybrid qualifier

* Did someone transpose the SOLVE_TRI result matrix? Perhaps...

* Whitespace

* Proper tensors for cb calls

* Use llama-graph.h vertical alignment

* BROKEN: chunking

* Set new tensors as inputs.

* Proper chunk logic

* It's the circle of life...

* More shenanigans for n_seq > 1

* Nail in the coffin?

* Fix Windows build

* Eh, one fails on Windows, the other fails on Mac... just use general capture.

* quant : cleanup

* model : cleanup

* qwen3 : cleanup

* cont : cleanup

* cont : cleanup

* ggml : revert change

* qwen3 : cleanup

* cont : cleanup

* Readd cmath

* qwen3 : fix typo

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Usual suspects

* fix my bad suggestion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-28 12:02:56 +01:00
Daniel Bevenius
9ad6522be6 squash! sampling : support intermixed backend/cpu samplers
Add check that logits is not null which is can happen for embeddings.
2025-11-28 08:57:48 +01:00
Daniel Bevenius
74be332e24 sampling : support intermixed backend/cpu samplers
This commit updates the backend sampling implementation to support
intermixed usage of backend and CPU samplers within the same batch.

The initial implementation was developed as an all-or-nothing solution:
either perform backend sampling for the entire batch, or perform CPU
sampling for the entire batch.

The motivation for this change is to support batches with mixed
sequences. For example, we may have a backend sampler configured for
sequence 0, while sequence 1 in the same batch uses CPU sampling. This
was not supported in the initial implementation.

This issue manifested in llama-server with the webui: decoding with
backend samplers would work initially, but after changing to CPU
sampling, a slot (sequence) could still be using a backend sampler.
This meant that logits in output_reserve would not be allocated,
resulting in an error.

The solution in this commit inspects the batch to determine which
sampling modes are needed and allocates buffers accordingly. However,
there is a known inefficiency: when we have intermixed backend/CPU
samplers in the same batch, we currently copy all logits to the host,
even for sequences using backend samplers.

Added test_backend_cpu_mixed_batch to verify correct behavior with
mixed backend/CPU samplers in a single batch, including dynamic
sampler switching between decode calls.
2025-11-28 08:38:05 +01:00
Daniel Bevenius
172208afbf sampling : add comments about backend sampler [no ci]
This commit adds a comment to llama_context's constructor explaining why
backend samplers are initialized early in the process.
2025-11-27 14:59:52 +01:00
Daniel Bevenius
2b4c7927ee Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-11-25 06:10:33 +01:00
Daniel Bevenius
134e6940ca llama : skip output reordering for single token batches (#17466)
This commit adds a check to skip the output reordering logic when
n_outputs == 1. With a single output token, the data is trivially
sorted and the reordering code is currently doing unnecessary work
(resetting and rebuilding output_ids to the same values).

The motivation for this change is improved code clarity and avoiding
confusion when debugging. While the performance impact is probably
negligible, this unnecessary work happens on every decode call in
llama-server when processing batches with single-token outputs.
2025-11-24 21:06:17 +01:00
Daniel Bevenius
a02adf4211 sampling : add assertions for contiguous tensors in async copy functions 2025-11-24 21:01:06 +01:00
Daniel Bevenius
8eb9b4769d sampling : remove redundant checks for stride and size [no ci] 2025-11-24 13:53:29 +01:00
Daniel Bevenius
4a90583d7d sampling : cleanup and clarify output_reserve 2025-11-24 13:26:18 +01:00
Daniel Bevenius
9e273f7aa4 sampling : fix copying both sampled tokens and logits/probs from backend
This commit fixes the issue where both sampled tokens and logits/probs
were not being copied correctly from the backend to the host when
multiple backend samplers were used.

A test for this scenario has also been added to ensure that both types
of data are copied correctly when different backend samplers are
employed.
2025-11-23 13:12:01 +01:00
Daniel Bevenius
ae23d2d2c1 sampling: clarify candidate ids usage in comments 2025-11-23 11:28:19 +01:00
Daniel Bevenius
65500d05ab sampling : add stride variable for clarity 2025-11-23 11:27:54 +01:00
Daniel Bevenius
61ffe41dc1 sampling : use pinned memory for backend sampling buffers 2025-11-21 14:02:16 +01:00
Daniel Bevenius
0d28b16bdc sampling : introduce sampling_info struct
This commit introduces a sampling_info struct to encapsulate all
backend sampling related data within the llama_context class.

It also updates to use more descriptive names for sampled tokens and
candidates in the backend sampler ggml data structure.
2025-11-20 14:45:56 +01:00
Daniel Bevenius
18ed4d8f96 squash! sampling : simplify backend sampling logic decode
The commit fixes a variable shadowing issue in the
`llama_context::decode` function which was introduced in a previous
refactoring.
2025-11-19 15:10:15 +01:00
Daniel Bevenius
d74eb61aa7 squash! sampling : simplify backend sampling logic decode
Fix condition to check if backend actually sampled tokens, not just that
backend samplers are available.
2025-11-19 11:29:26 +01:00
Daniel Bevenius
7e98ebcc6b sampling : simplify backend sampling logic decode
This commit tries to simplify the backend sampling logic in
llama_context::decode.
2025-11-19 09:31:33 +01:00
Daniel Bevenius
311c1a347f sampling : ensure at most one output token per seq
This commit adds a check in the batch allocator to ensure that when
backend sampling is enabled, at most one output token is specified per
sequence.
2025-11-18 16:06:23 +01:00
Daniel Bevenius
82957a90f2 sampling : always expose sampled_ids
This commit precomputes and caches the full-vocab token id list in
llama_context's constructor, so llama_get_backend_sampled_token_ids_ith
always returns a valid pointer.

The motivation for this is that this enables both common/sampling.cpp
and src/llama-sampling.cpp can simplify their logic.

Not all backends samplers that process logits need to set the
sampled_tokens_id as they may not change the order of the logits, for
example the temperature sampler only scales the logits but does not
change their order. Simliar the logit bias sampler only adds bias to
specific token ids but does not change the order of the logits. In
these cases there will not be a device to host copy of the sampled
token ids, and this is the use case where having this precomputed
list is useful.
2025-11-18 15:11:59 +01:00
Georgi Gerganov
4b52e59903 graph : do not include llama-model.h 2025-11-18 13:53:25 +02:00
Daniel Bevenius
7884b0e0ac sampling : add support for backend sampling
This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
2025-11-17 16:15:58 +01:00