Commit Graph

749 Commits

Author SHA1 Message Date
Georgi Gerganov
6d38db5dfe Merge branch 'master' into HEAD 2025-12-08 17:55:24 +02:00
Piotr Wilkin (ilintar)
e4e9c4329c Make graph_max_nodes vary by ubatch size (#17794)
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph

* Update src/llama-context.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add missing const

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:32:41 +01:00
Xuan-Son Nguyen
4d3726278b model: add llama 4 scaling for mistral-large (deepseek arch) (#17744) 2025-12-07 22:29:54 +01:00
Georgi Gerganov
72e3681073 sampling : fix top-p 2025-12-07 17:11:50 +02:00
Georgi Gerganov
8ef5f900db cont : fixes 2025-12-07 15:45:00 +02:00
Georgi Gerganov
fdac9686f7 Merge branch 'master' into HEAD 2025-12-06 16:55:33 +02:00
Georgi Gerganov
30742a6ff5 sampling : expand support (wip) 2025-12-06 16:51:56 +02:00
Daniel Bevenius
444f00b0ec llama : remove quantization sanity check (#17788)
* llama : remove quantization sanity check

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers.  For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

* llama : remove unused pruned_attention_w and is_clip_model vars
2025-12-06 12:26:20 +01:00
Oliver Simons
7668999518 Merge branch 'master' into gpu-sampling
Let's keep `master's` cumsum implementation for it's likely better AMD
perf and add back pure-CUB-implementation in follow-up commit
2025-12-05 14:41:08 +01:00
Georgi Gerganov
cf74b1a8ec sampling : fix candidates logic 2025-12-05 14:24:28 +02:00
Pascal
1be97831e4 fix: prevent segfault in tokenizer on highly repetitive input (#17786)
Add nosubs|optimize flags to std::regex constructors to prevent
catastrophic backtracking when processing prompts with repeated
identical characters (e.g., 'A' * 10000).

The nosubs flag disables subgroup capture, significantly reducing
memory usage and backtracking on uniform token sequences
2025-12-05 13:52:23 +02:00
Georgi Gerganov
7864074fdb sampling : fix outputs and device checks 2025-12-04 19:33:01 +02:00
Georgi Gerganov
6958d41366 sampling : check backend support during init 2025-12-04 17:29:08 +02:00
Georgi Gerganov
1bde70785d sampling : remove redundant calls to ggml_build_forward_expand 2025-12-04 14:25:28 +02:00
Georgi Gerganov
fce571ee51 sampling : simplify temp sampling 2025-12-04 14:23:02 +02:00
Daniel Bevenius
ac9e164714 sampling : fix backend temp sampling to use logits masking 2025-12-04 09:39:20 +01:00
Georgi Gerganov
a67ef0f47f llama : fix sanity checks during quantization (#17721) 2025-12-04 10:33:42 +02:00
Daniel Bevenius
10bd640aae Revert "sampling : stop short if backend sampler sampled a token"
This reverts commit 87b2719eca.
2025-12-04 08:26:33 +01:00
Daniel Bevenius
c0b182f4d6 Merge remote-tracking branch 'upstream/master' into backend-sampling 2025-12-04 08:17:50 +01:00
Daniel Bevenius
87b2719eca sampling : stop short if backend sampler sampled a token
This commit modifies the graph building logic to immediately continue
when a token has already been sampled by the backend sampler.

It also updates the test for backend temporary sampling to include
top-k and distribution samplers in the chain to verify that they are not
producing any logits (they are not run).
2025-12-04 08:13:49 +01:00
Georgi Gerganov
cce3b2a8ad sampling : minor cleanup 2025-12-03 15:39:44 +02:00
Herman Semenoff
37adc9c6ba ggml, llama : use defaulted constructors/destructors (#17649) 2025-12-03 07:12:18 +01:00
Daniel Bevenius
aad5a6afd7 sampling : implement temp_ext_backend sampling
This commit implements the apply function for the extended temperature
sampling.
2025-12-02 17:26:04 +01:00
Adrien Gallouët
f3a9674ae8 llama : fix signed comparison warning on FreeBSD (#17497)
This ensures correct RLIM_INFINITY handling and compatibility on all platforms (32/64-bit).

    warning: comparison of integers of different signs: 'rlim_t' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
      488 |         if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size)) {
          |                         ~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-02 12:05:38 +01:00
Daniel Bevenius
db8972e251 squash! sampling : fix backend temp sampler for zero temperature
This modifies the parent commit to simply return the most probably token
instead of masking the logits.
2025-12-02 11:53:29 +01:00
Daniel Bevenius
3e9a258c14 Merge remote-tracking branch 'upstream/master' into gpu-sampling 2025-12-02 09:26:04 +01:00
Daniel Bevenius
739b597804 sampling : fix backend temp sampler for zero temperature
This commit fixes the implementation of the temperature-based sampler
for the case when the temperature is set to zero. This now correctly
selects the most probable token by masking out all other tokens in the
logits.
2025-12-02 09:13:07 +01:00
Piotr Wilkin (ilintar)
746f9ee889 Override SSM_A op for Qwen3 Next to reduce splits (#17587)
* Override SSM_A op for Qwen3 Next to reduce splits

* New tensor mapping SSM_A_NOSCAN for SSM_A used outside of OP_SSM_SCAN context.

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-02 00:43:13 +01:00
Gilad S.
00c361fe53 fix: llama arch implementation (#17665) 2025-12-01 21:21:13 +01:00
Georgi Gerganov
88cca45bb8 sampling : fix top_p empty condition 2025-12-01 18:02:34 +02:00
Georgi Gerganov
04f2822a86 sampling : do not create empty samplers 2025-12-01 17:52:07 +02:00
Georgi Gerganov
4032ce2378 common : simplify sampler chain initialization 2025-12-01 17:11:11 +02:00
Oliver Simons
217469f07f Make backend's top_p sampler inclusive
In addition to match the algorithm proposed in the original
[paper](https://arxiv.org/abs/1904.09751), this resolves the edge-case
where `max_p is > top_p` for a single logit, where the mask would
otherwise be empty (and we thus sample from the whole vocabulary with
equal likelihood)
2025-12-01 15:28:06 +01:00
Oliver Simons
ae0bb6a6da Factor out ggml_sort into its own function 2025-12-01 15:28:06 +01:00
Georgi Gerganov
16451d6bc3 Merge branch 'master' into HEAD 2025-12-01 14:47:50 +02:00
Xuan-Son Nguyen
cd3c118908 model: support Ministral3 (#17644)
* conversion script

* support ministral 3

* maybe this is better?

* add TODO for rope_yarn_log_mul

* better ppl (tested on 14B-Instruct)

* Add Ministral3 support to Mistral format

* improve arch handling

* add sizes

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* nits

---------

Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-01 12:26:52 +01:00
Oliver Simons
8bee483c97 Fix backend_top_p_sampler
softmax(softmax) will return uniform distribution, so we should not
return the softmax but the logits instead.
2025-12-01 12:07:30 +01:00
Aman Gupta
6eea666912 llama-graph: avoid expand_forward for fusion (#17633) 2025-12-01 11:12:48 +02:00
Daniel Bevenius
cf0e1475c5 sampling : lower log level for output buffer reallocations [no ci]
This commit changes the logging level for output buffer reallocations
in the llama_context::output_reserve function from INFO to DEBUG.

The motivation for this is that it currently logs to info and when
enabling verbose logging for llama-cli this will get mixed with the
output, for example:

```console
What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB
 1. Stockholm
2\. Helsinki
Based are the options
1. Stockholm
Explanation: Stockholm is the capital of
...
```
2025-12-01 09:13:47 +01:00
Georgi Gerganov
80742cbaeb cont : naming 2025-11-30 11:24:30 +02:00
Georgi Gerganov
c187003d81 llama : naming 2025-11-30 00:05:47 +02:00
Georgi Gerganov
1760bd69b3 llama : reserve graphs with samplers 2025-11-29 23:57:25 +02:00
Georgi Gerganov
ff7b0bf632 llama : call backend_init once 2025-11-29 23:09:53 +02:00
Georgi Gerganov
d8d98bb4bb Merge branch 'master' into HEAD 2025-11-29 22:38:44 +02:00
Georgi Gerganov
9028ebfea8 llama : cleanup + naming 2025-11-29 22:37:07 +02:00
Georgi Gerganov
fbc8f49f3c llama : simplify 2025-11-29 17:01:00 +02:00
Diego Devesa
e072b2052e ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276)
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.

* llama : update worst-case graph for unified cache

* ci : disable op offload in some tests

* fix spelling

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-28 17:33:23 +02:00
Georgi Gerganov
2464d1b3fc sampling : simplify 2025-11-28 17:21:12 +02:00
Daniel Bevenius
8cac9dee45 sampling : use logits directly for min-p filtering 2025-11-28 16:12:05 +01:00
Oliver Simons
333da805fe Add initial version for top-p sampling
As we only support static graphs for the time and we don't know the size
of the output of top-p, we have to do value-scaling same as for min-p
operator.

Further improvements can be applied to the unit-test (i.e. check for
equivalence of top_p happening on backend with top_p happening on cpu)
and also by constructing candidates and sorting those as opposed to
reversing the sort of the logits (this would be arange +
get_rows instead of argsort + get_rows)
2025-11-28 15:16:20 +01:00