llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-15 05:24:06 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	16451d6bc3	Merge branch 'master' into HEAD	2025-12-01 14:47:50 +02:00
Adrien Gallouët	beb1f0c503	common : throttle download progress output to reduce IO flush (#17427 ) This change limits progress updates to approximately every 0.1% of the file size to minimize stdio overhead. Also fixes compiler warnings regarding __func__ in lambdas. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-11-30 14:22:44 +02:00
Aaron Teo	def5404f26	common: add LLAMA_LOG_FILE env var (#17609 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-11-30 12:12:32 +01:00
Georgi Gerganov	80742cbaeb	cont : naming	2025-11-30 11:24:30 +02:00
ddh0	5a6241feb0	common: update env var name (#17588 )	2025-11-30 09:59:25 +08:00
Georgi Gerganov	c187003d81	llama : naming	2025-11-30 00:05:47 +02:00
Georgi Gerganov	d8d98bb4bb	Merge branch 'master' into HEAD	2025-11-29 22:38:44 +02:00
Igor Smirnov	0874693b44	common : fix json schema with '\' in literals (#17307 ) * Fix json schema with '\' in literals * Add "literal string with escapes" test	2025-11-29 17:06:32 +01:00
Georgi Gerganov	fbc8f49f3c	llama : simplify	2025-11-29 17:01:00 +02:00
DAN™	03914c7ef8	common : move all common_chat_parse_* to chat-parser.cpp. (#17481 )	2025-11-28 19:29:36 +01:00
Oliver Simons	333da805fe	Add initial version for top-p sampling As we only support static graphs for the time and we don't know the size of the output of top-p, we have to do value-scaling same as for min-p operator. Further improvements can be applied to the unit-test (i.e. check for equivalence of top_p happening on backend with top_p happening on cpu) and also by constructing candidates and sorting those as opposed to reversing the sort of the logits (this would be arange + get_rows instead of argsort + get_rows)	2025-11-28 15:16:20 +01:00
Georgi Gerganov	117e2079a9	refactor : simplify and improve memory management	2025-11-28 16:09:42 +02:00
Daniel Bevenius	e9d070980b	sampling : remove backend sampling chain from common_sampler This commit removes the backend sampling chain from the common_sampler structure and related functions. The motivation for this change is that the backend samplers are not currently set on the context, and if they are they would cause the a graph reallocation to occur. Instead, the intialization is handled like it currently is by llama_context's constructor.	2025-11-27 15:28:37 +01:00
Daniel Bevenius	51107a0b63	sampling : fix temperature check to allow zero temperature This commit modifies the temperature sampling check to allow a temperature value of zero. Previously, the check only allowed positive temperature values, which excluded the valid case of zero temperature. The motivation for this is to enable a zero temperature setting which is also currently causing the following test to fail: ```console (venv) $ cd tools/server/tests (venv) $ ./tests.sh unit/test_basic.py::test_load_split_model ```	2025-11-27 09:18:43 +01:00
Xuan-Son Nguyen	e509411cf1	server: enable jinja by default, update docs (#17524 ) * server: enable jinja by default, update docs * fix tests	2025-11-27 01:02:50 +01:00
Daniel Bevenius	0f7805f32a	common : add get_active_samplers function to check enabled samplers This commit adds a function to check if a sampler is actually enabled, meaning that it does not have values that disables its effect. This is then used by the backend samplers initialization to avoid considering samplers that are not enabled when determining the split point between them. The motivation for this is that this allows the default sampler chain for `--samplers` to be used and any sampler that is not enabled will not cause the backend samplers to be skipped. For example, before this change if the penalties sampler was included in the samplers list but had default values that disable it, it would cause the backend samplers to be skipped entirely. This commit also contains some refactoring to remove some code duplication.	2025-11-26 15:46:33 +01:00
Daniel Bevenius	b45d504e70	sampling : add min-p backend sampler	2025-11-26 10:50:58 +01:00
Daniel Bevenius	9e5e09d087	sampling : remove backend-dist option (wip) This commit removes the `--backend-dist` option and instead uses the configured --samplers chain to determine which samplers run on the backend. Backend sampling is still enabled using With `--backend_sampling`, and the sampler chain, either explictly specified using `--samplers` or the default, is automatically analyzed to determine which samplers can run on the backend. The system finds the longest contiguous chain of backend supported samplers from the start of the sampler sequence. For example: * If the chain is `top-k -> temperature -> top-p`, and both `top-k` and `temperature` are backend-supported but `top-p` is not, then `top-k` and `temperature` will run on the backend, while `top-p` and subsequent samplers run on the CPU. * If all configured samplers are supported, the final distribution sampling will also happen on the backend, transferring only the sampled token IDs back to the host. * If the sampler chain starts with an unsupported sampler (e.g., `penalties`), all sampling runs on the CPU. Note that this is currently the case with the default sampler so to use backend sampling it is required to specify a sampler chain. See below for an example. The following shows how llama-cli can be run with backend sampling: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature' ``` In this case the all sampling will happen on the backend since both `top_k` and `temperature` are supported backend samplers. To enable a partial backend sampling (hybrid sampling), for example running `top_k` and `temperature` on the backend and `typ_p` on the CPU the following sampler chain could be specified: ```console $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \ --prompt 'What is the capital of Sweden?' \ -n 20 \ -no-cnv \ --verbose-prompt \ -ngl 40 \ --backend-sampling \ --samplers 'top_k;temperature;top_p' ``` If this looks good then I'll follow up with updates the llama-cli and llama-server documentation to reflect these changes.	2025-11-25 14:01:23 +01:00
Daniel Bevenius	2b4c7927ee	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-25 06:10:33 +01:00
Aaron Teo	877566d512	llama: introduce support for model-embedded sampling parameters (#17120 )	2025-11-25 09:56:07 +08:00
Georgi Gerganov	b26c7069fb	common : initialize backend samplers	2025-11-24 20:25:44 +02:00
Georgi Gerganov	e2d4f0829c	llama-cli : fix dangling reference to sampler config	2025-11-24 19:51:32 +02:00
Daniel Bevenius	d88ba1813c	common : remove build-info.cpp from commit [no ci] This file was generated during the build process and should not be included in previous commits.	2025-11-24 09:31:14 +01:00
Daniel Bevenius	79b8cf2a75	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-21 16:38:32 +01:00
Daniel Bevenius	9b2439347f	common, tools : refactor model loading to support backend samplers This commit refactors the model loading process in common/common.cpp to enable backend sampler to be configure prior to the llama_context creation. The motivation for this change is that just being able to set/reset the backend samplers after the llama_context has been created will cause a resize to occur in llama_context::output_reserve which we want to avoid.	2025-11-21 14:26:52 +01:00
Daniel Bevenius	61ffe41dc1	sampling : use pinned memory for backend sampling buffers	2025-11-21 14:02:16 +01:00
Daniel Bevenius	c1625620f6	sampling : return early if backend sampling is disabled	2025-11-21 08:47:31 +01:00
Georgi Gerganov	196f5083ef	common : more accurate sampling timing (#17382 ) * common : more accurate sampling timing * eval-callback : minor fixes * cont : add time_meas impl * cont : fix log msg [no ci] * cont : fix multiple definitions of time_meas * llama-cli : exclude chat template init from time measurement * cont : print percentage of unaccounted time * cont : do not reset timings	2025-11-20 13:40:10 +02:00
Daniel Bevenius	0c660e7390	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-20 06:57:24 +01:00
Georgi Gerganov	38f408c253	common : fix regression caused by extra memory allocations during sampling	2025-11-19 13:43:29 +02:00
Daniel Bevenius	51fee29822	sampling : always populate logits for sampled probs This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers.	2025-11-19 07:14:11 +01:00
Xuan-Son Nguyen	10e9780154	chat: fix int overflow, prevent size calculation in float/double (#17357 ) * chat: fix int overflow, prevent size calculation in float/double * Update common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-18 19:11:53 +01:00
hksdpc255	1920345c3b	common : Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) (#16932 ) * Add files via upload * fix unit test * fix crashes for --reasoning-format=none * Patch buggy official MiniMax-M2 chat template * add upstream minja fix: https://github.com/ochafik/minja/pull/7 * Fix <think> token not generated * add test copied from https://github.com/ggml-org/llama.cpp/pull/16946 * cleanup * Hopes to fix the compilation error on CI * Delete chat template patching since it’s fixed by upstream Minja * Remove undeeded Minimax-M2 template patch https://github.com/ochafik/minja/pull/7#issuecomment-3480356100 * Add proper handling of optional parameters with test merged tests from: `23d4bb75c4` * Fix making all tool parameters optional * Move xml tool parser to separate file * cleanup & add tests for GLM4.5 * add streaming tests & enhancement & cleanups Add streaming test for both GLM 4.5 and minimax-m2. Cleanup for preserved_tokens. Cleanup for grammar rule name. Enhance the parser's stability. * cleanup & add support for Kimi-K2 Qwen3-Coder Apriel-1.5 Xiaomi-MiMo * apply suggestions from reviewers * fix a misuse for data.grammar_lazy * fix grammar when tool have no argument * Fix `no triggers set for lazy grammar!` for GLM4.5/4.6. Insert additional stops for Kimi-K2 * update chat.cpp * fix grammar for GLM 4.5/4.6 * Try fix Jinja template for GLM * Try fix GLM-4.6.jinja * Update common/chat-parser-xml-toolcall.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * improve chat template for GLM, rename Kimi-K2 template to Kimi-K2-Thinking * Improve Kimi-K2 chat template * Fix unit test * Fix "Invalid tool call arguments passed" in a rare case. In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation. --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-18 18:54:15 +01:00
Daniel Bevenius	82957a90f2	sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful.	2025-11-18 15:11:59 +01:00
Georgi Gerganov	4b52e59903	graph : do not include llama-model.h	2025-11-18 13:53:25 +02:00
Daniel Bevenius	7884b0e0ac	sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.	2025-11-17 16:15:58 +01:00
Xuan-Son Nguyen	9b17d74ab7	mtmd: add mtmd_log_set (#17268 )	2025-11-14 15:56:19 +01:00
Adrien Gallouët	52cf111b31	cmake : cleanup (#17199 )	2025-11-12 14:48:30 +02:00
Adrien Gallouët	78010a0d52	cmake : move OpenSSL linking to vendor/cpp-httplib (#17177 ) * cmake : move OpenSSL linking to vendor/cpp-httplib Signed-off-by: Adrien Gallouët <angt@huggingface.co> * bring back httplib 0.27.0 * add -DLLAMA_HTTPLIB * update cmake config for visionos --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-12 12:32:50 +01:00
Xuan-Son Nguyen	1d45b4228f	vendor: split httplib to cpp/h files (#17150 ) * vendor: split httplib to cpp/h files * move defines * include httplib if curl is not used * add TODO * fix build ios * fix build visionos instead	2025-11-11 13:32:58 +01:00
Georgi Gerganov	f914544b16	batched-bench : add "separate text gen" mode (#17103 )	2025-11-10 12:59:29 +02:00
Xuan-Son Nguyen	aa3b7a90b4	arg: add --cache-list argument to list cached models (#17073 ) * arg: add --cache-list argument to list cached models * new manifest naming format * improve naming * Update common/arg.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-08 21:54:14 +01:00
Xuan-Son Nguyen	5c9a18e674	common: move download functions to download.(cpp\|h) (#17059 ) * common: move download functions to download.(cpp\|h) * rm unused includes * minor cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-07 11:23:34 +01:00
Georgi Gerganov	13b339bcd9	server : do not default to multiple slots with speculative decoding (#17017 ) * server : do not default to multiple slots with speculative decoding * cont : fix	2025-11-05 14:32:55 +02:00
Xuan-Son Nguyen	070ff4d535	mtmd: add --image-min/max-tokens (#16921 )	2025-11-03 11:11:18 +01:00
Aldehir Rojas	87c9efc3b2	common : move gpt-oss reasoning processing to init params (#16937 )	2025-11-02 16:56:28 +02:00
Sigbjørn Skjæret	961660b8c3	common : allow --system-prompt-file for diffusion-cli (#16903 )	2025-11-01 11:01:42 +01:00
Shagun Bera	835e918d84	common: fix typo in cli help text (#16864 )	2025-10-30 17:47:31 +02:00
Sam Malayek	1c1409e131	embedding: add raw option for --embd-output-format (#16541 ) * Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README	2025-10-28 12:51:41 +02:00
Aldehir Rojas	280d97be96	grammar : support array references in json schema (#16792 ) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-28 09:37:52 +01:00

1 2 3 4 5 ...

645 Commits