7248 Commits

Author SHA1 Message Date
Pascal
5ceed62421 server: fix duplicate HTTP headers in multiple models mode (#17698)
* llama-server: fix duplicate HTTP headers in multiple models mode (#17693)

* llama-server: address review feedback from ngxson

- restrict scope of header after std::move
- simplify header check (remove unordered_set)
b7248
2025-12-03 10:28:43 +01:00
Reese Levine
7ca5991d2b ggml webgpu: add support for emscripten builds (#17184)
* Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (#9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (#4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Fix .gitignore

* Add memory64 option and remove unneeded macros for setting threads to 1

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b7247
2025-12-03 10:25:34 +01:00
Sigbjørn Skjæret
b3e3060f4e ci : move release details to the top visible by default (#17719) 2025-12-03 09:22:46 +01:00
Herman Semenoff
37adc9c6ba ggml, llama : use defaulted constructors/destructors (#17649) b7245 2025-12-03 07:12:18 +01:00
Marcos Del Sol Vives
16cc3c606e build: document how to compile with Vulkan using Debian/Ubuntu packages (#17688) b7244 2025-12-03 08:25:11 +08:00
Xuan-Son Nguyen
13628d8bdb server: add --media-path for local media files (#17697)
* server: add --media-path for local media files

* remove unused fn
b7243
2025-12-02 22:49:20 +01:00
Xuan-Son Nguyen
a96283adc4 mtmd: fix --no-warmup (#17695) 2025-12-02 22:48:08 +01:00
Ali Tariq
4eba8d9451 ci : RVV1.0 builds with tests (#16682)
* Added RISC-V supported tests

* Added default value for LLAMA_FATAL_WARNINGS and option to specify by user

* Added RISC-V supported tests

* Added default value for LLAMA_FATAL_WARNINGS and option to specify by user

* Removed apt prompt

* Added RISC-V specific tests with corrections

Corrections included:
1. Changed the test names from debian to ubuntu as it is more stable than Debian Trixie
2. Added explicit compiler in cmake command as GCC compiler below version 14 have been recorded
to throw errors with rvv1.0 and some other extensions
3. Added dependencies which are not installed by default in the RISC-V Ubuntu 24.04
4. Separate ccache directory for all jobs as all the ccache results are not the same and may cause ccache to not work

* Resolved the merge conflict and cleaned up run.sh

* Update ci/run.sh

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Removed previously added build ci for RISC-V

* Removed trailing whitespaces

* corrected build name

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* cleanup

* Enabled build tests (1)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Enabled build tests (2)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* enable openssl

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-02 21:46:10 +01:00
Jeff Bolz
61bde8e21f vulkan: Reduce temporary memory usage for TOP_K (#17623)
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
b7240
2025-12-02 19:22:04 +01:00
xiaobing318
e251e5ebbe cmake : add utf8 compilation options for msvc (#17682) b7239 2025-12-02 19:50:57 +02:00
Chad Voegele
c4357dcc35 Server: Change Invalid Schema from Server Error (500) to User Error (400) (#17572)
* Make invalid schema a user error (400)

* Move invalid_argument exception handler to ex_wrapper

* Fix test

* Simplify test back to original pattern
2025-12-02 17:33:50 +01:00
Adrien Gallouët
e148380c7c ggml : use svcntb() for SVE vector length detection (#17474)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7237
2025-12-02 18:21:11 +02:00
TianHao324
a2b0fe8d37 CANN: Disable Ger operator of OUT_PROD on 310p device (#17563) b7236 2025-12-02 20:35:23 +08:00
Daniel Bevenius
7f3a72a8ed ggml : remove redundant n_copies check when setting input/output (#17612)
This commit removes a redundant check for sched->n_copies > 1 when
setting input and output flags on tensor copies in
ggml_backend_sched_split_graph.

The motivation for this change is to clarify the code as the outer if
statement already performs this check.
b7235
2025-12-02 12:52:45 +01:00
Eric Curtin
b9a37717b0 codeowners : remove ericcurtin (#17658)
Taking a break from llama.cpp . I wasn't around at the start of llama.cpp
but I want to thank @ggerganov and @slaren for creating a neat community
here.

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-12-02 12:18:15 +01:00
Adrien Gallouët
f3a9674ae8 llama : fix signed comparison warning on FreeBSD (#17497)
This ensures correct RLIM_INFINITY handling and compatibility on all platforms (32/64-bit).

    warning: comparison of integers of different signs: 'rlim_t' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare]
      488 |         if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size)) {
          |                         ~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7233
2025-12-02 12:05:38 +01:00
Xuan-Son Nguyen
2c453c6c77 convert: add error message for mistral3 quantized weight (#17686) 2025-12-02 11:48:31 +01:00
Xuan-Son Nguyen
5d6bd842ea server: remove default "gpt-3.5-turbo" model name (#17668)
* server: remove default "gpt-3.5-turbo" model name

* do not reflect back model name from request

* fix test
b7231
2025-12-02 11:38:57 +01:00
senhtry
fd3abe849e server: fixing naming conflict res_error in server-models.cpp (#17679) b7230 2025-12-02 11:18:39 +01:00
Xuan-Son Nguyen
682e6658bb server: explicitly set exec path when create new instance (#17669)
* Revert "rm unused fn"

This reverts commit f2dbe9c087.

* server: explicitly set exec path when create new instance

* put back TODO

* only call get_server_exec_path() once

* add fallback logic
b7229
2025-12-02 10:25:11 +01:00
Adrien Gallouët
4574f2949e ci : skip winget update when not in ggml-org (#17465)
Prevent forks from generating daily failure notifications.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-02 10:15:01 +01:00
Adrien Gallouët
ab6726eeff ggml : add fallback definition for HWCAP2_SVE2 (#17683)
This align with other HWCAP2 feature flags

See #17528

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7227
2025-12-02 10:41:26 +02:00
Aleksander Grygier
cee92af553 Add context info to server error (#17663)
* fix: Add context info to server error

* chore: update webui build output
2025-12-02 09:20:57 +01:00
Aman Gupta
ed32089927 ggml-cuda: reorder only relevant nodes (#17639) b7225 2025-12-02 12:36:31 +08:00
Aaron Teo
7b6d745364 release: fix duplicate libs, store symbolic links (#17299) b7224 2025-12-02 11:52:05 +08:00
Neo Zhang Jianyu
98bd9ab1e4 enhance argsort for UT (#17573)
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
b7223
2025-12-02 08:56:46 +08:00
Piotr Wilkin (ilintar)
746f9ee889 Override SSM_A op for Qwen3 Next to reduce splits (#17587)
* Override SSM_A op for Qwen3 Next to reduce splits

* New tensor mapping SSM_A_NOSCAN for SSM_A used outside of OP_SSM_SCAN context.

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7222
2025-12-02 00:43:13 +01:00
Jeff Bolz
9810cb8247 ops.md: update vulkan support (#17661) 2025-12-01 15:26:21 -06:00
Xuan-Son Nguyen
ecf74a8417 mtmd: add mtmd_context_params::warmup option (#17652)
* mtmd: add mtmd_context_params::warmup option

* reuse the common_params::warmup
b7220
2025-12-01 21:32:25 +01:00
Gilad S.
00c361fe53 fix: llama arch implementation (#17665) b7219 2025-12-01 21:21:13 +01:00
Xuan-Son Nguyen
ec18edfcba server: introduce API for serving / loading / unloading multiple models (#17470)
* server: add model management and proxy

* fix compile error

* does this fix windows?

* fix windows build

* use subprocess.h, better logging

* add test

* fix windows

* feat: Model/Router server architecture WIP

* more stable

* fix unsafe pointer

* also allow terminate loading model

* add is_active()

* refactor: Architecture improvements

* tmp apply upstream fix

* address most problems

* address thread safety issue

* address review comment

* add docs (first version)

* address review comment

* feat: Improved UX for model information, modality interactions etc

* chore: update webui build output

* refactor: Use only the message data `model` property for displaying model used info

* chore: update webui build output

* add --models-dir param

* feat: New Model Selection UX WIP

* chore: update webui build output

* feat: Add auto-mic setting

* feat: Attachments UX improvements

* implement LRU

* remove default model path

* better --models-dir

* add env for args

* address review comments

* fix compile

* refactor: Chat Form Submit component

* ad endpoint docs

* Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2

Co-authored-by: Aleksander <aleksander.grygier@gmail.com>

* feat: Add copy to clipboard to model name in model info dialog

* feat: Model unavailable UI state for model selector

* feat: Chat Form Actions UI logic improvements

* feat: Auto-select model from last assistant response

* chore: update webui build output

* expose args and exit_code in API

* add note

* support extra_args on loading model

* allow reusing args if auto_load

* typo docs

* oai-compat /models endpoint

* cleaner

* address review comments

* feat: Use `model` property for displaying the `repo/model-name` naming format

* refactor: Attachments data

* chore: update webui build output

* refactor: Enum imports

* feat: Improve Model Selector responsiveness

* chore: update webui build output

* refactor: Cleanup

* refactor: Cleanup

* refactor: Formatters

* chore: update webui build output

* refactor: Copy To Clipboard Icon component

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

* refactor: UI badges

* chore: update webui build output

* refactor: Cleanup

* refactor: Cleanup

* chore: update webui build output

* add --models-allow-extra-args for security

* nits

* add stdin_file

* fix merge

* fix: Retrieve lost setting after resolving merge conflict

* refactor: DatabaseStore -> DatabaseService

* refactor: Database, Conversations & Chat services + stores architecture improvements (WIP)

* refactor: Remove redundant settings

* refactor: Multi-model business logic WIP

* chore: update webui build output

* feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic

* chore: update webui build output

* fix: Add `untrack` inside chat processing info data logic to prevent infinite effect

* fix: Regenerate

* feat: Remove redundant settigns + rearrange

* fix: Audio attachments

* refactor: Icons

* chore: update webui build output

* feat: Model management and selection features WIP

* chore: update webui build output

* refactor: Improve server properties management

* refactor: Icons

* chore: update webui build output

* feat: Improve model loading/unloading status updates

* chore: update webui build output

* refactor: Improve API header management via utility functions

* remove support for extra args

* set hf_repo/docker_repo as model alias when posible

* refactor: Remove ConversationsService

* refactor: Chat requests abort handling

* refactor: Server store

* tmp webui build

* refactor: Model modality handling

* chore: update webui build output

* refactor: Processing state reactivity

* fix: UI

* refactor: Services/Stores syntax + logic improvements

Refactors components to access stores directly instead of using exported getter functions.

This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction.

Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`.

* refactor: Architecture cleanup

* feat: Improve statistic badges

* feat: Condition available models based on modality + better model loading strategy & UX

* docs: Architecture documentation

* feat: Update logic for PDF as Image

* add TODO for http client

* refactor: Enhance model info and attachment handling

* chore: update webui build output

* refactor: Components naming

* chore: update webui build output

* refactor: Cleanup

* refactor: DRY `getAttachmentDisplayItems` function + fix UI

* chore: update webui build output

* fix: Modality detection improvement for text-based PDF attachments

* refactor: Cleanup

* docs: Add info comment

* refactor: Cleanup

* re

* refactor: Cleanup

* refactor: Cleanup

* feat: Attachment logic & UI improvements

* refactor: Constants

* feat: Improve UI sidebar background color

* chore: update webui build output

* refactor: Utils imports + move types to `app.d.ts`

* test: Fix Storybook mocks

* chore: update webui build output

* test: Update Chat Form UI tests

* refactor: Tooltip Provider from core layout

* refactor: Tests to separate location

* decouple server_models from server_routes

* test: Move demo test  to tests/server

* refactor: Remove redundant method

* chore: update webui build output

* also route anthropic endpoints

* fix duplicated arg

* fix invalid ptr to shutdown_handler

* server : minor

* rm unused fn

* add ?autoload=true|false query param

* refactor: Remove redundant code

* docs: Update README documentations + architecture & data flow diagrams

* fix: Disable autoload on calling server props for the model

* chore: update webui build output

* fix ubuntu build

* fix: Model status reactivity

* fix: Modality detection for MODEL mode

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7218
2025-12-01 19:41:04 +01:00
Xuan-Son Nguyen
7733409734 common: improve verbosity level definitions (#17630)
* common: improve verbosity level definitions

* string_format

* update autogen docs
b7217
2025-12-01 14:38:13 +01:00
Xuan-Son Nguyen
cd3c118908 model: support Ministral3 (#17644)
* conversion script

* support ministral 3

* maybe this is better?

* add TODO for rope_yarn_log_mul

* better ppl (tested on 14B-Instruct)

* Add Ministral3 support to Mistral format

* improve arch handling

* add sizes

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* nits

---------

Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7216
2025-12-01 12:26:52 +01:00
Georgi Gerganov
649495c9d9 metal : add FA head size 48 (#17619) b7215 2025-12-01 12:49:53 +02:00
Georgi Gerganov
90c72a614a ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (#17617) b7214 2025-12-01 12:49:33 +02:00
Aman Gupta
6eea666912 llama-graph: avoid expand_forward for fusion (#17633) b7213 2025-12-01 11:12:48 +02:00
Xuan-Son Nguyen
ff90508d68 contributing: update guidelines for AI-generated code (#17625)
* contributing: update guidelines for AI-generated code

* revise
b7212
2025-11-30 22:51:34 +01:00
Adrien Gallouët
0a4aeb927d cmake : add option to build and link LibreSSL (#17552)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7211
2025-11-30 22:14:32 +01:00
Tarek Dakhran
2ba719519d model: LFM2-VL fixes (#17577)
* Adjust to pytorch

* Add antialiasing upscale

* Increase number of patches to 1024

* Handle default marker insertion for LFM2

* Switch to flag

* Reformat

* Cuda implementation of antialias kernel

* Change placement in ops.cpp

* consistent float literals

* Pad only for LFM2

* Address PR feedback

* Rollback default marker placement changes

* Fallback to CPU implementation for antialias implementation of upscale
b7210
2025-11-30 21:57:31 +01:00
Xuan-Son Nguyen
7f8ef50cce clip: fix nb calculation for qwen3-vl (#17594) b7209 2025-11-30 15:33:55 +01:00
Xuan-Son Nguyen
3c136b21a3 cli: add migration warning (#17620) b7208 2025-11-30 15:32:43 +01:00
Adrien Gallouët
beb1f0c503 common : throttle download progress output to reduce IO flush (#17427)
This change limits progress updates to approximately every 0.1% of the
file size to minimize stdio overhead.

Also fixes compiler warnings regarding __func__ in lambdas.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7207
2025-11-30 14:22:44 +02:00
Aaron Teo
def5404f26 common: add LLAMA_LOG_FILE env var (#17609)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
b7206
2025-11-30 12:12:32 +01:00
Gilad S.
fa0465954f ggml: fix: macOS build with -DGGML_BACKEND_DL=ON (#17581) b7205 2025-11-30 10:00:59 +08:00
ddh0
5a6241feb0 common: update env var name (#17588) b7204 2025-11-30 09:59:25 +08:00
Aman Gupta
c7af376c29 CUDA: add stream-based concurrency (#16991)
* CUDA: add stream-based concurrency

* HIP: fix hipStreamWaitEvent define and nodiscard warnings

* ggml-cuda: fix fusion inside stream

* ggml-cuda: fix bug w.r.t first stream launch

* ggml-cuda: format

* ggml-cuda: improve assert message

* ggml-cuda: use lambda instead of duplicating code

* ggml-cuda: add some more comments

* ggml-cuda: add more detailed comments about concurrency

* ggml-cuda: rename + remove unused var

* ggml-cuda: fix condition for stream launch

* ggml-cuda: address review comments, add destructor

* common.cuh: add is_valid for concurrent events

* common.cuh: make comment better

* update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* common.cuh: fix lower_bound condition + remove join_node data from write_ranges

* ggml-cuda: fix overlap condition + shadowing parameter

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b7203
2025-11-30 08:17:55 +08:00
Mahekk Shaikh
00425e2ed1 cuda : add error checking for cudaMemcpyAsync in argsort (#17599)
* cuda : add error checking for cudaMemcpyAsync in argsort (#12836)

* fix indentation
b7202
2025-11-30 08:16:28 +08:00
Acly
385c3da5e6 vulkan : fix FA mask load with bounds check (coopmat2) (#17606) b7201 2025-11-30 01:03:21 +01:00
Xuan-Son Nguyen
ab49f094d2 server: move server-context to its own cpp|h (#17595)
* git mv

* add server-context.h

* add server-context.h

* clean up headers

* cont : cleanup

* also expose server_response_reader (to be used by CLI)

* fix windows build

* decouple server_routes and server_http

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7200
2025-11-29 22:04:44 +01:00
Haiyue Wang
8c32d9d96d server: explicitly set the function name in lambda (#17538)
As [1] explained, the real debug message will be like:
	"res    operator(): operator() : queue result stop"

Set the name explicitly, the message is easy for debugging:
	"res    operator(): recv : queue result stop"

The left "operator()" is generated by 'RES_DBG() ... __func__'

[1]: https://clang.llvm.org/extra/clang-tidy/checks/bugprone/lambda-function-name.html

Signed-off-by: Haiyue Wang <haiyuewa@163.com>
b7199
2025-11-29 18:43:29 +01:00