Piotr Wilkin
1171723f9d
Fix FATTN profiling
2026-05-12 23:58:28 +02:00
Piotr Wilkin
f0210cc40d
Converge implementation with export-graph-ops
2026-05-09 17:24:54 +02:00
Piotr Wilkin
2bbcc61af7
Add missing op parameters to the profiler; add support for test-backend-ops to run performance tests with exactly the tensor shapes from the run
2026-05-09 17:24:54 +02:00
Piotr Wilkin
395c43eadf
fix mul_mat_id stats, add throughput stat, add envvar trigger, add concurrent mode fix
2026-05-09 17:24:54 +02:00
Piotr Wilkin
b1252bcd73
fix builds, integrate vulkan profiler, fix copy events, fix export
2026-05-09 17:24:54 +02:00
Piotr Wilkin
8291ecd707
Fix more missing backend stuff (and Python errors)
2026-05-09 17:24:54 +02:00
Piotr Wilkin
391d2bf23a
add second dimension to reported tensors, fix Mac build, add missing initializer to all backends
2026-05-09 17:24:54 +02:00
Piotr Wilkin
2e66d2c130
feat: cool profiler thingy
2026-05-09 17:24:54 +02:00
smugman-dot
5d6f18a638
webui: fix LLM title generation for agentic conversations ( #22840 )
2026-05-08 16:36:04 +02:00
Xuan-Son Nguyen
29debb3a6a
server: support Vertex AI compatible API ( #22545 )
...
* server: support Vertex AI compatible API
* a bit safer
* support other AIP_* env var
* various fixes
* if AIP_MODE is unset, do nothing
* fix test case
* fix windows build
2026-05-08 15:23:04 +02:00
Xuan-Son Nguyen
9dcf835528
server: (router) expose child model info from router's /v1/models ( #22683 )
...
* server: (router) expose child model info from router's /v1/models
* update docs
2026-05-08 14:42:15 +02:00
Aleksander Grygier
9b2925e1e0
webui: Add Import/Export of Settings configuration + improve architecture ( #22803 )
...
* refactor: Settings keys as constant object keys
* chore: Run `npm audit fix`
* refactor: Settings Sections UI
* feat: Refactor Settings structure and implement import/export logic
* feat: Introduce ROUTES constant and RouterService
* refactor: Consolidate settings definitions into registry
* refactor: Update settings page routing structure
* chore: Migrate hardcoded URLs to use ROUTES and RouterService
* feat: Enhance model selection logic for settings and chat
* chore: Update webui static build
* refactor: Address PR review comments
* fix: Remove unneeded setting
* fix: Re-add missing settings
* fix: Add missing `/slots` proxy for webui dev mode
* chore: Dev-mode logs
* fix: Data binding
* fix: Steering for non-agentic flow
2026-05-08 11:26:04 +02:00
smugman-dot
aaf4a4d5e0
webui: add option for LLM title generation ( #22265 )
...
* webui: add LLM title generation option
* webui: use chat_template_kwargs for title gen + fix conversation check
* webui: capture firstUserMessage before async streamChatCompletion to fix race condition
* webui: extract LLM title generation into separate method
* webui: use constants and ChatService for LLM generated titles
* webui: rebuild static output
* webui: add LLM title generation setting to new settings location
* webui: use sendMessage in generateTitle
* webui: rebuild static output
* webui: fix formatting
* webui: configurable title prompt, remove think tag regexes, fix TS error
* webui: group title constants into TITLE object, use TruncatedText for CSS truncation and fix race condition
* webui: rebuild static output
2026-05-07 21:14:03 +02:00
Pascal
cc97e45a14
mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT ( #22770 )
2026-05-07 14:01:01 +02:00
Pascal
f4b5a2ee91
webui: fix ?model= URL param race in router mode ( #22771 )
...
* webui: fix ?model= URL param race in router mode
* chore: update webui build output
2026-05-07 13:09:32 +02:00
viggy
e358d75adb
webui: fix flicker issue on dismiss animation on overlay primitives ( #22773 )
...
* add fill-mode-forwards
* generated diffs
2026-05-07 08:11:31 +02:00
tc-mb
2496f9c149
mtmd : support MiniCPM-V 4.6 ( #22529 )
...
* Support MiniCPM-V 4.6 in new branch
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix code bug
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix pre-commit
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix convert
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* rename clip_graph_minicpmv4_6
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* use new TYPE_MINICPMV4_6
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* use build_attn to allow flash attention support
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* no use legacy code, restored here.
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* use the existing tensors name
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* unused ctx->model.hparams.minicpmv_version
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* use n_merge for slice alignment
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* borrow wa_layer_indexes for vit_merger insertion point
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix code style
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* use filter_tensors and add model.vision_tower
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix chkhsh
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
* fix type check
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
---------
Signed-off-by: tc-mb <tianchi_cai@icloud.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2026-05-06 21:54:09 +02:00
Yakine Tahtah
a00e47e422
mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) ( #22101 )
...
* mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)
Conformer encoder with Shaw relative position encoding,
QFormer projector, log-mel spectrogram with frame stacking.
Encoder uses GLU gating, folded batch norm, and SSM depthwise
conv. QFormer compresses encoder output via windowed
cross-attention (window=15, queries=3) into the LLM embedding
space.
Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank,
dynamic range compression, 2x frame stacking (80->160 mel).
GGUF converter handles batch norm folding at export time,
fused K/V split, and Conv1d weight reshaping.
Tested against HF transformers reference: token-for-token match
on 30s/60s audio clips with greedy decoding.
* mtmd: rename gs_ prefixed tensors to generic/architecture names
* mtmd: use tensor_mapping.py for all granite_speech tensors
* convert: fold GraniteSpeechTextModel into GraniteModel
* mtmd: replace n_layer hack with explicit has_standard_layers flag
* mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech
* mtmd: align KEY_A_ define spacing
* convert: register GraniteModel for GraniteSpeechForConditionalGeneration
* convert: fix ty type-check for GraniteSpeechMmprojModel registration
* mtmd: align TN_ define spacing
* mtmd: use generic layer loop for granite speech tensor loading
* mtmd: merge qformer_proj_layer into clip_layer
* mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs
* mtmd: granite_speech add comment explaining why build_attn is not used
* mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata
* gguf: add spacing between granite_speech tensor mapping blocks
* mtmd: make generic audio layer_norm_eps read optional
* mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps
* mtmd: align defines and struct fields in clip-impl.h and clip-model.h
* mtmd: fix alignment and ordering issues across granite speech files
* convert: granite_speech use filter_tensors instead of modify_tensors for skipping
2026-05-06 14:40:59 +02:00
Aleksander Grygier
e3e3f8e46a
webui: Remove Google Favicons & Improve MCP Information logic & UI ( #22719 )
...
* refactor: Remove Google favicon utility
* fix: MCP Server favicon
* refactor: Cleanup
* refactor: MCP Server Information
* fix: Fix MCP Settings UI
* refactor: Cleanup
2026-05-06 11:12:27 +02:00
viggy
07eaf919ed
add tabindex and aria-hidden ( #22699 )
2026-05-06 09:21:58 +02:00
Adrien Gallouët
bf76ac77be
common : only load backends when required ( #22290 )
...
* common : only load backends when required
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
* llama : call ggml_backend_load_all() directly from llama_backend_init()
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
* Add ggml_backend_load_all() where llama_backend_init() is not used
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
2026-05-05 09:23:50 +02:00
Georgi Gerganov
2bacb1eb77
server : validate --tools CLI argument against known tool names ( #22538 )
...
Previously, unknown tool names passed via --tools were silently ignored.
Now the server validates each tool name at startup and exits with an
error if an unrecognized tool is specified, listing the available tools.
Assisted-by: llama.cpp:local pi
2026-05-05 06:35:27 +03:00
Georgi Gerganov
d6e7b033a4
llama : add option to save memory in device buffers ( #22679 )
...
* llama : add option to save memory in device buffers
* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Xuan-Son Nguyen
935a340292
server: implement /models?reload=1 ( #21848 )
2026-05-04 16:23:26 +02:00
JusteLeo
36a694c965
webui : fix circular dependency between chat.service.ts and models.svelte.ts ( #22625 )
2026-05-04 13:38:10 +02:00
Piotr Wilkin (ilintar)
a4701c98f7
common/autoparser: fixes for newline handling / forced tool calls ( #22654 )
...
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
2026-05-04 13:18:11 +02:00
Evan Huus
c84e6d6db5
server: Add a simple get_datetime server tool ( #22649 )
2026-05-04 12:19:41 +02:00
Nick Towle
fa8feaed34
webui: restore missing settings ( #22666 )
2026-05-04 09:04:07 +02:00
Georgi Gerganov
846262d787
docs : update speculative decoding parameters after refactor ( #22397 ) ( #22539 )
...
* docs : update speculative decoding parameters after refactor (#22397 )
Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397 :
- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands
Assisted-by: llama.cpp:local pi
* pi : add rule to use gh CLI for GitHub resources
Assisted-by: llama.cpp:local pi
* docs : run llama-gen-docs
* arg : fix typo
2026-05-04 08:52:07 +03:00
Georgi Gerganov
0754b7b6fe
server : avoid checkpoint data host copies ( #22558 )
...
* server : avoid checkpoint data host copies
* llama : refactor llama_io_read_i
2026-05-02 18:03:25 +03:00
Aleksander Grygier
ab6120cde5
webui: Spring Cleaning Refactor v1 ( #22505 )
...
* wip: server_tools
* feat: Integrate with `/tools` endpoint
* feat: Builtin + MCP + JSON Schema Tools WIP
* refactor
* displayName -> display_name
* snake_case everywhere
* rm redundant field
* feat: Improvements
* chore: update webui build output
* refactor: Updates after server updates
* chore: update webui build output
* change arg to --tools all
* feat: UI improvements
* chore: update webui build output
* add readme mention
* llama-gen-docs
* chore: update webui build output
* chore: update webui build output
* chore: update webui build output
* feat: Reorganize settings sections
* feat: Separate dialogs for MCP Servers Settings and Import/Export
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* WIP on allozaur/20677-webui-server-tools
* feat: UI improvements
* chore: Update package lock
* chore: Run `npm audit fix`
* feat: UI WIP
* feat: UI
* refactor: Desktop Icon Strip DRY
* feat: Cleaner rendering and transition for ChatScreen
* feat: UI improvements
* feat: UI improvement
* feat: Remove MCP Server "enable" switch from Tools submenu
* chore: Run `npm audit fix`
* feat: WIP
* feat: Logic improvements
* refactor: Cleanup
* refactor: DRY
* test: Fix Chat Sidebar UI Tests
* chore: Update package lock
* refactor: Cleanup
* feat: Chat Message Action Card with Continue and Permission flow implementations
* feat: Add agentic steering messages, draft messages and improve chat UX
* fix: Search results UI
* test: Fix unit test
* feat: UI/UX improvements
* refactor: Simplify `useToolsPanel` access in components
* feat: Implement Processing Info Context API
* feat: Implement 'Go back to chat' functionality for settings
* feat: Enhance MCP Server management in Chat Form Attachments
* style: Minor UI and branding adjustments
* chore: Update webui static build output
* chore: Formatting, linting & type checks
* feat: Draft messages logic
* feat: UI improvements
* feat: Steering Messages improvements
* refactor: Cleanup
* refactor: Cleanup
* feat: Improve UI
* refactor: Settings navigation hook
* refactor: DRY code
* refactor: DRY ChatMessageUser UI components
* refactor: Desktop Icon Strip DRY
* refactor: Tools & permissions
* fix: Navigation condition
* refactor: Cleanup
* refactor: Cleanup
* refactor: Cleanup
* fix: preserve reasoning_content in agentic flow
* refactor: Storybook cleanup
* refactor: isInViewport util function
* refactor: Rename globally `onClick` to `onclick`
* chore: `npm audit fix`
* refactor: Action Icon usage
* refactor: Naming
* refactor: JS in `class` directive
* refactor: Chat components cleanup WIP
* refactor: Components structure
* refactor: Cleanup WIP
* feat: New ChatAttachmentsPreview component
* feat: UI improvements
* feat: UI improvements
* refactor: Cleanup
* refactor: ChatAttachmentsPreview UI/UX
* refactor: Remove dead code
* refactor: Cleanup
* fix: Model Name aliases displaying
* feat: Shortcut improvements
* refactor: Chat Message
* feat: Move Import/Export to settings
* refactor: Cleanup
* refactor: Cleanup
* refactor: Cleanup
* refactor: Cleanup
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2026-05-01 18:36:29 +02:00
Georgi Gerganov
80afa33aad
spec : fix draft model checkpoints ( #22521 )
...
* spec : fix draft model checkpoints
* cont : clean-up
* cont : gate the ngram-mod reset warning behind verbose flag
2026-04-30 08:32:18 +03:00
Georgi Gerganov
683c5acb90
spec : disacard last drafted token with low prob ( #22506 )
2026-04-29 17:00:00 +03:00
Pascal
59237bfbbc
webui: fix slow mic stop and WAV encode ( #22480 )
...
* webui: instant mic stop, race-free recorder restart
* webui: faster WAV PCM encode via hoisted channels and Int16Array
* chore: update webui build output
* webui: drop setTimeout(0) hack and harden cancelRecording
* chore: update webui build output
2026-04-29 12:58:35 +02:00
Aleksander Grygier
f42e29fdf1
webui: Server tools ( #21237 )
...
* wip: server_tools
* feat: Integrate with `/tools` endpoint
* feat: Builtin + MCP + JSON Schema Tools WIP
* refactor
* displayName -> display_name
* snake_case everywhere
* rm redundant field
* feat: Improvements
* chore: update webui build output
* refactor: Updates after server updates
* chore: update webui build output
* change arg to --tools all
* feat: UI improvements
* chore: update webui build output
* add readme mention
* llama-gen-docs
* chore: update webui build output
* chore: update webui build output
* chore: update webui build output
* feat: Reorganize settings sections
* feat: Separate dialogs for MCP Servers Settings and Import/Export
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* feat: WIP
* WIP on allozaur/20677-webui-server-tools
* feat: UI improvements
* chore: Update package lock
* chore: Run `npm audit fix`
* feat: UI WIP
* feat: UI
* refactor: Desktop Icon Strip DRY
* feat: Cleaner rendering and transition for ChatScreen
* feat: UI improvements
* feat: UI improvement
* feat: Remove MCP Server "enable" switch from Tools submenu
* chore: Run `npm audit fix`
* feat: WIP
* feat: Logic improvements
* refactor: Cleanup
* refactor: DRY
* test: Fix Chat Sidebar UI Tests
* chore: Update package lock
* refactor: Cleanup
* feat: Chat Message Action Card with Continue and Permission flow implementations
* feat: Add agentic steering messages, draft messages and improve chat UX
* fix: Search results UI
* test: Fix unit test
* feat: UI/UX improvements
* refactor: Simplify `useToolsPanel` access in components
* feat: Implement Processing Info Context API
* feat: Implement 'Go back to chat' functionality for settings
* feat: Enhance MCP Server management in Chat Form Attachments
* style: Minor UI and branding adjustments
* chore: Update webui static build output
* chore: Formatting, linting & type checks
* feat: Draft messages logic
* feat: UI improvements
* feat: Steering Messages improvements
* refactor: Cleanup
* refactor: Cleanup
* feat: Improve UI
* refactor: Settings navigation hook
* refactor: DRY code
* refactor: DRY ChatMessageUser UI components
* refactor: Desktop Icon Strip DRY
* refactor: Tools & permissions
* fix: Navigation condition
* refactor: Cleanup
* refactor: Cleanup
* refactor: Cleanup
* fix: preserve reasoning_content in agentic flow
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
2026-04-28 14:35:49 +03:00
Georgi Gerganov
14e733e36f
spec : refactor params ( #22397 )
...
* spec : refactor params
* cont : fix
* cont : rename "sparam" to "sampling"
* cont : add spec params category
* cont : add info about removed arguments
* cont : skip param length check for spec params
* cont : adapt server tests
2026-04-28 09:07:33 +03:00
Aman Gupta
516e8d7a8a
server: use pos_next instead of n_tokens for m-rope ( #22439 )
2026-04-28 08:41:00 +03:00
tha80
983ca8992e
server: (router) Forward form-data to model server ( Fixes #22044 ) ( #22118 )
...
* This commit enables the router to forward form-data to model server.
Fixes #22044 (enabling to use the /v1/audio/transcriptions in router mode)
* * Applied the suggestion from Copilots first comment: using the non-throwing json::parse overload.
* Addressed Copilots third comment by extending the files representation to also include filename and content-type
* Addressed Copilots fourth comment by making the RNG thread_local
* Changed variable body from std::string to std::ostringstream in build_multipart_body
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127099053
* Added sanitize_field lambda in build_multipart_body for key, filename and content_type
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127104647
* explicitly checking if value/item is string before calling value/item.get<std::string>()
as requested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127111279
* Added double quote to the sanitize lambda and throw on json parse failure
---------
Co-authored-by: Ralph Paßgang <ralph@trust-it.de >
2026-04-27 23:55:00 +02:00
unraido
ceaf47c4b1
fix: rpc-server cache may not work in Windows environments ( #22394 )
...
* fix: create directory and log cache file name.
* Remove GGML_LOG_INFO conditional compilation.
---------
Co-authored-by: kotaro <kotaro.kusunoki@gmail.com >
2026-04-27 17:25:09 +03:00
Max Krasnyansky
5594d13224
common: fix missing exports in llama-common ( #22340 )
...
* common: refactor common/debug to move abort_on_nan into base_callback_data
Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO.
It should just be a member of the base_callback_data instead.
* cont : cleanup
* common : use pimpl in debug.h to reduce header dependencies
Move common_debug_cb_user_data's data members (std::regex,
std::vector<uint8_t>) into a private impl struct in debug.cpp.
This removes the includes of common.h and <regex> from debug.h,
reducing transitive dependencies for any translation unit that
includes the header.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2026-04-27 08:06:39 +03:00
Piotr Wilkin (ilintar)
0adede866d
parser: fix structured output bug ( #22302 )
...
* fix very stupid structured output bug
* Things just cannot be too easy.
2026-04-24 23:19:55 +02:00
Georgi Gerganov
ffdd983fb8
server : fix swa-full logic ( #22288 )
2026-04-24 10:17:37 +03:00
Yes You Can Have Your Own
793d0a7931
server: rename debug tags to match --cache-idle-slots naming ( #22292 )
2026-04-24 09:28:44 +03:00
Ethan Turner
fa0b8a70a8
cli: Remove redundant local sampling variables ( #20429 ) ( #22264 )
...
This change implements the third requested change in issue 20429.
Because defaults.sampling contains the reasoning budget token count and
the reasoning budget message, it's not necessary to assign them to
struct variables.
2026-04-24 00:53:23 +02:00
srkizer
185cbff6f1
server : convert_anthropic_to_oai: also copy chat_template_kwargs ( #22154 )
2026-04-23 13:32:46 -05:00
Song Li
c78fb909b2
server: fix heap-buffer-overflow from negative n_discard (CVE-2026-21869) ( #22267 )
...
* server: clamp n_discard to non-negative at JSON parse boundary (CVE-2026-21869)
A negative n_discard from client JSON causes heap-buffer-overflow in
update_slots() context-shift loop (CWE-787, CVSS 8.8). Clamp to 0 at
ingress; n_discard=0 already triggers auto-discard (n_left/2).
Ref: GHSA-8947-pfff-2f3c
* cont : cleaner
* cont : cleanerer
* cont : cleanest
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2026-04-23 18:39:07 +02:00
kvc0
c807c6e3b0
server: (anthropic API) fix prefix caching ( #21793 )
...
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id 3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id 3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.
I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!
It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
2026-04-23 17:45:02 +02:00
Matthias Straka
0dd7f915fd
cli : cleanup auto-completion code ( #21745 )
2026-04-23 15:03:28 +02:00
Tarek Dakhran
550d684bd1
server: Enable transcriptions API for LFM2-Audio ( #22000 )
2026-04-23 10:47:26 +02:00
Piotr Wilkin (ilintar)
8bccdbbff9
chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs ( #22217 )
...
* chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs
* Fix ty errors.
* Fix flake8 err
2026-04-22 18:10:56 +02:00