Commit Graph

7518 Commits

Author SHA1 Message Date
Daniel Bevenius
847c35f7d5 model-conversion : add trust_remote_code for embedding scripts (#18288)
This commit adds the trust_remote_code=True parameter when loading
models and configurations in the embedding model conversion scripts.
It also adds a cast to float for models that might use a data type that
is not supported by python, for example bfloat16.

The motivation for this is that some models may require custom code to
be executed during loading, and setting trust_remote_code to True avoids
getting prompted for confirmation.

Future work will consolidate the embedding conversion scripts with the
causal conversion scripts to avoid code duplication. But in the mean
time it would be nice to have this fix in place.
2025-12-23 07:27:37 +01:00
Neo Zhang
a6a552e4ec [SYCL] replace llama-cli by llama-completion to rm the impact to test script (#18290)
* replace llama-cli by llama-completion to rm the impact to test script

* Update examples/sycl/run-llama2.sh

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update examples/sycl/run-llama3.sh

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update examples/sycl/run-llama3.sh

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update examples/sycl/win-run-llama2.bat

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update examples/sycl/win-run-llama3.bat

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-23 12:59:12 +08:00
Alessandro98-git
96e33a814e model : fix div-by-zero for Nemotron V2 (#18309)
* llama-model : fix Nemotron V2 crash by moving MoE parameters calculation

* remove whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7516
2025-12-23 03:04:57 +01:00
Ryan Mangeno
dfc959b886 model : Granite Embedding support (#15641)
ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome!

* constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only

* conversion now working, hf -> gguf

* working on support, now working on building graph

* some cleanup

* cleanup

* continuing

* correct tensor shape for qkv

* fixed tensor mappings and working on buildin graph

* tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this

* cleanup

* cleanup

* cleanup

* more cleanup

* ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention  keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more

* added cls token per previous modern bert attempt, still working on checking out the rest

* fixed pre tokenizer and still working through previous pr

* working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer

* fixed pre tokenizer

* working on swa with local and global alternating attention

* some cleanup and now fails on build attn

* starting to work, and some cleanup, currently failing on last layer construction in graph build

* alternating rope implemented and modern bert graph build succeeds

* fixed asser for equal ubatch seq

* cleanup

* added mask check in vocab

* fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values

* reuse variable

* removed repeat

* standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL

* correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ...

* more modular hparam setting

* replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-graph.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* removed redundant hparam set

* enums for model sizes

* conversion for modern-bert model supported rather than just granite-small

* Update src/llama-model.cpp

Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>

* Update src/llama-model.cpp

Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>

* fixed ordering of enum for freq_base_swa

* fixed where I added residual, now gives much much better embeddings~

* readded cacheless logic

* removing whitespace

* conversion now working for swa pattern - dense every n layers

* modern bert put into seperate src file

* removing whitespace

* fixed whitespace and newline errors in editorconfig job

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* better naming convention, n_swa_pattern -> swa_period

* reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support

* fixing pyright type-check fail

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-saver.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/modern-bert.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* added descriptions in llama-model

* fixed tensor mappings for conversion

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* mapping name for size

* nits

* unused

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
b7515
2025-12-23 00:28:19 +01:00
compilade
8f48807380 gguf-py : do not align the data start offset (#18291)
The safetensors format doesn't require alignment.
2025-12-22 20:25:16 +01:00
Shouyu
bf6bc3c155 ggml-hexagon: gelu optimization (#18151)
* feat: working gelu with src0 put on vtcm

* feat: gelu ping-pong for both in and out

* fix: fixu compile error

* break: distinguish dma ddr->vtcm and vtcm->ddr operation

* fix: fix dma queue size

* break: update dma api to either pop src or dst ptr

* fix: fix activation vtcm allocation issue for src1 when swapperd

* refactor: ping-pong gelu logic to avoid unnecessary if else

* dma: improved queue interface and prefetch handling

* gelu: fix N+2 block prefetch

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
b7513
2025-12-22 10:56:52 -08:00
Xuan-Son Nguyen
179fd82a72 gen-docs: automatically update markdown file (#18294)
* gen-docs: automatically update markdown file

* also strip whitespace

* do not add extra newline

* update TOC
b7512
2025-12-22 19:30:19 +01:00
Taimur Ahmad
d34d5ca1e9 llamafile: add rvv support for sgemm kernels (#18199)
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
b7511
2025-12-22 20:20:23 +02:00
lhez
eb492bf43f opencl: unpack q4_0 for adreno in get_tensor (#18278) b7510 2025-12-22 10:19:01 -08:00
Jeff Bolz
e3b35ddf1c vulkan: Extend rope fusions to allow mrope (#18264)
Extend the test-backend-ops tests as well.
b7509
2025-12-22 11:03:13 -06:00
Xuan-Son Nguyen
6ce863c803 server: prevent data race from HTTP threads (#18263)
* server: prevent data race from HTTP threads

* fix params

* fix default_generation_settings

* nits: make handle_completions_impl looks less strange

* stricter const

* fix GGML_ASSERT(idx < states.size())

* move index to be managed by server_response_reader

* http: make sure req & res lifecycle are tied together

* fix compile

* fix index handling buggy

* fix data race for lora endpoint

* nits: fix shadow variable

* nits: revert redundant changes

* nits: correct naming for json_webui_settings
b7508
2025-12-22 14:23:34 +01:00
Xuan-Son Nguyen
3997c78e33 server: fix data race in to_json_anthropic (#18283) b7507 2025-12-22 13:21:43 +01:00
Mattt
ee74642982 release: update release workflow to store XCFramework as Zip file (#18284)
* Update release workflow to store XCFramework as Zip file

* Add comments to document Zip file requirement for XCFramework

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7506
2025-12-22 20:11:46 +08:00
Aaron Teo
a28310488c convert: rework ftype heuristics (#18214)
* convert: rework ftype heuristics

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

convert: fix type-check

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

convert: bring back heuristics comment

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* convert: revert to using first tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* convert: rework heuristics logic

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* convert: rm redundant float32 check

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-22 20:03:49 +08:00
Xuan-Son Nguyen
86af848153 server: (docs) remove mention about extra_args (#18262) 2025-12-22 12:22:01 +01:00
Johannes Gäßler
147a521636 tool/ex/tests: consistently free ctx, then model (#18168) b7503 2025-12-22 11:00:37 +01:00
Jeff Bolz
e1f15b454f vulkan: Implement set_tensor_async and the event interfaces (#18047)
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
b7502
2025-12-21 21:52:09 +01:00
Johannes Gäßler
0e1ccf15c7 llama: fix RPC for -fit on (#18233) b7501 2025-12-21 19:33:08 +01:00
Xuan-Son Nguyen
5e25ddebff move copilot instructions to AGENTS.md (#18259)
* move copilot --> agents.md

* agents: add disclose AI usage

* refine
2025-12-21 19:09:21 +01:00
Jeff Bolz
fd05c51cec vulkan: fix im2col overflowing maxworkgroupcount (#18180) b7499 2025-12-21 10:32:58 +01:00
Jeff Bolz
b365c3ff01 vulkan/cuda: fix topk_moe with exp_probs_b (#18071)
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.

CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
b7498
2025-12-21 10:27:34 +01:00
Jeff Bolz
cb64222b0c vulkan: support GGML_UNARY_OP_XIELU (#18062) b7497 2025-12-21 10:17:58 +01:00
Jeff Bolz
6eb7081860 vulkan: in graph_optimize, try to group ADD operations (#18060)
I saw the adds not staying together in the new nemotron 3 nano model.
b7496
2025-12-21 10:05:08 +01:00
lovedheart
4117ae5557 Vulkan: some improvement on mul_mat_iq2_xs (#18031)
* Some improvement on mul_mat_iq2_xs

Refactor calculations for db values and grid data to optimize performance and reduce redundancy.

* Fix trailing whitespace
b7495
2025-12-21 09:59:52 +01:00
Daniel Bevenius
65e96a2464 docs : fix links in parsing.md (#18245)
This commit corrects the links in the parsing.md which currently result
in 404 errors.
2025-12-21 09:35:40 +01:00
Aldehir Rojas
9496bbb808 common : reorganize includes to prioritize vendored deps (#18222) b7493 2025-12-20 21:43:21 -06:00
Xuan-Son Nguyen
ddcb75dd8a server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level

* implement server-context suspend

* add test

* add docs

* optimization: add fast path

* make sure to free llama_init

* nits

* fix use-after-free

* allow /models to be accessed during sleeping, fix use-after-free

* don't allow accessing /models during sleep, it is not thread-safe

* fix data race on accessing props and model_meta

* small clean up

* trailing whitespace

* rm outdated comments
b7492
2025-12-21 02:24:42 +01:00
Jeff Bolz
52ab19df63 tests: Avoid floating point precision false positives in SUM (#17471)
* tests: Avoid floating point precision false positives in SUM

* also apply to test_mean
b7491
2025-12-20 13:46:46 -06:00
Jeff Bolz
5182dd64cd test-backend-ops: improve msvc build time (#18209) b7490 2025-12-20 13:45:45 -06:00
Aadeshveer Singh
10b4f82d44 Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (#18212) b7489 2025-12-20 19:28:57 +08:00
Oleksandr Kuvshynov
408616adbd server : [easy] fix per round speculative decode logging (#18211)
Currently we always log 0, as we clear slot.drafted before.

To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:

```
% ./build/bin/llama-server -v  \
  -m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
  -md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
  -c 8192 2> /tmp/llama.cpp.debug

Check the log:

slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id  3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id  3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id  3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id  3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id  3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id  3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id  3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id  3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id  3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816
```

After the fix, get correct per round logging:

```
slot update_slots: id  3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id  3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id  3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id  3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id  3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id  3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id  3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id  3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id  3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736
```
b7488
2025-12-20 10:57:40 +01:00
Xuan-Son Nguyen
9e39a1e6a9 server: support load model on startup, support preset-only options (#18206)
* server: support autoload model, support preset-only options

* add docs

* load-on-startup

* fix

* Update common/arg.cpp

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
b7487
2025-12-20 09:25:27 +01:00
Sigbjørn Skjæret
74e05131e9 ci : remove non-windows zip artifacts (#18201)
* remove non-windows zip artifacts

* add cuda dll links
b7486
2025-12-19 22:29:46 +01:00
Sigbjørn Skjæret
f74747d886 ci : only save ccache on master (#18207) 2025-12-19 22:29:37 +01:00
Alfred
ce734a8a2f ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (#17977)
* feat: implement real Q8_0

* feat: adding cmake option for configuring FP32 quantize group size

* typo: set() shall be used

---------

Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>
b7484
2025-12-19 09:42:28 -08:00
Pascal
14931a826e arg: fix order to use short form before long form (#18196)
* arg: fix order to use short form before long form

* arg: update doc

* arg: update test-arg-parser

* arg: address review feedback from ngxson

simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified

* arg: update doc
b7483
2025-12-19 18:01:56 +01:00
Julius Tischbein
f99ef53d2a llama : Changing off_t to size_t for Windows (#18204) b7482 2025-12-19 16:42:46 +02:00
Aman Gupta
cc0a04343e server: friendlier error msg when ctx < input (#18174)
* llama-server: friendlier error msg when ctx < input

This PR adds formatted strings to the server's send_error function

* llama-server: use string_format inline

* fix test
b7481
2025-12-19 12:10:00 +01:00
Xuan-Son Nguyen
98c1c7a7bf presets: refactor, allow cascade presets from different sources, add global section (#18169)
* presets: refactor, allow cascade presets from different sources

* update docs

* fix neg arg handling

* fix empty mmproj

* also filter out server-controlled args before to_ini()

* skip loading custom_models if not specified

* fix unset_reserved_args

* fix crash on windows
b7480
2025-12-19 12:08:20 +01:00
Aleksander Grygier
acb73d8340 webui: Add editing attachments in user messages (#18147)
* feat: Enable editing attachments in user messages

* feat: Improvements for data handling & UI

* docs: Update Architecture diagrams

* chore: update webui build output

* refactor: Exports

* chore: update webui build output

* feat: Add handling paste for Chat Message Edit Form

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output
2025-12-19 11:14:07 +01:00
Daniel Bevenius
0a271d82b4 model-conversion : add verbose flag in run-org-model.py (#18194)
This commit adds a --verbose flag to the run-org-model.py script to
enable or disable detailed debug output, such as input and output
tensors for each layer. Debug utilities (summarize, debug_hook,
setup_rope_debug) have been moved to utils/common.py.

The motivation for this is that the detailed debug output can be useful
for diagnosing issues with model conversion or execution, but it can
also produce a large amount of output that may not always be needed.

The script will also be further cleaned/refactored in follow-up commits.
2025-12-19 08:43:16 +01:00
Naco Siren
52fc7fee8a android: fix missing screenshots for Android.md (#18156)
* Android basic sample app layout polish

* Add missing screenshots and polish android README doc

* Replace file blobs with URLs served by GitHub pages service.
2025-12-19 09:32:04 +02:00
Jeff Bolz
cdbada8d10 vulkan: Add perf logger mode with concurrency (#17944)
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.

GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).

GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
b7476
2025-12-19 06:36:46 +01:00
Xuan-Son Nguyen
8ea958d4d9 model : add ASR support for LFM2-Audio-1.5B (conformer) (#18106)
* ASR with LFM2-Audio-1.5B

* Set rope_theta

* Fix comment

* Remove rope_theta setting

* Address PR feedback

* rename functions to conformer

* remove some redundant ggml_cont

* fix missing tensor

* add prefix "a." for conv tensors

* remove redundant reshape

* clean up

* add test model

---------

Co-authored-by: Tarek Dakhran <tarek@liquid.ai>
b7475
2025-12-19 00:18:01 +01:00
Pascal
f9ec8858ed webui: display prompt processing stats (#18146)
* webui: display prompt processing stats

* feat: Improve UI of Chat Message Statistics

* chore: update webui build output

* refactor: Post-review improvements

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-18 17:55:03 +01:00
Taimur Ahmad
f716588e63 ggml-cpu: extend support for RVV floating-point kernels (#17318)
* cmake: add BF16 RVV flag for ggml-cpu

* ggml-cpu: add floating-point conversion kernels

* ggml: add floating-point kernels

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: fix lmul in vec_dot_bf16

* ggml-cpu: change redsum to lmul 4, fix leftover

---------

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2025-12-18 16:02:09 +02:00
Xuan-Son Nguyen
4d1316c440 arg: fix ASAN error on sampler_type_names empty (#18167) b7472 2025-12-18 14:30:32 +01:00
Sigbjørn Skjæret
ec7b9329ae gguf-py : use copy-on-write mode for localtensor (#18162) 2025-12-18 13:45:38 +01:00
yulo
54189c0d39 remove i_major_dual (#18157)
Co-authored-by: zhang hui <you@example.com>
b7470
2025-12-18 12:50:56 +01:00
Aleksander Grygier
9ce64aed7d webui: Fix selecting generated output issues during active streaming (#18091)
* draft: incremental markdown rendering with stable blocks

* refactor: Logic improvements

* refactor: DRY Markdown post-processing logic

* refactor: ID generation improvements

* fix: Remove runes

* refactor: Clean up & add JSDocs

* chore: update webui static output

* fix: Add tick to prevent race conditions for rendering Markdown blocks

Suggestion from @ServeurpersoCom

Co-authored-by: Pascal <admin@serveurperso.com>

* chore: Run `npm audit fix`

* chore: update webui static output

* feat: Improve performance using global counter & id instead of UUID

* refactor: Enhance Markdown rendering with link and code features

* chore: update webui static output

* fix: Code block content extraction

* chore: update webui static output

* chore: update webui static output

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-18 11:13:52 +01:00