Héctor Estrada Moreno
0c8986403b
retrieval : use at most n_seq_max chunks ( #18400 )
b7569
2025-12-29 13:21:13 +02:00
o7si
daa242dfc8
common: fix return value check for setpriority ( #18412 )
...
* common: fix return value check for setpriority
* tools: add logging for process priority setting
b7568
2025-12-29 11:07:49 +02:00
Johannes Gäßler
e70e640db3
CUDA: Blackwell features for non-native builds ( #18436 )
b7567
2025-12-29 09:35:42 +01:00
Aman Gupta
5fa66c6e67
cuda: fix race condition in cumsum ( #18448 )
...
* ggml-cuda: fix race condition in cumsum
* remove unneccesary sync_threads
b7566
2025-12-29 14:07:17 +08:00
Tim Neumann
382808c14b
ci : re-enable rocm build on amd64 ( #18439 )
...
This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913 .
I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).
A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.
The `runs_on` option was added to match the other entries.
b7565
2025-12-29 00:29:23 +01:00
uvos
4ffc47cb20
HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated ( #18202 )
b7564
2025-12-28 20:12:55 +01:00
momonga
9c675c7140
model : Plamo3 support ( #17304 )
...
* plamo3
* fix plamo3
* clean code
* clean up the code
* fix diff
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* clean up the code
* add chat_template if exist
* clean up the code
* fix cpu-backend
* chore: whitespace trim fix + typo fix
* Fix: address review feedback
* restore `FREQ_BASE_SWA` constant
* Fix: address review feedback2
* Fix:typecheck
* Fix: address review feedback3
* final cleanup
---------
Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
b7563
2025-12-28 17:28:31 +01:00
Aman Gupta
07a0c4ba92
Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON ( #18413 )" ( #18426 )
b7562
2025-12-28 20:53:36 +08:00
o7si
60f17f56da
rpc: fix segfault on invalid endpoint format ( #18387 )
...
* rpc: fix segfault on invalid endpoint format
* rpc: add error log for failed endpoint connection
b7561
2025-12-28 12:34:41 +02:00
Johannes Gäßler
f8d561eb87
llama-fit-params: fix step size for last device ( #18415 )
b7560
2025-12-28 10:52:09 +01:00
Johannes Gäßler
e59efe6a78
github: update issue templates [no ci] ( #18410 )
...
* github: update issue templates [no ci]
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2025-12-28 10:50:56 +01:00
Xuan-Son Nguyen
cffa5c46ea
mtmd: clarify that we no longer accept AI-generated PRs ( #18406 )
b7558
2025-12-28 09:57:04 +01:00
Boian Berberov
94de74e7b1
cmake: Added more x86_64 CPU backends when building with GGML_CPU_ALL_VARIANTS=On ( #18186 )
...
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`
* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`
- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`
Resolves : #17966
b7557
2025-12-28 09:33:29 +02:00
QDelta
4fd59e8427
ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON ( #18413 )
b7556
2025-12-28 09:33:14 +08:00
lhez
08566977a7
opencl: allow resizing transpose buffers ( #18384 )
...
* opencl: allow resizing transpose buffers instead of using fixed sizes
* opencl: remove commented code
b7555
2025-12-27 15:51:14 -08:00
Johannes Gäßler
a4bf35889e
llama-fit-params: fix overflow check ( #18354 )
b7554
2025-12-27 20:20:45 +01:00
Johannes Gäßler
026d2ad472
llama: fix magic number of 999 for GPU layers ( #18266 )
...
* llama: fix magic number of 999 for GPU layers
* use strings for -ngl, -ngld
* enacapsulate n_gpu_layers, split_mode
b7553
2025-12-27 20:18:35 +01:00
Aman Gupta
06705fdcb3
ggml-cuda: Use same regex for GGML_NATIVE=OFF ( #18407 )
b7552
2025-12-27 19:56:27 +08:00
Johannes Gäßler
a52dc60ba3
llama_fit_params: return enum for fail vs. error ( #18374 )
b7551
2025-12-27 09:59:19 +01:00
Johannes Gäßler
9045c9afe5
llama-fit-params: fix Gemma 3 calculation ( #18372 )
b7550
2025-12-27 09:56:04 +01:00
Jeff Bolz
c9ced4910b
vulkan: preprocess mul_mat_id experts and discard workgroups more quickly ( #18352 )
...
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
b7549
2025-12-26 16:12:58 -06:00
Jeff Bolz
7ac8902133
vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader ( #18349 )
...
* vulkan: Use BK=32 for coopmat2 mul_mat_id
* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader
Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.
Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
b7548
2025-12-26 18:15:50 +01:00
Jeff Bolz
9bf20d8ac3
vulkan: Use BK=32 for coopmat2 mul_mat_id ( #18332 )
b7547
2025-12-26 18:15:02 +01:00
Eve
cb999704fb
vulkan: small dequantization improvements ( #18380 )
...
* iq4_xs
* quants
2025-12-26 18:12:11 +01:00
Jeff Bolz
b96b82fc85
vulkan: Support UPSCALE w/antialias ( #18327 )
b7545
2025-12-26 17:00:57 +01:00
Jeff Bolz
10dc500bdb
vulkan: handle rope with large number of rows ( #18306 )
b7544
2025-12-26 16:53:46 +01:00
o7si
4893cc07bb
server : fix crash when seq_rm fails for hybrid/recurrent models ( #18391 )
...
* server : fix crash when seq_rm fails for hybrid/recurrent models
* server : add allow_processing param to clear_slot
b7543
2025-12-26 16:35:29 +01:00
Francisco Herrera
af3be131c0
docs: added note for pre SYCL Intel hardware ( #18016 )
...
Specify that it's for pre sycl hardware
b7542
2025-12-26 10:34:30 +08:00
0Marble
b07cda687c
CANN: implement the SSM_CONV operator ( #17737 )
...
* CANN: implement SSM_CONV operator
Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com >
Co-authored-by: Sujin Kang, <waterjin326@gmail.com >
* CANN: remove custom error limit for SSM_CONV
* CANN: merge SSM_CONV tensor shape/strides into one line
---------
Co-authored-by: Sujin Kang, <waterjin326@gmail.com >
b7541
2025-12-26 09:12:04 +08:00
Aman Gupta
85c40c9b02
ggml-cuda: fix regex for arch list ( #18371 )
...
* ggml-cuda: fix regex for arch list
* make regex exact
b7540
2025-12-26 01:35:14 +08:00
Aman Gupta
83b3b1c271
cuda: optimize cumsum cub path ( #18362 )
...
* cuda: optimize cumsum cub path
* remove heavy perf test
b7539
2025-12-25 23:55:38 +08:00
Aman Gupta
b0fb0f0aee
ggml-cuda: fix blackwell native builds ( #18361 )
...
* ggml-cuda: fix blackwell native builds
Replace 12x in native architectures by 12xa
* replace for GGML_NATIVE=OFF too
* only replace for native
* remove 120f-virtual for default compilation
---------
Co-authored-by: Aman Gupta <aman>
b7538
2025-12-25 22:12:11 +08:00
Penglin Cai
e68c19b0fd
CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 ( #17934 )
...
* CONV_TRANSPOSE_1D kernel_size>255
* remove condition check
* fix the bug of type conversion
* removing trailing whitespaces
* fix: return true in the switch case
2025-12-25 16:46:09 +08:00
Aadeshveer Singh
c54bba869d
ggml : optimize cuda cumsum fallback kernel ( #18343 )
b7536
2025-12-25 12:11:13 +08:00
Xuan-Son Nguyen
f5acfb2ffa
server: (router) add stop-timeout option ( #18350 )
...
* server: (router) add stop-timeout option
* also allow stop while loading
* add docs
* unload_lru: also wait for unload to complete
2025-12-24 23:47:49 +01:00
Xuan-Son Nguyen
4cbafad4f0
model: support MiMo-V2-Flash ( #18328 )
...
* mimov2: convert ok
* rename mimov2 --> mimo2
* fix conversion
* runnable not incorrect
* use sink
* add_sliding_window_pattern
* add swa and per-layer n_head_kv
* correct params
* somewhat working
* correct gating func
* nits
* mimo2: wire RMS eps + MoE bias + converter guards
* add co-author
Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com >
* use add_rope_freq_base_swa
---------
Co-authored-by: Aaryan Kapoor <aaryankapoor2006@gmail.com >
Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com >
2025-12-24 23:07:08 +01:00
Aadeshveer Singh
c184284230
fit-params : fix race condition in fit-params output ( #18276 )
2025-12-24 15:57:38 +01:00
Aman Gupta
c8a2417d7b
CUDA: experimental native mxfp4 support for blackwell ( #17906 )
...
* CUDA: experimental native mxfp4 support for blackwell
* optimize load_tiles
* optimize quantize_mxfp4
* cleanup
* first pass review: formatting
* use interleaved layout for mma
* mmq: add assert for size
* use __nv_fp4x4_e2m1
* use iter_k as 512, cleanup
* Use 1200 as blackwell instead of 1000
* address review comments
* mmq: fix stride
* quantize.cu: use reference impl of e8m0 scale
* address review comments
* add 120f-virtual + minor fixes
---------
Co-authored-by: Aman Gupta <aman>
2025-12-24 22:28:26 +08:00
Saba Fallah
54132f1b1f
model : support for LlamaBidirectionalModel architecture ( #18220 )
...
* model: llama-embed-nemotron
* minor: python lint
* changed arch-name
* templated llm_build_llama to be used for both llama and llama-embed arch
b7531
2025-12-24 14:02:36 +01:00
Jeff Bolz
2a9ea2020c
vulkan: fix command buffer corruption in ggml_backend_vk_event_wait ( #18302 )
b7530
2025-12-24 12:36:34 +01:00
Wang Weixuan
ce7a6dc0fc
CANN : refactor ACL graph cache ( #17752 )
...
Move the graph property checking code into methods of LRU cache.
Signed-off-by: Wang Weixuan <wangweixvan@gmail.com >
b7529
2025-12-24 17:50:24 +08:00
Jesse Ikonen
1ce0126b18
docs: Fix typos in SYCL documentation ( #18269 )
2025-12-24 17:19:47 +08:00
Ruben Ortlam
7f459c98e7
vulkan: use fewer FA rows for small cache runs ( #18280 )
b7527
2025-12-24 08:59:14 +01:00
TianHao324
cf2ffc02bc
CANN: Uses yarn_ramp cache in ROPE ( #17725 )
b7526
2025-12-24 14:55:33 +08:00
ddh0
10355dc7d0
common: add LLAMA_ARG_OVERRIDE_TENSOR env var for -ot arg ( #18267 )
b7525
2025-12-24 14:19:12 +08:00
Xuan-Son Nguyen
5ee4e43f26
server: return_progress to also report 0% processing state ( #18305 )
b7524
2025-12-23 21:49:05 +01:00
Pascal
5b6c9bc0f3
webui: apply webui_settings on first load ( #18223 )
...
* webui: apply webui_settings on first load
The webui_settings from /props were not applied on initial load
when default_generation_settings.params was null
Now syncs whenever serverProps is available, regardless of params,
works for both single-model and router modes
* chore: update webui build output
2025-12-23 15:48:03 +01:00
Xuan-Son Nguyen
849d021104
server: fix crash with model not having BOS/EOS ( #18321 )
b7522
2025-12-23 14:39:36 +01:00
Daniel Bevenius
8e3ead6e4d
model-conversion : add device option to run-org-model.py ( #18318 )
...
* model-conversion : add device option to run-org-model.py
This commit refactors the `run-org-model.py` script to include a
`--device` argument, to allow users to specify the device on which to
run the model (e.g., cpu, cuda, mps, auto).
It also extracts a few common functions to prepare for future changes
where some code duplication will be removed which there currently
exists in embedding scripts.
The Makefile is also been updated to pass the device argument, for
example:
```console
(venv) $ make causal-verify-logits DEVICE=cpu
```
* fix error handling and remove parser reference
This commit fixes the error handling which previously referenced an
undefined 'parser' variable.
2025-12-23 14:07:25 +01:00
Chris Rohlf
12ee1763a6
rpc : add check for rpc buffer type ( #18242 )
b7520
2025-12-23 11:56:49 +02:00