llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-05-10 02:54:06 +00:00

Files

wsbagnsv1 5814b4dce1 cuda: optimize SOLVE_TRI using registers and FMAF (#17703 )

* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling

- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.

* Small cleanup

* Remove comments in solve_tri.cu

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Use const for variables in solve_tri.cu

* Replace fmaf with more readable code

* remove last fmaf

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

2025-12-08 10:41:08 +01:00

cmake

ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )

2025-08-07 13:45:41 +02:00

include

ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690 )

2025-12-07 00:13:33 +08:00

src

cuda: optimize SOLVE_TRI using registers and FMAF (#17703 )

2025-12-08 10:41:08 +01:00

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (#17784 )

2025-12-08 10:41:34 +02:00