Compare commits

...

57 Commits
b8772 ... b8829

Author SHA1 Message Date
Georgi Gerganov
6990e2f1f7 libs : rename libcommon -> libllama-common (#21936)
* cmake : allow libcommon to be shared

* cmake : rename libcommon to libllama-common

* cont : set -fPIC for httplib

* cont : export all symbols

* cont : fix build_info exports

* libs : add libllama-common-base

* log : add common_log_get_verbosity_thold()
2026-04-17 11:11:46 +03:00
Eric Zhang
fcc7508759 model : Gemma4 model type detection (#22027)
* model : Gemma4 model type detection

* model : Gemma4 model type detection
2026-04-17 10:07:11 +02:00
lhez
5e6c0e18b6 opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for Adreno (#21938)
* opencl: refactor q8_0 gemm/gemv Adreno dispatch

* opencl: refactor q8_0 set_tensor

* opencl: fix whitespace
2026-04-16 22:28:33 -07:00
Sigbjørn Skjæret
30dce2cf29 cli : use get_media_marker (#22017) 2026-04-17 00:12:31 +02:00
Xuan-Son Nguyen
089dd41fe3 cmake: use glob to collect src/models sources (#22005) 2026-04-16 23:25:16 +02:00
nullname
85dde8dc4a hexagon: optimize HMX matmul operations (#21071)
* optimize hmx_mat_mul functions by calculating row and column tiles upfront

* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability

* wip

* set scale outside of loop

* wip

* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts

* wip

* wip

* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions

* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation

* wip

* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization

* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking

* refactor core_dot_chunk_fp16 to improve tile stride calculations for output

* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization

* fix compiling error

* wip

* optimize row and column tile indexing in core_mma_chunk_fp16 function

* wip

* Revert "wip"

This reverts commit cde679eff7.

* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions

* wip
2026-04-16 13:48:34 -07:00
Xuan-Son Nguyen
4fbdabdc61 model: using single llm_build per arch (#21970)
* model: using single llm_build per arch

* fix merge

* nits
2026-04-16 21:10:22 +02:00
shaofeiqi
e45dbdece8 opencl: add q5_K gemm and gemv kernels for Adreno (#21595) 2026-04-16 12:08:33 -07:00
Pascal
4adac43f6f server: tests: fetch random media marker via /apply-template (#21962) (#21980)
* server: tests: fetch random media marker via /apply-template (#21962 fix)

* server: allow pinning media marker via LLAMA_MEDIA_MARKER env var

get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it
as-is if set, falling back to the random marker otherwise.

Tests no longer need to fetch the marker dynamically via /apply-template:
the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts
work as before.

Address review feedback from ngxson

* server: make get_media_marker() thread-safe via magic statics

Use a C++11 static local with a lambda initializer instead of a global
static with an empty-check. The runtime guarantees initialization exactly
once without explicit locking.

Address review feedback from ggerganov

* nits

* nits
2026-04-16 20:46:21 +03:00
PikaPikachu
9db77a020c model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245)
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers

* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
2026-04-16 17:41:34 +02:00
Sigbjørn Skjæret
f772f6e434 model : support NVFP4 tensors for Gemma4 (#21971)
* support nvfp4 tensors for Gemma4

* add wo_s to build_attn

* add wo_s to build_attn

* fix glm4
2026-04-16 16:51:47 +02:00
Ruben Ortlam
b572d1ecd6 codeowners: add team member comments (#21714) 2026-04-16 13:13:11 +03:00
Anav Prasad
03b3d07798 Convert: Fix NemotronH Config Parsing (#21664)
* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns

* fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code
2026-04-16 13:11:45 +03:00
Aman Gupta
3f7c29d318 ggml: add graph_reused (#21764)
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
2026-04-16 17:21:28 +08:00
Kusha Gharahi
ae2d34899e metal: Implement ROLL op (#21946)
* nix: support unified apple-sdk

* Impl roll op for Metal

* Revert "nix: support unified apple-sdk"

This reverts commit abfa473360.

* update ops.md

* update op docs
2026-04-16 11:54:37 +03:00
rehan-10xengineer
1e796eb41f ggml-cpu: add 128-bit RVV implementation for Quantization Vector Dot (#20633)
* ggml-cpu: add 128-bit impls for i-quants, ternary quants

* ggml-cpu: add 128-bit impls for iq2_xs, iq3_s, iq3_xxs, tq2_0

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: refactor; add rvv checks

---------

Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai>
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2026-04-16 11:15:15 +03:00
rehan-10xengineer
5637536517 ggml : implemented simd_gemm kernel for riscv vector extension (#20627)
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2026-04-16 11:14:26 +03:00
Yuannan
90fb96a7b3 devops : added spirv-headers to nix (#21965) 2026-04-16 11:12:52 +03:00
Reese Levine
82677a6ede ggml-webgpu: compute pass batching and removing profiling overhead (#21873)
* Update register tiling matmul to use f32 accumulation

* fix profiling code

* Fix register tiling matmul for chrome, i'm blaming dawn

* Update batch tuning value for iOS

* compile fix

* Fix use of new load function

* Move to a single query set for GPU profiling

* Move to batching compute passes when not profiling

* Refactor build_multi

* remove iOS throttling now that we're batching compute passes
2026-04-16 11:12:19 +03:00
Ludovic Henry
8612ed18b7 ci : Use ggml-org/ccache-action on RISC-V as well (#21632) 2026-04-16 11:11:25 +03:00
Katostrofik
b1be68e8ca [SYCL] Fix Q8_0 reorder: garbage on 2nd prompt + crash on full VRAM (#21638)
* [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM

The Q8_0 reorder optimization (#21527) was missing a reorder-aware
dequantizer for the GEMM code path used during prompt processing.
After token generation reordered Q8_0 weights (via DMMV/MMVQ), the
next prompt processing pass would read them with the standard
dequantizer, producing garbage output.

Add dequantize_block_q8_0_reorder() and wire it into both
ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the
pattern already used by Q4_0, Q4_K, and Q6_K.

Fixes #21589

AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.

* SYCL: fix reorder crash when device memory is full

The reorder optimization allocates a temporary buffer the full size of
the weight tensor on the device. When VRAM is nearly full (large models
on a single GPU), this allocation fails and the subsequent memcpy crashes
on a NULL pointer.

Fix: try device allocation first, fall back to host memory if device
memory is full. The reorder kernel still works correctly reading from
host memory over PCIe. This is slower for the one-time reorder (~21 t/s
vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for
all subsequent inference. If both device and host allocation fail, skip
the reorder and fall back to the unoptimized kernel path.

Also fixes a bug where opt_for_reorder() marked tensors as reordered
even when the reorder was skipped due to allocation failure. This caused
DMMV/MMVQ kernels to read the original AoS data as if it were SoA,
producing garbage output or NaN results.

Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was
AI-assisted (Claude), reviewed and tested on hardware by a human.

Fixes #20478

* SYCL: add RAII temp buffer class + macro guard for host fallback

Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free
functions with sycl_reorder_temp_buffer RAII class. The host_fallback
bool is now a private member, and cleanup happens automatically at
scope exit.

Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard
the host memory fallback code path. Device access to host memory
requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels
can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it.

Addresses arthw's review on PR #21638.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add reorder-aware DMMV dequantizers for Q4_K and Q6_K

Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not
DMMV. When the DMMV path encountered reordered data it would abort.

Add DMMV kernels that read from the SOA reorder layout for both
types. Same math as the non-reorder versions, different memory
access pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 08:34:05 +03:00
Xuan-Son Nguyen
408225bb1a server: use random media marker (#21962)
* server: use random media marker

* nits

* remove legacy <__image__> token

* revert special char in random
2026-04-15 23:52:22 +02:00
Ruben Ortlam
b3d758750a vulkan: optimize im2col (#21713)
* vulkan: improve im2col memory write layout

* cap workgroups

* minimal device tuning

* use vendor_id instead of subgroup size
2026-04-15 19:04:51 +02:00
Pasha Khosravi
7e72b38bc1 cuda: Q1_0 initial backend (#21629)
* [cuda] initial Q1_0 backend

* remove unused code, fix AMD MMA guard

* attempt to support dp4a

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-15 18:38:38 +02:00
Reese Levine
20d3bc2cc8 ggml-webgpu: Fix dequantization helpers to not pass in pointers (#21872)
* Fix dequantization helpers to not pass in pointers

* Increase XIELU precision
2026-04-15 09:14:40 -07:00
Johannes Gäßler
a6206958d2 CUDA: require explicit opt-in for P2P access (#21910) 2026-04-15 16:01:46 +02:00
Johannes Gäßler
014dca49d6 CUDA: manage NCCL communicators in context (#21891)
* CUDA: manage NCCL communicators in context

* add check that all backends are CUDA

* remove unused vector, limit init to > 1 GPUs

* fix warnings

* fix cuda device, cache allreduce
2026-04-15 15:58:40 +02:00
Valeriy Dubov
adb541a6ad rpc : add native RDMA transport for RPC backend (RoCEv2) (#20590) 2026-04-15 16:44:02 +03:00
Xuan-Son Nguyen
80d8770804 docs: more extensive RoPE documentation [no ci] (#21953)
* more extensive ggml_rope documentation

* add more docs

* nits
2026-04-15 14:45:16 +02:00
Ruben Ortlam
8dc530b86d ci: disable test-backend-ops on Vulkan llvmpipe run and resture default timeout (#21901) 2026-04-15 10:55:21 +02:00
Piotr Wilkin (ilintar)
e1a9a6dcbe autoparser: support case of JSON_NATIVE with per-call markers (test case: Reka-Edge) (#21892) 2026-04-15 10:51:50 +02:00
Matt
e39eba26f3 read n_ctx back after making llama_context (#21939) 2026-04-15 15:24:57 +08:00
Yiwei Shao
5d14e5d19b hexagon: optimization for HMX mat_mul (#21554)
* hexagon: add async HMX worker

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.

* hexagon: cost-based VTCM chunk search for out-stationary matmul

* hexagon: fix futex race in hmx_worker_drain
Store the boolean to local variable avoid atomic load twice

* hex-mm: hmx optimize scatter/transpose and use HMX intrinsics

* hex-vmem: drop vmem limit a touch under 3GB on v73

* hexagon: add fwd declaration of htp_context

* hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface

Simplifies the overall implemantion, reduces thread wakeup roundtrips.

* hex-mm: add debug log to hmx work func called from hmx-queue

* Update hmx-queue.h

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
2026-04-14 14:09:03 -07:00
Xuan-Son Nguyen
fae3a28070 ggml : remove ggml-ext.h (#21869)
* ggml: correct placement of ggml-ext.h

* ggml : remove ggml-ext.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-14 17:32:58 +03:00
Georgi Gerganov
c0de6eda72 metal : fix FA support logic (#21898) 2026-04-14 17:32:29 +03:00
Xuan-Son Nguyen
707c0b7a6e mtmd: add mtmd_image_tokens_get_decoder_pos() API (#21851)
* mtmd: add mtmd_image_tokens_get_decoder_pos() API

* consistent naming

* fix build
2026-04-14 16:07:41 +02:00
Jeff Bolz
1f30ac0cea vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it (#21572)
* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
2026-04-14 15:17:45 +02:00
Georgi Gerganov
f4b5bf2f32 ci : re-enable mac workflows (#21894)
* ci : re-enable mac workflows

* vulkan : fix compile warning
2026-04-14 15:58:09 +03:00
Seyoung Jeong
aa0f1897b7 metal : add XIELU unary op (#20802) 2026-04-14 15:43:59 +03:00
Adrien Gallouët
be76dd0bb2 vendor : update BoringSSL to 0.20260413.0 (#21881)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-14 14:25:09 +03:00
Richard Davison
2e05f06ffb ggml : fix ARM NEON nvfp4 dot product on non-dotprod targets (#21559) 2026-04-14 14:23:45 +03:00
texasich
acc37a42ea cmake: fix CMP0194 warning on Windows with MSVC (#21630)
* cmake: fix CMP0194 warning on Windows with MSVC

Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1.

The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler.

This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147).

Closes ggml-org/llama.cpp#20311

* cmake: apply cisc's formatting suggestion

---------

Co-authored-by: texasich <texasich@users.noreply.github.com>
2026-04-14 13:47:56 +03:00
Reese Levine
5a23695d5a ggml-webgpu: Update register tiling matmul to use f32 accumulation (#21644)
* Update register tiling matmul to use f32 accumulation

* fix profiling code

* Fix register tiling matmul for chrome, i'm blaming dawn

* Update batch tuning value for iOS

* compile fix

* Fix use of new load function
2026-04-14 13:46:41 +03:00
Berk Idem
56666fa607 common: skip reasoning budget sampler when no budget is requested (#21870)
* common: skip reasoning budget sampler when no budget is requested

After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).

More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).

So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.

* common: preserve rbudget when grammar is lazy

Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
2026-04-14 12:43:06 +02:00
Jeff Bolz
6a6780a232 vulkan: Support GGML_TYPE_NVFP4 (#21455)
This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For
mul_mat, it does not add support for the dp4/q8_1 path, it's all via
fp16/fp32.
2026-04-14 11:34:23 +02:00
Xuan-Son Nguyen
e489a5ca0e server: support OAI /v1/audio/transcriptions API (#21863)
* server: support OAI /v1/audio/transcriptions API

* address autoreview comments

* correct default response_format value
2026-04-14 11:09:52 +02:00
Aldehir Rojas
e21cdc11a0 common/gemma4 : handle parsing edge cases (#21760) 2026-04-13 18:18:18 -05:00
Xuan-Son Nguyen
e974923698 docs: listing qwen3-asr and qwen3-omni as supported (#21857)
* docs: listing qwen3-asr and qwen3-omni as supported

* nits
2026-04-13 22:28:17 +02:00
Piotr Wilkin (ilintar)
1c0d9081fd chat: dedicated DeepSeek v3.2 parser + "official" template (#21785) 2026-04-13 22:23:53 +02:00
Christian Kastner
a8bad3842e ci: Also exempt 'security' tag from auto-close (#21844) 2026-04-14 01:18:44 +08:00
Ruben Ortlam
75f3bc94e6 vulkan: Flash Attention DP4A shader for quantized KV cache (#20797)
* use integer dot product for quantized KV flash attention

* small improvements

* fix SHMEM_STAGING indexing

* add missing KV type quants

* fixes

* add supported quants to FA tests

* readd fast paths for <8bit quants

* fix mmq gate and shmem checks
2026-04-13 14:21:31 +02:00
Adrien Gallouët
aa00911d12 common : add download cancellation and temp file cleanup (#21813)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-13 11:18:23 +02:00
Gaspard Petit
ce8fd4b1a6 server: Expose build_info in router mode (#21835) 2026-04-13 11:14:42 +02:00
Oliver Simons
9f5e1edb10 CUDA: Limit DeviceSegmentedSort to immediate mode (#21718)
* CUDA: Limit DeviceSegmentedSort to immediate mode

DeviceSegmentedSort is currently not capturable in a cuda graph. Hence,
we have to go for the slower DeviceSegmentedRadixSort in that case.

Perf numbers on RTX Pro 6000 Blackwell Max-Q:
DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs)

  ARGSORT(type=f32,ne=[2048,512,1,1],order=1):                 12291 runs -   105.94 us/run -     8192 kB/run -   73.75 GB/s
  ARGSORT(type=f32,ne=[4096,512,1,1],order=1):                 10245 runs -   115.08 us/run -    16384 kB/run -  135.77 GB/s
  ARGSORT(type=f32,ne=[8192,512,1,1],order=1):                  5125 runs -   221.22 us/run -    32768 kB/run -  141.26 GB/s
  ARGSORT(type=f32,ne=[16384,512,1,1],order=1):                 2565 runs -   430.98 us/run -    65536 kB/run -  145.02 GB/s
  ARGSORT(type=f32,ne=[32768,512,1,1],order=1):                 1028 runs -  1185.83 us/run -   131072 kB/run -  105.41 GB/s
  ARGSORT(type=f32,ne=[65536,512,1,1],order=1):                  387 runs -  2748.62 us/run -   262144 kB/run -   90.95 GB/s

DeviceSegmentedSort in immediate mode

  ARGSORT(type=f32,ne=[2048,512,1,1],order=1):                 16388 runs -    71.17 us/run -     8192 kB/run -  109.78 GB/s
  ARGSORT(type=f32,ne=[4096,512,1,1],order=1):                 12294 runs -    81.38 us/run -    16384 kB/run -  192.00 GB/s
  ARGSORT(type=f32,ne=[8192,512,1,1],order=1):                  5125 runs -   240.81 us/run -    32768 kB/run -  129.77 GB/s
  ARGSORT(type=f32,ne=[16384,512,1,1],order=1):                 2565 runs -   406.60 us/run -    65536 kB/run -  153.71 GB/s
  ARGSORT(type=f32,ne=[32768,512,1,1],order=1):                 1285 runs -   873.23 us/run -   131072 kB/run -  143.15 GB/s
  ARGSORT(type=f32,ne=[65536,512,1,1],order=1):                  516 runs -  2288.46 us/run -   262144 kB/run -  109.24 GB/s

* Add test case for dispatch to DeviceSegmentedRadixSort

We currently lack a way to force graph mode in CUDA, patch callback to
invoke ggml_backend_compare_graph_backend twice to enforce each test to
run in graph mode
2026-04-13 11:14:06 +02:00
Xuan-Son Nguyen
920b3e78cb mtmd: use causal attn for gemma 4 audio (#21824) 2026-04-13 09:47:55 +02:00
Rohan Jain
974c8c94cc webui: add setting for first-line chat titles (#21797)
* webui: add setting for first-line chat titles

Add an opt-in setting (`titleGenerationUseFirstLine`) to use the first
non-empty line of a prompt as the generated conversation title.

Previously, the complete multi-line prompt was being used, which created
long titles for complex queries. Coupled with
"Ask for confirmation before changing conversation title", the dialog
would overflow.

* Update tools/server/webui/src/lib/utils/text.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/utils/text.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: Run build to update the bundle

As requested in:
https://github.com/ggml-org/llama.cpp/pull/21797#pullrequestreview-4094935065

* webui: Fix missing import for NEWLINE_SEPARATOR

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-04-13 09:30:46 +02:00
Aleksander Grygier
227ed28e12 webui: MCP Diagnostics improvements (#21803)
* Add MCP Connection diagnostics and CORS hint to web-ui

* tidy up test

* webui: Refactor and improve MCP diagnostic logging

---------

Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>
2026-04-13 07:58:38 +02:00
328 changed files with 12832 additions and 5695 deletions

View File

@@ -18,6 +18,7 @@
vulkan-loader,
openssl,
shaderc,
spirv-headers,
useBlas ?
builtins.all (x: !x) [
useCuda
@@ -145,6 +146,7 @@ effectiveStdenv.mkDerivation (finalAttrs: {
ninja
pkg-config
git
spirv-headers
]
++ optionals useCuda [
cudaPackages.cuda_nvcc

View File

@@ -7,7 +7,7 @@ RUN apt update && apt install -y git build-essential cmake wget xz-utils
# Install SSL and Vulkan SDK dependencies
RUN apt install -y libssl-dev curl \
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libvulkan-dev glslc
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libvulkan-dev glslc spirv-headers
# Build it
WORKDIR /app

View File

@@ -47,22 +47,10 @@ jobs:
steps:
- name: Install dependencies
run: |
sudo apt-get update
# Install necessary packages
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 cmake build-essential wget git-lfs
# Set gcc-14 and g++-14 as the default compilers
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
if ! which rustc; then
# Install Rust stable version
sudo apt-get install -y rustup
rustup install stable
rustup default stable
fi
git lfs install
- name: GCC version check
@@ -74,12 +62,12 @@ jobs:
id: checkout
uses: actions/checkout@v6
# FIXME: Enable when ggml-org/ccache-action works on riscv64
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-riscv64-native-sanitizer-${{ matrix.sanytizer }}-${{ matrix.build_type }}
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: ccache
uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
with:
key: ubuntu-riscv64-native-sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build

View File

@@ -141,61 +141,59 @@ jobs:
# amd-smi static
# GG_BUILD_ROCM=1 GG_BUILD_AMDGPU_TARGETS="gfx1101" bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
# TODO: sandbox Mac runners
# ggml-ci-mac-metal:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-mac-webgpu:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Dawn Dependency
# id: dawn-depends
# run: |
# DAWN_VERSION="v2.0.0"
# DAWN_OWNER="reeselevine"
# DAWN_REPO="dawn"
# DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release"
# echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
# curl -L -o artifact.zip \
# "https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
# mkdir dawn
# unzip artifact.zip
# tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
# bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-mac-vulkan:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# vulkaninfo --summary
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-metal:
runs-on: [self-hosted, macOS, ARM64]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Test
id: ggml-ci
run: |
GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-webgpu:
runs-on: [self-hosted, macOS, ARM64]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dawn Dependency
id: dawn-depends
run: |
DAWN_VERSION="v20260317.182325"
DAWN_OWNER="google"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-18eb229ef5f707c1464cc581252e7603c73a3ef0-macos-latest-Release"
echo "Fetching release asset from https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
curl -L -o artifact.tar.gz \
"https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
mkdir dawn
tar -xvf artifact.tar.gz -C dawn --strip-components=1
- name: Test
id: ggml-ci
run: |
GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-vulkan:
runs-on: [self-hosted, macOS, ARM64]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Test
id: ggml-ci
run: |
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-linux-intel-vulkan:
runs-on: [self-hosted, Linux, Intel]

View File

@@ -93,4 +93,5 @@ jobs:
export GGML_VK_DISABLE_F16=1
export GGML_VK_DISABLE_COOPMAT=1
# This is using llvmpipe and runs slower than other backends
ctest -L main --verbose --timeout 4800
# test-backend-ops is too slow on llvmpipe, skip it
ctest -L main -E test-backend-ops --verbose --timeout 900

View File

@@ -318,7 +318,7 @@ jobs:
id: depends
run: |
sudo apt-get update
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev libssl-dev ninja-build
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev spirv-headers libssl-dev ninja-build
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
@@ -1001,22 +1001,14 @@ jobs:
steps:
- name: Install dependencies
run: |
sudo apt-get update
# Install necessary packages
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 cmake build-essential libssl-dev wget git-lfs
sudo apt-get update
sudo apt-get install -y libssl-dev
# Set gcc-14 and g++-14 as the default compilers
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
if ! which rustc; then
# Install Rust stable version
sudo apt-get install -y rustup
rustup install stable
rustup default stable
fi
git lfs install
- name: Check environment
@@ -1032,13 +1024,12 @@ jobs:
id: checkout
uses: actions/checkout@v6
# FIXME: Enable when ggml-org/ccache-action works on riscv64
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-cpu-riscv64-native
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: ccache
uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
with:
key: ubuntu-cpu-riscv64-native
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build

View File

@@ -17,7 +17,7 @@ jobs:
steps:
- uses: actions/stale@v10
with:
exempt-issue-labels: "refactoring,help wanted,good first issue,research 🔬,bug,roadmap"
exempt-issue-labels: "refactoring,help wanted,good first issue,research 🔬,bug,roadmap,security"
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"

View File

@@ -202,7 +202,7 @@ jobs:
sudo apt-get install -y build-essential mesa-vulkan-drivers vulkan-sdk libssl-dev
else
sudo apt-get update -y
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev libssl-dev ninja-build
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev spirv-headers libssl-dev ninja-build
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
fi

View File

@@ -84,41 +84,42 @@ jobs:
export ${{ matrix.extra_args }}
pytest -v -x -m "not slow"
server-cuda:
runs-on: [self-hosted, llama-server, Linux, NVIDIA]
name: server-cuda (${{ matrix.wf_name }})
strategy:
matrix:
build_type: [Release]
wf_name: ["GPUx1"]
include:
- build_type: Release
extra_args: "LLAMA_ARG_BACKEND_SAMPLING=1"
wf_name: "GPUx1, backend-sampling"
fail-fast: false
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
with:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Build
id: cmake_build
run: |
cmake -B build -DGGML_SCHED_NO_REALLOC=ON
cmake --build build --config ${{ matrix.build_type }} -j $(sysctl -n hw.logicalcpu) --target llama-server
- name: Tests
id: server_integration_tests
if: ${{ (!matrix.disabled_on_pr || !github.event.pull_request) }}
run: |
cd tools/server/tests
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
export ${{ matrix.extra_args }}
pytest -v -x -m "not slow"
# TODO: provision CUDA runner
# server-cuda:
# runs-on: [self-hosted, llama-server, Linux, NVIDIA]
#
# name: server-cuda (${{ matrix.wf_name }})
# strategy:
# matrix:
# build_type: [Release]
# wf_name: ["GPUx1"]
# include:
# - build_type: Release
# extra_args: "LLAMA_ARG_BACKEND_SAMPLING=1"
# wf_name: "GPUx1, backend-sampling"
# fail-fast: false
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
# with:
# fetch-depth: 0
# ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
#
# - name: Build
# id: cmake_build
# run: |
# cmake -B build -DGGML_SCHED_NO_REALLOC=ON
# cmake --build build --config ${{ matrix.build_type }} -j $(sysctl -n hw.logicalcpu) --target llama-server
#
# - name: Tests
# id: server_integration_tests
# if: ${{ (!matrix.disabled_on_pr || !github.event.pull_request) }}
# run: |
# cd tools/server/tests
# python3 -m venv venv
# source venv/bin/activate
# pip install -r requirements.txt
# export ${{ matrix.extra_args }}
# pytest -v -x -m "not slow"

View File

@@ -225,7 +225,7 @@ foreach(FILE_PATH ${EXTRA_LICENSES})
endforeach()
if (LLAMA_BUILD_COMMON)
license_generate(common)
license_generate(llama-common)
endif()
#
@@ -249,6 +249,10 @@ set_target_properties(llama
install(TARGETS llama LIBRARY PUBLIC_HEADER)
if (LLAMA_BUILD_COMMON)
install(TARGETS llama-common LIBRARY)
endif()
configure_package_config_file(
${CMAKE_CURRENT_SOURCE_DIR}/cmake/llama-config.cmake.in
${CMAKE_CURRENT_BINARY_DIR}/llama-config.cmake

View File

@@ -1,5 +1,21 @@
# collaborators can optionally add themselves here to indicate their availability for reviewing related PRs
# multiplie collaborators per item can be specified
# multiple collaborators per item can be specified
#
# ggml-org/ci : CISC, danbev, ggerganov, netrunnereve, ngxson, taronaeo
# ggml-org/ggml-cann : hipudding
# ggml-org/ggml-cuda : JohannesGaessler, am17an, IMbackK, ORippler
# ggml-org/ggml-hexagon : lhez, max-krasnyansky
# ggml-org/ggml-metal : ggerganov
# ggml-org/ggml-opencl : lhez, max-krasnyansky
# ggml-org/ggml-rpc : rgerganov
# ggml-org/ggml-sycl : arthw
# ggml-org/ggml-vulkan : 0cc4m, jeffbolznv
# ggml-org/ggml-webgpu : reeselevine
# ggml-org/ggml-zdnn : taronaeo
# ggml-org/llama-common : ggerganov, aldehir, angt, danbev, ngxson, pwilkin
# ggml-org/llama-mtmd : ngxson
# ggml-org/llama-server : ggerganov, ngxson, allozaur, angt, ServeurpersoCom
# ggml-org/llama-webui : allozaur
/.devops/*.Dockerfile @ngxson
/.github/actions/ @ggml-org/ci

View File

@@ -1,9 +1,11 @@
# common
find_package(Threads REQUIRED)
llama_add_compile_flags()
#
# llama-common-base
#
# Build info header
if(EXISTS "${PROJECT_SOURCE_DIR}/.git")
@@ -33,17 +35,25 @@ endif()
set(TEMPLATE_FILE "${CMAKE_CURRENT_SOURCE_DIR}/build-info.cpp.in")
set(OUTPUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/build-info.cpp")
configure_file(${TEMPLATE_FILE} ${OUTPUT_FILE})
set(TARGET build_info)
add_library(${TARGET} OBJECT ${OUTPUT_FILE})
set(TARGET llama-common-base)
add_library(${TARGET} STATIC ${OUTPUT_FILE})
target_include_directories(${TARGET} PUBLIC .)
if (BUILD_SHARED_LIBS)
set_target_properties(${TARGET} PROPERTIES POSITION_INDEPENDENT_CODE ON)
endif()
set(TARGET common)
#
# llama-common
#
add_library(${TARGET} STATIC
set(TARGET llama-common)
add_library(${TARGET}
arg.cpp
arg.h
base64.hpp
@@ -106,17 +116,24 @@ add_library(${TARGET} STATIC
jinja/caps.h
)
set_target_properties(${TARGET} PROPERTIES
VERSION ${LLAMA_INSTALL_VERSION}
SOVERSION 0
MACHO_CURRENT_VERSION 0 # keep macOS linker from seeing oversized version number
)
target_include_directories(${TARGET} PUBLIC . ../vendor)
target_compile_features (${TARGET} PUBLIC cxx_std_17)
if (BUILD_SHARED_LIBS)
set_target_properties(${TARGET} PROPERTIES POSITION_INDEPENDENT_CODE ON)
# TODO: make fine-grained exports in the future
set_target_properties(${TARGET} PROPERTIES WINDOWS_EXPORT_ALL_SYMBOLS ON)
endif()
target_link_libraries(${TARGET} PRIVATE
build_info
cpp-httplib
)
target_link_libraries(${TARGET} PUBLIC llama-common-base)
target_link_libraries(${TARGET} PRIVATE cpp-httplib)
if (LLAMA_LLGUIDANCE)
include(ExternalProject)

View File

@@ -1,5 +1,6 @@
#include "arg.h"
#include "build-info.h"
#include "chat.h"
#include "common.h"
#include "download.h"
@@ -1044,8 +1045,8 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--version"},
"show version and build info",
[](common_params &) {
fprintf(stderr, "version: %d (%s)\n", LLAMA_BUILD_NUMBER, LLAMA_COMMIT);
fprintf(stderr, "built with %s for %s\n", LLAMA_COMPILER, LLAMA_BUILD_TARGET);
fprintf(stderr, "version: %d (%s)\n", llama_build_number(), llama_commit());
fprintf(stderr, "built with %s for %s\n", llama_compiler(), llama_build_target());
exit(0);
}
));

View File

@@ -1,4 +1,35 @@
#include "build-info.h"
#include <cstdio>
#include <string>
int LLAMA_BUILD_NUMBER = @LLAMA_BUILD_NUMBER@;
char const *LLAMA_COMMIT = "@LLAMA_BUILD_COMMIT@";
char const *LLAMA_COMPILER = "@BUILD_COMPILER@";
char const *LLAMA_BUILD_TARGET = "@BUILD_TARGET@";
char const * LLAMA_COMMIT = "@LLAMA_BUILD_COMMIT@";
char const * LLAMA_COMPILER = "@BUILD_COMPILER@";
char const * LLAMA_BUILD_TARGET = "@BUILD_TARGET@";
int llama_build_number(void) {
return LLAMA_BUILD_NUMBER;
}
const char * llama_commit(void) {
return LLAMA_COMMIT;
}
const char * llama_compiler(void) {
return LLAMA_COMPILER;
}
const char * llama_build_target(void) {
return LLAMA_BUILD_TARGET;
}
const char * llama_build_info(void) {
static std::string s = "b" + std::to_string(LLAMA_BUILD_NUMBER) + "-" + LLAMA_COMMIT;
return s.c_str();
}
void llama_print_build_info(void) {
fprintf(stderr, "%s: build = %d (%s)\n", __func__, llama_build_number(), llama_commit());
fprintf(stderr, "%s: built with %s for %s\n", __func__, llama_compiler(), llama_build_target());
}

11
common/build-info.h Normal file
View File

@@ -0,0 +1,11 @@
#pragma once
int llama_build_number(void);
const char * llama_commit(void);
const char * llama_compiler(void);
const char * llama_build_target(void);
const char * llama_build_info(void);
void llama_print_build_info(void);

View File

@@ -198,10 +198,19 @@ common_peg_parser analyze_tools::build_tool_parser_json_native(parser_build_cont
args_field = format.function_field + "." + args_field;
}
auto tools_parser = p.standard_json_tools(
format.section_start, format.section_end, inputs.tools, inputs.parallel_tool_calls,
inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED, name_field, args_field, format.tools_array_wrapped,
format.fun_name_is_key, format.id_field, format.gen_id_field, format.parameter_order);
auto tools_parser = p.eps();
if (format.section_start.empty() && !format.per_call_start.empty()) {
auto single_tool_parser = p.standard_json_tools(
format.per_call_start, format.per_call_end, inputs.tools, inputs.parallel_tool_calls,
inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED, name_field, args_field, format.tools_array_wrapped,
format.fun_name_is_key, format.id_field, format.gen_id_field, format.parameter_order);
tools_parser = p.trigger_rule("tool-calls", p.one_or_more(single_tool_parser + p.space()));
} else {
tools_parser = p.standard_json_tools(
format.section_start, format.section_end, inputs.tools, inputs.parallel_tool_calls,
inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED, name_field, args_field, format.tools_array_wrapped,
format.fun_name_is_key, format.id_field, format.gen_id_field, format.parameter_order);
}
// Handle content wrappers if present
if (ctx.content && ctx.content->is_always_wrapped()) {

View File

@@ -308,19 +308,23 @@ struct analyze_tools : analyze_base {
private:
// Extract tool calling 'haystack' for further analysis and delegate further analysis based on format
void analyze_tool_calls(const analyze_reasoning & reasoning);
void analyze_tool_calls(const analyze_reasoning & reasoning, bool supports_parallel_tool_calls);
// Analyze format based on position of function and argument name in needle
void analyze_tool_call_format(const std::string & haystack,
const std::string & fun_name_needle,
const std::string & arg_name_needle,
const analyze_reasoning & reasoning);
const analyze_reasoning & reasoning,
bool supports_parallel_tool_calls);
// Analyze specifics of JSON native format (entire tool call is a JSON object)
void analyze_tool_call_format_json_native(const std::string & clean_haystack,
const std::string & fun_name_needle,
const std::string & arg_name_needle);
// Check if parallel calls in JSON native format array wrapped or tag wrapped
void analyze_json_native_parallel_calls();
// Analyze specifics of non-JSON native format (tags for function name or for function name and arguments)
void analyze_tool_call_format_non_json(const std::string & clean_haystack,
const std::string & fun_name_needle);

View File

@@ -558,7 +558,7 @@ analyze_tools::analyze_tools(const common_chat_template & tmpl,
: analyze_base(tmpl) {
LOG_DBG(ANSI_ORANGE "Phase 3: Tool call analysis\n" ANSI_RESET);
analyze_tool_calls(reasoning);
analyze_tool_calls(reasoning, caps.supports_parallel_tool_calls);
if (format.mode != tool_format::NONE && format.mode != tool_format::JSON_NATIVE) {
if (caps.supports_parallel_tool_calls) {
@@ -577,7 +577,7 @@ analyze_tools::analyze_tools(const common_chat_template & tmpl,
}
}
void analyze_tools::analyze_tool_calls(const analyze_reasoning & reasoning) {
void analyze_tools::analyze_tool_calls(const analyze_reasoning & reasoning, bool supports_parallel_tool_calls) {
json assistant_no_tools = json{
{ "role", "assistant" },
{ "content", ASSISTANT_MSG }
@@ -611,13 +611,14 @@ void analyze_tools::analyze_tool_calls(const analyze_reasoning & reasoning) {
return;
}
analyze_tool_call_format(tool_section, FUN_FIRST, ARG_FIRST, reasoning);
analyze_tool_call_format(tool_section, FUN_FIRST, ARG_FIRST, reasoning, supports_parallel_tool_calls);
}
void analyze_tools::analyze_tool_call_format(const std::string & haystack,
const std::string & fun_name_needle,
const std::string & arg_name_needle,
const analyze_reasoning & reasoning) {
const analyze_reasoning & reasoning,
bool supports_parallel_tool_calls) {
if (fun_name_needle.empty() || arg_name_needle.empty() || haystack.empty()) {
return;
}
@@ -660,6 +661,9 @@ void analyze_tools::analyze_tool_call_format(const std::string & haystack,
if (format.mode == tool_format::JSON_NATIVE) {
analyze_tool_call_format_json_native(clean_haystack, fun_name_needle, arg_name_needle);
if (supports_parallel_tool_calls) {
analyze_json_native_parallel_calls();
}
} else {
analyze_tool_call_format_non_json(clean_haystack, fun_name_needle);
}
@@ -668,6 +672,42 @@ void analyze_tools::analyze_tool_call_format(const std::string & haystack,
format.per_call_end = trim_whitespace(format.per_call_end);
}
void analyze_tools::analyze_json_native_parallel_calls() {
json assistant_one_tool = json{
{ "role", "assistant" },
{ "content", "" },
{ "tool_calls", json::array({ first_tool_call }) }
};
json assistant_two_tools = json{
{ "role", "assistant" },
{ "content", "" },
{ "tool_calls", json::array({ first_tool_call, second_tool_call }) }
};
template_params params;
params.messages = json::array({ user_msg, assistant_one_tool });
params.tools = tools;
params.add_generation_prompt = false;
params.enable_thinking = true;
auto comparison = compare_variants(
*tmpl, params, [&](template_params & p) { p.messages = json::array({ user_msg, assistant_two_tools }); });
if (!comparison) {
LOG_DBG(ANSI_ORANGE "%s: Template application failed\n" ANSI_RESET, __func__);
return;
}
std::string & second_call = comparison->diff.right;
if (!format.section_start.empty() && second_call.find(format.section_start) != std::string::npos) {
format.per_call_start = format.section_start;
format.per_call_end = format.section_end;
format.section_start.clear();
format.section_end.clear();
}
}
void analyze_tools::analyze_tool_call_format_json_native(const std::string & clean_haystack,
const std::string & fun_name_needle,
const std::string & arg_name_needle) {

View File

@@ -676,7 +676,7 @@ common_peg_parser common_chat_peg_builder::build_json_tools_nested_keys(
ordered_json params = function.contains("parameters") ? function.at("parameters") : ordered_json::object();
auto nested_name = literal("\"" + nested_name_field + "\"") + space() + literal(":") + space() +
literal("\"") + tool_name(literal(name)) + literal("\"");
atomic(literal("\"") + tool_name(literal(name)) + literal("\""));
auto nested_args = literal("\"" + nested_args_field + "\"") + space() + literal(":") + space() +
tool_args(schema(json(), "tool-" + name + "-schema", params));
@@ -744,7 +744,7 @@ common_peg_parser common_chat_peg_builder::build_json_tools_flat_keys(
ordered_json params = function.contains("parameters") ? function.at("parameters") : ordered_json::object();
auto tool_name_ = name_key_parser + space() + literal(":") + space() +
literal("\"") + tool_name(literal(name)) + literal("\"");
atomic(literal("\"") + tool_name(literal(name)) + literal("\""));
auto tool_args_ = args_key_parser + space() + literal(":") + space() +
tool_args(schema(json(), "tool-" + name + "-schema", params));

View File

@@ -1091,6 +1091,14 @@ static common_chat_params common_chat_params_init_gemma4(const common_chat_templ
common_chat_params data;
data.prompt = common_chat_template_direct_apply_impl(tmpl, inputs);
if (inputs.add_generation_prompt && string_ends_with(data.prompt, "<turn|>\n")) {
// This may happen if the model generates content + tool_call, the
// template does not add the model's next turn and confuses the model
// from emitting its proper reasoning token sequence.
data.prompt += "<|turn>model\n";
}
data.format = COMMON_CHAT_FORMAT_PEG_GEMMA4;
data.supports_thinking = true;
data.thinking_start_tag = "<|channel>thought";
@@ -1118,7 +1126,8 @@ static common_chat_params common_chat_params_init_gemma4(const common_chat_templ
p.rule("thought", p.content(p.literal("<|channel>thought") + p.space() + p.until("<channel|>") + p.literal("<channel|>")));
}
auto thought = (p.peek(p.literal("<|channel>")) + p.ref("thought")) | p.negate(p.literal("<|channel>"));
auto consume_empty_channels = p.gbnf(p.zero_or_more(p.literal("<|channel>") + p.negate(p.literal("thought"))), "");
auto thought = (p.peek(p.literal("<|channel>")) + consume_empty_channels + p.ref("thought")) | p.negate(p.literal("<|channel>"));
if (has_response_format) {
auto response_format = p.literal("```json") <<
@@ -1182,12 +1191,16 @@ static common_chat_params common_chat_params_init_gemma4(const common_chat_templ
/* max = */ inputs.parallel_tool_calls ? -1 : 1
));
auto content = p.rule("content", p.content(p.until_one_of({"<|channel>", "<|tool_call>"})));
auto scan_to_toolcall = p.rule("scan-to-toolcall", p.until("<|tool_call>"));
auto content = p.rule("content", p.content(p.until_one_of({"<|channel>", "<channel|>", "<|tool_call>"})));
auto message = p.rule("message", thought + content);
return start + p.zero_or_more(message) + tool_call;
return start + p.zero_or_more(message) + scan_to_toolcall + tool_call;
}
auto content = p.rule("content", p.content(p.until("<|channel>")));
// Gemma 4 may emit an extra <|channel>thought\n<channel|> at the end of the content. It may
// also emit a single trailing <channel|> token. Consume all complete reasoning blocks and
// then stop at the first unmatched <channel|> token.
auto content = p.rule("content", p.content(p.until_one_of({"<|channel>", "<channel|>"})));
auto message = p.rule("message", thought + content);
return start + p.one_or_more(message);
});
@@ -1656,6 +1669,173 @@ static common_chat_params common_chat_params_init_gigachat_v3(
return data;
}
static common_chat_params common_chat_params_init_deepseek_v3_2(const common_chat_template & tmpl,
const autoparser::generation_params & inputs) {
common_chat_params data;
data.prompt = common_chat_template_direct_apply_impl(tmpl, inputs);
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
data.supports_thinking = true;
data.thinking_start_tag = "<think>";
data.thinking_end_tag = "</think>";
data.preserved_tokens = {
"DSML",
"<think>",
"</think>",
};
auto has_tools = inputs.tools.is_array() && !inputs.tools.empty();
auto has_response_format = !inputs.json_schema.is_null() && inputs.json_schema.is_object();
auto extract_reasoning = inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE;
auto include_grammar = has_response_format || (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE);
const std::string DSML = "DSML";
const std::string THINK_START = "<think>";
const std::string THINK_END = "</think>";
const std::string FC_START = "<" + DSML + "function_calls>";
const std::string FC_END = "</" + DSML + "function_calls>";
const std::string INVOKE_START = "<" + DSML + "invoke";
const std::string INVOKE_END = "</" + DSML + "invoke>";
const std::string PARAM_START = "<" + DSML + "parameter";
const std::string PARAM_END = "</" + DSML + "parameter>";
auto parser = build_chat_peg_parser([&](common_chat_peg_builder & p) {
auto generation_prompt = p.prefix(inputs.generation_prompt, THINK_START);
auto end = p.end();
auto reasoning = p.eps();
if (extract_reasoning && inputs.enable_thinking) {
reasoning = p.optional(THINK_START + p.reasoning(p.until(THINK_END)) + THINK_END);
} else if (extract_reasoning) {
// Thinking disabled but reasoning extraction requested: the generation prompt
// contains an empty <think></think> pair that must still be consumed.
reasoning = p.optional(p.literal(THINK_START) + p.until(THINK_END) + p.literal(THINK_END));
}
if (has_response_format) {
auto response_format = p.rule("response-format",
p.literal("```json") + p.space() +
p.content(p.schema(p.json(), "response-format-schema", inputs.json_schema)) +
p.space() + p.literal("```"));
return generation_prompt + reasoning + response_format + end;
}
if (!has_tools || inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_NONE) {
return generation_prompt + reasoning + p.content(p.rest()) + end;
}
auto tool_choice = p.choice();
foreach_function(inputs.tools, [&](const json & tool) {
const auto & function = tool.at("function");
std::string name = function.at("name");
auto params = function.contains("parameters") ? function.at("parameters") : json::object();
const auto & props = params.contains("properties") ? params.at("properties") : json::object();
std::set<std::string> required;
if (params.contains("required")) {
params.at("required").get_to(required);
}
auto schema_info = common_schema_info();
schema_info.resolve_refs(params);
std::vector<common_peg_parser> required_parsers;
std::vector<common_peg_parser> optional_parsers;
for (const auto & [param_name, param_schema] : props.items()) {
bool is_required = required.find(param_name) != required.end();
bool is_string = schema_info.resolves_to_string(param_schema);
auto arg = p.tool_arg(
p.tool_arg_open(
p.literal(PARAM_START + " name=\"") +
p.tool_arg_name(p.literal(param_name)) +
p.literal("\" string=\"" + std::string(is_string ? "true" : "false") + "\">")) +
(is_string
? p.tool_arg_string_value(p.until(PARAM_END))
: p.tool_arg_json_value(p.schema(p.json(),
"tool-" + name + "-arg-" + param_name + "-schema",
param_schema, false))) +
p.tool_arg_close(p.literal(PARAM_END)));
auto named_arg = p.rule("tool-" + name + "-arg-" + param_name, arg);
if (is_required) {
required_parsers.push_back(named_arg);
} else {
optional_parsers.push_back(named_arg);
}
}
common_peg_parser args_seq = p.eps();
for (size_t i = 0; i < required_parsers.size(); i++) {
if (i > 0) {
args_seq = args_seq + p.space();
}
args_seq = args_seq + required_parsers[i];
}
if (!optional_parsers.empty()) {
common_peg_parser any_opt = p.choice();
for (const auto & opt : optional_parsers) {
any_opt |= opt;
}
args_seq = args_seq + p.repeat(p.space() + any_opt, 0, -1);
}
common_peg_parser invoke_body = args_seq;
auto func_parser = p.tool(
p.tool_open(p.literal(INVOKE_START + " name=\"") +
p.tool_name(p.literal(name)) + p.literal("\">\n")) +
invoke_body + p.space() +
p.tool_close(p.literal(INVOKE_END)));
tool_choice |= p.rule("tool-" + name, func_parser);
});
auto require_tools = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED;
common_peg_parser tool_calls = p.eps();
if (inputs.parallel_tool_calls) {
tool_calls = p.trigger_rule("tool-call",
p.literal(FC_START) + p.space() + tool_choice +
p.zero_or_more(p.space() + tool_choice) + p.space() + p.literal(FC_END));
} else {
tool_calls = p.trigger_rule("tool-call",
p.literal(FC_START) + p.space() + tool_choice + p.space() + p.literal(FC_END));
}
if (!require_tools) {
tool_calls = p.optional(tool_calls);
}
auto content_before_tools = p.content(p.until(FC_START));
return generation_prompt + reasoning + content_before_tools + tool_calls + end;
});
data.parser = parser.save();
if (include_grammar) {
data.grammar_lazy = !(has_response_format || (has_tools && inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED));
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
foreach_function(inputs.tools, [&](const json & tool) {
const auto & function = tool.at("function");
auto schema = function.contains("parameters") ? function.at("parameters") : json::object();
builder.resolve_refs(schema);
});
if (has_response_format) {
auto schema = inputs.json_schema;
builder.resolve_refs(schema);
}
parser.build_grammar(builder, data.grammar_lazy);
});
data.grammar_triggers = {
{ COMMON_GRAMMAR_TRIGGER_TYPE_WORD, FC_START },
};
}
return data;
}
namespace workaround {
static void map_developer_role_to_system(json & messages) {
@@ -1927,6 +2107,15 @@ std::optional<common_chat_params> common_chat_try_specialized_template(
return common_chat_params_init_gigachat_v3(tmpl, params);
}
// DeepSeek V3.2 format detection: template defines dsml_token and uses it for tool calls.
// The template source contains the token as a variable assignment, not as a literal in markup.
if (src.find("dsml_token") != std::string::npos &&
src.find("function_calls") != std::string::npos &&
src.find("DSML") != std::string::npos) {
LOG_DBG("Using specialized template: DeepSeek V3.2\n");
return common_chat_params_init_deepseek_v3_2(tmpl, params);
}
// Gemma4 format detection
if (src.find("'<|tool_call>call:'") != std::string::npos) {
if (src.find("{#- OpenAI Chat Completions:") == std::string::npos) {

View File

@@ -1,6 +1,7 @@
#include "ggml.h"
#include "gguf.h"
#include "build-info.h"
#include "common.h"
#include "log.h"
#include "llama.h"
@@ -372,7 +373,7 @@ void common_init() {
const char * build_type = " (debug)";
#endif
LOG_DBG("build: %d (%s) with %s for %s%s\n", LLAMA_BUILD_NUMBER, LLAMA_COMMIT, LLAMA_COMPILER, LLAMA_BUILD_TARGET, build_type);
LOG_DBG("build: %d (%s) with %s for %s%s\n", llama_build_number(), llama_commit(), llama_compiler(), llama_build_target(), build_type);
}
std::string common_params_get_system_info(const common_params & params) {

View File

@@ -2,9 +2,10 @@
#pragma once
#include "llama-cpp.h"
#include "ggml-opt.h"
#include "ggml.h"
#include "llama-cpp.h"
#include <set>
#include <sstream>
@@ -27,11 +28,6 @@
#define die(msg) do { fputs("error: " msg "\n", stderr); exit(1); } while (0)
#define die_fmt(fmt, ...) do { fprintf(stderr, "error: " fmt "\n", __VA_ARGS__); exit(1); } while (0)
#define print_build_info() do { \
fprintf(stderr, "%s: build = %d (%s)\n", __func__, LLAMA_BUILD_NUMBER, LLAMA_COMMIT); \
fprintf(stderr, "%s: built with %s for %s\n", __func__, LLAMA_COMPILER, LLAMA_BUILD_TARGET); \
} while(0)
struct common_time_meas {
common_time_meas(int64_t & t_acc, bool disable = false);
~common_time_meas();
@@ -53,14 +49,6 @@ struct common_adapter_lora_info {
using llama_tokens = std::vector<llama_token>;
// build info
extern int LLAMA_BUILD_NUMBER;
extern const char * LLAMA_COMMIT;
extern const char * LLAMA_COMPILER;
extern const char * LLAMA_BUILD_TARGET;
const static std::string build_info("b" + std::to_string(LLAMA_BUILD_NUMBER) + "-" + LLAMA_COMMIT);
struct common_control_vector_load_info;
//

View File

@@ -1,5 +1,6 @@
#include "arg.h"
#include "build-info.h"
#include "common.h"
#include "log.h"
#include "download.h"
@@ -258,6 +259,9 @@ static bool common_pull_file(httplib::Client & cli,
if (progress_step >= p.total / 1000 || p.downloaded == p.total) {
if (callback) {
callback->on_update(p);
if (callback->is_cancelled()) {
return false;
}
}
progress_step = 0;
}
@@ -300,7 +304,7 @@ static int common_download_file_single_online(const std::string & url,
headers.emplace(h.first, h.second);
}
if (headers.find("User-Agent") == headers.end()) {
headers.emplace("User-Agent", "llama-cpp/" + build_info);
headers.emplace("User-Agent", "llama-cpp/" + std::string(llama_build_info()));
}
if (!opts.bearer_token.empty()) {
headers.emplace("Authorization", "Bearer " + opts.bearer_token);
@@ -373,6 +377,9 @@ static int common_download_file_single_online(const std::string & url,
}
for (int i = 0; i < max_attempts; ++i) {
if (opts.callback && opts.callback->is_cancelled()) {
break;
}
if (i) {
LOG_WRN("%s: retrying after %d seconds...\n", __func__, delay);
std::this_thread::sleep_for(std::chrono::seconds(delay));
@@ -412,6 +419,12 @@ static int common_download_file_single_online(const std::string & url,
if (opts.callback) {
opts.callback->on_done(p, success);
}
if (opts.callback && opts.callback->is_cancelled() &&
std::filesystem::exists(path_temporary)) {
if (remove(path_temporary.c_str()) != 0) {
LOG_ERR("%s: unable to delete temporary file: %s\n", __func__, path_temporary.c_str());
}
}
if (!success) {
LOG_ERR("%s: download failed after %d attempts\n", __func__, max_attempts);
return -1; // max attempts reached
@@ -429,7 +442,7 @@ std::pair<long, std::vector<char>> common_remote_get_content(const std::string
headers.emplace(h.first, h.second);
}
if (headers.find("User-Agent") == headers.end()) {
headers.emplace("User-Agent", "llama-cpp/" + build_info);
headers.emplace("User-Agent", "llama-cpp/" + std::string(llama_build_info()));
}
if (params.timeout > 0) {

View File

@@ -21,6 +21,7 @@ public:
virtual void on_start(const common_download_progress & p) = 0;
virtual void on_update(const common_download_progress & p) = 0;
virtual void on_done(const common_download_progress & p, bool ok) = 0;
virtual bool is_cancelled() const { return false; }
};
struct common_remote_params {

View File

@@ -1,5 +1,6 @@
#include "hf-cache.h"
#include "build-info.h"
#include "common.h"
#include "log.h"
#include "http.h"
@@ -200,7 +201,7 @@ static nl::json api_get(const std::string & url,
auto [cli, parts] = common_http_client(url);
httplib::Headers headers = {
{"User-Agent", "llama-cpp/" + build_info},
{"User-Agent", "llama-cpp/" + std::string(llama_build_info())},
{"Accept", "application/json"}
};

View File

@@ -23,6 +23,10 @@
int common_log_verbosity_thold = LOG_DEFAULT_LLAMA;
int common_log_get_verbosity_thold(void) {
return common_log_verbosity_thold;
}
void common_log_set_verbosity_thold(int verbosity) {
common_log_verbosity_thold = verbosity;
}

View File

@@ -38,7 +38,7 @@ enum log_colors {
// needed by the LOG_TMPL macro to avoid computing log arguments if the verbosity lower
// set via common_log_set_verbosity()
extern int common_log_verbosity_thold;
int common_log_get_verbosity_thold(void);
void common_log_set_verbosity_thold(int verbosity); // not thread-safe
@@ -98,7 +98,7 @@ void common_log_flush (struct common_log * log); // f
#define LOG_TMPL(level, verbosity, ...) \
do { \
if ((verbosity) <= common_log_verbosity_thold) { \
if ((verbosity) <= common_log_get_verbosity_thold()) { \
common_log_add(common_log_main(), (level), __VA_ARGS__); \
} \
} while (0)

View File

@@ -890,6 +890,10 @@ struct parser_executor {
}
return result;
}
common_peg_parse_result operator()(const common_peg_gbnf_parser & p) {
return arena.parse(p.child, ctx, start_pos);
}
};
common_peg_parse_result common_peg_arena::parse(common_peg_parse_context & ctx, size_t start) const {
@@ -957,7 +961,8 @@ void common_peg_arena::resolve_refs() {
std::is_same_v<T, common_peg_and_parser> ||
std::is_same_v<T, common_peg_not_parser> ||
std::is_same_v<T, common_peg_tag_parser> ||
std::is_same_v<T, common_peg_atomic_parser>) {
std::is_same_v<T, common_peg_atomic_parser> ||
std::is_same_v<T, common_peg_gbnf_parser>) {
p.child = resolve_ref(p.child);
} else if constexpr (std::is_same_v<T, common_peg_rule_parser>) {
p.child = resolve_ref(p.child);
@@ -1036,6 +1041,8 @@ std::string common_peg_arena::dump_impl(common_peg_parser_id
return "Not(" + dump_impl(p.child, visited) + ")";
} else if constexpr (std::is_same_v<T, common_peg_atomic_parser>) {
return "Atomic(" + dump_impl(p.child, visited) + ")";
} else if constexpr (std::is_same_v<T, common_peg_gbnf_parser>) {
return "Gbnf(" + p.grammar + ", " + dump_impl(p.child, visited) + ")";
} else if constexpr (std::is_same_v<T, common_peg_any_parser>) {
return "Any";
} else if constexpr (std::is_same_v<T, common_peg_space_parser>) {
@@ -1565,6 +1572,7 @@ static std::unordered_set<std::string> collect_reachable_rules(
std::is_same_v<T, common_peg_not_parser> ||
std::is_same_v<T, common_peg_tag_parser> ||
std::is_same_v<T, common_peg_atomic_parser> ||
std::is_same_v<T, common_peg_gbnf_parser> ||
std::is_same_v<T, common_peg_schema_parser>) {
visit(p.child);
} else if constexpr (std::is_same_v<T, common_peg_rule_parser>) {
@@ -1651,10 +1659,13 @@ void common_peg_arena::build_grammar(const common_grammar_builder & builder, boo
} else if constexpr (std::is_same_v<T, common_peg_sequence_parser>) {
std::string s;
for (const auto & child : p.children) {
auto child_gbnf = to_gbnf(child);
if (child_gbnf.empty()) {
continue;
}
if (!s.empty()) {
s += " ";
}
auto child_gbnf = to_gbnf(child);
const auto & child_parser = effective_parser(child);
if (std::holds_alternative<common_peg_choice_parser>(child_parser) ||
std::holds_alternative<common_peg_sequence_parser>(child_parser)) {
@@ -1754,6 +1765,8 @@ void common_peg_arena::build_grammar(const common_grammar_builder & builder, boo
return to_gbnf(p.child);
} else if constexpr (std::is_same_v<T, common_peg_atomic_parser>) {
return to_gbnf(p.child);
} else if constexpr (std::is_same_v<T, common_peg_gbnf_parser>) {
return p.grammar;
} else {
static_assert(is_always_false_v<T>);
}
@@ -1888,6 +1901,8 @@ static nlohmann::json serialize_parser_variant(const common_peg_parser_variant &
{"child", p.child},
{"tag", p.tag}
};
} else if constexpr (std::is_same_v<T, common_peg_gbnf_parser>) {
return json{{"type", "gbnf"}, {"child", p.child}, {"grammar", p.grammar}};
}
}, variant);
}
@@ -2050,6 +2065,16 @@ static common_peg_parser_variant deserialize_parser_variant(const nlohmann::json
};
}
if (type == "gbnf") {
if (!j.contains("child") || !j.contains("grammar")) {
throw std::runtime_error("gbnf parser missing required fields");
}
return common_peg_gbnf_parser{
j["child"].get<common_peg_parser_id>(),
j["grammar"].get<std::string>(),
};
}
throw std::runtime_error("Unknown parser type: " + type);
}

View File

@@ -270,6 +270,11 @@ struct common_peg_tag_parser {
std::string tag;
};
struct common_peg_gbnf_parser {
common_peg_parser_id child;
std::string grammar;
};
// Variant holding all parser types
using common_peg_parser_variant = std::variant<
common_peg_epsilon_parser,
@@ -290,7 +295,8 @@ using common_peg_parser_variant = std::variant<
common_peg_rule_parser,
common_peg_ref_parser,
common_peg_atomic_parser,
common_peg_tag_parser
common_peg_tag_parser,
common_peg_gbnf_parser
>;
class common_peg_arena {
@@ -504,6 +510,10 @@ class common_peg_parser_builder {
// Unlike rules, you can tag multiple nodes with the same tag.
common_peg_parser tag(const std::string & tag, const common_peg_parser & p) { return add(common_peg_tag_parser{p.id(), tag}); }
// Wraps a child parser but emits a custom GBNF grammar string instead of
// the child's grammar. Parsing delegates entirely to the child.
common_peg_parser gbnf(const common_peg_parser & p, const std::string & grammar) { return add(common_peg_gbnf_parser{p, grammar}); }
void set_root(const common_peg_parser & p);
common_peg_arena build();

View File

@@ -287,8 +287,8 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
}
}
// reasoning budget sampler
if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty()) {
// reasoning budget sampler (skip when budget is unlimited unless a lazy grammar is active, which needs rbudget for thinking-block suppression)
if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy || params.reasoning_budget_tokens >= 0)) {
rbudget = common_reasoning_budget_init(
vocab,
params.reasoning_budget_start,

View File

@@ -10893,7 +10893,64 @@ class NemotronHModel(GraniteHybridModel):
self.gguf_writer.add_moe_latent_size(latent_size)
def set_vocab(self):
super().set_vocab()
# The NemotronH config uses pattern characters (e.g. '-') that may not
# be supported by the installed transformers version. AutoTokenizer
# internally calls AutoConfig which triggers this parsing failure.
# Using trust_remote_code=True to load the model's own config class.
tokens: list[str] = []
toktypes: list[int] = []
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
# Pad vocab size (from Mamba2Model/GraniteHybridModel)
self.hparams["pad_vocab_size_multiple"] = 8 # Setting this here since GraniteHybridModel.set_vocab() isn't being invoked now.
# From Mamba2Model.set_vocab():
vocab_size = self.hparams["vocab_size"]
pad_vocab = self.hparams.get("pad_vocab_size_multiple", 16)
# ref: https://stackoverflow.com/a/17511341/22827863
vocab_size = -(vocab_size // -pad_vocab) * pad_vocab
self.hparams["vocab_size"] = vocab_size
assert max(tokenizer.vocab.values()) < vocab_size
tokpre = self.get_vocab_base_pre(tokenizer)
reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()}
added_vocab = tokenizer.get_added_vocab()
added_tokens_decoder = tokenizer.added_tokens_decoder
for i in range(vocab_size):
if i not in reverse_vocab:
tokens.append(f"[PAD{i}]")
toktypes.append(gguf.TokenType.UNUSED)
else:
token: str = reverse_vocab[i]
if token in added_vocab:
if not added_tokens_decoder[i].normalized:
previous_token = token
token = tokenizer.decode(tokenizer.encode(token, add_special_tokens=False))
if previous_token != token:
logger.info(f"{repr(previous_token)} is encoded and decoded back to {repr(token)} using AutoTokenizer")
if added_tokens_decoder[i].special or self.does_token_look_special(token):
toktypes.append(gguf.TokenType.CONTROL)
else:
token = token.replace(b"\xe2\x96\x81".decode("utf-8"), " ") # pre-normalize user-defined spaces
toktypes.append(gguf.TokenType.USER_DEFINED)
else:
toktypes.append(gguf.TokenType.NORMAL)
tokens.append(token)
# From TextModel.set_vocab_gpt2():
self.gguf_writer.add_tokenizer_model("gpt2")
self.gguf_writer.add_tokenizer_pre(tokpre)
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
special_vocab.add_to_gguf(self.gguf_writer)
# The tokenizer _does_ add a BOS token (via post_processor type
# TemplateProcessing) but does not set add_bos_token to true in the

View File

@@ -689,6 +689,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. (1.) |
| GGML_SYCL_GRAPH | OFF *(default)* \|ON *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. |
| GGML_SYCL_HOST_MEM_FALLBACK | ON *(default)* \|OFF *(Optional)* | Allow host memory fallback when device memory is full during quantized weight reorder. Enables inference to continue at reduced speed (reading over PCIe) instead of failing. Requires Linux kernel 6.8+. |
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |

View File

@@ -281,6 +281,12 @@ Use `GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` environment variable to force use FP16
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as `System Memory Fallback`.
### Peer Access
The environment variable `GGML_CUDA_P2P` can be set to enable peer-to-peer access between multiple GPUs, allowing them to transfer data directly rather than to go through system memory.
Requires driver support (usually restricted to workstation/datacenter GPUs).
May cause crashes or corrupted outputs for some motherboards and BIOS settings (e.g. IOMMU).
### Performance Tuning
The following compilation options are also available to tweak performance:
@@ -456,7 +462,8 @@ pacman -S git \
mingw-w64-ucrt-x86_64-gcc \
mingw-w64-ucrt-x86_64-cmake \
mingw-w64-ucrt-x86_64-vulkan-devel \
mingw-w64-ucrt-x86_64-shaderc
mingw-w64-ucrt-x86_64-shaderc \
mingw-w64-ucrt-x86_64-spirv-headers
```
Switch into the `llama.cpp` directory and build using CMake.
@@ -490,9 +497,11 @@ First, follow the official LunarG instructions for the installation and setup of
On Debian / Ubuntu, you can install the required dependencies using:
```sh
sudo apt-get install libvulkan-dev glslc
sudo apt-get install libvulkan-dev glslc spirv-headers
```
SPIRV-Headers (`spirv/unified1/spirv.hpp`) are required for the Vulkan backend and are **not** always pulled in by the Vulkan loader dev package alone. Other distros use names such as `spirv-headers` (Ubuntu / Debian / Arch), or `spirv-headers-devel` (Fedora / openSUSE). On Windows, the LunarG Vulkan SDKs `Include` directory already contains these headers.
#### Common steps
Second, after verifying that you have followed all of the SDK installation/setup steps, use this command to make sure before proceeding:

View File

@@ -130,6 +130,23 @@ Note:
- Adding a model-specific API or CLI is an anti-pattern in `libmtmd`. The goal of `libmtmd` is to provide an easy-to-use, model-agnostic library for multimodal pipeline.
- In most cases, `llama-mtmd-cli` should not be modified. If a model requires a specific prompt, either let the user provide it or bake it into the Jinja chat template.
## Tips and tricks
### Working with ggml_rope_ext
PyTorch implementations usually prefer explicitly calculating `freq_cis`/`sin`/`cos` components. However, in llama.cpp, most RoPE operations can be handled via `ggml_rope_ext`, which does not require a sin/cos matrix. This saves memory while allowing the GGML RoPE kernel to be fused with other ops.
However, since `ggml_rope_ext` only provides a subset of the RoPE implementations that models use, converting models from PyTorch to llama.cpp may require some creative adaptations.
For more information about `ggml_rope_ext`, please refer to the in-code documentation in `ggml.h`.
Examples:
- `libmtmd` implements 2D RoPE with `GGML_ROPE_TYPE_NORMAL` ordering by splitting the input tensor in half, applying `ggml_rope_ext` separately to each half, then joining them back together using `ggml_concat`.
- The [Kimi-K2.5](https://github.com/ggml-org/llama.cpp/pull/19170) vision encoder uses vision RoPE with interleaved frequencies. The weights must be permuted during conversion in order to reuse the `build_rope_2d()` function.
- [Gemma 4](https://github.com/ggml-org/llama.cpp/pull/21309) uses "proportional" RoPE. We employ a trick where `rope_freqs` is set to a very large value in the last dimensions to prevent those dimensions from being rotated. See the `Gemma4Model` class in `convert_hf_to_gguf.py`.
- Some models require scaling the input position. For example, `[0, 1, 2, ...]` becomes `[0, 0.5, 1, ...]`. In this case, you can provide the scaling via `freq_scale = 0.5f`.
- Some models use learned RoPE frequencies instead of relying on `powf(freq_base, -2.0 * i / n_dims)`. In this case, you can provide the learned frequencies via the `rope_freqs` tensor (corresponding to the `c` argument in `ggml_rope_ext`), then set `freq_base = 1.0f`. An important note is that `rope_freqs` in GGML is the **inverse** (`theta = pos[i] / rope_freqs`), so you may need to invert `rope_freqs` during conversion.
## GGUF specification
https://github.com/ggml-org/ggml/blob/master/docs/gguf.md

View File

@@ -114,6 +114,10 @@ NOTE: some models may require large context window, for example: `-c 8192`
# Mistral's Voxtral
(tool_name) -hf ggml-org/Voxtral-Mini-3B-2507-GGUF
# Qwen3-ASR
(tool_name) -hf ggml-org/Qwen3-ASR-0.6B-GGUF
(tool_name) -hf ggml-org/Qwen3-ASR-1.7B-GGUF
```
**Mixed modalities**:
@@ -124,6 +128,11 @@ NOTE: some models may require large context window, for example: `-c 8192`
(tool_name) -hf ggml-org/Qwen2.5-Omni-3B-GGUF
(tool_name) -hf ggml-org/Qwen2.5-Omni-7B-GGUF
# Qwen3 Omni
# Capabilities: audio input, vision input
(tool_name) -hf ggml-org/Qwen3-Omni-30B-A3B-Instruct-GGUF
(tool_name) -hf ggml-org/Qwen3-Omni-30B-A3B-Thinking-GGUF
# Gemma 4
# Capabilities: audio input, vision input
(tool_name) -hf ggml-org/gemma-4-E2B-it-GGUF

View File

@@ -22,13 +22,13 @@ Legend:
| ARANGE | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ARGMAX | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
| ARGSORT | ❌ | ✅ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ |
| CEIL | ❌ | ❌ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| CEIL | ❌ | ❌ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| CLAMP | ❌ | ✅ | ✅ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| CONCAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | ✅ | ✅ | ❌ | ❌ |
| CONT | ❌ | 🟡 | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| CONT | ❌ | 🟡 | ✅ | ✅ | | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| CONV_3D | ❌ | ❌ | ✅ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CONV_3D | ❌ | ❌ | ✅ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| COS | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
@@ -46,7 +46,7 @@ Legend:
| EXPM1 | ❌ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| FILL | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| FLASH_ATTN_EXT | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| FLOOR | ❌ | ❌ | ✅ | 🟡 | | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| FLOOR | ❌ | ❌ | ✅ | 🟡 | | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| GATED_DELTA_NET | ❌ | ❌ | ✅ | ❌ | 🟡 | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ |
| GATED_LINEAR_ATTN | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| GEGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
@@ -84,10 +84,10 @@ Legend:
| REPEAT_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| RMS_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| RMS_NORM_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ROLL | ❌ | ❌ | ✅ | ✅ | | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ROLL | ❌ | ❌ | ✅ | ✅ | | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ROPE | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| ROPE_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ROUND | ❌ | ❌ | ✅ | 🟡 | | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| ROUND | ❌ | ❌ | ✅ | 🟡 | | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| RWKV_WKV6 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| RWKV_WKV7 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| SCALE | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
@@ -116,6 +116,6 @@ Legend:
| TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| TOP_K | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| TRI | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
| TRUNC | ❌ | ❌ | ✅ | 🟡 | | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| TRUNC | ❌ | ❌ | ✅ | 🟡 | | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| UPSCALE | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| XIELU | ❌ | ❌ | ✅ | ❌ | | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| XIELU | ❌ | ❌ | ✅ | ❌ | | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |

File diff suppressed because it is too large Load Diff

View File

@@ -1,5 +1,5 @@
set(TARGET llama-batched)
add_executable(${TARGET} batched.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-convert-llama2c-to-ggml)
add_executable(${TARGET} convert-llama2c-to-ggml.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-debug)
add_executable(${TARGET} debug.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-diffusion-cli)
add_executable(${TARGET} diffusion-cli.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE llama common ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama llama-common ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -602,8 +602,8 @@ int main(int argc, char ** argv) {
int n_input = input_tokens.size();
if (n_input >= params.n_ctx) {
LOG_ERR("error: input too long (%d tokens), max context is %d\n", n_input, params.n_ctx);
if (static_cast<uint32_t>(n_input) >= llama_n_ctx(ctx)) {
LOG_ERR("error: input too long (%d tokens), max context is %d\n", n_input, llama_n_ctx(ctx));
llama_free(ctx);
llama_model_free(model);
return 1;

View File

@@ -1,5 +1,5 @@
set(TARGET llama-embedding)
add_executable(${TARGET} embedding.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,7 +1,7 @@
set(TARGET llama-eval-callback)
add_executable(${TARGET} eval-callback.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)
if(LLAMA_BUILD_TESTS)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-gen-docs)
add_executable(${TARGET} gen-docs.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-idle)
add_executable(${TARGET} idle.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE llama common ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama llama-common ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-lookahead)
add_executable(${TARGET} lookahead.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,23 +1,23 @@
set(TARGET llama-lookup)
add_executable(${TARGET} lookup.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)
set(TARGET llama-lookup-create)
add_executable(${TARGET} lookup-create.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)
set(TARGET llama-lookup-merge)
add_executable(${TARGET} lookup-merge.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)
set(TARGET llama-lookup-stats)
add_executable(${TARGET} lookup-stats.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-parallel)
add_executable(${TARGET} parallel.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-passkey)
add_executable(${TARGET} passkey.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-retrieval)
add_executable(${TARGET} retrieval.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-save-load-state)
add_executable(${TARGET} save-load-state.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-speculative-simple)
add_executable(${TARGET} speculative-simple.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-speculative)
add_executable(${TARGET} speculative.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -5,5 +5,5 @@
set(TARGET llama-ls-sycl-device)
add_executable(${TARGET} ls-sycl-device.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -1,5 +1,5 @@
set(TARGET llama-finetune)
add_executable(${TARGET} finetune.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)

View File

@@ -1,4 +1,11 @@
cmake_minimum_required(VERSION 3.14...3.28) # for add_link_options and implicit target directories.
# ref: https://cmake.org/cmake/help/latest/policy/CMP0194.html
# MSVC is not a valid assembler for the ASM language.
# Set to NEW to avoid a warning on CMake 4.1+ with MSVC.
if (POLICY CMP0194)
cmake_policy(SET CMP0194 NEW)
endif()
project("ggml" C CXX ASM)
### GGML Version
@@ -247,6 +254,7 @@ option(GGML_RPC "ggml: use RPC"
option(GGML_SYCL "ggml: use SYCL" OFF)
option(GGML_SYCL_F16 "ggml: use 16 bit floats for sycl calculations" OFF)
option(GGML_SYCL_GRAPH "ggml: enable graphs in the SYCL backend" ON)
option(GGML_SYCL_HOST_MEM_FALLBACK "ggml: allow host memory fallback in SYCL reorder (requires kernel 6.8+)" ON)
option(GGML_SYCL_DNN "ggml: enable oneDNN in the SYCL backend" ON)
set (GGML_SYCL_TARGET "INTEL" CACHE STRING
"ggml: sycl target device")

View File

@@ -202,8 +202,11 @@ extern "C" {
// Common functions that may be obtained using ggml_backend_reg_get_proc_address
// AllReduce operation for tensor parallelism (meta backend)
typedef bool (*ggml_backend_allreduce_tensor_t)(ggml_backend_t * backends, struct ggml_tensor ** tensors, size_t n_backends);
// Context management and operations for faster communication between backends, used for tensor parallelism (meta backend)
typedef void * (*ggml_backend_comm_init_t)(ggml_backend_t * backends, size_t n_backends);
typedef void (*ggml_backend_comm_free_t)(void * comm_ctx);
typedef bool (*ggml_backend_comm_allreduce_tensor_t)(void * comm_ctx, struct ggml_tensor ** tensors);
// Split buffer type for tensor parallelism (old)
typedef ggml_backend_buffer_type_t (*ggml_backend_split_buffer_type_t)(int main_device, const float * tensor_split);
// Set the number of threads for the backend
@@ -348,6 +351,53 @@ extern "C" {
// Set a callback to be called for each resulting node during graph compute
GGML_API void ggml_backend_sched_set_eval_callback(ggml_backend_sched_t sched, ggml_backend_sched_eval_callback callback, void * user_data);
//
// Meta backend
//
#define GGML_BACKEND_META_MAX_DEVICES 16
enum ggml_backend_meta_split_axis {
// tensor split by tensor dimensions:
GGML_BACKEND_SPLIT_AXIS_0 = 0,
GGML_BACKEND_SPLIT_AXIS_1 = 1,
GGML_BACKEND_SPLIT_AXIS_2 = 2,
GGML_BACKEND_SPLIT_AXIS_3 = 3,
GGML_BACKEND_SPLIT_AXIS_MIRRORED = 10, // all values on all backends
GGML_BACKEND_SPLIT_AXIS_PARTIAL = 11, // each backend has a partial sum
// for internal bookkeeping only:
GGML_BACKEND_SPLIT_AXIS_NONE = 98,
GGML_BACKEND_SPLIT_AXIS_UNKNOWN = 99,
};
GGML_API const char * ggml_backend_meta_split_axis_name(enum ggml_backend_meta_split_axis split_axis);
struct ggml_backend_meta_split_state {
enum ggml_backend_meta_split_axis axis;
// for tensors with axis >= 0 && axis < GGML_MAX_DIMS:
// - each device has a slice of the tensor along the split axis
// - most tensors have n_segments == 1 and a contiguous slice of the tensor data
// - some tensors have an inhomogenenous data layout along the split axis,
// those tensors are divided into segments which are each individually split across devices
// - ne has one entry per segment and device that add up to ggml_tensor::ne for that axis,
// the outer/inner loops are over segments/devices like [seg0_dev0, seg0_dev1, seg1_dev0, seg1_dev1],
// - for example, a transformer may have a fused QKV matrix rather than 3 matrices, those would be 3 separate segments
// that each need to be split individually across devices so that each device gets a slice of Q, K, and V
int64_t ne[16*GGML_BACKEND_META_MAX_DEVICES];
uint32_t n_segments;
};
// function to assign split states for statically allocated tensors, compute tensor split states will be assigned to be compatible:
typedef struct ggml_backend_meta_split_state(*ggml_backend_meta_get_split_state_t)(const struct ggml_tensor * tensor, void * userdata);
// create a new meta device from "simple" devices, meta buffer type/buffer/backend is then derived from this:
// TODO: this looks a bit strange - a backend API creates a device. I think we should try
// express this as a backend registry functionality instead
GGML_API ggml_backend_dev_t ggml_backend_meta_device(
ggml_backend_dev_t * devs, size_t n_devs, ggml_backend_meta_get_split_state_t get_split_state, void * get_split_state_ud);
//
// Utils
//

View File

@@ -6,9 +6,9 @@
extern "C" {
#endif
#define RPC_PROTO_MAJOR_VERSION 3
#define RPC_PROTO_MINOR_VERSION 6
#define RPC_PROTO_PATCH_VERSION 1
#define RPC_PROTO_MAJOR_VERSION 4
#define RPC_PROTO_MINOR_VERSION 0
#define RPC_PROTO_PATCH_VERSION 0
#ifdef __cplusplus
static_assert(GGML_OP_COUNT == 96, "GGML_OP_COUNT has changed - update RPC_PROTO_PATCH_VERSION");

View File

@@ -1773,8 +1773,32 @@ extern "C" {
int n_dims,
int mode);
// custom RoPE
// RoPE operations with extended options
// a is the input tensor to apply RoPE to, shape [n_embd, n_head, n_token]
// b is an int32 vector with size n_token
// c is freq factors (e.g. phi3-128k), (optional)
// mode can be GGML_ROPE_TYPE_NORMAL or NEOX; for MROPE and VISION mode, use ggml_rope_multi
//
// pseudo-code for computing theta:
// for i in [0, n_dims/2):
// theta[i] = b[i] * powf(freq_base, -2.0 * i / n_dims);
// theta[i] = theta[i] / c[i]; # if c is provided, divide theta by c
// theta[i] = rope_yarn(theta[i], ...); # note: theta = theta * freq_scale is applied here
//
// other params are used by YaRN RoPE scaling, these default values will disable YaRN:
// freq_scale = 1.0f
// ext_factor = 0.0f
// attn_factor = 1.0f
// beta_fast = 0.0f
// beta_slow = 0.0f
//
// example:
// (marking: c = cos, s = sin, 0 = unrotated)
// given a single head with size = 8 --> [00000000]
// GGML_ROPE_TYPE_NORMAL n_dims = 4 --> [cscs0000]
// GGML_ROPE_TYPE_NORMAL n_dims = 8 --> [cscscscs]
// GGML_ROPE_TYPE_NEOX n_dims = 4 --> [ccss0000]
// GGML_ROPE_TYPE_NEOX n_dims = 8 --> [ccccssss]
GGML_API struct ggml_tensor * ggml_rope_ext(
struct ggml_context * ctx,
struct ggml_tensor * a,
@@ -1790,6 +1814,36 @@ extern "C" {
float beta_fast,
float beta_slow);
// multi-dimensional RoPE, for Qwen-VL and similar vision models
// mode can be either VISION, MROPE, IMROPE, cannot be combined with NORMAL or NEOX
// sections specify how many dimensions to rotate in each section:
// section length is equivalent to number of cos/sin pairs, NOT the number of dims
// (i.e. sum of 4 sections are expected to be n_dims/2)
// last sections can be 0, means ignored
// all other options are identical to ggml_rope_ext
//
// important note:
// - NEOX ordering is automatically applied and cannot be disabled for MROPE and VISION
// if you need normal ordering, there are 2 methods:
// (1) split the tensor manually using ggml_view
// (2) permute the weight upon conversion
// - for VISION, n_dims must be head_size/2
//
// example M-RoPE:
// given sections = [t=4, y=2, x=2, 0]
// given a single head with size = 18 --> [000000000000000000]
// GGML_ROPE_TYPE_MROPE n_dims = 16 --> [ttttyyxxttttyyxx00] (cos/sin are applied in NEOX ordering)
// GGML_ROPE_TYPE_IMROPE n_dims = 16 --> [ttyxttyxttyxttyx00] (interleaved M-RoPE, still NEOX ordering)
// note: the theta for each dim is computed the same way as ggml_rope_ext, no matter the section
// in other words, idx used for theta: [0123456789... until n_dims/2], not reset for each section
//
// example vision RoPE:
// given sections = [y=4, x=4, 0, 0] (last 2 sections are ignored)
// given a single head with size = 8 --> [00000000]
// GGML_ROPE_TYPE_VISION n_dims = 4 --> [yyyyxxxx]
// other values of n_dims are untested and is undefined behavior
// note: unlike MROPE, the theta for each dim is computed differently for each section
// in other words, idx used for theta: [0123] for y section, then [0123] for x section
GGML_API struct ggml_tensor * ggml_rope_multi(
struct ggml_context * ctx,
struct ggml_tensor * a,

View File

@@ -2,6 +2,7 @@
#include "ggml-backend-impl.h"
#include "ggml.h"
#include "ggml-impl.h"
#include <assert.h>
#include <limits.h>
#include <stdarg.h>

View File

@@ -5,9 +5,6 @@
#include "ggml-alloc.h"
#include "ggml-cpp.h"
// TODO: tmp
#include "ggml-ext.h"
#include <algorithm>
#include <cassert>
#include <cmath>
@@ -1422,22 +1419,48 @@ struct ggml_backend_meta_context {
size_t max_tmp_size = 0;
size_t max_subgraphs = 0;
void * comm_ctx = nullptr;
ggml_backend_comm_allreduce_tensor_t comm_allreduce = nullptr;
ggml_backend_meta_context(ggml_backend_dev_t meta_dev, const char * params) {
const size_t n_devs = ggml_backend_meta_dev_n_devs(meta_dev);
name = "Meta(";
std::vector<ggml_backend_t> simple_backends;
backend_configs.reserve(n_devs);
simple_backends.reserve(n_devs);
for (size_t i = 0; i < n_devs; i++) {
ggml_backend_dev_t simple_dev = ggml_backend_meta_dev_simple_dev(meta_dev, i);
if (i > 0) {
name += ",";
}
name += ggml_backend_dev_name(simple_dev);
backend_configs.emplace_back(ggml_backend_dev_init(simple_dev, params));
simple_backends.push_back(ggml_backend_dev_init(simple_dev, params));
backend_configs.emplace_back(simple_backends.back());
}
name += ")";
if (n_devs > 1) {
ggml_backend_comm_init_t comm_init = (ggml_backend_comm_init_t) ggml_backend_reg_get_proc_address(
ggml_backend_dev_backend_reg(ggml_backend_get_device(simple_backends[0])), "ggml_backend_comm_init");
if (comm_init != nullptr) {
comm_ctx = comm_init(simple_backends.data(), simple_backends.size());
}
}
if (comm_ctx != nullptr) {
comm_allreduce = (ggml_backend_comm_allreduce_tensor_t)
ggml_backend_reg_get_proc_address(ggml_backend_dev_backend_reg(
ggml_backend_get_device(simple_backends[0])), "ggml_backend_comm_allreduce_tensor");
GGML_ASSERT(comm_allreduce != nullptr);
}
}
~ggml_backend_meta_context() {
if (comm_ctx != nullptr) {
ggml_backend_comm_free_t comm_free = (ggml_backend_comm_free_t) ggml_backend_reg_get_proc_address(
ggml_backend_dev_backend_reg(ggml_backend_get_device(backend_configs[0].backend)), "ggml_backend_comm_free");
GGML_ASSERT(comm_free != nullptr);
comm_free(comm_ctx);
}
for (auto & bc : backend_configs) {
ggml_backend_free(bc.backend);
}
@@ -1848,20 +1871,15 @@ static enum ggml_status ggml_backend_meta_graph_compute(ggml_backend_t backend,
if (n_backends > 1 && i < n_subgraphs - 1) {
bool backend_allreduce_success = false;
ggml_backend_allreduce_tensor_t allreduce_tensor = (ggml_backend_allreduce_tensor_t) ggml_backend_reg_get_proc_address(
ggml_backend_dev_backend_reg(ggml_backend_get_device(backend_ctx->backend_configs[0].backend)), "ggml_backend_allreduce_tensor");
if (allreduce_tensor) {
std::vector<ggml_backend_t> backends;
backends.reserve(n_backends);
if (backend_ctx->comm_ctx) {
std::vector<ggml_tensor *> nodes;
nodes.reserve(n_backends);
for (size_t j = 0; j < n_backends; j++) {
auto & bcj = backend_ctx->backend_configs[j];
backends.push_back(bcj.backend);
ggml_cgraph * cgraph_ij = bcj.cgraphs[i].cgraph_main;
nodes.push_back(cgraph_ij->nodes[cgraph_ij->n_nodes-1]);
}
backend_allreduce_success = allreduce_tensor(backends.data(), nodes.data(), n_backends);
backend_allreduce_success = backend_ctx->comm_allreduce(backend_ctx->comm_ctx, nodes.data());
}
if (!backend_allreduce_success) {

View File

@@ -1030,6 +1030,8 @@ void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct ggml_cgra
GGML_ABORT("%s: failed to initialize context\n", __func__);
}
graph->uid = ggml_graph_next_uid();
// pass 1: assign backends to ops with pre-allocated inputs
for (int i = 0; i < graph->n_leafs; i++) {
struct ggml_tensor * leaf = graph->leafs[i];
@@ -1477,6 +1479,11 @@ void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct ggml_cgra
assert(graph_copy->size > graph_copy->n_leafs);
graph_copy->leafs[graph_copy->n_leafs++] = leaf;
}
// set ids for all splits
for (int i = 0; i < sched->n_splits; ++i) {
sched->splits[i].graph.uid = ggml_graph_next_uid();
}
}
static bool ggml_backend_sched_alloc_splits(ggml_backend_sched_t sched) {

View File

@@ -783,6 +783,7 @@ void ggml_vec_dot_nvfp4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
const int8x16_t q4_lo_1 = ggml_vqtbl1q_s8(values, vandq_u8 (q4bits_1, m4b));
const int8x16_t q4_hi_1 = ggml_vqtbl1q_s8(values, vshrq_n_u8(q4bits_1, 4));
#if defined(__ARM_FEATURE_DOTPROD)
const int8x16_t q8_0a = vld1q_s8(y[2*ib].qs);
const int8x16_t q8_0b = vld1q_s8(y[2*ib].qs + 16);
const int8x16_t q8_lo_0 = vcombine_s8(vget_low_s8(q8_0a), vget_low_s8(q8_0b));
@@ -794,15 +795,40 @@ void ggml_vec_dot_nvfp4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
const int8x16_t q8_hi_1 = vcombine_s8(vget_high_s8(q8_1a), vget_high_s8(q8_1b));
const int32x4_t p0 = vaddq_s32(
ggml_vdotq_s32(vdupq_n_s32(0), q4_lo_0, q8_lo_0),
ggml_vdotq_s32(vdupq_n_s32(0), q4_hi_0, q8_hi_0));
vdotq_s32(vdupq_n_s32(0), q4_lo_0, q8_lo_0),
vdotq_s32(vdupq_n_s32(0), q4_hi_0, q8_hi_0));
const int32x4_t p1 = vaddq_s32(
ggml_vdotq_s32(vdupq_n_s32(0), q4_lo_1, q8_lo_1),
ggml_vdotq_s32(vdupq_n_s32(0), q4_hi_1, q8_hi_1));
vdotq_s32(vdupq_n_s32(0), q4_lo_1, q8_lo_1),
vdotq_s32(vdupq_n_s32(0), q4_hi_1, q8_hi_1));
const int32x4_t sums = vpaddq_s32(p0, p1);
const int32x4_t sumi = vpaddq_s32(p0, p1);
#else
const int8x8_t q4_0_lo = vget_low_s8(q4_lo_0);
const int8x8_t q4_0_hi = vget_low_s8(q4_hi_0);
const int8x8_t q4_1_lo = vget_high_s8(q4_lo_0);
const int8x8_t q4_1_hi = vget_high_s8(q4_hi_0);
const int8x8_t q4_2_lo = vget_low_s8(q4_lo_1);
const int8x8_t q4_2_hi = vget_low_s8(q4_hi_1);
const int8x8_t q4_3_lo = vget_high_s8(q4_lo_1);
const int8x8_t q4_3_hi = vget_high_s8(q4_hi_1);
const int8x8_t q8_0_lo = vld1_s8(y[2*ib].qs);
const int8x8_t q8_0_hi = vld1_s8(y[2*ib].qs + 8);
const int8x8_t q8_1_lo = vld1_s8(y[2*ib].qs + 16);
const int8x8_t q8_1_hi = vld1_s8(y[2*ib].qs + 24);
const int8x8_t q8_2_lo = vld1_s8(y[2*ib+1].qs);
const int8x8_t q8_2_hi = vld1_s8(y[2*ib+1].qs + 8);
const int8x8_t q8_3_lo = vld1_s8(y[2*ib+1].qs + 16);
const int8x8_t q8_3_hi = vld1_s8(y[2*ib+1].qs + 24);
const int32x4_t sumi = (int32x4_t){
vaddvq_s32(ggml_nvfp4_dot8(q4_0_lo, q8_0_lo, q4_0_hi, q8_0_hi)),
vaddvq_s32(ggml_nvfp4_dot8(q4_1_lo, q8_1_lo, q4_1_hi, q8_1_hi)),
vaddvq_s32(ggml_nvfp4_dot8(q4_2_lo, q8_2_lo, q4_2_hi, q8_2_hi)),
vaddvq_s32(ggml_nvfp4_dot8(q4_3_lo, q8_3_lo, q4_3_hi, q8_3_hi)),
};
#endif
// Decode 4 UE4M3 scales to f32 and multiply with q8 scales
const float dy0 = GGML_CPU_FP16_TO_FP32(y[2*ib].d);
const float dy1 = GGML_CPU_FP16_TO_FP32(y[2*ib+1].d);
const float32x4_t nvsc = {
@@ -813,7 +839,7 @@ void ggml_vec_dot_nvfp4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
};
const float32x4_t scales = vmulq_f32(nvsc, (float32x4_t){dy0, dy0, dy1, dy1});
acc = vfmaq_f32(acc, vcvtq_f32_s32(sums), scales);
acc = vfmaq_f32(acc, vcvtq_f32_s32(sumi), scales);
}
sumf = vaddvq_f32(acc);
#else

File diff suppressed because it is too large Load Diff

View File

@@ -306,6 +306,7 @@ inline static uint8x16_t ggml_vqtbl1q_u8(uint8x16_t a, uint8x16_t b) {
#if !defined(__ARM_FEATURE_DOTPROD)
// NOTE: this fallback produces the same total sum as native vdotq_s32 but with different per-lane grouping — do not use when individual lane values matter.
inline static int32x4_t ggml_vdotq_s32(int32x4_t acc, int8x16_t a, int8x16_t b) {
const int16x8_t p0 = vmull_s8(vget_low_s8 (a), vget_low_s8 (b));
const int16x8_t p1 = vmull_s8(vget_high_s8(a), vget_high_s8(b));
@@ -319,6 +320,15 @@ inline static int32x4_t ggml_vdotq_s32(int32x4_t acc, int8x16_t a, int8x16_t b)
#endif // !defined(__ARM_FEATURE_DOTPROD)
static inline int32x4_t ggml_nvfp4_dot8(const int8x8_t q4_lo, const int8x8_t q8_lo,
const int8x8_t q4_hi, const int8x8_t q8_hi) {
const int16x8_t p_lo = vmull_s8(q4_lo, q8_lo);
const int16x8_t p_hi = vmull_s8(q4_hi, q8_hi);
const int32x4_t sum_lo = vpaddlq_s16(p_lo);
const int32x4_t sum_hi = vpaddlq_s16(p_hi);
return vaddq_s32(sum_lo, sum_hi);
}
#endif // defined(__ARM_NEON)
#ifdef __wasm_simd128__

View File

@@ -109,6 +109,96 @@ static void simd_gemm(
C += N;
}
}
#elif defined(GGML_SIMD) && defined(__riscv_v_intrinsic)
// RM accumulators + 1 B vector = RM + 1 <= 8 => RM <= 7
// Microkernel: C[RM x vl] += A[RM x K] * B[K x N]
template <int RM>
static inline void rvv_simd_gemm_ukernel(
float * GGML_RESTRICT C,
const float * GGML_RESTRICT A,
const float * GGML_RESTRICT B,
int K, int N, size_t vl)
{
static_assert(RM >= 1 && RM <= 7, "RM must be 1..7 for LMUL=4");
vfloat32m4_t acc_0 = __riscv_vle32_v_f32m4(C + 0 * N, vl);
vfloat32m4_t acc_1, acc_2, acc_3, acc_4, acc_5, acc_6;
if constexpr (RM > 1) acc_1 = __riscv_vle32_v_f32m4(C + 1 * N, vl);
if constexpr (RM > 2) acc_2 = __riscv_vle32_v_f32m4(C + 2 * N, vl);
if constexpr (RM > 3) acc_3 = __riscv_vle32_v_f32m4(C + 3 * N, vl);
if constexpr (RM > 4) acc_4 = __riscv_vle32_v_f32m4(C + 4 * N, vl);
if constexpr (RM > 5) acc_5 = __riscv_vle32_v_f32m4(C + 5 * N, vl);
if constexpr (RM > 6) acc_6 = __riscv_vle32_v_f32m4(C + 6 * N, vl);
for (int kk = 0; kk < K; kk++) {
vfloat32m4_t b_0 = __riscv_vle32_v_f32m4(B + kk * N, vl);
acc_0 = __riscv_vfmacc_vf_f32m4(acc_0, A[0 * K + kk], b_0, vl);
if constexpr (RM > 1) acc_1 = __riscv_vfmacc_vf_f32m4(acc_1, A[1 * K + kk], b_0, vl);
if constexpr (RM > 2) acc_2 = __riscv_vfmacc_vf_f32m4(acc_2, A[2 * K + kk], b_0, vl);
if constexpr (RM > 3) acc_3 = __riscv_vfmacc_vf_f32m4(acc_3, A[3 * K + kk], b_0, vl);
if constexpr (RM > 4) acc_4 = __riscv_vfmacc_vf_f32m4(acc_4, A[4 * K + kk], b_0, vl);
if constexpr (RM > 5) acc_5 = __riscv_vfmacc_vf_f32m4(acc_5, A[5 * K + kk], b_0, vl);
if constexpr (RM > 6) acc_6 = __riscv_vfmacc_vf_f32m4(acc_6, A[6 * K + kk], b_0, vl);
}
__riscv_vse32_v_f32m4(C + 0 * N, acc_0, vl);
if constexpr (RM > 1) __riscv_vse32_v_f32m4(C + 1 * N, acc_1, vl);
if constexpr (RM > 2) __riscv_vse32_v_f32m4(C + 2 * N, acc_2, vl);
if constexpr (RM > 3) __riscv_vse32_v_f32m4(C + 3 * N, acc_3, vl);
if constexpr (RM > 4) __riscv_vse32_v_f32m4(C + 4 * N, acc_4, vl);
if constexpr (RM > 5) __riscv_vse32_v_f32m4(C + 5 * N, acc_5, vl);
if constexpr (RM > 6) __riscv_vse32_v_f32m4(C + 6 * N, acc_6, vl);
}
template <int RM>
static inline void rvv_simd_gemm_dispatch_tail(
float * GGML_RESTRICT C,
const float * GGML_RESTRICT A,
const float * GGML_RESTRICT B,
int K, int N, int KN, int remaining_rows)
{
if constexpr (RM > 0) {
if (remaining_rows == RM) {
int64_t jj = 0;
for (; jj + KN <= N; jj += KN) {
rvv_simd_gemm_ukernel<RM>(C + jj, A, B + jj, K, N, KN);
}
if (jj < N) {
rvv_simd_gemm_ukernel<RM>(C + jj, A, B + jj, K, N, N - jj);
}
} else {
rvv_simd_gemm_dispatch_tail<RM - 1>(C, A, B, K, N, KN, remaining_rows);
}
}
}
static constexpr int GEMM_RM = 7;
// C[M x N] += A[M x K] * B[K x N]
static void simd_gemm(
float * GGML_RESTRICT C,
const float * GGML_RESTRICT A,
const float * GGML_RESTRICT B,
int M, int K, int N)
{
const int KN = (int)__riscv_vlenb();
int64_t ii = 0;
for (; ii + GEMM_RM <= M; ii += GEMM_RM) {
int64_t jj = 0;
for (; jj + KN <= N; jj += KN) {
rvv_simd_gemm_ukernel<GEMM_RM>(C + jj, A, B + jj, K, N, KN);
}
if (jj < N) {
rvv_simd_gemm_ukernel<GEMM_RM>(C + jj, A, B + jj, K, N, N - jj);
}
A += GEMM_RM * K;
C += GEMM_RM * N;
}
int remaining_rows = M - ii;
rvv_simd_gemm_dispatch_tail<GEMM_RM - 1>(C, A, B, K, N, KN, remaining_rows);
}
#if defined(__GNUC__) && !defined(__clang__)
#pragma GCC diagnostic pop

View File

@@ -58,26 +58,48 @@ void argsort_f32_i32_cuda_cub(ggml_cuda_pool & pool,
size_t temp_storage_bytes = 0;
bool is_capturing = false;
#ifdef USE_CUDA_GRAPH
// Currently (confirmed for CCCL <= 3.2) DeviceSegmentedSort does not support stream capture, while DeviceSegmentedRadixSort does.
// See https://github.com/NVIDIA/cccl/issues/5661#issuecomment-3229037149
// TODO: constrain this to the CCCL versions that have this issue once it's resolved in a future CCCL release.
cudaStreamCaptureStatus capture_status;
CUDA_CHECK(cudaStreamIsCapturing(stream, &capture_status));
is_capturing = (capture_status != cudaStreamCaptureStatusNone);
#endif // USE_CUDA_GRAPH
if (order == GGML_SORT_ORDER_ASC) {
if (nrows == 1) {
CUDA_CHECK(DeviceRadixSort::SortPairs(nullptr, temp_storage_bytes, temp_keys, temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols, 0, sizeof(float) * 8, stream));
temp_indices, dst, // values (indices)
ncols, 0, sizeof(float) * 8, stream));
} else if (is_capturing) {
CUDA_CHECK(DeviceSegmentedRadixSort::SortPairs(
nullptr, temp_storage_bytes, temp_keys, temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols * nrows, nrows, // num items, num segments
offset_iterator, offset_iterator + 1, 0, sizeof(float) * 8, stream));
} else {
CUDA_CHECK(DeviceSegmentedSort::SortPairs(nullptr, temp_storage_bytes, temp_keys, temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols * nrows, nrows, // num items, num segments
offset_iterator, offset_iterator + 1, stream));
CUDA_CHECK(DeviceSegmentedSort::SortPairs(nullptr, temp_storage_bytes, temp_keys,
temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols * nrows, nrows, // num items, num segments
offset_iterator, offset_iterator + 1, stream));
}
} else {
if (nrows == 1) {
CUDA_CHECK(DeviceRadixSort::SortPairsDescending(nullptr, temp_storage_bytes, temp_keys, temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols, 0, sizeof(float) * 8, stream));
CUDA_CHECK(DeviceRadixSort::SortPairsDescending(nullptr, temp_storage_bytes, temp_keys,
temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols, 0, sizeof(float) * 8, stream));
} else if (is_capturing) {
CUDA_CHECK(DeviceSegmentedRadixSort::SortPairsDescending(
nullptr, temp_storage_bytes, temp_keys, temp_keys, temp_indices, dst, ncols * nrows, nrows,
offset_iterator, offset_iterator + 1, 0, sizeof(float) * 8, stream));
} else {
CUDA_CHECK(DeviceSegmentedSort::SortPairsDescending(nullptr, temp_storage_bytes, temp_keys, temp_keys, temp_indices,
dst, ncols * nrows, nrows, offset_iterator, offset_iterator + 1,
stream));
CUDA_CHECK(DeviceSegmentedSort::SortPairsDescending(nullptr, temp_storage_bytes, temp_keys, temp_keys,
temp_indices, dst, ncols * nrows, nrows,
offset_iterator, offset_iterator + 1, stream));
}
}
@@ -86,22 +108,33 @@ void argsort_f32_i32_cuda_cub(ggml_cuda_pool & pool,
if (order == GGML_SORT_ORDER_ASC) {
if (nrows == 1) {
CUDA_CHECK(DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, temp_keys, temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols, 0, sizeof(float) * 8, stream));
CUDA_CHECK(DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, temp_keys,
temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols, 0, sizeof(float) * 8, stream));
} else if (is_capturing) {
CUDA_CHECK(DeviceSegmentedRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, temp_keys, temp_keys,
temp_indices, dst, ncols * nrows, nrows, offset_iterator,
offset_iterator + 1, 0, sizeof(float) * 8, stream));
} else {
CUDA_CHECK(DeviceSegmentedSort::SortPairs(d_temp_storage, temp_storage_bytes, temp_keys, temp_keys, temp_indices, dst,
ncols * nrows, nrows, offset_iterator, offset_iterator + 1, stream));
CUDA_CHECK(DeviceSegmentedSort::SortPairs(d_temp_storage, temp_storage_bytes, temp_keys, temp_keys,
temp_indices, dst, ncols * nrows, nrows, offset_iterator,
offset_iterator + 1, stream));
}
} else {
if (nrows == 1) {
CUDA_CHECK(DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, temp_keys, temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols, 0, sizeof(float) * 8, stream));
CUDA_CHECK(DeviceRadixSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, temp_keys,
temp_keys, // keys (in-place)
temp_indices, dst, // values (indices)
ncols, 0, sizeof(float) * 8, stream));
} else if (is_capturing) {
CUDA_CHECK(DeviceSegmentedRadixSort::SortPairsDescending(
d_temp_storage, temp_storage_bytes, temp_keys, temp_keys, temp_indices, dst, ncols * nrows, nrows,
offset_iterator, offset_iterator + 1, 0, sizeof(float) * 8, stream));
} else {
CUDA_CHECK(DeviceSegmentedSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, temp_keys, temp_keys,
temp_indices, dst, ncols * nrows, nrows, offset_iterator,
offset_iterator + 1, stream));
CUDA_CHECK(DeviceSegmentedSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, temp_keys,
temp_keys, temp_indices, dst, ncols * nrows, nrows,
offset_iterator, offset_iterator + 1, stream));
}
}
}

View File

@@ -924,6 +924,13 @@ struct ggml_cuda_type_traits<GGML_TYPE_F16> {
static constexpr int qr = 1;
};
template<>
struct ggml_cuda_type_traits<GGML_TYPE_Q1_0> {
static constexpr int qk = QK1_0;
static constexpr int qr = QR1_0;
static constexpr int qi = QI1_0;
};
template<>
struct ggml_cuda_type_traits<GGML_TYPE_Q4_0> {
static constexpr int qk = QK4_0;
@@ -1092,10 +1099,6 @@ struct ggml_cuda_device_info {
cuda_device_info devices[GGML_CUDA_MAX_DEVICES] = {};
std::array<float, GGML_CUDA_MAX_DEVICES> default_tensor_split = {};
#ifdef GGML_USE_NCCL
ncclComm_t comms[GGML_CUDA_MAX_DEVICES];
#endif // GGML_USE_NCCL
};
const ggml_cuda_device_info & ggml_cuda_info();
@@ -1183,6 +1186,7 @@ struct ggml_cuda_graph {
std::vector<cudaGraphNode_t> nodes;
bool disable_due_to_gpu_arch = false;
bool warmup_complete = false;
uint64_t uid = 0;
struct node_properties {
ggml_tensor node;
void * node_src_data_ptrs[GGML_MAX_SRC];

View File

@@ -711,6 +711,8 @@ to_bf16_cuda_t ggml_get_to_bf16_cuda(ggml_type type) {
to_fp16_cuda_t ggml_get_to_fp16_cuda(ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0:
return dequantize_block_cont_cuda<QK1_0, QR1_0, dequantize_q1_0>;
case GGML_TYPE_Q4_0:
return dequantize_row_q4_0_cuda;
case GGML_TYPE_Q4_1:
@@ -767,6 +769,8 @@ to_fp16_cuda_t ggml_get_to_fp16_cuda(ggml_type type) {
to_fp32_cuda_t ggml_get_to_fp32_cuda(ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0:
return dequantize_block_cont_cuda<QK1_0, QR1_0, dequantize_q1_0>;
case GGML_TYPE_Q4_0:
return dequantize_row_q4_0_cuda;
case GGML_TYPE_Q4_1:
@@ -822,6 +826,8 @@ to_fp16_nc_cuda_t ggml_get_to_fp16_nc_cuda(ggml_type type) {
switch (type) {
case GGML_TYPE_F32:
return convert_unary_cuda<float>;
case GGML_TYPE_Q1_0:
return dequantize_block_cuda<QK1_0, QR1_0, dequantize_q1_0>;
case GGML_TYPE_Q4_0:
return dequantize_block_cuda<QK4_0, QR4_0, dequantize_q4_0>;
case GGML_TYPE_Q4_1:
@@ -843,6 +849,8 @@ to_bf16_nc_cuda_t ggml_get_to_bf16_nc_cuda(ggml_type type) {
switch (type) {
case GGML_TYPE_F32:
return convert_unary_cuda<float, nv_bfloat16>;
case GGML_TYPE_Q1_0:
return dequantize_block_cuda<QK1_0, QR1_0, dequantize_q1_0>;
case GGML_TYPE_Q4_0:
return dequantize_block_cuda<QK4_0, QR4_0, dequantize_q4_0>;
case GGML_TYPE_Q4_1:
@@ -864,6 +872,8 @@ to_fp32_nc_cuda_t ggml_get_to_fp32_nc_cuda(ggml_type type) {
switch (type) {
case GGML_TYPE_F16:
return convert_unary_cuda<half, float>;
case GGML_TYPE_Q1_0:
return dequantize_block_cuda<QK1_0, QR1_0, dequantize_q1_0>;
case GGML_TYPE_Q4_0:
return dequantize_block_cuda<QK4_0, QR4_0, dequantize_q4_0>;
case GGML_TYPE_Q4_1:

View File

@@ -1,5 +1,27 @@
#include "common.cuh"
static __device__ __forceinline__ void dequantize_q1_0(const void * vx, const int64_t ib, const int iqs, float2 & v){
const block_q1_0 * x = (const block_q1_0 *) vx;
const float d = x[ib].d;
const int bit_index_0 = iqs;
const int bit_index_1 = iqs + 1;
const int byte_index_0 = bit_index_0 / 8;
const int bit_offset_0 = bit_index_0 % 8;
const int byte_index_1 = bit_index_1 / 8;
const int bit_offset_1 = bit_index_1 % 8;
// Extract bits: 1 = +d, 0 = -d (branchless)
const int bit_0 = (x[ib].qs[byte_index_0] >> bit_offset_0) & 1;
const int bit_1 = (x[ib].qs[byte_index_1] >> bit_offset_1) & 1;
v.x = (2*bit_0 - 1) * d;
v.y = (2*bit_1 - 1) * d;
}
static __device__ __forceinline__ void dequantize_q4_0(const void * vx, const int64_t ib, const int iqs, float2 & v){
const block_q4_0 * x = (const block_q4_0 *) vx;

View File

@@ -179,6 +179,10 @@ static void ggml_cuda_get_rows_switch_src0_type(
get_rows_cuda_float((const nv_bfloat16 *) src0_d, src1_d, dst_d,
ne00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb1, nb2, nb3, stream);
break;
case GGML_TYPE_Q1_0:
get_rows_cuda_q<QK1_0, QR1_0, dequantize_q1_0>(src0_d, src1_d, dst_d,
ne00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb1, nb2, nb3, stream);
break;
case GGML_TYPE_Q4_0:
get_rows_cuda_q<QK4_0, QR4_0, dequantize_q4_0>(src0_d, src1_d, dst_d,
ne00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb1, nb2, nb3, stream);

View File

@@ -324,28 +324,22 @@ static ggml_cuda_device_info ggml_cuda_init() {
// configure logging to stdout
// CUBLAS_CHECK(cublasLoggerConfigure(1, 1, 0, nullptr));
for (int id = 0; id < info.device_count; ++id) {
ggml_cuda_set_device(id);
for (int id_other = 0; id_other < info.device_count; ++id_other) {
if (id == id_other) {
continue;
}
int can_access_peer;
CUDA_CHECK(cudaDeviceCanAccessPeer(&can_access_peer, id, id_other));
if (can_access_peer) {
CUDA_CHECK(cudaDeviceEnablePeerAccess(id_other, 0));
if (getenv("GGML_CUDA_P2P") != nullptr) {
for (int id = 0; id < info.device_count; ++id) {
ggml_cuda_set_device(id);
for (int id_other = 0; id_other < info.device_count; ++id_other) {
if (id == id_other) {
continue;
}
int can_access_peer;
CUDA_CHECK(cudaDeviceCanAccessPeer(&can_access_peer, id, id_other));
if (can_access_peer) {
CUDA_CHECK(cudaDeviceEnablePeerAccess(id_other, 0));
}
}
}
}
#ifdef GGML_USE_NCCL
int dev_ids[GGML_CUDA_MAX_DEVICES];
for (int id = 0; id < info.device_count; ++id) {
dev_ids[id] = id;
}
NCCL_CHECK(ncclCommInitAll(info.comms, info.device_count, dev_ids));
#endif // GGML_USE_NCCL
return info;
}
@@ -1125,66 +1119,51 @@ static const ggml_backend_buffer_type_i ggml_backend_cuda_split_buffer_type_inte
/* .is_host = */ ggml_backend_cuda_split_buffer_type_is_host,
};
bool ggml_backend_cuda_allreduce_tensor(ggml_backend_t * backends, struct ggml_tensor ** tensors, size_t n_backends) {
#ifdef GGML_USE_NCCL
const int64_t ne = ggml_nelements(tensors[0]);
// FIXME the input of llm_graph_context::build_in_out_ids can produce a tensor with 0 elements if n_outputs == 0
// This then causes a crash in this function
if (ne == 0) {
return true;
}
for (size_t i = 0; i < n_backends; ++i) {
GGML_ASSERT(tensors[i] != nullptr);
GGML_ASSERT(ggml_nelements(tensors[i]) == ne);
GGML_ASSERT(ggml_is_contiguously_allocated(tensors[i]));
}
struct ggml_backend_cuda_comm_context {
std::vector<ggml_backend_t> backends;
std::vector<ncclComm_t> comms;
const ggml_cuda_device_info info = ggml_cuda_info();
// For small tensors, simply reduce them as FP32.
// The following heuristic for how "small" a tensor should be is based on RTX 4090s connected via 16x PCIe 4.0.
if ((n_backends <= 2 && ne < 32768) || (n_backends == 3 && ne < 131072) || (n_backends >= 4 && ne < 262144)) {
NCCL_CHECK(ncclGroupStart());
for (size_t i = 0; i < n_backends; ++i) {
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) backends[i]->context;
NCCL_CHECK(ncclAllReduce(tensors[i]->data, tensors[i]->data, ne, ncclFloat, ncclSum, info.comms[cuda_ctx->device], cuda_ctx->stream()));
~ggml_backend_cuda_comm_context() {
for (ncclComm_t comm : comms) {
NCCL_CHECK(ncclCommDestroy(comm));
}
NCCL_CHECK(ncclGroupEnd());
return true;
}
};
#endif // GGML_USE_NCCL
// For large tensors it's faster to compress them to BF16 for the reduction:
to_bf16_cuda_t to_bf16 = ggml_get_to_bf16_cuda(GGML_TYPE_F32);
to_fp32_cuda_t to_fp32 = ggml_get_to_fp32_cuda(GGML_TYPE_BF16);
static void ggml_backend_cuda_comm_free(void * comm_ctx_v) {
#ifdef GGML_USE_NCCL
if (comm_ctx_v == nullptr) {
return;
}
ggml_backend_cuda_comm_context * comm_ctx = (ggml_backend_cuda_comm_context *) comm_ctx_v;
delete comm_ctx;
#else
GGML_UNUSED(comm_ctx_v);
#endif // GGML_USE_NCCL
}
ggml_cuda_pool_alloc<nv_bfloat16> tmp[GGML_CUDA_MAX_DEVICES];
for (size_t i = 0; i < n_backends; ++i) {
static void * ggml_backend_cuda_comm_init(ggml_backend_t * backends, size_t n_backends) {
#ifdef GGML_USE_NCCL
for (size_t i = 0; i < n_backends; i++) {
if (!ggml_backend_is_cuda(backends[i])) {
return nullptr;
}
}
ggml_backend_cuda_comm_context * ret = new ggml_backend_cuda_comm_context;
std::vector<int> dev_ids;
ret->backends.reserve(n_backends);
dev_ids.reserve(n_backends);
for (size_t i = 0; i < n_backends; i++) {
ret->backends.push_back(backends[i]);
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) backends[i]->context;
tmp[i].pool = &cuda_ctx->pool();
tmp[i].alloc(ne);
ggml_cuda_set_device(i);
to_bf16(tensors[i]->data, tmp[i].get(), ne, cuda_ctx->stream());
CUDA_CHECK(cudaGetLastError());
dev_ids.push_back(cuda_ctx->device);
}
NCCL_CHECK(ncclGroupStart());
for (size_t i = 0; i < n_backends; ++i) {
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) backends[i]->context;
NCCL_CHECK(ncclAllReduce(tmp[i].get(), tmp[i].get(), ne, ncclBfloat16, ncclSum, info.comms[cuda_ctx->device], cuda_ctx->stream()));
}
NCCL_CHECK(ncclGroupEnd());
for (size_t i = 0; i < n_backends; ++i) {
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) backends[i]->context;
ggml_cuda_set_device(i);
to_fp32(tmp[i].get(), (float *) tensors[i]->data, ne, cuda_ctx->stream());
CUDA_CHECK(cudaGetLastError());
}
return true;
ret->comms.resize(n_backends);
NCCL_CHECK(ncclCommInitAll(ret->comms.data(), n_backends, dev_ids.data()));
return ret;
#else
// If NCCL is installed it is used by default for optimal performance.
// However, NVIDIA does not distribute NCCL with CUDA so users may be unwittingly missing this package.
@@ -1197,7 +1176,76 @@ bool ggml_backend_cuda_allreduce_tensor(ggml_backend_t * backends, struct ggml_t
warning_printed = true;
}
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
GGML_UNUSED_VARS(backends, tensors, n_backends);
GGML_UNUSED_VARS(backends, n_backends);
return nullptr;
#endif // GGML_USE_NCCL
}
static bool ggml_backend_cuda_comm_allreduce_tensor(void * comm_ctx_v, struct ggml_tensor ** tensors) {
#ifdef GGML_USE_NCCL
const int64_t ne = ggml_nelements(tensors[0]);
// FIXME the input of llm_graph_context::build_in_out_ids can produce a tensor with 0 elements if n_outputs == 0
// This then causes a crash in this function
if (ne == 0) {
return true;
}
GGML_ASSERT(comm_ctx_v != nullptr);
ggml_backend_cuda_comm_context * comm_ctx = (ggml_backend_cuda_comm_context *) comm_ctx_v;
const size_t n_backends = comm_ctx->backends.size();
for (size_t i = 0; i < n_backends; ++i) {
GGML_ASSERT(tensors[i] != nullptr);
GGML_ASSERT(ggml_nelements(tensors[i]) == ne);
GGML_ASSERT(ggml_is_contiguously_allocated(tensors[i]));
}
// For small tensors, simply reduce them as FP32.
// The following heuristic for how "small" a tensor should be is based on RTX 4090s connected via 16x PCIe 4.0.
if ((n_backends <= 2 && ne < 32768) || (n_backends == 3 && ne < 131072) || (n_backends >= 4 && ne < 262144)) {
NCCL_CHECK(ncclGroupStart());
for (size_t i = 0; i < n_backends; ++i) {
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) comm_ctx->backends[i]->context;
NCCL_CHECK(ncclAllReduce(tensors[i]->data, tensors[i]->data, ne, ncclFloat, ncclSum, comm_ctx->comms[i], cuda_ctx->stream()));
}
NCCL_CHECK(ncclGroupEnd());
return true;
}
// For large tensors it's faster to compress them to BF16 for the reduction:
to_bf16_cuda_t to_bf16 = ggml_get_to_bf16_cuda(GGML_TYPE_F32);
to_fp32_cuda_t to_fp32 = ggml_get_to_fp32_cuda(GGML_TYPE_BF16);
ggml_cuda_pool_alloc<nv_bfloat16> tmp[GGML_CUDA_MAX_DEVICES];
for (size_t i = 0; i < n_backends; ++i) {
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) comm_ctx->backends[i]->context;
tmp[i].pool = &cuda_ctx->pool();
tmp[i].alloc(ne);
ggml_cuda_set_device(cuda_ctx->device);
to_bf16(tensors[i]->data, tmp[i].get(), ne, cuda_ctx->stream());
CUDA_CHECK(cudaGetLastError());
}
NCCL_CHECK(ncclGroupStart());
for (size_t i = 0; i < n_backends; ++i) {
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) comm_ctx->backends[i]->context;
NCCL_CHECK(ncclAllReduce(tmp[i].get(), tmp[i].get(), ne, ncclBfloat16, ncclSum, comm_ctx->comms[i], cuda_ctx->stream()));
}
NCCL_CHECK(ncclGroupEnd());
for (size_t i = 0; i < n_backends; ++i) {
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *) comm_ctx->backends[i]->context;
ggml_cuda_set_device(cuda_ctx->device);
to_fp32(tmp[i].get(), (float *) tensors[i]->data, ne, cuda_ctx->stream());
CUDA_CHECK(cudaGetLastError());
}
return true;
#else
GGML_UNUSED_VARS(comm_ctx_v, tensors);
return false;
#endif // GGML_USE_NCCL
}
@@ -3060,6 +3108,15 @@ static bool ggml_cuda_graph_update_required(ggml_backend_cuda_context * cuda_ctx
const void * graph_key = ggml_cuda_graph_get_key(cgraph);
ggml_cuda_graph * graph = cuda_ctx->cuda_graph(graph_key);
if (cgraph->uid != 0 &&
cgraph->uid == graph->uid) {
GGML_LOG_DEBUG("CUDA Graph id %zu reused\n", cgraph->uid);
GGML_ASSERT((int)graph->node_props.size() == cgraph->n_nodes);
return false;
}
graph->uid = cgraph->uid;
// Check if the graph size has changed
if ((int)graph->node_props.size() != cgraph->n_nodes) {
res = true;
@@ -4783,6 +4840,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
switch (a->type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
@@ -4820,6 +4878,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_TYPE_F32:
case GGML_TYPE_BF16:
case GGML_TYPE_I32:
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
@@ -5220,8 +5279,14 @@ static ggml_backend_feature * ggml_backend_cuda_get_features(ggml_backend_reg_t
static void * ggml_backend_cuda_reg_get_proc_address(ggml_backend_reg_t reg, const char * name) {
GGML_UNUSED(reg);
if (strcmp(name, "ggml_backend_allreduce_tensor") == 0) {
return (void *)ggml_backend_cuda_allreduce_tensor;
if (strcmp(name, "ggml_backend_comm_init") == 0) {
return (void *)ggml_backend_cuda_comm_init;
}
if (strcmp(name, "ggml_backend_comm_free") == 0) {
return (void *)ggml_backend_cuda_comm_free;
}
if (strcmp(name, "ggml_backend_comm_allreduce_tensor") == 0) {
return (void *)ggml_backend_cuda_comm_allreduce_tensor;
}
if (strcmp(name, "ggml_backend_split_buffer_type") == 0) {
return (void *)ggml_backend_cuda_split_buffer_type;

View File

@@ -5,6 +5,9 @@
static void ggml_cuda_mul_mat_q_switch_type(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
switch (args.type_x) {
case GGML_TYPE_Q1_0:
mul_mat_q_case<GGML_TYPE_Q1_0>(ctx, args, stream);
break;
case GGML_TYPE_Q4_0:
mul_mat_q_case<GGML_TYPE_Q4_0>(ctx, args, stream);
break;
@@ -270,6 +273,7 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
bool mmq_supported;
switch (type) {
case GGML_TYPE_Q1_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:

View File

@@ -57,6 +57,8 @@ static_assert(sizeof(block_fp4_mmq) == sizeof(block_q8_1_mmq), "Unexpected b
static mmq_q8_1_ds_layout mmq_get_q8_1_ds_layout(const ggml_type type_x) {
switch (type_x) {
case GGML_TYPE_Q1_0:
return MMQ_Q8_1_DS_LAYOUT_D4;
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
return MMQ_Q8_1_DS_LAYOUT_DS4;
@@ -185,6 +187,7 @@ static constexpr __device__ int get_mmq_y_device() {
static constexpr __host__ __device__ tile_x_sizes mmq_get_dp4a_tile_x_sizes(ggml_type type, int mmq_y) {
switch (type) {
case GGML_TYPE_Q1_0: return MMQ_DP4A_TXS_Q8_0;
case GGML_TYPE_Q4_0: return MMQ_DP4A_TXS_Q4_0;
case GGML_TYPE_Q4_1: return MMQ_DP4A_TXS_Q4_1;
case GGML_TYPE_Q5_0: return MMQ_DP4A_TXS_Q8_0;
@@ -229,6 +232,7 @@ static_assert(MMQ_MMA_TILE_X_K_NVFP4 % 8 == 4, "Wrong padding.");
static constexpr __host__ __device__ int mmq_get_mma_tile_x_k(ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0: return MMQ_MMA_TILE_X_K_Q8_0;
case GGML_TYPE_Q4_0: return MMQ_MMA_TILE_X_K_Q8_0;
case GGML_TYPE_Q4_1: return MMQ_MMA_TILE_X_K_Q8_1;
case GGML_TYPE_Q5_0: return MMQ_MMA_TILE_X_K_Q8_0;
@@ -302,6 +306,87 @@ static constexpr __device__ int mmq_get_nwarps_device() {
// ------------------------------------------------------------
template <int mmq_y, bool need_check> static __device__ __forceinline__ void load_tiles_q1_0(
const char * __restrict__ x, int * __restrict__ x_tile, const int kbx0, const int i_max, const int stride) {
constexpr int nwarps = mmq_get_nwarps_device();
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
#if defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
int * x_qs = (int *) x_tile;
float * x_df = (float *) (x_qs + 2*MMQ_TILE_NE_K);
#else
constexpr tile_x_sizes txs = mmq_get_dp4a_tile_x_sizes(GGML_TYPE_Q8_0, mmq_y);
int * x_qs = (int *) x_tile;
float * x_df = (float *) (x_qs + txs.qs);
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
constexpr int blocks_per_iter = MMQ_ITER_K / QK1_0;
constexpr int threads_per_row = blocks_per_iter * QI1_0;
constexpr int nrows = warp_size / threads_per_row;
constexpr int scale_entries_per_block = QK1_0 / QK8_1;
constexpr int scale_entries_per_row = blocks_per_iter * scale_entries_per_block;
const int txi = threadIdx.x % threads_per_row;
const int kbx = txi / QI1_0;
const int kqsx = txi % QI1_0;
#pragma unroll
for (int i0 = 0; i0 < mmq_y; i0 += nrows*nwarps) {
int i = i0 + threadIdx.y*nrows + threadIdx.x/threads_per_row;
if (need_check) {
i = min(i, i_max);
}
const block_q1_0 * bxi = (const block_q1_0 *) x + kbx0 + i*stride + kbx;
const int qs_offset = 4*kqsx;
const int qs0 = bxi->qs[qs_offset + 0] | (bxi->qs[qs_offset + 1] << 8) |
(bxi->qs[qs_offset + 2] << 16) | (bxi->qs[qs_offset + 3] << 24);
int unpacked_bytes[8];
#pragma unroll
for (int j = 0; j < 8; ++j) {
const int shift = j * 4;
const int bits4 = (qs0 >> shift) & 0x0F;
const int b0 = (bits4 & 0x01) ? 1 : -1;
const int b1 = (bits4 & 0x02) ? 1 : -1;
const int b2 = (bits4 & 0x04) ? 1 : -1;
const int b3 = (bits4 & 0x08) ? 1 : -1;
unpacked_bytes[j] = (b0 & 0xFF) | ((b1 & 0xFF) << 8) | ((b2 & 0xFF) << 16) | ((b3 & 0xFF) << 24);
}
const int dst_offset = kbx*(scale_entries_per_block*QI8_0) + kqsx*QI8_0;
#pragma unroll
for (int j = 0; j < 8; ++j) {
#if defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
x_qs[i*MMQ_MMA_TILE_X_K_Q8_0 + dst_offset + j] = unpacked_bytes[j];
#else
x_qs[i*(2*MMQ_TILE_NE_K + 1) + dst_offset + j] = unpacked_bytes[j];
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
}
}
const int ksx = threadIdx.x % scale_entries_per_row;
const int scale_block = ksx / scale_entries_per_block;
#pragma unroll
for (int i0 = 0; i0 < mmq_y; i0 += nwarps) {
int i = i0 + threadIdx.y;
if (need_check) {
i = min(i, i_max);
}
const block_q1_0 * bxi = (const block_q1_0 *) x + kbx0 + i*stride + scale_block;
#if defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
x_df[i*MMQ_MMA_TILE_X_K_Q8_0 + ksx] = bxi->d;
#else
x_df[i*(2*MMQ_TILE_NE_K/QI8_0) + i/(QI8_0/2) + ksx] = bxi->d;
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
}
}
template <int mmq_y, bool need_check> static __device__ __forceinline__ void load_tiles_q4_0(
const char * __restrict__ x, int * __restrict__ x_tile, const int kbx0, const int i_max, const int stride) {
constexpr int nwarps = mmq_get_nwarps_device();
@@ -3290,6 +3375,14 @@ static __device__ __forceinline__ void mmq_write_back_mma(
template <int mmq_x, int mmq_y, bool need_check, ggml_type type>
struct mmq_type_traits;
template <int mmq_x, int mmq_y, bool need_check>
struct mmq_type_traits<mmq_x, mmq_y, need_check, GGML_TYPE_Q1_0> {
static constexpr int vdr = VDR_Q1_0_Q8_1_MMQ;
static constexpr load_tiles_mmq_t load_tiles = load_tiles_q1_0<mmq_y, need_check>;
static constexpr vec_dot_mmq_t vec_dot_mma = vec_dot_q8_0_q8_1_mma<mmq_x, mmq_y, MMQ_Q8_1_DS_LAYOUT_D4>;
static constexpr vec_dot_mmq_t vec_dot_dp4a = vec_dot_q8_0_q8_1_dp4a<mmq_x, mmq_y>;
};
template <int mmq_x, int mmq_y, bool need_check>
struct mmq_type_traits<mmq_x, mmq_y, need_check, GGML_TYPE_Q4_0> {
static constexpr int vdr = VDR_Q4_0_Q8_1_MMQ;

View File

@@ -9,6 +9,7 @@ typedef float (*vec_dot_q_cuda_t)(const void * __restrict__ vbq, const block_q8_
static constexpr __device__ vec_dot_q_cuda_t get_vec_dot_q_cuda(ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0: return vec_dot_q1_0_q8_1;
case GGML_TYPE_Q4_0: return vec_dot_q4_0_q8_1;
case GGML_TYPE_Q4_1: return vec_dot_q4_1_q8_1;
case GGML_TYPE_Q5_0: return vec_dot_q5_0_q8_1;
@@ -36,6 +37,7 @@ static constexpr __device__ vec_dot_q_cuda_t get_vec_dot_q_cuda(ggml_type type)
static constexpr __host__ __device__ int get_vdr_mmvq(ggml_type type) {
switch (type) {
case GGML_TYPE_Q1_0: return VDR_Q1_0_Q8_1_MMVQ;
case GGML_TYPE_Q4_0: return VDR_Q4_0_Q8_1_MMVQ;
case GGML_TYPE_Q4_1: return VDR_Q4_1_Q8_1_MMVQ;
case GGML_TYPE_Q5_0: return VDR_Q5_0_Q8_1_MMVQ;
@@ -886,6 +888,12 @@ static void mul_mat_vec_q_switch_type(
const int nsamples_x, const int nsamples_dst, const int stride_sample_x, const int stride_sample_y, const int stride_sample_dst,
const int ids_stride, cudaStream_t stream) {
switch (type_x) {
case GGML_TYPE_Q1_0:
mul_mat_vec_q_switch_ncols_dst<GGML_TYPE_Q1_0>
(vx, vy, ids, fusion, dst, ncols_x, nrows_x, ncols_dst, stride_row_x, stride_col_y, stride_col_dst,
nchannels_x, nchannels_y, nchannels_dst, stride_channel_x, stride_channel_y, stride_channel_dst,
nsamples_x, nsamples_dst, stride_sample_x, stride_sample_y, stride_sample_dst, ids_stride, stream);
break;
case GGML_TYPE_Q4_0:
mul_mat_vec_q_switch_ncols_dst<GGML_TYPE_Q4_0>
(vx, vy, ids, fusion, dst, ncols_x, nrows_x, ncols_dst, stride_row_x, stride_col_y, stride_col_dst,

View File

@@ -32,6 +32,7 @@ SOURCE_FATTN_MMA_START = """// This file has been autogenerated by generate_cu_f
SOURCE_FATTN_MMA_CASE = "DECL_FATTN_MMA_F16_CASE({head_size_kq}, {head_size_v}, {ncols1}, {ncols2});\n"
TYPES_MMQ = [
"GGML_TYPE_Q1_0",
"GGML_TYPE_Q4_0", "GGML_TYPE_Q4_1", "GGML_TYPE_Q5_0", "GGML_TYPE_Q5_1", "GGML_TYPE_Q8_0",
"GGML_TYPE_Q2_K", "GGML_TYPE_Q3_K", "GGML_TYPE_Q4_K", "GGML_TYPE_Q5_K", "GGML_TYPE_Q6_K",
"GGML_TYPE_IQ2_XXS", "GGML_TYPE_IQ2_XS", "GGML_TYPE_IQ2_S", "GGML_TYPE_IQ3_XXS", "GGML_TYPE_IQ3_S",

View File

@@ -0,0 +1,5 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../mmq.cuh"
DECL_MMQ_CASE(GGML_TYPE_Q1_0);

View File

@@ -106,6 +106,9 @@ static __device__ __forceinline__ uint32_t unpack_ksigns(const uint8_t v) {
// VDR = vec dot ratio, how many contiguous integers each thread processes when the vec dot kernel is called
// MMVQ = mul_mat_vec_q, MMQ = mul_mat_q
#define VDR_Q1_0_Q8_1_MMVQ 1 // Process one 32-element chunk at a time for parallelism
#define VDR_Q1_0_Q8_1_MMQ 4 // Q1_0 has 128 bits (4 ints) per block
#define VDR_Q4_0_Q8_1_MMVQ 2
#define VDR_Q4_0_Q8_1_MMQ 4
@@ -669,6 +672,51 @@ static __device__ __forceinline__ float vec_dot_q6_K_q8_1_impl_mmq(
return d6 * sumf_d;
}
static __device__ __forceinline__ float vec_dot_q1_0_q8_1(
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {
const block_q1_0 * bq1_0 = (const block_q1_0 *) vbq + kbx;
// Q1_0: 128 elements with ONE scale
// Q8_1: 32 elements per block with individual scales
// iqs selects which of the 4 chunks of 32 elements to process (0-3)
const float d1 = bq1_0->d;
// Process only the chunk specified by iqs
const block_q8_1 * bq8_1_chunk = bq8_1 + iqs;
// Load 32 bits (4 bytes) for this chunk from Q1_0
const int offset = iqs * 4;
const int v = bq1_0->qs[offset + 0] | (bq1_0->qs[offset + 1] << 8) |
(bq1_0->qs[offset + 2] << 16) | (bq1_0->qs[offset + 3] << 24);
// Unpack 32 bits into 32 signed values (-1 or +1)
int vi_bytes[8];
#pragma unroll
for (int j = 0; j < 8; ++j) {
const int shift = j * 4;
const int bits4 = (v >> shift) & 0x0F;
const int b0 = (bits4 & 0x01) ? 1 : -1;
const int b1 = (bits4 & 0x02) ? 1 : -1;
const int b2 = (bits4 & 0x04) ? 1 : -1;
const int b3 = (bits4 & 0x08) ? 1 : -1;
vi_bytes[j] = (b0 & 0xFF) | ((b1 & 0xFF) << 8) | ((b2 & 0xFF) << 16) | ((b3 & 0xFF) << 24);
}
// Compute dot product for this 32-element chunk
int sumi = 0;
#pragma unroll
for (int j = 0; j < 8; ++j) {
const int u = get_int_b4(bq8_1_chunk->qs, j);
sumi = ggml_cuda_dp4a(vi_bytes[j], u, sumi);
}
// Apply Q1_0's single scale and this chunk's Q8_1 scale
const float d8 = __low2float(bq8_1_chunk->ds);
return d1 * d8 * sumi;
}
static __device__ __forceinline__ float vec_dot_q4_0_q8_1(
const void * __restrict__ vbq, const block_q8_1 * __restrict__ bq8_1, const int & kbx, const int & iqs) {

View File

@@ -1,56 +0,0 @@
#pragma once
#include "ggml.h"
#include "ggml-backend.h"
// This is a "staging" header for new ggml API
// It is not publicly available and it should not be used by 3rd party projects
//
// When the API matures enough, it will be moved to the official public API
//
// Meta backend
//
#define GGML_BACKEND_META_MAX_DEVICES 16
enum ggml_backend_meta_split_axis {
// tensor split by tensor dimensions:
GGML_BACKEND_SPLIT_AXIS_0 = 0,
GGML_BACKEND_SPLIT_AXIS_1 = 1,
GGML_BACKEND_SPLIT_AXIS_2 = 2,
GGML_BACKEND_SPLIT_AXIS_3 = 3,
GGML_BACKEND_SPLIT_AXIS_MIRRORED = 10, // all values on all backends
GGML_BACKEND_SPLIT_AXIS_PARTIAL = 11, // each backend has a partial sum
// for internal bookkeeping only:
GGML_BACKEND_SPLIT_AXIS_NONE = 98,
GGML_BACKEND_SPLIT_AXIS_UNKNOWN = 99,
};
GGML_API const char * ggml_backend_meta_split_axis_name(enum ggml_backend_meta_split_axis split_axis);
struct ggml_backend_meta_split_state {
enum ggml_backend_meta_split_axis axis;
// for tensors with axis >= 0 && axis < GGML_MAX_DIMS:
// - each device has a slice of the tensor along the split axis
// - most tensors have n_segments == 1 and a contiguous slice of the tensor data
// - some tensors have an inhomogenenous data layout along the split axis,
// those tensors are divided into segments which are each individually split across devices
// - ne has one entry per segment and device that add up to ggml_tensor::ne for that axis,
// the outer/inner loops are over segments/devices like [seg0_dev0, seg0_dev1, seg1_dev0, seg1_dev1],
// - for example, a transformer may have a fused QKV matrix rather than 3 matrices, those would be 3 separate segments
// that each need to be split individually across devices so that each device gets a slice of Q, K, and V
int64_t ne[16*GGML_BACKEND_META_MAX_DEVICES];
uint32_t n_segments;
};
// function to assign split states for statically allocated tensors, compute tensor split states will be assigned to be compatible:
typedef struct ggml_backend_meta_split_state(*ggml_backend_meta_get_split_state_t)(const struct ggml_tensor * tensor, void * userdata);
// create a new meta device from "simple" devices, meta buffer type/buffer/backend is then derived from this:
// TODO: this looks a bit strange - a backend API creates a device. I think we should try
// express this as a backend registry functionality instead
GGML_API ggml_backend_dev_t ggml_backend_meta_device(
ggml_backend_dev_t * devs, size_t n_devs, ggml_backend_meta_get_split_state_t get_split_state, void * get_split_state_ud);

View File

@@ -47,6 +47,7 @@ list(FIND HTP_HMX_VERSIONS ${DSP_VERSION} _hmx_idx)
if (_hmx_idx GREATER_EQUAL 0)
target_sources(${HTP_LIB} PRIVATE
hmx-queue.c
hmx-matmul-ops.c
)

View File

@@ -31,6 +31,14 @@ static inline uint64_t hex_get_pktcnt() {
return pktcnt;
}
static inline uint32_t hex_ceil_pow2(uint32_t x) {
if (x <= 1) { return 1; }
int p = 2;
x--;
while (x >>= 1) { p <<= 1; }
return p;
}
static inline size_t hmx_ceil_div(size_t num, size_t den) {
return (num + den - 1) / den;
}
@@ -73,8 +81,7 @@ static inline void hex_l2fetch(const void * p, uint32_t width, uint32_t stride,
#define HEX_L2_LINE_SIZE 64
#define HEX_L2_FLUSH_SIZE (128 * 1024)
static inline void hex_l2flush(void * addr, size_t size)
{
static inline void hex_l2flush(void * addr, size_t size) {
if (size > HEX_L2_FLUSH_SIZE) {
qurt_mem_cache_clean((qurt_addr_t) 0, 0, QURT_MEM_CACHE_FLUSH_INVALIDATE_ALL, QURT_MEM_DCACHE);
} else {
@@ -89,4 +96,8 @@ static inline void hex_l2flush(void * addr, size_t size)
}
}
static inline void hex_pause() {
asm volatile(" pause(#255)\n");
}
#endif /* HEX_UTILS_H */

View File

@@ -16,14 +16,16 @@
#include "ggml-common.h"
#include "hex-dma.h"
#include "worker-pool.h"
#include "hvx-utils.h"
#include "hvx-dump.h"
#include "worker-pool.h"
#include "htp-ctx.h"
#include "htp-ops.h"
#include "hmx-utils.h"
#include "hmx-ops.h"
#include "hmx-utils.h"
#include "hmx-queue.h"
#include "hmx-profile.h"
static const __fp16 q4_0_to_fp16_lut[64] __attribute__((aligned(VLEN))) = {
@@ -47,7 +49,8 @@ static const __fp16 iq4_nl_to_fp16_lut[64] __attribute__((aligned(VLEN))) = {
static const int32_t weight_transpose_scatter_offsets[32] __attribute__((aligned(VLEN))) = {
0*128, 1*128, 2*128, 3*128, 4*128, 5*128, 6*128, 7*128,
8*128, 9*128, 10*128, 11*128, 12*128, 13*128, 14*128, 15*128,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
16*128, 17*128, 18*128, 19*128, 20*128, 21*128, 22*128, 23*128,
24*128, 25*128, 26*128, 27*128, 28*128, 29*128, 30*128, 31*128
};
// Scales per x4x2 logical block: 8 × sizeof(__fp16) = 16 bytes
@@ -109,36 +112,45 @@ static inline bool hmx_add_overflow(size_t a, size_t b, size_t *out) {
return false;
}
// Search for optimal (mc, nc) chunk sizes that maximize mc * nc within VTCM budget.
// Search for optimal (mc, nc) chunk sizes within VTCM budget.
//
// Cost model: total = nc * per_n_cost + mc * per_m_cost + mc * nc * per_mn_cost + overhead
// per_n_cost: bytes per nc column (weight + scratch buffers)
// per_m_cost: bytes per mc row (activation)
// per_mn_cost: bytes per mc*nc element (output)
// overhead: fixed bytes (scales 256B, eye_tile 2048B, etc.)
// VTCM model: nc * per_n_cost + mc * per_m_cost + mc * nc * per_mn_cost + overhead
//
// Minimize ceil(m/mc) * m_block_cost + ceil(n/nc) * n_block_cost.
// All matmul paths repeat weight processing per M-block and activation loading
// per N-block, so discrete block counts drive total overhead.
// Tie-break: when cost is equal, prefer larger mc * nc.
//
// Caller-provided coefficients:
// m_block_cost: penalty per extra M-block (weight redundancy, scales with n).
// n_block_cost: penalty per extra N-block (activation redundancy, scales with m).
//
// Algorithm: nc sweeps from n_max down by 32, analytically solving for mc_max.
// Returns 0 on success, -1 if VTCM is insufficient.
static int hmx_compute_chunks(
size_t vtcm_total, size_t overhead,
size_t per_n_cost, size_t per_m_cost, size_t per_mn_cost,
int m, int n,
size_t *m_chunk_out, size_t *n_chunk_out,
size_t *total_out)
{
static int hmx_compute_chunks(size_t vtcm_total,
size_t overhead,
size_t per_n_cost,
size_t per_m_cost,
size_t per_mn_cost,
int m,
int n,
size_t m_block_cost,
size_t n_block_cost,
size_t * m_chunk_out,
size_t * n_chunk_out,
size_t * total_out) {
if (m <= 0 || n <= 0) return -1;
if (vtcm_total <= overhead) return -1;
if (per_n_cost == 0 || per_m_cost == 0 || per_mn_cost == 0) return -1;
const size_t usable = vtcm_total - overhead;
size_t best_mn = 0, best_m = 0, best_n = 0;
size_t best_cost = SIZE_MAX;
size_t best_mn = 0;
size_t best_m = 0, best_n = 0;
const size_t n_max = hex_align_down((size_t)n, HMX_FP16_TILE_N_COLS);
for (size_t nc = n_max; nc >= HMX_FP16_TILE_N_COLS; nc -= HMX_FP16_TILE_N_COLS) {
// Early exit: if nc * m_max cannot beat best, smaller nc won't either
if (nc * hex_align_down((size_t)m, HMX_FP16_TILE_N_ROWS) <= best_mn)
break;
size_t n_fixed = 0, ncmn = 0, mc_denom = 0;
if (hmx_mul_overflow(nc, per_n_cost, &n_fixed)) continue;
if (n_fixed >= usable) goto next_nc;
@@ -152,10 +164,19 @@ static int hmx_compute_chunks(
mc = hex_align_down(mc, HMX_FP16_TILE_N_ROWS);
mc = hex_smin(mc, (size_t)m);
if (mc > 0 && mc * nc > best_mn) {
best_mn = mc * nc;
best_m = mc;
best_n = nc;
if (mc == 0) {
goto next_nc;
}
size_t mblocks = ((size_t) m + mc - 1) / mc;
size_t nblocks = ((size_t) n + nc - 1) / nc;
size_t cost = mblocks * m_block_cost + nblocks * n_block_cost;
size_t mn = mc * nc;
if (cost < best_cost || (cost == best_cost && mn > best_mn)) {
best_cost = cost;
best_mn = mn;
best_m = mc;
best_n = nc;
}
}
@@ -233,7 +254,7 @@ static inline HVX_Vector dequantize_x4x2_q4_0_group_hvx(
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_scales = hvx_vec_splat_f16(*scale);
// q4x4x2 stores two int4 values per byte. Keep only the selected nibble.
HVX_Vector v_quants = upper_nibbles ? Q6_Vub_vlsr_VubR(vq, 4) : vq;
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
// Shuffle before LUT
v_quants = Q6_Vb_vshuff_Vb(v_quants);
@@ -257,7 +278,7 @@ static inline void dequantize_x4x2_q4_0_x4groups_hvx(
// Load all 128 packed bytes (4 contiguous 32-byte groups)
HVX_Vector vq = hvx_vmemu(packed_128);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_quants = upper_nibbles ? Q6_Vub_vlsr_VubR(vq, 4) : vq;
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
// Shuffle before LUT
@@ -277,10 +298,8 @@ static inline void dequantize_x4x2_q4_0_x4groups_hvx(
v_hi = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hi, v_sc23));
// Extract individual groups: scatter uses q_mask64 so only first 64 bytes matter
out[0] = v_lo; // group0 already in [0:63]
out[1] = Q6_V_vror_VR(v_lo, 64); // group1 rotated to [0:63]
out[2] = v_hi; // group2 already in [0:63]
out[3] = Q6_V_vror_VR(v_hi, 64); // group3 rotated to [0:63]
out[0] = v_lo; // group0 already in [0:63]
out[1] = v_hi; // group2 already in [0:63]
}
// Dequantize one x4x2 Q8_0 group (32 int8 quants) -> 32 FP16 in first 64 bytes.
@@ -384,8 +403,9 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
size_t row_stride, int weight_type,
int start_tile, int end_tile) {
const int n_k_tiles = k_block / HMX_FP16_TILE_N_COLS;
const int qrow_size = (weight_type == HTP_TYPE_Q8_0) ? k_block : (k_block / 2);
const int n_k_tiles = (unsigned)k_block / HMX_FP16_TILE_N_COLS;
const bool is_q4 = (weight_type == HTP_TYPE_Q4_0 || weight_type == HTP_TYPE_IQ4_NL);
const int qrow_size = is_q4 ? ((unsigned)k_block / 2) : k_block;
const HVX_Vector vlut_cvt = (weight_type == HTP_TYPE_IQ4_NL) ? hvx_vmem(iq4_nl_to_fp16_lut) :
(weight_type == HTP_TYPE_MXFP4) ? hvx_vmem(mxfp4_to_fp16_lut) :
@@ -398,47 +418,46 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4); // 4 bytes = 1 column step
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64); // first 16 words (64 bytes)
for (int t = start_tile; t < end_tile; ) {
int ct = t / n_k_tiles; // column tile index
int kt = t % n_k_tiles; // K tile index
unsigned ct = (unsigned)start_tile / n_k_tiles; // column tile index
unsigned kt = (unsigned)start_tile % n_k_tiles; // K tile index
for (unsigned t = start_tile; t < end_tile; ) {
if (kt >= n_k_tiles) { kt = 0; ct++; }
// --- Batch-4 fast path for Q4_0/IQ4_NL: process 4 contiguous K-tiles with one vlut16 per row ---
if ((weight_type == HTP_TYPE_Q4_0 || weight_type == HTP_TYPE_IQ4_NL) && (kt % 4 == 0) && (t + 4 <= end_tile) &&
((t + 3) / n_k_tiles == ct)) {
int blk_idx = (kt * 32) / QK_Q4_0x4x2;
int sub_blk_base = ((kt * 32) % QK_Q4_0x4x2) / 32; // 0 or 4
bool upper = (sub_blk_base >= 4);
int packed_off = blk_idx * (QK_Q4_0x4x2 / 2); // 128 contiguous packed bytes
int scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE
+ sub_blk_base * (int)sizeof(__fp16); // 4 consecutive scales
// --- Batch-4 fast path for Q4: process 4 contiguous K-tiles with one vlut16 per row ---
if (is_q4 && (kt % 4 == 0) && (t + 4 <= end_tile) && ((t + 3) / n_k_tiles == ct)) {
unsigned blk_idx = (kt * 32) / QK_Q4_0x4x2;
unsigned sub_blk_base = ((kt * 32) % QK_Q4_0x4x2) / 32; // 0 or 4
bool upper = (sub_blk_base >= 4);
unsigned packed_off = blk_idx * (QK_Q4_0x4x2 / 2); // 128 contiguous packed bytes
unsigned scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE
+ sub_blk_base * (int)sizeof(__fp16); // 4 consecutive scales
__fp16 *tile_bases[4];
for (int g = 0; g < 4; g++) { tile_bases[g] = vtcm_dst + (t + g) * HMX_FP16_TILE_N_ELMS; }
for (unsigned g = 0; g < 4; g++) { tile_bases[g] = vtcm_dst + (t + g) * HMX_FP16_TILE_N_ELMS; }
HVX_Vector v_off = v_scat_base;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2) {
int row0 = ct * HMX_FP16_TILE_N_COLS + r;
int row1 = row0 + 1;
const uint8_t *r0 = vtcm_src + row0 * row_stride;
const uint8_t *r1 = vtcm_src + row1 * row_stride;
HVX_Vector v0[4], v1[4];
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * row_stride;
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
HVX_Vector v0[2];
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
dequantize_x4x2_q4_0_x4groups_hvx(r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt, v0);
if (row1 < n_cols) {
dequantize_x4x2_q4_0_x4groups_hvx(r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt, v1);
} else {
v1[0] = v1[1] = v1[2] = v1[3] = Q6_V_vzero();
}
for (int g = 0; g < 4; g++) { Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_bases[g], HMX_FP16_TILE_SIZE - 1, v_off, v0[g]); }
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, v0[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, v0[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
for (int g = 0; g < 4; g++) { Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_bases[g], HMX_FP16_TILE_SIZE - 1, v_off, v1[g]); }
r0 = vtcm_src + row_offset; row_offset += row_stride;
dequantize_x4x2_q4_0_x4groups_hvx(r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt, v0);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, v0[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, v0[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
for (int g = 0; g < 4; g++) { (void) *(volatile HVX_Vector *)(tile_bases[g]); }
t += 4;
t += 4; kt += 4;
continue;
}
@@ -495,20 +514,19 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
// --- Single-tile fallback ---
__fp16 *tile_base = vtcm_dst + t * HMX_FP16_TILE_N_ELMS;
if (weight_type == HTP_TYPE_Q4_0 || weight_type == HTP_TYPE_IQ4_NL) {
int blk_idx = (kt * 32) / QK_Q4_0x4x2;
int sub_blk = ((kt * 32) % QK_Q4_0x4x2) / 32;
bool upper = (sub_blk >= 4);
int byte_off = blk_idx * (QK_Q4_0x4x2 / 2) + (upper ? (sub_blk - 4) : sub_blk) * 32;
int scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE + sub_blk * (int)sizeof(__fp16);
if (is_q4) {
unsigned blk_idx = (kt * 32) / QK_Q4_0x4x2;
unsigned sub_blk = ((kt * 32) % QK_Q4_0x4x2) / 32;
bool upper = (sub_blk >= 4);
unsigned byte_off = blk_idx * (QK_Q4_0x4x2 / 2) + (upper ? (sub_blk - 4) : sub_blk) * 32;
unsigned scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE + sub_blk * (int)sizeof(__fp16);
HVX_Vector v_off = v_scat_base; // reset to column 0
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2) {
int row0 = ct * HMX_FP16_TILE_N_COLS + r;
int row1 = row0 + 1;
const uint8_t *r0 = vtcm_src + row0 * row_stride;
const uint8_t *r1 = vtcm_src + row1 * row_stride;
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * row_stride;
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector v0 = dequantize_x4x2_q4_0_group_hvx(
r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
@@ -585,7 +603,7 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
}
(void) *(volatile HVX_Vector *)(tile_base);
}
++t;
++t; ++kt;
}
// Drain HVX scatter write buffer: a vmem load on the same HW thread retires
@@ -630,9 +648,9 @@ static void dequantize_x4x2_weight_chunk_to_fp16_tiles(
assert(n_cols % HMX_FP16_TILE_N_COLS == 0);
assert(k_block % HMX_FP16_TILE_N_COLS == 0);
int n_col_tiles = n_cols / HMX_FP16_TILE_N_COLS;
int n_k_tiles = k_block / HMX_FP16_TILE_N_COLS;
int n_tot_tiles = n_col_tiles * n_k_tiles;
size_t n_col_tiles = n_cols / HMX_FP16_TILE_N_COLS;
size_t n_k_tiles = k_block / HMX_FP16_TILE_N_COLS;
size_t n_tot_tiles = n_col_tiles * n_k_tiles;
size_t n_tiles_per_task = hmx_ceil_div(n_tot_tiles, ctx->n_threads);
@@ -653,49 +671,91 @@ static void dequantize_x4x2_weight_chunk_to_fp16_tiles(
// --- End x4x2 dequantizers ---
// requires external HMX lock
static void core_dot_chunk_fp16(__fp16 *output, const __fp16 *activation, const __fp16 *weight, const __fp16 *scales,
static void core_dot_chunk_fp16(__fp16 *restrict output, const __fp16 *restrict activation, const __fp16 *restrict weight, const __fp16 *restrict scales,
int n_row_tiles, int n_col_tiles, int n_dot_tiles) {
hmx_set_output_scales(scales);
__builtin_assume(n_row_tiles > 0);
__builtin_assume(n_col_tiles > 0);
__builtin_assume(n_dot_tiles > 0);
Q6_bias_mxmem2_A((void *)scales);
for (int r = 0; r < n_row_tiles; ++r) {
for (int c = 0; c < n_col_tiles; ++c) {
for (size_t c = 0; c < n_col_tiles; ++c) {
Q6_mxclracc_hf();
const __fp16 *row_tiles = activation + r * n_dot_tiles * HMX_FP16_TILE_N_ELMS;
const __fp16 *col_tiles = weight + c * n_dot_tiles * HMX_FP16_TILE_N_ELMS;
for (int k = 0; k < n_dot_tiles; ++k) {
int offset = k * HMX_FP16_TILE_N_ELMS;
hmx_load_tile_pair_fp16(row_tiles + offset, col_tiles + offset);
Q6_activation_hf_mxmem_RR((unsigned int)row_tiles, 2047);
Q6_weight_hf_mxmem_RR((unsigned int)col_tiles, 2047);
row_tiles += HMX_FP16_TILE_N_ELMS;
col_tiles += HMX_FP16_TILE_N_ELMS;
}
__fp16 *out_tile = output + (r * n_col_tiles + c) * HMX_FP16_TILE_N_ELMS;
hmx_consume_accumulator_fp16(out_tile);
Q6_mxmem_AR_after_hf(out_tile, 0);
}
}
}
// --- Async HMX matmul job (for pipeline overlap) ---
typedef struct {
__fp16 * output;
const __fp16 * activation;
const __fp16 * weight;
const __fp16 * scales;
uint32_t n_row_tiles;
uint32_t n_col_tiles;
uint32_t n_dot_tiles;
} hmx_matmul_job_t;
static void hmx_matmul_worker_fn(void * data) {
hmx_matmul_job_t * job = (hmx_matmul_job_t *) data;
FARF(HIGH, "hmx-mm-job: n_row_tiles %u n_col_tiles %u n_dot_tiles %u", job->n_row_tiles, job->n_col_tiles, job->n_dot_tiles);
core_dot_chunk_fp16(job->output, job->activation, job->weight, job->scales, job->n_row_tiles, job->n_col_tiles, job->n_dot_tiles);
}
static inline void hmx_matmul_job_init(hmx_matmul_job_t * job,
__fp16 * output,
const __fp16 * activation,
const __fp16 * weight,
const __fp16 * scales,
int n_row_tiles,
int n_col_tiles,
int n_dot_tiles) {
job->output = output;
job->activation = activation;
job->weight = weight;
job->scales = scales;
job->n_row_tiles = n_row_tiles;
job->n_col_tiles = n_col_tiles;
job->n_dot_tiles = n_dot_tiles;
}
// --- End async HMX matmul job ---
static void transfer_output_chunk_fp16_to_fp32(float *restrict dst, const __fp16 *restrict vtcm_src, int n_rows, int n_cols, int n) {
assert(n_cols % HMX_FP16_TILE_N_COLS == 0);
const int n_col_tiles = n_cols / HMX_FP16_TILE_N_COLS;
const size_t tile_row_stride = (n_cols / HMX_FP16_TILE_N_COLS) * HMX_FP16_TILE_N_ELMS;
const HVX_Vector one = hvx_vec_splat_f16(1.0);
for (int r = 0; r < n_rows; r += 2) {
int r0 = r / HMX_FP16_TILE_N_ROWS;
int r1 = r % HMX_FP16_TILE_N_ROWS;
for (size_t r = 0; r < n_rows; r += 2) {
const size_t r0 = r / HMX_FP16_TILE_N_ROWS;
const size_t r1 = (r % HMX_FP16_TILE_N_ROWS) / 2; // index of the row pair within the tile
const __fp16 *row_base = vtcm_src + r0 * tile_row_stride;
float *output_row_base = dst + r * n; // global memory row base for row r (and r+1)
#pragma unroll(4)
for (int c = 0; c < n_cols; c += HMX_FP16_TILE_N_COLS) {
int c0 = c / HMX_FP16_TILE_N_COLS;
const __fp16 *tile = vtcm_src + (r0 * n_col_tiles + c0) * HMX_FP16_TILE_N_ELMS;
HVX_Vector v = ((const HVX_Vector *) tile)[r1 / 2];
for (size_t c = 0; c < n_cols; c += HMX_FP16_TILE_N_COLS) {
const size_t c0 = c / HMX_FP16_TILE_N_COLS;
const __fp16 *tile = row_base + c0 * HMX_FP16_TILE_N_ELMS;
HVX_Vector v = ((const HVX_Vector *) tile)[r1];
HVX_VectorPair vp = Q6_Wqf32_vmpy_VhfVhf(v, one);
volatile HVX_Vector *pv_out0 = (volatile HVX_Vector *) (dst + (r * n + c + 0));
volatile HVX_Vector *pv_out1 = (volatile HVX_Vector *) (dst + (r * n + c + n)); // next row in global memory
volatile HVX_Vector *pv_out0 = (volatile HVX_Vector *) (output_row_base + c + 0);
volatile HVX_Vector *pv_out1 = (volatile HVX_Vector *) (output_row_base + c + n); // next row in global memory
*pv_out0 = Q6_Vsf_equals_Vqf32(Q6_V_lo_W(vp));
if (r + 1 < n_rows) {
@@ -733,7 +793,7 @@ static void transfer_output_chunk_threaded(struct htp_context *ctx, float *dst,
assert(n_cols % HMX_FP16_TILE_N_COLS == 0);
size_t n_tot_chunks = n_rows;
size_t n_chunks_per_task = 32; // must be multiple of HMX_FP16_TILE_N_ROWS (32)
size_t n_chunks_per_task = HMX_FP16_TILE_N_ROWS; // must be multiple of HMX_FP16_TILE_N_ROWS (32)
output_transfer_task_state_t state;
state.n_tasks = (n_tot_chunks + n_chunks_per_task - 1) / n_chunks_per_task;
@@ -832,12 +892,13 @@ int hmx_mat_mul_permuted_w16a32_batched(struct htp_context *ctx, const hmx_matmu
const size_t f32_scratch_per_m = use_dma_activation ? (size_t) params->k * sizeof(float) : 0;
size_t m_chunk_n_rows = 0, n_chunk_n_cols = 0, vtcm_used = 0;
// FP16 weight: interleave and activation load have similar per-element cost.
if (hmx_compute_chunks(vtcm_budget, /*overhead=*/256,
/*per_n=*/3 * vec_dot_size,
/*per_m=*/group_size * vec_dot_size + f32_scratch_per_m,
/*per_mn=*/sizeof(__fp16),
params->m, params->n,
&m_chunk_n_rows, &n_chunk_n_cols, &vtcm_used) != 0) {
/*per_n=*/3 * vec_dot_size,
/*per_m=*/group_size * vec_dot_size + f32_scratch_per_m,
/*per_mn=*/sizeof(__fp16), params->m, params->n,
/*m_block_cost=*/(size_t) params->n,
/*n_block_cost=*/(size_t) params->m, &m_chunk_n_rows, &n_chunk_n_cols, &vtcm_used) != 0) {
FARF(HIGH, "%s: grouped path does not fit VTCM, falling back to legacy batched loop", __func__);
return hmx_mat_mul_permuted_w16a32_batched_legacy(ctx, params);
}
@@ -864,7 +925,7 @@ int hmx_mat_mul_permuted_w16a32_batched(struct htp_context *ctx, const hmx_matmu
return hmx_mat_mul_permuted_w16a32_batched_legacy(ctx, params);
}
hmx_init_column_scales(vtcm_scales, Q6_V_vsplat_R(0x3c00)); // fp16: 1.0
hmx_init_column_scales(vtcm_scales, Q6_V_vsplat_R(0x3c00)); // scale: 1.0, bias: 0.0 in FP16
FARF(MEDIUM, "%s: grouped path m=%d k=%d n=%d group=%d streams=%d mc=%zu nc=%zu vtcm=%zu/%zu",
__func__, params->m, params->k, params->n, group_size, params->ne13,
@@ -882,12 +943,15 @@ int hmx_mat_mul_permuted_w16a32_batched(struct htp_context *ctx, const hmx_matmu
const size_t fp16_row_bytes = (size_t) params->k * sizeof(__fp16);
const size_t weight_row_bytes = (size_t) params->weight_stride * sizeof(__fp16);
HAP_compute_res_hmx_lock(ctx->vtcm_rctx);
for (int b3 = 0; b3 < params->ne13; ++b3) {
for (int b2_base = 0; b2_base < params->ne12; b2_base += group_size) {
const __fp16 *weight_group = hmx_matmul_weight_batch_ptr(params, b2_base, b3);
for (size_t mr = 0; mr < (size_t) params->m; mr += m_chunk_n_rows) {
const size_t n_rows = hex_smin((size_t) params->m - mr, m_chunk_n_rows);
const size_t n_row_tiles = hmx_ceil_div((int) n_rows, HMX_FP16_TILE_N_ROWS);
// Pre-load activations for all heads in the group (once per m_chunk).
// When the source is strided (permuted Q), use 2D DMA to gather
@@ -925,10 +989,9 @@ int hmx_mat_mul_permuted_w16a32_batched(struct htp_context *ctx, const hmx_matmu
fp16_row_bytes, weight_row_bytes, fp16_row_bytes, n_cols_first);
}
HAP_compute_res_hmx_lock(ctx->vtcm_rctx);
for (size_t nc = 0; nc < (size_t) params->n; nc += n_chunk_n_cols) {
const size_t n_cols = hex_smin((size_t) params->n - nc, n_chunk_n_cols);
const size_t n_col_tiles = hmx_ceil_div((int) n_cols, HMX_FP16_TILE_N_COLS);
TIMER_START(weight_load);
{
@@ -952,11 +1015,9 @@ int hmx_mat_mul_permuted_w16a32_batched(struct htp_context *ctx, const hmx_matmu
for (int g = 0; g < group_size; ++g) {
TIMER_START(hmx_core);
{
const __fp16 *vtcm_act_g = vtcm_activation + (size_t) g * act_head_stride;
const int n_row_tiles = hmx_ceil_div((int) n_rows, HMX_FP16_TILE_N_ROWS);
const int n_col_tiles = hmx_ceil_div((int) n_cols, HMX_FP16_TILE_N_COLS);
core_dot_chunk_fp16(vtcm_output, vtcm_act_g, vtcm_weight, vtcm_scales,
n_row_tiles, n_col_tiles, params->k / 32);
const __fp16 * vtcm_act_g = vtcm_activation + (size_t) g * act_head_stride;
core_dot_chunk_fp16(vtcm_output, vtcm_act_g, vtcm_weight, vtcm_scales, n_row_tiles, n_col_tiles,
params->k / 32);
}
TIMER_STOP(hmx_core);
@@ -968,12 +1029,12 @@ int hmx_mat_mul_permuted_w16a32_batched(struct htp_context *ctx, const hmx_matmu
TIMER_STOP(output_store);
}
}
HAP_compute_res_hmx_unlock(ctx->vtcm_rctx);
}
}
}
HAP_compute_res_hmx_unlock(ctx->vtcm_rctx);
TIMER_STOP(total);
#if defined(ENABLE_PROFILE_TIMERS)
@@ -1006,13 +1067,15 @@ int hmx_mat_mul_permuted_w16a32(struct htp_context *ctx, float *restrict dst, co
const size_t f32_scratch_per_m = use_dma_activation ? (size_t) k * sizeof(float) : 0;
size_t m_chunk_n_rows = 0, n_chunk_n_cols = 0, vtcm_used = 0;
// FP16 weight: interleave and activation load have similar per-element cost.
if (hmx_compute_chunks(vtcm_budget,
/*overhead=*/ 256,
/*per_n=*/ 3 * vec_dot_size, // W + S0 + S1
/*per_m=*/ vec_dot_size + f32_scratch_per_m, // A + optional F32 scratch
/*per_mn=*/ sizeof(__fp16), // O
m, n,
&m_chunk_n_rows, &n_chunk_n_cols, &vtcm_used) != 0) {
/*overhead=*/256,
/*per_n=*/3 * vec_dot_size, // W + S0 + S1
/*per_m=*/vec_dot_size + f32_scratch_per_m, // A + optional F32 scratch
/*per_mn=*/sizeof(__fp16), // O
m, n,
/*m_block_cost=*/(size_t) n,
/*n_block_cost=*/(size_t) m, &m_chunk_n_rows, &n_chunk_n_cols, &vtcm_used) != 0) {
FARF(HIGH, "%s: VTCM too small (m=%d k=%d n=%d budget=%zu)", __func__, m, k, n, vtcm_budget);
return -1;
}
@@ -1039,7 +1102,7 @@ int hmx_mat_mul_permuted_w16a32(struct htp_context *ctx, float *restrict dst, co
return -1;
}
hmx_init_column_scales(vtcm_scales, Q6_V_vsplat_R(0x3c00)); // fp16: 1.0
hmx_init_column_scales(vtcm_scales, Q6_V_vsplat_R(0x3c00)); // scale: 1.0, bias: 0.0 in FP16
FARF(MEDIUM, "%s: m=%d k=%d n=%d mc=%zu nc=%zu vtcm=%zu/%zu",
__func__, m, k, n, m_chunk_n_rows, n_chunk_n_cols,
@@ -1057,7 +1120,8 @@ int hmx_mat_mul_permuted_w16a32(struct htp_context *ctx, float *restrict dst, co
for (size_t mr = 0; mr < m; mr += m_chunk_n_rows) {
// transfer activation matrix chunk into VTCM
size_t n_rows = hex_smin(m - mr, m_chunk_n_rows);
const size_t n_rows = hex_smin(m - mr, m_chunk_n_rows);
const size_t n_row_tiles = hmx_ceil_div(n_rows, HMX_FP16_TILE_N_ROWS);
TIMER_START(activation_load);
{
@@ -1095,7 +1159,8 @@ int hmx_mat_mul_permuted_w16a32(struct htp_context *ctx, float *restrict dst, co
}
for (size_t nc = 0; nc < n; nc += n_chunk_n_cols) {
size_t n_cols = hex_smin(n - nc, n_chunk_n_cols);
const size_t n_cols = hex_smin(n - nc, n_chunk_n_cols);
const size_t n_col_tiles = hmx_ceil_div(n_cols, HMX_FP16_TILE_N_COLS);
TIMER_START(weight_load);
{
@@ -1120,8 +1185,6 @@ int hmx_mat_mul_permuted_w16a32(struct htp_context *ctx, float *restrict dst, co
TIMER_START(hmx_core);
{
const int n_row_tiles = hmx_ceil_div(n_rows, HMX_FP16_TILE_N_ROWS);
const int n_col_tiles = hmx_ceil_div(n_cols, HMX_FP16_TILE_N_COLS);
core_dot_chunk_fp16(vtcm_output, vtcm_activation, vtcm_weight, vtcm_scales, n_row_tiles, n_col_tiles, k / 32);
}
TIMER_STOP(hmx_core);
@@ -1157,6 +1220,8 @@ int hmx_mat_mul_permuted_w16a32(struct htp_context *ctx, float *restrict dst, co
int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict out, const float *restrict x, const uint8_t *restrict w, int m,
int k, int n, int w_type);
#define FALLBACK_TO_STANDARD 1
int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict dst, const float *restrict activation,
const uint8_t *restrict permuted_weight, int m, int k, int n,
int weight_type) {
@@ -1169,9 +1234,12 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
// for large m, k (e.g. prefill FFN Down), use out-stationary version
if (m >= 128 && k > n && n > 1024) {
FARF(MEDIUM, "hmx_matmul_qk: OUT-STATIONARY path m=%d k=%d n=%d type=%d (K_BLOCK=512, %d K-iters with fp16 intermediate)",
m, k, n, weight_type, (k + 511) / 512);
return mat_mul_qk_0_d16a32_out_stationary(ctx, dst, activation, permuted_weight, m, k, n, weight_type);
int rc = mat_mul_qk_0_d16a32_out_stationary(ctx, dst, activation, permuted_weight, m, k, n, weight_type);
if (rc != FALLBACK_TO_STANDARD) {
return rc; // 0 success, -1 error
}
FARF(MEDIUM, "hmx_matmul_qk: out-stationary fallback to standard m=%d k=%d n=%d", m, k, n);
// fall through to standard path
}
size_t row_stride = get_x4x2_row_stride(weight_type, k);
@@ -1197,9 +1265,10 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
}
size_t m_chunk_n_rows = 0, n_chunk_n_cols = 0, vtcm_used = 0;
if (hmx_compute_chunks(vtcm_budget, /*overhead=*/256,
per_n_cost, /*per_m=*/vec_dot_size, per_mn_cost,
m, n, &m_chunk_n_rows, &n_chunk_n_cols, &vtcm_used) != 0) {
// Quantized weight: dequant ~1.5x more expensive per element than activation load.
if (hmx_compute_chunks(vtcm_budget, /*overhead=*/256, per_n_cost, /*per_m=*/vec_dot_size, per_mn_cost, m, n,
/*m_block_cost=*/(size_t) n * 3,
/*n_block_cost=*/(size_t) m * 2, &m_chunk_n_rows, &n_chunk_n_cols, &vtcm_used) != 0) {
FARF(HIGH, "%s: VTCM too small (m=%d k=%d n=%d pipe=%d budget=%zu)",
__func__, m, k, n, use_pipeline, vtcm_budget);
return -1;
@@ -1237,7 +1306,7 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
return -1;
}
hmx_init_column_scales(vtcm_scales, Q6_V_vsplat_R(0x3c00)); // fp16: 1.0
hmx_init_column_scales(vtcm_scales, Q6_V_vsplat_R(0x3c00)); // scale: 1.0, bias: 0.0 in FP16
FARF(MEDIUM, "%s: m=%d k=%d n=%d wtype=%d pipe=%d mc=%zu nc=%zu vtcm=%zu/%zu",
__func__, m, k, n, weight_type, use_pipeline,
@@ -1256,12 +1325,12 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
use_pipeline ? "PIPELINE" : "SEQUENTIAL", m_chunk_n_rows, n_chunk_n_cols,
(size_t)(vtcm_ptr - (uint8_t *)ctx->vtcm_base), vtcm_budget);
HAP_compute_res_hmx_lock(ctx->vtcm_rctx);
if (!use_pipeline) {
HAP_compute_res_hmx_lock(ctx->vtcm_rctx);
for (size_t mr = 0; mr < m; mr += m_chunk_n_rows) {
// transfer activation matrix chunk into VTCM
size_t n_rows = hex_smin(m - mr, m_chunk_n_rows);
const size_t n_rows = hex_smin(m - mr, m_chunk_n_rows);
const size_t n_row_tiles = hmx_ceil_div(n_rows, HMX_FP16_TILE_N_ROWS);
TIMER_START(activation_load);
{
@@ -1279,7 +1348,8 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
}
for (size_t nc = 0; nc < n; nc += n_chunk_n_cols) {
size_t n_cols = hex_smin(n - nc, n_chunk_n_cols);
const size_t n_cols = hex_smin(n - nc, n_chunk_n_cols);
const size_t n_col_tiles = hmx_ceil_div(n_cols, HMX_FP16_TILE_N_COLS);
TIMER_START(weight_load);
{
@@ -1304,8 +1374,6 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
TIMER_START(hmx_core);
{
const int n_row_tiles = hmx_ceil_div(n_rows, HMX_FP16_TILE_N_ROWS);
const int n_col_tiles = hmx_ceil_div(n_cols, HMX_FP16_TILE_N_COLS);
core_dot_chunk_fp16(vtcm_output, vtcm_activation, vtcm_weight, vtcm_scales, n_row_tiles, n_col_tiles, k / 32);
}
TIMER_STOP(hmx_core);
@@ -1318,20 +1386,22 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
TIMER_STOP(output_store);
}
}
HAP_compute_res_hmx_unlock(ctx->vtcm_rctx);
} else {
// 4-stage pipeline: DMA load (A), dequantize (B), HMX matmul (C), store (D)
// stage B and D (dequantize and store) are expected to be on the critical path
// HMX compute (C) runs on dedicated worker thread, overlapping with HVX stages (B, D).
// A --> B: vtcm_qweight, 1 buffer
// B --> C: vtcm_weight0/vtcm_weight1, 2 buffers
// C --> D: vtcm_output0/vtcm_output1, 2 buffers
//
// LD ||A3| | B3 ||
// MM || C2 ||
// ST || D1 | ||
// Async timeline (C overlaps B+D):
// main+HVX: [A0][Act][B0][A1][sub C0][B1‖C0][A2][wait,sub C1][D0+B2‖C1][wait,sub C2][D1‖C2][wait][D2]
// HMX queue: [████ C0 ████████][████ C1 ████████████][████ C2 ████████]
int n_chunk_cnt = hmx_ceil_div(n, n_chunk_n_cols);
hmx_matmul_job_t job_slots[2]; // persistent double-buffered job descriptors
for (size_t mr = 0; mr < m; mr += m_chunk_n_rows) {
const size_t n_rows = hex_smin(m - mr, m_chunk_n_rows);
@@ -1352,31 +1422,34 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
transfer_activation_chunk_threaded(ctx, vtcm_activation, activation_chunk, n_rows, k, k);
}
// prologue: B0, A1, C0, B1
// prologue: B0, A1, submit C0 (async), B1 (overlaps C0)
{
// B0
// B0: wait for DMA, dequant weight chunk 0
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[0], vtcm_qweight, n_cols_A0, k, row_stride, weight_type);
// A1
// A1: issue DMA for weight chunk 1
const size_t n_cols_A1 = hex_smin(n - 1 * n_chunk_n_cols, n_chunk_n_cols);
if (1 < n_chunk_cnt) {
const uint8_t *qweight_chunk_A1 = permuted_weight + n_chunk_n_cols * row_stride;
dma_queue_push(ctx->dma[0], dma_make_ptr(vtcm_qweight, qweight_chunk_A1), row_stride, row_stride, row_stride, n_cols_A1);
}
// C0
core_dot_chunk_fp16((__fp16 *) vtcm_output_bufs[0], (__fp16 *) vtcm_activation, (__fp16 *) vtcm_weight_bufs[0], vtcm_scales,
hmx_ceil_div(n_rows, HMX_FP16_TILE_N_ROWS), hmx_ceil_div(n_cols_A0, HMX_FP16_TILE_N_COLS), k / HMX_FP16_TILE_N_ROWS);
// submit C0 (non-blocking — HMX worker executes in parallel)
hmx_matmul_job_init(&job_slots[0], (__fp16 *) vtcm_output_bufs[0], (__fp16 *) vtcm_activation,
(__fp16 *) vtcm_weight_bufs[0], vtcm_scales,
hmx_ceil_div(n_rows, HMX_FP16_TILE_N_ROWS),
hmx_ceil_div(n_cols_A0, HMX_FP16_TILE_N_COLS), k / HMX_FP16_TILE_N_ROWS);
hmx_queue_push(ctx->hmx_queue, hmx_queue_make_desc(hmx_matmul_worker_fn, &job_slots[0]));
// B1
// B1: DMA pop + dequant (runs in parallel with C0 on HMX worker)
if (1 < n_chunk_cnt) {
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[1], vtcm_qweight, n_cols_A1, k, row_stride, weight_type);
}
}
// main loop
// main loop: wait C_i → submit C_{i+1} → D_i + B_{i+2} (parallel with C_{i+1})
for (int i = 0; i < n_chunk_cnt; ++i) {
const size_t nc = i * n_chunk_n_cols;
const size_t nc_p1 = nc + 1 * n_chunk_n_cols;
@@ -1386,36 +1459,41 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
const size_t n_cols_p1 = hex_smin(n - nc_p1, n_chunk_n_cols);
const size_t n_cols_p2 = hex_smin(n - nc_p2, n_chunk_n_cols);
// issue A_{i+2}
// issue A_{i+2}: DMA push (non-blocking)
if (i + 2 < n_chunk_cnt) {
const uint8_t *qweight_chunk_p2 = permuted_weight + nc_p2 * row_stride;
dma_queue_push(ctx->dma[0], dma_make_ptr(vtcm_qweight, qweight_chunk_p2), row_stride, row_stride, row_stride, n_cols_p2);
}
// wait for HMX (C_{i}) -- C_{i} is done
// wait C_i: block until prologue/previous C completes
hmx_queue_pop(ctx->hmx_queue);
// result of B_{i+1} (input of C_{i+1}) should be ready now
// issue C_{i+1}
// submit C_{i+1} (non-blocking, overlaps with D_i + B_{i+2} below)
// job_slots[(i+1)%2] is safe: C_i just completed, freeing slot i%2's
// counterpart — and (i+1)%2 was last used by C_{i-1} which completed
// before C_i was submitted.
if (i + 1 < n_chunk_cnt) {
core_dot_chunk_fp16((__fp16 *) vtcm_output_bufs[(i + 1) % 2], (__fp16 *) vtcm_activation, (__fp16 *) vtcm_weight_bufs[(i + 1) % 2], vtcm_scales,
hmx_ceil_div(n_rows, HMX_FP16_TILE_N_ROWS), hmx_ceil_div(n_cols_p1, HMX_FP16_TILE_N_COLS), k / HMX_FP16_TILE_N_ROWS);
hmx_matmul_job_init(&job_slots[(i + 1) % 2], (__fp16 *) vtcm_output_bufs[(i + 1) % 2],
(__fp16 *) vtcm_activation, (__fp16 *) vtcm_weight_bufs[(i + 1) % 2],
vtcm_scales, hmx_ceil_div(n_rows, HMX_FP16_TILE_N_ROWS),
hmx_ceil_div(n_cols_p1, HMX_FP16_TILE_N_COLS), k / HMX_FP16_TILE_N_ROWS);
hmx_queue_push(ctx->hmx_queue, hmx_queue_make_desc(hmx_matmul_worker_fn, &job_slots[(i + 1) % 2]));
}
// compute D_{i}
// D_i: store output (multi-thread HVX, parallel with C_{i+1})
float *output_chunk = dst + (mr * n + nc);
transfer_output_chunk_threaded(ctx, output_chunk, vtcm_output_bufs[i % 2], n_rows, n_cols, n);
// wait for DMA (A_{i+2}), compute B_{i+2}
// B_{i+2}: DMA pop + dequant (multi-thread HVX, parallel with C_{i+1})
if (i + 2 < n_chunk_cnt) {
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[(i + 2) % 2], vtcm_qweight, n_cols_p2, k, row_stride, weight_type);
}
}
}
}
HAP_compute_res_hmx_unlock(ctx->vtcm_rctx);
hmx_queue_suspend(ctx->hmx_queue);
}
TIMER_STOP(total);
@@ -1434,29 +1512,36 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
}
// C += AB
void core_mma_chunk_fp16(__fp16 *c, const __fp16 *a, const __fp16 *b, const __fp16 *col_scales, const __fp16 *eye_tile,
void core_mma_chunk_fp16(__fp16 *restrict c, const __fp16 *restrict a, const __fp16 *restrict b, const __fp16 *restrict col_scales, const __fp16 *restrict eye_tile,
int n_row_tiles, int n_col_tiles, int n_dot_tiles, bool zero_init) {
__builtin_assume(n_row_tiles > 0);
__builtin_assume(n_col_tiles > 0);
__builtin_assume(n_dot_tiles > 0);
hmx_set_output_scales(col_scales);
Q6_bias_mxmem2_A((void *)col_scales);
for (int i = 0; i < n_row_tiles; ++i) {
for (int j = 0; j < n_col_tiles; ++j) {
const size_t dot_tile_stride = n_dot_tiles * HMX_FP16_TILE_N_ELMS;
for (size_t i = 0; i < n_row_tiles; ++i) {
const __fp16 *row_base = a + i * dot_tile_stride;
__fp16 *res_base = c + i * n_col_tiles * HMX_FP16_TILE_N_ELMS;
for (size_t j = 0; j < n_col_tiles; ++j) {
Q6_mxclracc_hf();
const __fp16 *row_tiles = a + i * n_dot_tiles * HMX_FP16_TILE_N_ELMS;
const __fp16 *col_tiles = b + j * n_dot_tiles * HMX_FP16_TILE_N_ELMS;
__fp16 *accum_tile = c + (i * n_col_tiles + j) * HMX_FP16_TILE_N_ELMS;
const __fp16 *col_tiles = b + j * dot_tile_stride;
const __fp16 *row_tiles = row_base;
__fp16 *accum_tile = res_base + j * HMX_FP16_TILE_N_ELMS;
if (!zero_init) {
hmx_load_tile_pair_fp16(accum_tile, eye_tile);
Q6_activation_hf_mxmem_RR((unsigned int)accum_tile, 2047);
Q6_weight_hf_mxmem_RR((unsigned int)eye_tile, 2047);
}
for (int k = 0; k < n_dot_tiles; ++k) {
int offset = k * HMX_FP16_TILE_N_ELMS;
hmx_load_tile_pair_fp16(row_tiles + offset, col_tiles + offset);
Q6_activation_hf_mxmem_RR((unsigned int)row_tiles, 2047);
Q6_weight_hf_mxmem_RR((unsigned int)col_tiles, 2047);
row_tiles += HMX_FP16_TILE_N_ELMS;
col_tiles += HMX_FP16_TILE_N_ELMS;
}
hmx_consume_accumulator_fp16(accum_tile);
Q6_mxmem_AR_after_hf(accum_tile, 0);
}
}
}
@@ -1540,12 +1625,41 @@ int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict
const size_t vtcm_budget = ctx->vtcm_size;
const size_t M_BLOCK_SIZE = 512;
const size_t N_BLOCK_SIZE = 512;
const size_t K_BLOCK_SIZE = 512;
const size_t K_BLOCK_SIZE = 1024;
// Compute precise buffer sizes
// Fallback: if k doesn't need K-blocking, out-stationary has no advantage
const size_t k_iters_check = (k + K_BLOCK_SIZE - 1) / K_BLOCK_SIZE;
if (k_iters_check <= 1) {
FARF(MEDIUM, "%s: K_BLK=%zu >= k=%d, fallback to standard path", __func__, K_BLOCK_SIZE, k);
return FALLBACK_TO_STANDARD;
}
// Dynamic M,N search via hmx_compute_chunks
const size_t sub_row_stride_alloc = get_x4x2_row_stride(weight_type, K_BLOCK_SIZE);
const size_t per_m = K_BLOCK_SIZE * sizeof(float) // scratch1: M×K×4 (act DMA staging F32)
+ K_BLOCK_SIZE * sizeof(__fp16); // activation: M×K×2 (F16 tiles)
const size_t per_n = sub_row_stride_alloc // scratch0: N×sub_row(K) (packed quant)
+ K_BLOCK_SIZE * sizeof(__fp16); // weight: N×K×2 (F16 tiles)
const size_t per_mn = sizeof(__fp16); // output: M×N×2 (out-stationary)
// Alignment margin: hex_align_up can add up to 2047 bytes per buffer;
// scratch1 (mc×6144) is naturally 2048-aligned, remaining 4 buffers need margin
const size_t align_margin = 4 * HMX_FP16_TILE_SIZE;
const size_t overhead = HMX_FP16_TILE_SIZE + 256 + align_margin; // eye_tile + scales + alignment
size_t M_BLOCK_SIZE, N_BLOCK_SIZE, vtcm_used;
// Cost-based search: minimize ceil(m/mc)*m_block_cost + ceil(n/nc)*n_block_cost.
// From profiling: wt_dequant per element ≈ 1.5× activation load per element.
// m_block_cost = n*3: each extra M-block re-dequants all N×K weight (expensive).
// n_block_cost = m*2: each extra N-block re-loads all M×K activation (cheaper).
const size_t m_block_cost = (size_t) n * 3;
const size_t n_block_cost = (size_t) m * 2;
if (hmx_compute_chunks(vtcm_budget, overhead, per_n, per_m, per_mn, m, n, m_block_cost, n_block_cost, &M_BLOCK_SIZE,
&N_BLOCK_SIZE, &vtcm_used) != 0) {
FARF(HIGH, "%s: VTCM too small (m=%d k=%d n=%d budget=%zu)", __func__, m, k, n, vtcm_budget);
return -1;
}
// Compute precise buffer sizes from searched M,N and fixed K
const size_t weight_size = hex_align_up(N_BLOCK_SIZE * K_BLOCK_SIZE * sizeof(__fp16), HMX_FP16_TILE_SIZE);
const size_t act_size = hex_align_up(M_BLOCK_SIZE * K_BLOCK_SIZE * sizeof(__fp16), HMX_FP16_TILE_SIZE);
const size_t out_size = hex_align_up(M_BLOCK_SIZE * N_BLOCK_SIZE * sizeof(__fp16), HMX_FP16_TILE_SIZE);
@@ -1554,7 +1668,8 @@ int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict
const size_t total_vtcm = weight_size + act_size + out_size + scratch0_sz + scratch1_sz + HMX_FP16_TILE_SIZE + 256;
if (total_vtcm > vtcm_budget) {
FARF(HIGH, "%s: VTCM too small: need %zu have %zu (m=%d k=%d n=%d)", __func__, total_vtcm, vtcm_budget, m, k, n);
FARF(HIGH, "%s: VTCM overflow after search: need %zu have %zu (M=%zu N=%zu K=%zu)", __func__, total_vtcm,
vtcm_budget, M_BLOCK_SIZE, N_BLOCK_SIZE, K_BLOCK_SIZE);
return -1;
}
@@ -1568,8 +1683,8 @@ int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict
__fp16 *vtcm_scales = (__fp16 *) vtcm_seq_alloc(&vtcm_ptr, 256);
assert((size_t)(vtcm_ptr - (uint8_t *)ctx->vtcm_base) <= vtcm_budget);
FARF(MEDIUM, "%s: m=%d k=%d n=%d wtype=%d vtcm=%zu/%zu", __func__, m, k, n, weight_type,
(size_t)(vtcm_ptr - (uint8_t *)ctx->vtcm_base), vtcm_budget);
FARF(HIGH, "hmx-mm: m=%d k=%d n=%d wtype=%d block M=%zu N=%zu K=%zu vtcm=%zu/%zu", __func__, m, k, n, weight_type,
M_BLOCK_SIZE, N_BLOCK_SIZE, K_BLOCK_SIZE, (size_t) (vtcm_ptr - (uint8_t *) ctx->vtcm_base), vtcm_budget);
// initialize eye tile (32x32 identity matrix)
{
@@ -1583,7 +1698,7 @@ int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict
v = Q6_V_vror_VR(v, VLEN - 8);
}
}
hmx_init_column_scales(vtcm_scales, Q6_V_vsplat_R(0x3c00)); // fp16: 1.0
hmx_init_column_scales(vtcm_scales, Q6_V_vsplat_R(0x3c00)); // scale: 1.0, bias: 0.0 in FP16
TIMER_DEFINE(fetch);
TIMER_DEFINE(act_load);
@@ -1601,7 +1716,7 @@ int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict
const int n_col_tiles = hmx_ceil_div(n_blk_sz, HMX_FP16_TILE_N_COLS);
for (size_t kk = 0; kk < k; kk += K_BLOCK_SIZE) {
size_t k_blk_sz = hex_smin(k - kk, K_BLOCK_SIZE);
const size_t k_blk_sz = hex_smin(k - kk, K_BLOCK_SIZE);
TIMER_START(fetch);
// fetch activation block into VTCM
@@ -1617,13 +1732,13 @@ int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict
}
// fetch weight block into VTCM (x4x2 sub-block: quants + scales)
const size_t sub_row_stride = get_x4x2_row_stride(weight_type, k_blk_sz);
{
qweight_fetch_task_state_t s;
const int blk_start = kk / QK_Q4_0x4x2;
const int nb_sub = (k_blk_sz + QK_Q4_0x4x2 - 1) / QK_Q4_0x4x2;
const int full_qrow = (weight_type == HTP_TYPE_Q8_0) ? k : (k / 2);
const size_t sub_row_stride = get_x4x2_row_stride(weight_type, k_blk_sz);
const int scale_blk_size =
(weight_type == HTP_TYPE_MXFP4) ? HMX_X4X2_MXFP4_EBLK_SIZE : HMX_X4X2_DBLK_SIZE;
@@ -1663,7 +1778,6 @@ int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict
dma_queue_pop(ctx->dma[0]);
// vtcm_scratch0 is used to store the qweight chunk
// worker_pool_run_func already returned, so fetch is done
const size_t sub_row_stride = get_x4x2_row_stride(weight_type, k_blk_sz);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight, vtcm_scratch0,
n_blk_sz, k_blk_sz, sub_row_stride, weight_type);
}

View File

@@ -0,0 +1,158 @@
#pragma clang diagnostic ignored "-Wunused-function"
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
#include <qurt_thread.h>
#include <qurt_futex.h>
#include <HAP_compute_res.h>
#include "hmx-queue.h"
#define QURT_LOWEST_PRIO (254)
static inline void hmx_lock(struct hmx_queue *q)
{
if (!q->hmx_locked) {
HAP_compute_res_hmx_lock(q->hap_rctx);
q->hmx_locked = true;
}
}
static inline void hmx_unlock(struct hmx_queue *q)
{
if (q->hmx_locked) {
HAP_compute_res_hmx_unlock(q->hap_rctx);
q->hmx_locked = false;
}
}
static inline void hmx_queue_process(struct hmx_queue *q, bool* killed) {
unsigned int ir = atomic_load(&q->idx_read);
while (ir != atomic_load(&q->idx_write)) {
struct hmx_queue_desc *d = &q->desc[ir];
if (!d->done) {
FARF(HIGH, "hmx-queue-process: ir %u func %p data %p", ir, d->func, d->data);
enum hmx_queue_signal sig = (enum hmx_queue_signal) (unsigned int) d->func;
switch (sig) {
case HMX_QUEUE_NOOP: /* noop */; break;
case HMX_QUEUE_KILL: *killed = true; break;
case HMX_QUEUE_SUSPEND: hmx_unlock(q); break;
default:
hmx_lock(q);
d->func(d->data);
break;
}
atomic_fetch_add(&d->done, 1);
}
ir = (ir + 1) & q->idx_mask;
atomic_store(&q->idx_read, ir);
}
}
static void hmx_queue_thread(void * arg) {
struct hmx_queue * q = (struct hmx_queue *) arg;
FARF(HIGH, "hmx-queue-thread: started");
bool killed = false;
unsigned int poll_cnt = HMX_QUEUE_POLL_COUNT;
unsigned int prev_seqn = 0;
while (!killed) {
unsigned int seqn = atomic_load(&q->seqn);
if (seqn == prev_seqn) {
if (--poll_cnt) { hex_pause(); continue; }
FARF(HIGH, "hmx-queue-thread: sleeping");
qurt_futex_wait(&q->seqn, prev_seqn);
continue;
}
prev_seqn = seqn;
poll_cnt = HMX_QUEUE_POLL_COUNT;
FARF(HIGH, "hmx-queue-thread: new work");
hmx_queue_process(q, &killed);
}
FARF(HIGH, "hmx-queue-thread: stopped");
}
struct hmx_queue * hmx_queue_create(size_t capacity, uint32_t hap_rctx) {
capacity = hex_ceil_pow2(capacity);
struct hmx_queue * q = (struct hmx_queue *) memalign(32, sizeof(struct hmx_queue));
if (q == NULL) {
FARF(ERROR, "%s: failed to allocate DMA queue\n", __FUNCTION__);
return NULL;
}
memset(q, 0, sizeof(struct hmx_queue));
q->capacity = capacity;
q->idx_mask = capacity - 1;
q->hap_rctx = hap_rctx;
q->desc = (struct hmx_queue_desc *) memalign(64, capacity * sizeof(struct hmx_queue_desc));
if (!q->desc) {
FARF(ERROR, "hmx-queue: failed to allocate HMX queue descriptors\n");
return NULL;
}
memset(q->desc, 0, capacity * sizeof(struct hmx_queue_desc));
const size_t stack_size = HMX_QUEUE_THREAD_STACK_SIZE;
q->stack = (unsigned char *) memalign(64, stack_size);
if (!q->stack) {
FARF(ERROR, "hmx-queue: thread stack allocation failed (%zu bytes)", stack_size);
return NULL;
}
memset(q->stack, 0, stack_size);
// Match caller thread priority (same pattern as worker-pool.c).
int prio = qurt_thread_get_priority(qurt_thread_get_id());
if (prio < 1) {
prio = 1;
}
if (prio > QURT_LOWEST_PRIO) {
prio = QURT_LOWEST_PRIO;
}
qurt_thread_attr_t attr;
qurt_thread_attr_init(&attr);
qurt_thread_attr_set_stack_addr(&attr, q->stack);
qurt_thread_attr_set_stack_size(&attr, stack_size);
qurt_thread_attr_set_priority(&attr, prio);
qurt_thread_attr_set_name(&attr, "hmx-queue");
int err = qurt_thread_create(&q->thread, &attr, hmx_queue_thread, q);
if (err) {
FARF(ERROR, "hmx-worker: thread create failed (%d)", err);
return NULL;
}
FARF(HIGH, "hmx-queue: capacity %u\n", capacity);
return q;
}
void hmx_queue_delete(struct hmx_queue * q) {
if (!q) {
return;
}
// Tell the worker to exit.
hmx_queue_flush(q);
hmx_queue_signal(q, HMX_QUEUE_KILL);
hmx_queue_flush(q);
int status;
qurt_thread_join(q->thread, &status);
free(q->desc);
free(q->stack);
free(q);
}

View File

@@ -0,0 +1,134 @@
#ifndef HMX_QUEUE_H
#define HMX_QUEUE_H
#include <stdbool.h>
#include <stdint.h>
#include <stdatomic.h>
#include <hexagon_types.h>
#include <qurt_thread.h>
#include <qurt_futex.h>
#include <HAP_farf.h>
#include "hex-utils.h"
#ifdef __cplusplus
extern "C" {
#endif
#define HMX_QUEUE_THREAD_STACK_SIZE (16 * 1024)
#define HMX_QUEUE_POLL_COUNT 2000
typedef void (*hmx_queue_func)(void *);
// Dummy funcs used as signals
enum hmx_queue_signal {
HMX_QUEUE_NOOP = 0, // aka NULL
HMX_QUEUE_SUSPEND,
HMX_QUEUE_KILL
};
struct hmx_queue_desc {
hmx_queue_func func;
void * data;
atomic_uint done;
};
struct hmx_queue {
struct hmx_queue_desc * desc;
atomic_uint idx_write; // updated by producer (push)
atomic_uint idx_read; // updated by consumer (process)
unsigned int idx_pop; // updated by producer (pop)
uint32_t idx_mask;
uint32_t capacity;
atomic_uint seqn; // incremented for all pushes, used with futex
qurt_thread_t thread;
void * stack;
uint32_t hap_rctx;
bool hmx_locked;
};
struct hmx_queue * hmx_queue_create(size_t capacity, uint32_t hap_rctx);
void hmx_queue_delete(struct hmx_queue * q);
static inline struct hmx_queue_desc hmx_queue_make_desc(hmx_queue_func func, void * data) {
struct hmx_queue_desc d = { func, data };
return d;
}
static inline bool hmx_queue_push(struct hmx_queue * q, struct hmx_queue_desc d) {
unsigned int ir = atomic_load(&q->idx_read);
unsigned int iw = q->idx_write;
if (((iw + 1) & q->idx_mask) == ir) {
FARF(HIGH, "hmx-queue-push: queue is full\n");
return false;
}
atomic_store(&d.done, 0);
FARF(HIGH, "hmx-queue-push: iw %u func %p data %p\n", iw, d.func, d.data);
q->desc[iw] = d;
atomic_store(&q->idx_write, (iw + 1) & q->idx_mask);
// wake up our thread
atomic_fetch_add(&q->seqn, 1);
qurt_futex_wake(&q->seqn, 1);
return true;
}
static inline bool hmx_queue_signal(struct hmx_queue *q, enum hmx_queue_signal sig) {
return hmx_queue_push(q, hmx_queue_make_desc((hmx_queue_func) sig, NULL));
}
static inline bool hmx_queue_empty(struct hmx_queue * q) {
return q->idx_pop == q->idx_write;
}
static inline uint32_t hmx_queue_depth(struct hmx_queue * q) {
return (q->idx_read - q->idx_read) & q->idx_mask;
}
static inline uint32_t hmx_queue_capacity(struct hmx_queue * q) {
return q->capacity;
}
static inline struct hmx_queue_desc hmx_queue_pop(struct hmx_queue * q) {
unsigned int ip = q->idx_pop;
unsigned int iw = q->idx_write;
struct hmx_queue_desc rd = { NULL, NULL };
if (ip == iw) {
return rd;
}
// Wait for desc to complete
struct hmx_queue_desc * d = &q->desc[ip];
while (!atomic_load(&d->done)) {
FARF(HIGH, "hmx-queue-pop: waiting for HMX queue : %u\n", ip);
hex_pause();
}
rd = *d;
q->idx_pop = (ip + 1) & q->idx_mask;
FARF(HIGH, "hmx-queue-pop: ip %u func %p data %p\n", ip, rd.func, rd.data);
return rd;
}
static inline void hmx_queue_flush(struct hmx_queue * q) {
while (hmx_queue_pop(q).func != NULL) ;
}
static inline void hmx_queue_suspend(struct hmx_queue *q) {
hmx_queue_signal(q, HMX_QUEUE_SUSPEND);
hmx_queue_flush(q);
}
#ifdef __cplusplus
} // extern "C"
#endif
#endif /* HMX_QUEUE_H */

View File

@@ -14,10 +14,6 @@
#define HMX_INLINE_ALWAYS inline __attribute__((unused, always_inline))
static HMX_INLINE_ALWAYS void hmx_set_output_scales(const void *scales) {
asm volatile("bias = mxmem2(%0)" :: "r"(scales));
}
// Initialise aligned 256-byte area with scale vector + zero padding.
static HMX_INLINE_ALWAYS void hmx_init_column_scales(void *out_scales, HVX_Vector v_scale) {
HVX_Vector *pv = (HVX_Vector *)out_scales;
@@ -25,58 +21,6 @@ static HMX_INLINE_ALWAYS void hmx_init_column_scales(void *out_scales, HVX_Vecto
*pv = Q6_V_vzero();
}
// Load multiple contiguous tiles with :deep streaming.
// Rt = total region size - 1; the hardware streams through [Rs, Rs + Rt].
// IMPORTANT: the tile region [Rs, Rs + Rt] must NOT cross a VTCM 4 MB bank
// boundary, otherwise the mxmem instruction will raise a precise bus error.
// Callers must ensure their VTCM layout satisfies this constraint.
static HMX_INLINE_ALWAYS void hmx_load_tiles_fp16(const __fp16 *row_tiles,
const __fp16 *col_tiles,
size_t n_tiles) {
size_t limit = n_tiles * HMX_FP16_TILE_SIZE - 1;
asm volatile(
"{ activation.hf = mxmem(%0, %1):deep\n"
"weight.hf = mxmem(%2, %3) }\n"
:: "r"(row_tiles), "r"(limit), "r"(col_tiles), "r"(limit)
: "memory");
}
// Load a single activation+weight tile pair (no :deep streaming).
// Rt defines the accessible region [Rs, Rs+Rt]. Following the reference formula
// (limit = n_tiles * HMX_FP16_TILE_SIZE - 1), for a single tile Rt = 2047.
// The original code used Rt=0x7FFF (32 KB region); when dynamic VTCM allocation
// places a tile near a 4 MB bank boundary, the oversized region crosses it and
// triggers a precise bus error (0x2601). Rt=2047 confines accesses to exactly
// one 2048-byte tile while covering all 16 HVX vectors (offsets 0..2047).
static HMX_INLINE_ALWAYS void hmx_load_tile_pair_fp16(const __fp16 *act_tile,
const __fp16 *wt_tile) {
asm volatile(
"{ activation.hf = mxmem(%0, %1)\n"
"weight.hf = mxmem(%2, %3) }\n"
:: "r"(act_tile), "r"(2047),
"r"(wt_tile), "r"(2047)
: "memory");
}
static HMX_INLINE_ALWAYS void hmx_consume_accumulator_fp16(__fp16 *out) {
// Use the combined convert-and-store instruction (matches the reference
// Q6_mxmem_AR_after_hf intrinsic). The previous two-instruction sequence
// "cvt.hf = acc(2); mxmem = cvt" used an undocumented Rs=2 parameter.
asm volatile(
"mxmem(%0, %1):after.hf = acc\n"
:: "r"(out), "r"(0)
: "memory");
}
// Compute inner product of two vectors of tiles and store result.
static HMX_INLINE_ALWAYS void hmx_dot_fp16(__fp16 *out,
const __fp16 *row_tiles,
const __fp16 *col_tiles,
size_t n_tiles) {
hmx_load_tiles_fp16(row_tiles, col_tiles, n_tiles);
hmx_consume_accumulator_fp16(out);
}
// --- VTCM sequential allocator (from htp-ops-lib/include/dsp/vtcm_mgr.h) ---
static inline uint8_t *vtcm_seq_alloc(uint8_t **vtcm_ptr, size_t size) {

View File

@@ -2,6 +2,7 @@
#define HTP_CTX_H
#include "hex-dma.h"
#include "hmx-queue.h"
#include "htp-ops.h"
#include "worker-pool.h"
@@ -30,6 +31,8 @@ struct htp_spad {
uint32_t size_per_thread; // size per thread
};
struct htp_context;
// Context while processing an Op
// TODO: fold this into the main context
struct htp_ops_context {
@@ -72,6 +75,10 @@ struct htp_context {
atomic_bool vtcm_needs_release;
struct htp_ops_context octx;
#ifdef HTP_HAS_HMX
struct hmx_queue * hmx_queue; // Async HMX queue for pipeline overlap
#endif
};
int op_matmul(struct htp_ops_context * octx);

View File

@@ -91,7 +91,14 @@ enum htp_op_code {
#define HTP_OP_MAX_BUFS 8
#define HTP_OP_MAX_REQS 256
#define HTP_OP_MAX_TENSORS (HTP_OP_MAX_REQS * HTP_OP_MAX_INPUTS + HTP_OP_MAX_REQS)
#if __HVX_ARCH__ < 75
#define HTP_OP_MAX_VMEM (3167538380u)
#else
#define HTP_OP_MAX_VMEM (3221225472u)
#endif
#define HTP_MMAP_MAX_VMEM (2147483648u)
enum htp_tensor_flags {
HTP_TENSOR_COMPUTE = (1U << 0), // Tensor buffer temporal compute data (not weights)

View File

@@ -116,9 +116,14 @@ static inline HVX_VectorPred hvx_vec_is_nan_f16(HVX_Vector v) {
}
static inline HVX_Vector hvx_vec_f32_to_f16_shuff(HVX_Vector v0, HVX_Vector v1) {
#if __HVX_ARCH__ >= 81
HVX_Vector q0 = Q6_Vqf32_equals_Vsf(v0);
HVX_Vector q1 = Q6_Vqf32_equals_Vsf(v1);
#else
const HVX_Vector zero = Q6_V_vzero();
HVX_Vector q0 = Q6_Vqf32_vadd_VsfVsf(v0, zero);
HVX_Vector q1 = Q6_Vqf32_vadd_VsfVsf(v1, zero);
#endif
return Q6_Vhf_equals_Wqf32(Q6_W_vcombine_VV(q1, q0));
}

View File

@@ -18,8 +18,9 @@
#include <remote.h>
#include <string.h>
#include "hex-dma.h"
#include "hex-utils.h"
#include "hex-dma.h"
#include "hmx-queue.h"
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
@@ -117,7 +118,11 @@ AEEResult htp_iface_close(remote_handle64 handle) {
// release the mmaps (if any)
for (uint32_t i=0; i<HTP_MAX_MMAPS; i++) {
if (ctx->mmap[i].size) {
#if __HVX_ARCH__ > 73
HAP_munmap2((void *) ctx->mmap[i].base, ctx->mmap[i].size);
#else
HAP_munmap((void *) ctx->mmap[i].base, ctx->mmap[i].size);
#endif
ctx->mmap[i].size = 0;
ctx->mmap[i].base = NULL;
ctx->mmap[i].fd = -1;
@@ -172,8 +177,16 @@ AEEResult htp_iface_mmap(remote_handle64 handle, int fd, uint32_t size, uint32_t
struct htp_mmap *m = &ctx->mmap[i];
if (!m->size) {
FARF(HIGH, "mmap : fd %u size %u pinned %u", fd, size, pinned);
#if __HVX_ARCH__ > 73
void *va = HAP_mmap2(NULL, size, HAP_PROT_READ | HAP_PROT_WRITE, 0, fd, 0);
#else
if (size > HTP_MMAP_MAX_VMEM) { // HAP_mmap has a size limit of 2GB
FARF(ERROR, "mmap failed : size %u exceeds 2GB limit for HAP_mmap", (uint32_t) size);
abort(); // can't do much else at this point
}
void *va = HAP_mmap(NULL, size, HAP_PROT_READ | HAP_PROT_WRITE, 0, fd, 0);
#endif
if (va == (void*)-1) {
FARF(ERROR, "mmap failed : va %p fd %u size %u", va, fd, (uint32_t) size);
return AEE_EFAILED;
@@ -201,7 +214,11 @@ AEEResult htp_iface_munmap(remote_handle64 handle, int fd) {
struct htp_mmap *m = &ctx->mmap[i];
if (fd < 0 || m->fd == fd) {
FARF(HIGH, "unmmap : base %p fd %u size %u", (void*) m->base, m->fd, (uint32_t) m->size);
#if __HVX_ARCH__ > 73
HAP_munmap2((void *) m->base, m->size);
#else
HAP_munmap((void *) m->base, m->size);
#endif
m->size = 0;
m->base = NULL;
m->fd = -1;
@@ -324,6 +341,14 @@ AEEResult htp_iface_start(remote_handle64 handle, uint32 sess_id, uint64 dsp_que
#ifdef HTP_HAS_HMX
ctx->hmx_enabled = use_hmx;
ctx->hmx_queue = NULL;
if (use_hmx) {
ctx->hmx_queue = hmx_queue_create(16, ctx->vtcm_rctx);
if (!ctx->hmx_queue) {
FARF(ERROR, "hmx-queue-create failed");
ctx->hmx_enabled = false;
}
}
FARF(HIGH, "HMX %s (use_hmx=%d)", ctx->hmx_enabled ? "enabled" : "disabled", use_hmx);
#endif
@@ -389,7 +414,11 @@ AEEResult htp_iface_stop(remote_handle64 handle) {
}
#ifdef HTP_HAS_HMX
ctx->hmx_enabled = 0;
if (ctx->hmx_queue) {
hmx_queue_delete(ctx->hmx_queue);
ctx->hmx_queue = NULL;
}
ctx->hmx_enabled = false;
#endif
vtcm_free(ctx);
@@ -513,7 +542,11 @@ static inline bool reuse_buf(struct htp_context *ctx, uint32_t *m_reuse, struct
static inline void drop_mmap(struct htp_context *ctx, struct htp_mmap *m) {
if (m->size && !m->pinned) {
FARF(HIGH, "unmap : fd %u base %p size %u pinned %u", m->fd, (void*) m->base, (uint32_t) m->size, m->pinned);
#if __HVX_ARCH__ > 73
HAP_munmap2((void *) m->base, m->size);
#else
HAP_munmap((void *) m->base, m->size);
#endif
m->size = 0;
m->base = 0;
m->fd = -1;
@@ -527,7 +560,16 @@ static inline void mmap_buf(struct htp_context *ctx, struct htp_buf_desc *b) {
for (uint32_t i=0; i < HTP_MAX_MMAPS; i++) {
struct htp_mmap *m = &ctx->mmap[i];
if (!m->size) {
#if __HVX_ARCH__ > 73
void *va = HAP_mmap2(NULL, b->size, HAP_PROT_READ | HAP_PROT_WRITE, 0, b->fd, 0);
#else
if (b->size > HTP_MMAP_MAX_VMEM) { // HAP_mmap has a size limit of 2GB
FARF(ERROR, "mmap failed : size %u exceeds 2GB limit for HAP_mmap", (uint32_t) b->size);
abort(); // can't do much else at this point
}
void *va = HAP_mmap(NULL, b->size, HAP_PROT_READ | HAP_PROT_WRITE, 0, b->fd, 0);
#endif
if (va == (void*)-1) {
FARF(ERROR, "mmap failed : va %p fd %u size %u", va, b->fd, (uint32_t) b->size);
abort(); // can't do much else at this point

View File

@@ -30,6 +30,8 @@ extern "C" {
void ggml_print_backtrace(void);
uint64_t ggml_graph_next_uid(void);
#ifndef MIN
# define MIN(a, b) ((a) < (b) ? (a) : (b))
#endif
@@ -338,6 +340,10 @@ struct ggml_cgraph {
struct ggml_hash_set visited_hash_set;
enum ggml_cgraph_eval_order order;
// an optional identifier that can be utilized to recognize same graphs if two non-zero values match
// a value of 0 means it is not set and should be ignored
uint64_t uid;
};
// returns a slice of cgraph with nodes [i0, i1)

View File

@@ -250,6 +250,7 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_unary(ggml_metal
case GGML_UNARY_OP_CEIL: op_num = OP_UNARY_NUM_CEIL; break;
case GGML_UNARY_OP_ROUND: op_num = OP_UNARY_NUM_ROUND; break;
case GGML_UNARY_OP_TRUNC: op_num = OP_UNARY_NUM_TRUNC; break;
case GGML_UNARY_OP_XIELU: op_num = OP_UNARY_NUM_XIELU; break;
default: GGML_ABORT("fatal error");
} break;
default: GGML_ABORT("fatal error");
@@ -1818,6 +1819,23 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_upscale(ggml_met
return res;
}
ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_roll(ggml_metal_library_t lib, const ggml_tensor * op) {
assert(op->op == GGML_OP_ROLL);
char base[256];
char name[256];
snprintf(base, 256, "kernel_roll_%s", ggml_type_name(op->src[0]->type));
snprintf(name, 256, "%s", base);
ggml_metal_pipeline_with_params res = ggml_metal_library_get_pipeline(lib, name);
if (!res.pipeline) {
res = ggml_metal_library_compile_pipeline(lib, base, name, nullptr);
}
return res;
}
ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_pad(ggml_metal_library_t lib, const ggml_tensor * op) {
assert(op->op == GGML_OP_PAD);

View File

@@ -152,6 +152,7 @@ struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_conv_3d
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_upscale (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_pad (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_pad_reflect_1d (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_roll (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_arange (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_timestep_embedding(ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_opt_step_adamw (ggml_metal_library_t lib, const struct ggml_tensor * op);

View File

@@ -1043,6 +1043,7 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
case GGML_UNARY_OP_CEIL:
case GGML_UNARY_OP_ROUND:
case GGML_UNARY_OP_TRUNC:
case GGML_UNARY_OP_XIELU:
return ggml_is_contiguous_rows(op->src[0]) && (op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16);
default:
return false;
@@ -1137,6 +1138,7 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
case GGML_OP_ARGSORT:
case GGML_OP_TOP_K:
case GGML_OP_ARANGE:
case GGML_OP_ROLL:
return true;
case GGML_OP_FLASH_ATTN_EXT:
// for new head sizes, add checks here
@@ -1159,6 +1161,23 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
if (op->src[1]->type != op->src[2]->type) {
return false;
}
switch (op->src[1]->type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
case GGML_TYPE_Q5_1:
break;
case GGML_TYPE_BF16:
if (!has_bfloat) {
return false;
}
break;
default:
return false;
}
return has_simdgroup_mm; // TODO: over-restricted for vec-kernels
case GGML_OP_SSM_CONV:
case GGML_OP_SSM_SCAN:

View File

@@ -127,6 +127,7 @@
#define OP_UNARY_NUM_CEIL 118
#define OP_UNARY_NUM_ROUND 119
#define OP_UNARY_NUM_TRUNC 120
#define OP_UNARY_NUM_XIELU 121
#define OP_SUM_ROWS_NUM_SUM_ROWS 10
#define OP_SUM_ROWS_NUM_MEAN 11
@@ -1016,6 +1017,29 @@ typedef struct {
int32_t p1;
} ggml_metal_kargs_pad_reflect_1d;
typedef struct {
int64_t ne00;
int64_t ne01;
int64_t ne02;
int64_t ne03;
uint64_t nb00;
uint64_t nb01;
uint64_t nb02;
uint64_t nb03;
int64_t ne0;
int64_t ne1;
int64_t ne2;
int64_t ne3;
uint64_t nb0;
uint64_t nb1;
uint64_t nb2;
uint64_t nb3;
int32_t s0;
int32_t s1;
int32_t s2;
int32_t s3;
} ggml_metal_kargs_roll;
typedef struct {
uint64_t nb1;
int dim;

View File

@@ -410,6 +410,10 @@ static int ggml_metal_op_encode_impl(ggml_metal_op_t ctx, int idx) {
{
n_fuse = ggml_metal_op_pad_reflect_1d(ctx, idx);
} break;
case GGML_OP_ROLL:
{
n_fuse = ggml_metal_op_roll(ctx, idx);
} break;
case GGML_OP_ARANGE:
{
n_fuse = ggml_metal_op_arange(ctx, idx);
@@ -787,6 +791,13 @@ int ggml_metal_op_unary(ggml_metal_op_t ctx, int idx) {
args.max = ggml_get_op_params_f32(op, 1);
}
if (op->op == GGML_OP_UNARY && ggml_get_unary_op(op) == GGML_UNARY_OP_XIELU) {
args.slope = ggml_get_op_params_f32(op, 1); // alpha_n
args.scale = ggml_get_op_params_f32(op, 2); // alpha_p
args.bias = ggml_get_op_params_f32(op, 3); // beta
args.val = ggml_get_op_params_f32(op, 4); // eps
}
auto pipeline = ggml_metal_library_get_pipeline_unary(lib, op);
if (pipeline.c4) {
@@ -3938,6 +3949,59 @@ int ggml_metal_op_upscale(ggml_metal_op_t ctx, int idx) {
return 1;
}
int ggml_metal_op_roll(ggml_metal_op_t ctx, int idx) {
ggml_tensor * op = ctx->node(idx);
ggml_metal_library_t lib = ctx->lib;
ggml_metal_encoder_t enc = ctx->enc;
GGML_TENSOR_LOCALS( int32_t, ne0, op->src[0], ne);
GGML_TENSOR_LOCALS(uint64_t, nb0, op->src[0], nb);
GGML_TENSOR_LOCALS( int32_t, ne, op, ne);
GGML_TENSOR_LOCALS(uint64_t, nb, op, nb);
const int32_t s0 = ggml_get_op_params_i32(op, 0);
const int32_t s1 = ggml_get_op_params_i32(op, 1);
const int32_t s2 = ggml_get_op_params_i32(op, 2);
const int32_t s3 = ggml_get_op_params_i32(op, 3);
ggml_metal_kargs_roll args = {
/*.ne00 =*/ ne00,
/*.ne01 =*/ ne01,
/*.ne02 =*/ ne02,
/*.ne03 =*/ ne03,
/*.nb00 =*/ nb00,
/*.nb01 =*/ nb01,
/*.nb02 =*/ nb02,
/*.nb03 =*/ nb03,
/*.ne0 =*/ ne0,
/*.ne1 =*/ ne1,
/*.ne2 =*/ ne2,
/*.ne3 =*/ ne3,
/*.nb0 =*/ nb0,
/*.nb1 =*/ nb1,
/*.nb2 =*/ nb2,
/*.nb3 =*/ nb3,
/*.s0 =*/ s0,
/*.s1 =*/ s1,
/*.s2 =*/ s2,
/*.s3 =*/ s3
};
auto pipeline = ggml_metal_library_get_pipeline_roll(lib, op);
const int nth = std::min(1024, ne0);
ggml_metal_encoder_set_pipeline(enc, pipeline);
ggml_metal_encoder_set_bytes (enc, &args, sizeof(args), 0);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op->src[0]), 1);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op), 2);
ggml_metal_encoder_dispatch_threadgroups(enc, ne1, ne2, ne3, nth, 1, 1);
return 1;
}
int ggml_metal_op_pad(ggml_metal_op_t ctx, int idx) {
ggml_tensor * op = ctx->node(idx);

View File

@@ -81,6 +81,7 @@ int ggml_metal_op_conv_transpose_2d (ggml_metal_op_t ctx, int idx);
int ggml_metal_op_upscale (ggml_metal_op_t ctx, int idx);
int ggml_metal_op_pad (ggml_metal_op_t ctx, int idx);
int ggml_metal_op_pad_reflect_1d (ggml_metal_op_t ctx, int idx);
int ggml_metal_op_roll (ggml_metal_op_t ctx, int idx);
int ggml_metal_op_arange (ggml_metal_op_t ctx, int idx);
int ggml_metal_op_timestep_embedding(ggml_metal_op_t ctx, int idx);
int ggml_metal_op_argmax (ggml_metal_op_t ctx, int idx);

View File

@@ -1177,6 +1177,15 @@ kernel void kernel_unary_impl(
if (FC_OP == OP_UNARY_NUM_TRUNC) {
dst_ptr[i0] = (T) trunc(x);
}
if (FC_OP == OP_UNARY_NUM_XIELU) {
const TC xi = x;
const TC gate = TC(xi > TC(0.0f));
const TC clamped = fmin(xi, TC(args.val));
const TC y_pos = TC(args.scale) * xi * xi + TC(args.bias) * xi;
const TC y_neg = (exp(clamped) - TC(1.0f) - xi) * TC(args.slope) + TC(args.bias) * xi;
dst_ptr[i0] = (T) (gate * y_pos + (TC(1.0f) - gate) * y_neg);
}
}
#undef FC_OP
@@ -5238,6 +5247,40 @@ kernel void kernel_upscale_bicubic_f32(
}
}
kernel void kernel_roll_f32(
constant ggml_metal_kargs_roll & args,
device const char * src0,
device char * dst,
uint3 tgpig[[threadgroup_position_in_grid]],
uint3 tpitg[[thread_position_in_threadgroup]],
uint3 ntg[[threads_per_threadgroup]]) {
const int64_t i3 = tgpig.z;
const int64_t i2 = tgpig.y;
const int64_t i1 = tgpig.x;
device const float * src0_ptr = (device const float *) src0;
device float * dst_ptr = (device float *) dst;
for (int i0 = tpitg.x; i0 < args.ne0; i0 += ntg.x) {
// apply shifts and wrap around
int64_t i00 = i0 - args.s0;
int64_t i01 = i1 - args.s1;
int64_t i02 = i2 - args.s2;
int64_t i03 = i3 - args.s3;
if (i00 < 0) { i00 += args.ne00; } else if (i00 >= args.ne00) { i00 -= args.ne00; }
if (i01 < 0) { i01 += args.ne01; } else if (i01 >= args.ne01) { i01 -= args.ne01; }
if (i02 < 0) { i02 += args.ne02; } else if (i02 >= args.ne02) { i02 -= args.ne02; }
if (i03 < 0) { i03 += args.ne03; } else if (i03 >= args.ne03) { i03 -= args.ne03; }
int64_t src_idx = i03*args.ne02*args.ne01*args.ne00 + i02*args.ne01*args.ne00 + i01*args.ne00 + i00;
int64_t dst_idx = i3 *args.ne2 *args.ne1 *args.ne0 + i2 *args.ne1 *args.ne0 + i1 *args.ne0 + i0;
dst_ptr[dst_idx] = src0_ptr[src_idx];
}
}
kernel void kernel_pad_f32(
constant ggml_metal_kargs_pad & args,
device const char * src0,

View File

@@ -121,6 +121,8 @@ set(GGML_OPENCL_KERNELS
gemm_noshuffle_q4_k_f32
gemv_noshuffle_q6_k_f32
gemm_noshuffle_q6_k_f32
gemv_noshuffle_q5_k_f32
gemm_noshuffle_q5_k_f32
mul
neg
norm

View File

@@ -542,6 +542,8 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_restore_block_q4_K_noshuffle;
cl_kernel kernel_convert_block_q4_K, kernel_restore_block_q4_K;
cl_kernel kernel_convert_block_q5_K, kernel_restore_block_q5_K;
cl_kernel kernel_convert_block_q5_K_noshuffle;
cl_kernel kernel_restore_block_q5_K_noshuffle;
cl_kernel kernel_convert_block_q6_K, kernel_restore_block_q6_K;
cl_kernel kernel_mul_mat_q4_0_f32_1d_8x_flat, kernel_mul_mat_q4_0_f32_1d_16x_flat;
cl_kernel kernel_mul_mv_q4_1_f32;
@@ -730,6 +732,8 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_gemm_noshuffle_q4_k_f32;
cl_kernel kernel_gemv_noshuffle_q6_K_f32;
cl_kernel kernel_gemm_noshuffle_q6_K_f32;
cl_kernel kernel_gemv_noshuffle_q5_k_f32;
cl_kernel kernel_gemm_noshuffle_q5_k_f32;
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
void free() {
@@ -944,6 +948,8 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx, ggml_cl_ve
CL_CHECK((backend_ctx->kernel_restore_block_q4_K_noshuffle = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q4_K_noshuffle", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q5_K = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q5_K", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_q5_K = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q5_K", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q5_K_noshuffle = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q5_K_noshuffle", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_q5_K_noshuffle = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q5_K_noshuffle", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q6_K = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q6_K", &err), err));
CL_CHECK((backend_ctx->kernel_restore_block_q6_K = clCreateKernel(backend_ctx->program_cvt, "kernel_restore_block_q6_K", &err), err));
CL_CHECK((backend_ctx->kernel_convert_block_q6_K_noshuffle = clCreateKernel(backend_ctx->program_cvt, "kernel_convert_block_q6_K_noshuffle", &err), err));
@@ -2794,6 +2800,45 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx, ggml_cl_ve
CL_CHECK((backend_ctx->kernel_gemm_noshuffle_q6_K_f32 = clCreateKernel(prog, "kernel_gemm_noshuffle_q6_K_f32", &err), err));
GGML_LOG_CONT(".");
}
// gemv_noshuffle_q5_k_f32
{
std::string CL_gemv_compile_opts = std::string("-cl-std=") + opencl_c_std +
" -cl-mad-enable ";
if (backend_ctx->has_vector_subgroup_broadcast) {
CL_gemv_compile_opts += " -DVECTOR_SUB_GROUP_BROADCAST ";
}
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "gemv_noshuffle_q5_k_f32.cl.h"
};
#else
const std::string kernel_src = read_file("gemv_noshuffle_q5_k_f32.cl");
#endif
cl_program prog = build_program_from_source(
backend_ctx->context, backend_ctx->device, kernel_src.c_str(), CL_gemv_compile_opts);
CL_CHECK((backend_ctx->kernel_gemv_noshuffle_q5_k_f32 = clCreateKernel(prog, "kernel_gemv_noshuffle_q5_k_f32", &err), err));
CL_CHECK(clReleaseProgram(prog));
GGML_LOG_CONT(".");
}
// gemm_noshuffle_q5_k_f32
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "gemm_noshuffle_q5_k_f32.cl.h"
};
#else
const std::string kernel_src = read_file("gemm_noshuffle_q5_k_f32.cl");
#endif
cl_program prog = build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_gemm_noshuffle_q5_k_f32 = clCreateKernel(prog, "kernel_gemm_noshuffle_q5_k_f32", &err), err));
CL_CHECK(clReleaseProgram(prog));
GGML_LOG_CONT(".");
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
GGML_LOG_CONT("\n");
}
@@ -5071,115 +5116,8 @@ static void ggml_backend_opencl_buffer_set_tensor(ggml_backend_buffer_t buffer,
GGML_ASSERT(tensor->ne[2] == 1);
GGML_ASSERT(tensor->ne[3] == 1);
// Transpose weights
size_t q_size_bytes = K * M / 4 * sizeof(float);
cl_buffer_region region;
region.origin = 0;
region.size = q_size_bytes;
cl_mem qT_d = clCreateSubBuffer(
backend_ctx->prealloc_quant_trans.buffer,
0,
CL_BUFFER_CREATE_TYPE_REGION,
&region,
&err);
CL_CHECK(err);
cl_mem q_d_image1D;
cl_mem qT_d_image1D;
cl_image_format img_fmt_1d;
cl_image_desc img_desc_1d;
img_fmt_1d = { CL_RGBA, CL_FLOAT };
memset(&img_desc_1d, 0, sizeof(img_desc_1d));
img_desc_1d.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc_1d.image_width = M * K / 4 / 4;
img_desc_1d.buffer = extra->q;
q_d_image1D = clCreateImage(context, 0, &img_fmt_1d, &img_desc_1d, NULL, &err);
CL_CHECK(err);
img_fmt_1d = { CL_RGBA, CL_FLOAT };
memset(&img_desc_1d, 0, sizeof(img_desc_1d));
img_desc_1d.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc_1d.image_width = M * K / 4 / 4;
img_desc_1d.buffer = qT_d;
qT_d_image1D = clCreateImage(context, 0, &img_fmt_1d, &img_desc_1d, NULL, &err);
CL_CHECK(err);
int height_q = M / 4;
int width_q = K / 4 / 4;
kernel = backend_ctx->kernel_transpose_32;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &q_d_image1D));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &qT_d_image1D));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(int), &height_q));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(int), &width_q));
size_t local_size_q[3] = {4, 16, 1};
size_t global_size_q[3] = {static_cast<size_t>(width_q), static_cast<size_t>(height_q), 1};
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_size_q, local_size_q, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
// Transpose scales
size_t d_size_bytes = M * (K / 32) * 2;
region.origin = 0;
region.size = d_size_bytes;
cl_mem dT_d = clCreateSubBuffer(
backend_ctx->prealloc_scales_trans.buffer,
0,
CL_BUFFER_CREATE_TYPE_REGION,
&region,
&err);
CL_CHECK(err);
cl_mem d_d_image1D;
cl_mem dT_d_image1D;
memset(&img_desc_1d, 0, sizeof(img_desc_1d));
img_fmt_1d = { CL_R, CL_HALF_FLOAT };
img_desc_1d.image_width = M * K / 32;
img_desc_1d.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc_1d.buffer = extra->d;
d_d_image1D = clCreateImage(context, 0, &img_fmt_1d, &img_desc_1d, NULL, &err);
CL_CHECK(err);
img_fmt_1d = { CL_RGBA, CL_HALF_FLOAT };
memset(&img_desc_1d, 0, sizeof(img_desc_1d));
img_desc_1d.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc_1d.image_width = M * K / 32 / 4;
img_desc_1d.buffer = dT_d;
dT_d_image1D = clCreateImage(context, 0, &img_fmt_1d, &img_desc_1d, NULL, &err);
CL_CHECK(err);
int height_s = M / 4;
int width_s = K / 32;
kernel = backend_ctx->kernel_transpose_16_4x1;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_d_image1D));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &dT_d_image1D));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(int), &height_s));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(int), &width_s));
size_t local_size_s[3] = {4, 16, 1};
size_t global_size_s[3] = {static_cast<size_t>(width_s), static_cast<size_t>(height_s), 1};
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_size_s, local_size_s, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
// copy transposed buffer contents to original buffers
CL_CHECK(clEnqueueCopyBuffer(queue, qT_d, extra->q, 0, 0, q_size_bytes, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
CL_CHECK(clEnqueueCopyBuffer(queue, dT_d, extra->d, 0, 0, d_size_bytes, 0, NULL, &evt));
CL_CHECK(clWaitForEvents(1, &evt));
CL_CHECK(clReleaseMemObject(qT_d));
CL_CHECK(clReleaseMemObject(dT_d));
CL_CHECK(clReleaseMemObject(q_d_image1D));
CL_CHECK(clReleaseMemObject(d_d_image1D));
CL_CHECK(clReleaseMemObject(qT_d_image1D));
CL_CHECK(clReleaseMemObject(dT_d_image1D));
transpose_2d_as_32b(backend_ctx, extra->q, extra->q, size_q, K/4, M);
transpose_2d_as_16b(backend_ctx, extra->d, extra->d, size_d, K/32, M);
} // end transpose
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
@@ -5354,7 +5292,17 @@ static void ggml_backend_opencl_buffer_set_tensor(ggml_backend_buffer_t buffer,
CL_CHECK((extra->qh = clCreateSubBuffer(extra_orig->data_device, CL_MEM_READ_WRITE, CL_BUFFER_CREATE_TYPE_REGION, &region, &err), err));
CL_CHECK(err);
#ifdef GGML_OPENCL_USE_ADRENO_KERNELS
cl_kernel kernel = backend_ctx->kernel_convert_block_q5_K;
if (use_adreno_kernels(backend_ctx, tensor)) {
kernel = backend_ctx->kernel_convert_block_q5_K_noshuffle;
}
#else
cl_kernel kernel = backend_ctx->kernel_convert_block_q5_K;
#endif
cl_uchar mask_0F = 0x0F;
cl_uchar mask_F0 = 0xF0;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &data_device));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra->q));
@@ -5362,6 +5310,8 @@ static void ggml_backend_opencl_buffer_set_tensor(ggml_backend_buffer_t buffer,
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra->s));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extra->d));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_mem), &extra->dm));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_uchar), &mask_0F));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(cl_uchar), &mask_F0));
size_t global_work_size[] = {(size_t)ggml_nelements(tensor)/ggml_blck_size(tensor->type), 1, 1};
size_t local_work_size[] = {64, 1, 1};
@@ -5378,6 +5328,21 @@ static void ggml_backend_opencl_buffer_set_tensor(ggml_backend_buffer_t buffer,
extra->size_dm = size_dm;
tensor->extra = extra;
#ifdef GGML_OPENCL_USE_ADRENO_KERNELS
if (use_adreno_kernels(backend_ctx, tensor)) {
int M = tensor->ne[1];
int K = tensor->ne[0];
GGML_ASSERT(K % 32 == 0);
// Transpose q, d, dm as ushort, qh as uchar
transpose_2d_as_16b(backend_ctx, extra->q, extra->q, size_q, K/4, M);
transpose_2d_as_8b (backend_ctx, extra->qh, extra->qh, size_qh, K/8, M);
transpose_2d_as_16b(backend_ctx, extra->d, extra->d, size_d, K/256, M);
transpose_2d_as_16b(backend_ctx, extra->dm, extra->dm, size_dm, K/256, M);
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
return;
}
if (tensor->type == GGML_TYPE_Q6_K) {
@@ -5894,6 +5859,57 @@ static void ggml_backend_opencl_buffer_get_tensor(ggml_backend_buffer_t buffer,
ggml_nbytes(tensor), NULL, &err);
CL_CHECK(err);
cl_uchar mask_0F = 0x0F;
cl_uchar mask_F0 = 0xF0;
#ifdef GGML_OPENCL_USE_ADRENO_KERNELS
if (use_adreno_kernels(backend_ctx, tensor)) {
int M = tensor->ne[1];
int K = tensor->ne[0];
size_t size_q = extra->size_q;
size_t size_qh = extra->size_qh;
size_t size_d = extra->size_d;
size_t size_dm = extra->size_dm;
static ggml_cl_buffer buf_trans_q;
static ggml_cl_buffer buf_trans_qh;
static ggml_cl_buffer buf_trans_d;
static ggml_cl_buffer buf_trans_dm;
buf_trans_q.allocate(backend_ctx->context, size_q);
buf_trans_qh.allocate(backend_ctx->context, size_qh);
buf_trans_d.allocate(backend_ctx->context, size_d);
buf_trans_dm.allocate(backend_ctx->context, size_dm);
// Reverse transpose q, qh, d, dm
transpose_2d_as_16b(backend_ctx, extra->q, buf_trans_q.buffer, size_q, M, K/4);
transpose_2d_as_8b (backend_ctx, extra->qh, buf_trans_qh.buffer, size_qh, M, K/8);
transpose_2d_as_16b(backend_ctx, extra->d, buf_trans_d.buffer, size_d, M, K/256);
transpose_2d_as_16b(backend_ctx, extra->dm, buf_trans_dm.buffer, size_dm, M, K/256);
cl_kernel kernel = backend_ctx->kernel_restore_block_q5_K_noshuffle;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf_trans_q.buffer));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &buf_trans_qh.buffer));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra->s));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &buf_trans_d.buffer));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &buf_trans_dm.buffer));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_mem), &data_device));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_uchar), &mask_0F));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(cl_uchar), &mask_F0));
size_t global_work_size[] = {(size_t)ggml_nelements(tensor)/ggml_blck_size(tensor->type), 1, 1};
size_t local_work_size[] = {1, 1, 1};
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, 3, NULL,
global_work_size, local_work_size, 0, NULL, NULL));
CL_CHECK(clEnqueueReadBuffer(queue, data_device, CL_TRUE, offset,
size, data, 0, NULL, NULL));
CL_CHECK(clReleaseMemObject(data_device));
return;
}
#endif // GGML_OPENCL_USE_ADRENO_KERNELS
cl_kernel kernel = backend_ctx->kernel_restore_block_q5_K;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra->q));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra->qh));
@@ -5901,6 +5917,8 @@ static void ggml_backend_opencl_buffer_get_tensor(ggml_backend_buffer_t buffer,
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra->d));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extra->dm));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_mem), &data_device));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_uchar), &mask_0F));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(cl_uchar), &mask_F0));
size_t global_work_size[] = {(size_t)ggml_nelements(tensor)/ggml_blck_size(tensor->type), 1, 1};
size_t local_work_size[] = {1, 1, 1};
@@ -9831,19 +9849,18 @@ static void ggml_cl_mul_mat_q8_0_f32_adreno(ggml_backend_t backend, const ggml_t
GGML_ASSERT(dst);
GGML_ASSERT(dst->extra);
const enum ggml_type src0t = src0->type;
const enum ggml_type src1t = src1->type;
GGML_ASSERT(src0t == GGML_TYPE_Q8_0);
GGML_ASSERT(src1t == GGML_TYPE_F32);
GGML_ASSERT(src0->type == GGML_TYPE_Q8_0);
GGML_ASSERT(src1->type == GGML_TYPE_F32);
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
ggml_tensor_extra_cl * extra1 = (ggml_tensor_extra_cl *)src1->extra;
ggml_tensor_extra_cl * extrad = (ggml_tensor_extra_cl *)dst->extra;
ggml_tensor_extra_cl_q8_0 * extra0_q8_0 = (ggml_tensor_extra_cl_q8_0 *)src0->extra;
cl_ulong offset1 = extra1->offset + src1->view_offs;
cl_ulong offsetd = extrad->offset + dst->view_offs;
GGML_ASSERT(src1->view_offs == 0);
GGML_ASSERT(dst->view_offs == 0);
@@ -9864,148 +9881,112 @@ static void ggml_cl_mul_mat_q8_0_f32_adreno(ggml_backend_t backend, const ggml_t
cl_context context = backend_ctx->context;
cl_kernel kernel;
// init CL objects
cl_int status;
cl_image_format img_fmt_1d;
cl_image_desc img_desc_1d;
cl_int err;
cl_image_format img_fmt;
cl_image_desc img_desc;
cl_buffer_region region;
cl_mem A_image1d;
cl_mem B_image1d;
cl_mem B_sub_buffer;
cl_mem S_image1d;
// for B transpose
cl_mem B_image1d_trans = nullptr;
cl_mem B_d = nullptr;
cl_mem D_image1d;
cl_mem D_sub_buffer;
int M = ne01;
int N = ne1;
int K = ne00;
// create an image for A
img_fmt_1d = { CL_R, CL_FLOAT};
memset(&img_desc_1d, 0, sizeof(img_desc_1d));
img_desc_1d.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc_1d.image_width = M * K / 4; // Divide by 4 for char -> float
img_desc_1d.buffer = extra0_q8_0->q;
A_image1d = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt_1d, &img_desc_1d, NULL, &status);
CL_CHECK(status);
if (ne1 == 1) {
cl_mem q_img = nullptr;
cl_mem b_sub_buf = nullptr;
cl_mem b_img = nullptr;
// create an image for Scale
img_fmt_1d = { CL_R, CL_HALF_FLOAT};
memset(&img_desc_1d, 0, sizeof(img_desc_1d));
img_desc_1d.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc_1d.image_width = M * K / 32; // Block size is 32
img_desc_1d.buffer = extra0_q8_0->d;
S_image1d = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt_1d, &img_desc_1d, NULL, &status);
CL_CHECK(status);
// image for q
img_fmt = { CL_R, CL_UNSIGNED_INT32};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = M * K / 4;
img_desc.buffer = extra0_q8_0->q;
CL_CHECK((q_img = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt, &img_desc, NULL, &err), err));
// create a sub_buffer for B
region.origin = (extra1->offset); // + src1->view_offs);
region.size = K * N * sizeof(float);
B_sub_buffer = clCreateSubBuffer((extra1->data_device), 0, CL_BUFFER_CREATE_TYPE_REGION, &region, &status);
CL_CHECK(status);
// create a sub_buffer for B
region.origin = offset1;
region.size = K * N * sizeof(float);
CL_CHECK((b_sub_buf = clCreateSubBuffer((extra1->data_device), 0, CL_BUFFER_CREATE_TYPE_REGION, &region, &err), err));
// create an image for B from sub_buffer: RGBA (OCL)
img_fmt_1d = {CL_RGBA, CL_FLOAT};
memset(&img_desc_1d, 0, sizeof(img_desc_1d));
img_desc_1d.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc_1d.image_width = K * N / 4;
img_desc_1d.buffer = B_sub_buffer;
B_image1d = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt_1d, &img_desc_1d, NULL, &status);
CL_CHECK(status);
// image for activations
img_fmt = {CL_RGBA, CL_FLOAT};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = K * N / 4;
img_desc.buffer = b_sub_buf;
CL_CHECK((b_img = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt, &img_desc, NULL, &err), err));
// Create subbuffer and image1d_buffer for dst
region.origin = (extrad->offset); // + dst->view_offs;
region.size = M * N * sizeof(float);
D_sub_buffer = clCreateSubBuffer((extrad->data_device), 0, CL_BUFFER_CREATE_TYPE_REGION, &region, &status);
CL_CHECK(status);
img_fmt_1d = {CL_R, CL_FLOAT};
memset(&img_desc_1d, 0, sizeof(img_desc_1d));
img_desc_1d.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc_1d.image_width = M * N;
img_desc_1d.buffer = D_sub_buffer;
D_image1d = clCreateImage(context, CL_MEM_WRITE_ONLY, &img_fmt_1d, &img_desc_1d, NULL, &status);
CL_CHECK(status);
size_t local_work_size[3] = {1, 1, 1};
size_t global_work_size[3] = {1, 1, 1};
if (N == 1) {
kernel = backend_ctx->CL_mul_mat_vec_q8_0_f32;
int r2 = 1;
int r3 = 1;
cl_uint k_arg = 0;
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(cl_mem), &A_image1d));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(cl_mem), &extra0_q8_0->d));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(cl_mem), &B_image1d));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(cl_ulong), &extra1->offset));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(cl_ulong), &extrad->offset));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &ne10));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &ne0));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &ne1));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &r2));
CL_CHECK(clSetKernelArg(kernel, k_arg++, sizeof(int), &r3));
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &q_img));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra0_q8_0->d));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &b_img));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_ulong), &extra1->offset));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_ulong), &extrad->offset));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(int), &ne02));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(int), &ne10));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(int), &ne12));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(int), &ne0));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(int), &ne1));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(int), &r2));
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(int), &r3));
size_t wavesize = backend_ctx->adreno_wave_size;
local_work_size[0] = wavesize;
local_work_size[1] = 4; // reduce factor
local_work_size[2] = 1;
size_t local_work_size[] = { wavesize, 4, 1 };
size_t global_work_size[] = { CEIL_DIV(M, wavesize)*wavesize, 4, 1 };
global_work_size[0] = ((M + wavesize - 1) / wavesize) * wavesize;
global_work_size[1] = 4; // reduce factor
global_work_size[2] = 1;
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
CL_CHECK(clReleaseMemObject(q_img));
CL_CHECK(clReleaseMemObject(b_img));
CL_CHECK(clReleaseMemObject(b_sub_buf));
} else {
cl_ulong offsetd = extrad->offset + dst->view_offs;
int padding;
cl_mem b_sub_buf = nullptr;
cl_mem b_sub_buf_trans = nullptr;
cl_mem b_img = nullptr;
cl_mem b_img_trans = nullptr;
//how many extra elements beyond multiple of 8
// subbuffer for activations
region.origin = offset1;
region.size = K * N * sizeof(float);
CL_CHECK((b_sub_buf = clCreateSubBuffer(extra1->data_device, 0, CL_BUFFER_CREATE_TYPE_REGION, &region, &err), err));
// image for activations
img_fmt = {CL_RGBA, CL_FLOAT};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = K * N / 4;
img_desc.buffer = b_sub_buf;
CL_CHECK((b_img = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt, &img_desc, NULL, &err), err));
// pad N to multiple of 8
int extra_elements = N % 8;
//how much padding to add
padding = 0;
int padding = 0;
if (extra_elements > 0){
padding = 8 - extra_elements;
}
// Specify the starting offset (in bytes)
// subbuffer for transposed activations
region.origin = 0;
// Specify the size of the sub-buffer (divide by 2 for FP16)
region.size = K * (N + padding) * sizeof(float)/2;
backend_ctx->prealloc_act_trans.allocate(context, region.size);
B_d = clCreateSubBuffer(
backend_ctx->prealloc_act_trans.buffer,
0,
CL_BUFFER_CREATE_TYPE_REGION,
&region,
&status);
CL_CHECK(status);
CL_CHECK((b_sub_buf_trans = clCreateSubBuffer(backend_ctx->prealloc_act_trans.buffer, 0, CL_BUFFER_CREATE_TYPE_REGION, &region, &err), err));
cl_image_format image_format_B_d_output = { CL_RGBA, CL_HALF_FLOAT }; //(CL_HALF_FLOAT for FP16)
cl_image_desc image_desc_B_d_output = {
CL_MEM_OBJECT_IMAGE1D_BUFFER,
static_cast<size_t>(K * (N + padding)/4),
0, 0, 0, 0, 0, 0, 0, { B_d }
};
B_image1d_trans = clCreateImage(
context,
0,
&image_format_B_d_output,
&image_desc_B_d_output,
NULL,
&status);
CL_CHECK(status);
// image for transposed activations
img_fmt = {CL_RGBA, CL_HALF_FLOAT};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = K * (N + padding) / 4;
img_desc.buffer = b_sub_buf_trans;
CL_CHECK((b_img_trans = clCreateImage(context, 0, &img_fmt, &img_desc, NULL, &err), err));
// transpose activations
int height_B = N/4;
if (height_B == 0) {
height_B = 1;
@@ -10014,58 +9995,39 @@ static void ggml_cl_mul_mat_q8_0_f32_adreno(ggml_backend_t backend, const ggml_t
int padded_height_B = (N + padding)/4;
kernel = backend_ctx->kernel_transpose_32_16;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &B_image1d));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &B_image1d_trans));
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &b_img));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &b_img_trans));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(int), &height_B));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(int), &width_B));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(int), &padded_height_B));
size_t local_size_t[2] = { 1, 16 };
size_t global_size_t[2] = {
static_cast<size_t>(width_B),
static_cast<size_t>(padded_height_B)
};
backend_ctx->enqueue_ndrange_kernel(kernel, 2, global_size_t, local_size_t, dst);
size_t local_work_size_t[2] = { 1, 16 };
size_t global_work_size_t[2] = { (size_t)width_B, (size_t)padded_height_B };
backend_ctx->enqueue_ndrange_kernel(kernel, 2, global_work_size_t, local_work_size_t, dst);
// gemm
kernel = backend_ctx->kernel_mul_mm_q8_0_f32_8x4;
int N_with_padding = N + padding;
int padded_N = N + padding;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0_q8_0->q));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra0_q8_0->d));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &B_image1d_trans));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &b_img_trans));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(int), &K));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(int), &M));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(int), &N_with_padding));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(int), &padded_N));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(int), &N));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(cl_ulong), &offsetd));
global_work_size[0] = (size_t)(N + 7) / 8;
global_work_size[1] = (size_t)(M + 3) / 4;
global_work_size[2] = 1;
size_t global_work_size[] = { (size_t)CEIL_DIV(N, 8), (size_t)CEIL_DIV(M, 4), 1 };
size_t local_work_size[] = { 2, 128, 1 };
local_work_size[0] = 2;
local_work_size[1] = 128;
local_work_size[2] = 1;
}
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
// enqueue kernel with profiling
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
// deallocate sub buffers and images
CL_CHECK(clReleaseMemObject(A_image1d));
CL_CHECK(clReleaseMemObject(B_sub_buffer));
CL_CHECK(clReleaseMemObject(B_image1d));
CL_CHECK(clReleaseMemObject(S_image1d));
CL_CHECK(clReleaseMemObject(D_sub_buffer));
CL_CHECK(clReleaseMemObject(D_image1d));
if (B_image1d_trans) {
CL_CHECK(clReleaseMemObject(B_image1d_trans));
}
if (B_d) {
CL_CHECK(clReleaseMemObject(B_d));
CL_CHECK(clReleaseMemObject(b_img_trans));
CL_CHECK(clReleaseMemObject(b_sub_buf_trans));
CL_CHECK(clReleaseMemObject(b_img));
CL_CHECK(clReleaseMemObject(b_sub_buf));
}
#else
GGML_UNUSED(backend);
@@ -10451,6 +10413,201 @@ static void ggml_cl_mul_mat_q6_K_f32_adreno(ggml_backend_t backend, const ggml_t
#endif
}
static void ggml_cl_mul_mat_q5_K_f32_adreno(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
#ifdef GGML_OPENCL_USE_ADRENO_KERNELS
GGML_ASSERT(src0);
GGML_ASSERT(src0->extra);
GGML_ASSERT(src1);
GGML_ASSERT(src1->extra);
GGML_ASSERT(dst);
GGML_ASSERT(dst->extra);
ggml_backend_opencl_context *backend_ctx = (ggml_backend_opencl_context *)backend->context;
ggml_tensor_extra_cl * extra1 = (ggml_tensor_extra_cl *)src1->extra;
ggml_tensor_extra_cl * extrad = (ggml_tensor_extra_cl *)dst->extra;
ggml_tensor_extra_cl_q5_K * extra0_q5_k = (ggml_tensor_extra_cl_q5_K *)src0->extra;
cl_ulong offset1 = extra1->offset + src1->view_offs;
cl_ulong offsetd = extrad->offset + dst->view_offs;
const int ne00 = src0->ne[0];
const int ne01 = src0->ne[1];
const int ne1 = dst->ne[1];
GGML_ASSERT(ne00 % ggml_blck_size(src0->type) == 0);
cl_context context = backend_ctx->context;
cl_kernel kernel;
cl_int err;
cl_image_format img_fmt;
cl_image_desc img_desc;
cl_buffer_region region;
int M = ne01;
int N = ne1;
int K = ne00;
cl_uchar mask_d6 = 0x3F;
cl_uchar mask_d4 = 0x0F;
cl_uchar mask_hi2 = 0xC0;
if (ne1 == 1) {
cl_mem q_img = nullptr;
cl_mem qh_img = nullptr;
cl_mem b_sub_buf = nullptr;
cl_mem b_img = nullptr;
// image for q (CL_R, CL_UNSIGNED_INT32): width = M*K/2/4
img_fmt = {CL_R, CL_UNSIGNED_INT32};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = M * K / 2 / 4;
img_desc.buffer = extra0_q5_k->q;
CL_CHECK((q_img = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt, &img_desc, NULL, &err), err));
// image for qh (CL_R, CL_HALF_FLOAT): width = M*K/16
img_fmt = {CL_R, CL_HALF_FLOAT};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = M * K / 16;
img_desc.buffer = extra0_q5_k->qh;
CL_CHECK((qh_img = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt, &img_desc, NULL, &err), err));
// subbuffer for activations
region.origin = offset1;
region.size = K * N * sizeof(float);
CL_CHECK((b_sub_buf = clCreateSubBuffer(extra1->data_device, 0, CL_BUFFER_CREATE_TYPE_REGION, &region, &err), err));
// image for activations (CL_RGBA, CL_FLOAT): width = K*N/4
img_fmt = {CL_RGBA, CL_FLOAT};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = K * N / 4;
img_desc.buffer = b_sub_buf;
CL_CHECK((b_img = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt, &img_desc, NULL, &err), err));
kernel = backend_ctx->kernel_gemv_noshuffle_q5_k_f32;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &q_img));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &qh_img));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra0_q5_k->d));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra0_q5_k->dm));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extra0_q5_k->s));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_mem), &b_img));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(cl_int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(cl_int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(cl_uchar), &mask_d6));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(cl_uchar), &mask_d4));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(cl_uchar), &mask_hi2));
size_t local_work_size[3] = {64, 4, 1};
size_t global_work_size[3] = {(size_t)CEIL_DIV(ne01/2, 64)*64, 4, 1};
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
CL_CHECK(clReleaseMemObject(q_img));
CL_CHECK(clReleaseMemObject(qh_img));
CL_CHECK(clReleaseMemObject(b_sub_buf));
CL_CHECK(clReleaseMemObject(b_img));
} else {
cl_mem b_sub_buf = nullptr;
cl_mem b_sub_buf_trans = nullptr;
cl_mem b_img = nullptr;
cl_mem b_img_trans = nullptr;
// subbuffer for activations
region.origin = offset1;
region.size = K * N * sizeof(float);
CL_CHECK((b_sub_buf = clCreateSubBuffer(extra1->data_device, 0, CL_BUFFER_CREATE_TYPE_REGION, &region, &err), err));
// image for activations
img_fmt = {CL_RGBA, CL_FLOAT};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = K * N / 4;
img_desc.buffer = b_sub_buf;
CL_CHECK((b_img = clCreateImage(context, CL_MEM_READ_ONLY, &img_fmt, &img_desc, NULL, &err), err));
// pad N to multiple of 8
int extra_elements = N % 8;
int padding = 0;
if (extra_elements > 0) {
padding = 8 - extra_elements;
}
// subbuffer for transposed activations
region.origin = 0;
region.size = K * (N + padding) * sizeof(float) / 2;
backend_ctx->prealloc_act_trans.allocate(context, region.size);
CL_CHECK((b_sub_buf_trans = clCreateSubBuffer(backend_ctx->prealloc_act_trans.buffer, 0, CL_BUFFER_CREATE_TYPE_REGION, &region, &err), err));
// image for transposed activations
img_fmt = {CL_RGBA, CL_HALF_FLOAT};
memset(&img_desc, 0, sizeof(img_desc));
img_desc.image_type = CL_MEM_OBJECT_IMAGE1D_BUFFER;
img_desc.image_width = K * (N + padding) / 4;
img_desc.buffer = b_sub_buf_trans;
CL_CHECK((b_img_trans = clCreateImage(context, 0, &img_fmt, &img_desc, NULL, &err), err));
// transpose activations
int height_B = N / 4;
if (height_B == 0) height_B = 1;
int width_B = K / 4;
int padded_height_B = (N + padding) / 4;
kernel = backend_ctx->kernel_transpose_32_16;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &b_img));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &b_img_trans));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(int), &height_B));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(int), &width_B));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(int), &padded_height_B));
size_t local_work_size_t[2] = {1, 16};
size_t global_work_size_t[2] = {(size_t)width_B, (size_t)padded_height_B};
backend_ctx->enqueue_ndrange_kernel(kernel, 2, global_work_size_t, local_work_size_t, dst);
// gemm
kernel = backend_ctx->kernel_gemm_noshuffle_q5_k_f32;
int padded_N = N + padding;
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &extra0_q5_k->q));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &extra0_q5_k->qh));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &extra0_q5_k->s));
CL_CHECK(clSetKernelArg(kernel, 3, sizeof(cl_mem), &extra0_q5_k->d));
CL_CHECK(clSetKernelArg(kernel, 4, sizeof(cl_mem), &extra0_q5_k->dm));
CL_CHECK(clSetKernelArg(kernel, 5, sizeof(cl_mem), &b_img_trans));
CL_CHECK(clSetKernelArg(kernel, 6, sizeof(cl_mem), &extrad->data_device));
CL_CHECK(clSetKernelArg(kernel, 7, sizeof(cl_ulong), &offsetd));
CL_CHECK(clSetKernelArg(kernel, 8, sizeof(cl_int), &ne01));
CL_CHECK(clSetKernelArg(kernel, 9, sizeof(cl_int), &padded_N));
CL_CHECK(clSetKernelArg(kernel, 10, sizeof(cl_int), &ne00));
CL_CHECK(clSetKernelArg(kernel, 11, sizeof(cl_int), &ne1));
CL_CHECK(clSetKernelArg(kernel, 12, sizeof(cl_uchar), &mask_d6));
CL_CHECK(clSetKernelArg(kernel, 13, sizeof(cl_uchar), &mask_d4));
CL_CHECK(clSetKernelArg(kernel, 14, sizeof(cl_uchar), &mask_hi2));
size_t global_work_size[3] = {(size_t)CEIL_DIV(ne1, 8), (size_t)CEIL_DIV(ne01, 4), 1};
size_t local_work_size[3] = {1, 128, 1};
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
CL_CHECK(clReleaseMemObject(b_sub_buf));
CL_CHECK(clReleaseMemObject(b_sub_buf_trans));
CL_CHECK(clReleaseMemObject(b_img));
CL_CHECK(clReleaseMemObject(b_img_trans));
}
#else
GGML_UNUSED(backend);
GGML_UNUSED(src0);
GGML_UNUSED(src1);
GGML_UNUSED(dst);
#endif
}
static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
GGML_ASSERT(src0);
GGML_ASSERT(src0->extra);
@@ -10600,6 +10757,12 @@ static void ggml_cl_mul_mat(ggml_backend_t backend, const ggml_tensor * src0, co
return;
}
// q5_K x fp32
if (src0t == GGML_TYPE_Q5_K && src1t == GGML_TYPE_F32) {
ggml_cl_mul_mat_q5_K_f32_adreno(backend, src0, src1, dst);
return;
}
// q4_0 x fp32
if(src0t == GGML_TYPE_Q4_0 && src1t == GGML_TYPE_F32) {
// TODO: remove duplicate definitions of image description + format -- move to top

Some files were not shown because too many files have changed in this diff Show More