Updated Feature matrix (markdown)

Johannes Gäßler
2025-02-17 14:57:56 +01:00
parent 3180009927
commit ed24c4179b

@@ -1,18 +1,19 @@
| | **CPU (AVX2)** | **CPU (ARM NEON)** | **Metal** | **cuBLAS** | **rocBLAS** | **SYCL** | **CLBlast** | **Vulkan** | **Kompute** |
|:--------------------:|:--------------:|:------------------:|:---------:|:----------:|:----------------:|:--------:|:-----------:|:----------:|:-----------:|
| **K-quants** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ 🐢⁵ | ✅ 🐢⁵ | 🚫 |
| **I-quants** | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ | ✅ | Partial¹ | 🚫 | 🚫 | 🚫 |
| **Multi-GPU** | N/A | N/A | N/A | ✅ | | 🚫 | ❓ | | ❓ |
| **K cache quants** | ✅ | ❓ | ✅ | ✅ 🐢³ | Partial⁶ 🐢³ | ❓ | ✅ | 🚫 | 🚫 |
| **MoE architecture** | ✅ | ❓ | ✅ | ✅ | ✅ | ❓ | Partial² | 🚫 | 🚫 |
| | **CPU (AVX2)** | **CPU (ARM NEON)** | **Metal** | **CUDA** | **ROCm** | **SYCL** | **CLBlast** | **Vulkan** | **Kompute** |
|:-----------------------:|:--------------:|:------------------:|:---------:|:----------:|:----------------:|:--------:|:-----------:|:----------:|:-----------:|
| **K-quants** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ 🐢⁵ | ✅ 🐢⁵ | 🚫 |
| **I-quants** | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ | ✅ | Partial¹ | 🚫 | 🚫 | 🚫 |
| **Parallel Multi-GPU** | N/A | N/A | N/A | ✅ | | 🚫 | ❓ | | ❓ |
| **K cache quants** | ✅ | ❓ | ✅ | ✅ 🐢³ | Partial⁶ 🐢³ | ❓ | ✅ | 🚫 | 🚫 |
| **MoE architecture** | ✅ | ❓ | ✅ | ✅ | ✅ | ❓ | Partial² | 🚫 | 🚫 |
* ✅: feature works
* 🚫: feature does not work
* ❓: unknown, please contribute if you can test it youself
* ❓: unknown, please contribute if you can test it yourself
* 🐢: feature is slow
* ¹: IQ3_S and IQ1_S, see #5886
* ²: Only with `-ngl 0`
* ³: Inference is 50% slower
* ⁴: Slower than K-quants of comparable size
* ⁵: Slower than cuBLAS/rocBLAS on similar cards
* ⁶: By default, all backends can utilize multiple devices by running them sequentially. The CUDA code (which is also used for ROCm via HIP) also has code for running GPUs in parallel via `--split-mode row`. However, this is optimized relatively poorly and is only faster if the interconnect speed is fast vs. the speed of a single GPU.
* ⁶: Only q8_0 and iq4_nl