remove clblast and update vulkan and avx

Eve
2025-03-03 02:27:14 +00:00
parent 93aac9ac03
commit fe1b31c723

@@ -1,10 +1,10 @@
| | **CPU (AVX2)** | **CPU (ARM NEON)** | **Metal** | **CUDA** | **ROCm** | **SYCL** | **CLBlast** | **Vulkan** | **Kompute** |
|:-----------------------:|:--------------:|:------------------:|:---------:|:----------:|:----------------:|:--------:|:-----------:|:----------:|:-----------:|
| **K-quants** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ 🐢⁵ | ✅ 🐢⁵ | 🚫 |
| **I-quants** | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ | ✅ | Partial¹ | 🚫 | 🚫 | 🚫 |
| **Parallel Multi-GPU⁶** | N/A | N/A | N/A | ✅ | ✅ | 🚫 | ❓ | ❓ | ❓ |
| **K cache quants** | ✅ | ❓ | ✅ | ✅ | ✅ | ❓ | ✅ | 🚫 | 🚫 |
| **MoE architecture** | ✅ | ❓ | ✅ | ✅ | ✅ | ❓ | Partial² | 🚫 | 🚫 |
| | **CPU (AVX/AVX2)** | **CPU (ARM NEON)** | **Metal** | **CUDA** | **ROCm** | **SYCL** | **Vulkan** | **Kompute** |
|:-----------------------:|:--------------:|:------------------:|:---------:|:----------:|:----------------:|:----------:|:----------:|:-----------:|
| **K-quants** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ 🐢⁵ | 🚫 |
| **I-quants** | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ 🐢⁴ | ✅ | ✅ | Partial¹ | ✅ 🐢⁴ | 🚫 |
| **Parallel Multi-GPU⁶** | N/A | N/A | N/A | ✅ | ✅ | Sequential only | Sequential only | ❓ |
| **K cache quants** | ✅ | ❓ | ✅ | ✅ | ✅ | ❓ | | 🚫 |
| **MoE architecture** | ✅ | ❓ | ✅ | ✅ | ✅ | ❓ | | 🚫 |
* ✅: feature works
* 🚫: feature does not work
@@ -14,6 +14,6 @@
* ²: Only with `-ngl 0`
* ³: Inference is 50% slower
* ⁴: Slower than K-quants of comparable size
* ⁵: Slower than cuBLAS/rocBLAS on similar cards
* ⁶: By default, all backends can utilize multiple devices by running them sequentially. The CUDA code (which is also used for ROCm via HIP) also has code for running GPUs in parallel via `--split-mode row`. However, this is optimized relatively poorly and is only faster if the interconnect speed is fast vs. the speed of a single GPU.
* ⁵: Generally the CUDA or ROCM backends are faster, though there are cases where Vulkan has faster text generation. See #10879 for benchmarks.
* ⁶: By default, all GPU backends can utilize multiple devices by running them sequentially. The CUDA code (which is also used for ROCm via HIP) also has code for running GPUs in parallel via `--split-mode row`. However, this is optimized relatively poorly and is only faster if the interconnect speed is fast vs. the speed of a single GPU.
* ⁶: Only q8_0 and iq4_nl