mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-03-17 16:44:07 +00:00

Files

Georgi Gerganov 9b342d0a9f benches : add Nemotron 3 Nano on DGX Spark (#20652 )

[no ci]

2026-03-16 21:50:43 +02:00

8.8 KiB

Raw Blame History

NVIDIA DGX Spark

System info

uname --all
Linux spark-17ed 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

g++ --version
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

nvidia-smi
Fri Mar  6 11:39:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   52C    P0             13W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

ggml-org/Nemotron-3-Super-120B-GGUF

Model: https://huggingface.co/ggml-org/Nemotron-3-Super-120B-GGUF

llama-batched-bench

main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = 99, n_threads = 20, n_threads_batch = 20

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	1.094	468.05	1.621	19.74	2.715	200.37
512	32	2	1088	1.463	700.16	2.437	26.26	3.900	279.01
512	32	4	2176	2.647	773.76	4.043	31.66	6.689	325.29
512	32	8	4352	5.291	774.14	6.151	41.62	11.442	380.37
512	32	16	8704	10.603	772.62	10.385	49.30	20.987	414.72
512	32	32	17408	21.231	771.69	18.235	56.16	39.466	441.09
4096	32	1	4128	5.340	767.05	1.616	19.81	6.956	593.47
4096	32	2	8256	10.673	767.55	2.454	26.08	13.127	628.94
4096	32	4	16512	21.348	767.46	4.072	31.44	25.420	649.57
4096	32	8	33024	42.714	767.15	6.277	40.78	48.991	674.08
4096	32	16	66048	85.385	767.54	10.596	48.32	95.981	688.14
4096	32	32	132096	170.819	767.32	18.619	55.00	189.437	697.31
8192	32	1	8224	10.690	766.32	1.619	19.76	12.310	668.10
8192	32	2	16448	21.382	766.24	2.467	25.94	23.850	689.65
8192	32	4	32896	42.782	765.92	4.098	31.23	46.881	701.69
8192	32	8	65792	85.582	765.77	6.368	40.20	91.951	715.52
8192	32	16	131584	171.066	766.21	10.774	47.52	181.840	723.62
8192	32	32	263168	342.140	766.19	18.969	53.98	361.109	728.78

llama-bench

model	size	params	backend	n_ubatch	fa	test	t/s
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	pp2048	768.84 ± 0.90
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	tg32	19.94 ± 0.16
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	pp2048 @ d4096	764.51 ± 0.50
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	tg32 @ d4096	19.95 ± 0.18
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	pp2048 @ d8192	759.53 ± 0.71
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	tg32 @ d8192	19.83 ± 0.18
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	pp2048 @ d16384	747.98 ± 1.58
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	tg32 @ d16384	19.84 ± 0.18
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	pp2048 @ d32768	724.40 ± 2.70
nemotron 120B.A12B Q4_K	65.10 GiB	120.67 B	CUDA	2048	1	tg32 @ d32768	19.45 ± 0.18

build: 04a65daab (8268)

ggml-org/Nemotron-3-Nano-4B-GGUF

Model: https://huggingface.co/ggml-org/Nemotron-3-Nano-4B-GGUF

llama-batched-bench