Files
llama.cpp/benches/dgx-spark/dgx-spark.md
Georgi Gerganov 15274c0c50 benches : add eval results (#17139)
[no ci]
2025-11-10 10:44:10 +02:00

21 KiB

System info

uname --all
Linux spark-17ed 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 16:52:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

g++ --version
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

nvidia-smi
Sun Nov  2 10:43:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   35C    P8              4W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

ggml-org/gpt-oss-20b-GGUF

Model: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

  • llama-batched-bench

main: n_kv_max = 270336, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.374 1369.01 0.383 83.64 0.757 719.01
512 32 2 1088 0.274 3741.35 0.659 97.14 0.933 1166.66
512 32 4 2176 0.526 3896.47 0.817 156.73 1.342 1621.08
512 32 8 4352 1.044 3925.10 0.987 259.44 2.030 2143.56
512 32 16 8704 2.076 3945.84 1.248 410.32 3.324 2618.60
512 32 32 17408 4.170 3929.28 1.630 628.40 5.799 3001.76
4096 32 1 4128 1.083 3782.66 0.394 81.21 1.477 2795.13
4096 32 2 8256 2.166 3782.72 0.725 88.28 2.891 2856.14
4096 32 4 16512 4.333 3780.88 0.896 142.82 5.230 3157.38
4096 32 8 33024 8.618 3802.14 1.155 221.69 9.773 3379.08
4096 32 16 66048 17.330 3781.73 1.598 320.34 18.928 3489.45
4096 32 32 132096 34.671 3780.48 2.336 438.35 37.007 3569.51
8192 32 1 8224 2.233 3668.56 0.438 72.98 2.671 3078.44
8192 32 2 16448 4.425 3702.95 0.756 84.66 5.181 3174.95
8192 32 4 32896 8.859 3698.64 0.967 132.38 9.826 3347.72
8192 32 8 65792 17.714 3699.57 1.277 200.52 18.991 3464.35
8192 32 16 131584 35.494 3692.84 1.841 278.12 37.335 3524.46
8192 32 32 263168 70.949 3694.82 2.798 365.99 73.747 3568.53
  • llama-bench
model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 pp2048 3714.25 ± 20.36
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 tg32 86.58 ± 0.43
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 pp2048 @ d4096 3445.17 ± 17.85
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 tg32 @ d4096 81.72 ± 0.53
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 pp2048 @ d8192 3218.78 ± 11.34
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 tg32 @ d8192 74.86 ± 0.64
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 pp2048 @ d16384 2732.83 ± 7.17
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 tg32 @ d16384 71.57 ± 0.51
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 pp2048 @ d32768 2119.75 ± 12.81
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 0 tg32 @ d32768 62.33 ± 0.24

build: eeee367de (6989)

ggml-org/gpt-oss-120b-GGUF

Model: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

  • llama-batched-bench

main: n_kv_max = 270336, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.571 897.18 0.543 58.96 1.113 488.60
512 32 2 1088 0.593 1725.37 1.041 61.45 1.635 665.48
512 32 4 2176 1.043 1963.15 1.334 95.95 2.377 915.36
512 32 8 4352 2.099 1951.63 1.717 149.07 3.816 1140.45
512 32 16 8704 4.207 1947.12 2.311 221.56 6.518 1335.35
512 32 32 17408 8.422 1945.36 3.298 310.46 11.720 1485.27
4096 32 1 4128 2.138 1915.88 0.571 56.09 2.708 1524.12
4096 32 2 8256 4.266 1920.25 1.137 56.27 5.404 1527.90
4096 32 4 16512 8.564 1913.02 1.471 86.99 10.036 1645.29
4096 32 8 33024 17.092 1917.19 1.979 129.33 19.071 1731.63
4096 32 16 66048 34.211 1915.65 2.850 179.66 37.061 1782.15
4096 32 32 132096 68.394 1916.44 4.381 233.72 72.775 1815.13
8192 32 1 8224 4.349 1883.45 0.620 51.65 4.969 1655.04
8192 32 2 16448 8.674 1888.83 1.178 54.33 9.852 1669.48
8192 32 4 32896 17.351 1888.55 1.580 81.01 18.931 1737.68
8192 32 8 65792 34.743 1886.31 2.173 117.80 36.916 1782.20
8192 32 16 131584 69.413 1888.29 3.297 155.28 72.710 1809.70
8192 32 32 263168 138.903 1887.24 5.004 204.63 143.907 1828.73
  • llama-bench
model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 1919.36 ± 5.01
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 60.40 ± 0.30
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d4096 1825.30 ± 6.37
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d4096 56.94 ± 0.29
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d8192 1739.19 ± 6.00
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d8192 52.51 ± 0.42
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d16384 1536.75 ± 4.27
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d16384 49.33 ± 0.27
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 pp2048 @ d32768 1255.85 ± 3.26
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 2048 1 0 tg32 @ d32768 42.99 ± 0.18

build: eeee367de (6989)

ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

Model: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

  • llama-batched-bench

main: n_kv_max = 270336, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.398 1285.90 0.530 60.41 0.928 586.27
512 32 2 1088 0.386 2651.65 0.948 67.50 1.334 815.38
512 32 4 2176 0.666 3076.37 1.209 105.87 1.875 1160.71
512 32 8 4352 1.325 3091.39 1.610 158.98 2.935 1482.65
512 32 16 8704 2.664 3075.58 2.150 238.19 4.813 1808.39
512 32 32 17408 5.336 3070.31 2.904 352.59 8.240 2112.50
4096 32 1 4128 1.444 2836.81 0.581 55.09 2.025 2038.81
4096 32 2 8256 2.872 2852.14 1.084 59.06 3.956 2086.99
4096 32 4 16512 5.744 2852.32 1.440 88.90 7.184 2298.47
4096 32 8 33024 11.463 2858.68 2.068 123.78 13.531 2440.65
4096 32 16 66048 22.915 2859.95 3.018 169.67 25.933 2546.90
4096 32 32 132096 45.956 2852.10 4.609 222.18 50.565 2612.39
8192 32 1 8224 3.063 2674.72 0.693 46.20 3.755 2189.92
8192 32 2 16448 6.109 2681.87 1.214 52.71 7.323 2245.98
8192 32 4 32896 12.197 2686.63 1.682 76.11 13.878 2370.30
8192 32 8 65792 24.409 2684.94 2.556 100.17 26.965 2439.95
8192 32 16 131584 48.753 2688.50 3.994 128.20 52.747 2494.64
8192 32 32 263168 97.508 2688.42 6.528 156.86 104.037 2529.57
  • llama-bench
model size params backend ngl n_ubatch fa mmap test t/s
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 pp2048 2925.55 ± 4.25
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 tg32 62.80 ± 0.27
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 pp2048 @ d4096 2531.01 ± 6.79
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 tg32 @ d4096 55.86 ± 0.33
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 pp2048 @ d8192 2244.39 ± 5.33
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 tg32 @ d8192 45.95 ± 0.33
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 pp2048 @ d16384 1783.17 ± 3.68
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 tg32 @ d16384 39.07 ± 0.10
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 pp2048 @ d32768 1241.90 ± 3.13
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B CUDA 99 2048 1 0 tg32 @ d32768 29.92 ± 0.06

build: eeee367de (6989)

ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

Model: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

  • llama-batched-bench

main: n_kv_max = 270336, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.211 2421.57 1.055 30.33 1.266 429.57
512 32 2 1088 0.419 2441.34 1.130 56.65 1.549 702.32
512 32 4 2176 0.873 2345.54 1.174 108.99 2.048 1062.74
512 32 8 4352 1.727 2371.85 1.254 204.22 2.980 1460.19
512 32 16 8704 3.452 2373.22 1.492 343.16 4.944 1760.56
512 32 32 17408 6.916 2368.93 1.675 611.51 8.591 2026.36
4096 32 1 4128 1.799 2277.26 1.084 29.51 2.883 1431.91
4096 32 2 8256 3.577 2290.01 1.196 53.50 4.774 1729.51
4096 32 4 16512 7.172 2284.36 1.313 97.50 8.485 1946.00
4096 32 8 33024 14.341 2284.96 1.520 168.46 15.860 2082.18
4096 32 16 66048 28.675 2285.44 1.983 258.21 30.658 2154.33
4096 32 32 132096 57.354 2285.32 2.640 387.87 59.994 2201.82
8192 32 1 8224 3.701 2213.75 1.119 28.59 4.820 1706.34
8192 32 2 16448 7.410 2211.19 1.272 50.31 8.682 1894.56
8192 32 4 32896 14.802 2213.83 1.460 87.68 16.261 2022.96
8192 32 8 65792 29.609 2213.35 1.781 143.74 31.390 2095.93
8192 32 16 131584 59.229 2212.96 2.495 205.17 61.725 2131.79
8192 32 32 263168 118.449 2213.15 3.714 275.75 122.162 2154.25
  • llama-bench
model size params backend ngl n_ubatch fa mmap test t/s
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 pp2048 2272.74 ± 4.68
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 tg32 30.66 ± 0.02
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 pp2048 @ d4096 2107.80 ± 9.55
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 tg32 @ d4096 29.71 ± 0.05
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 pp2048 @ d8192 1937.80 ± 6.75
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 tg32 @ d8192 28.86 ± 0.04
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 pp2048 @ d16384 1641.12 ± 1.78
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 tg32 @ d16384 27.24 ± 0.04
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 pp2048 @ d32768 1296.02 ± 2.67
qwen2 7B Q8_0 7.54 GiB 7.62 B CUDA 99 2048 1 0 tg32 @ d32768 23.78 ± 0.03

build: eeee367de (6989)

ggml-org/gemma-3-4b-it-qat-GGUF

Model: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF

  • llama-batched-bench

main: n_kv_max = 270336, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.094 5434.73 0.394 81.21 0.488 1114.15
512 32 2 1088 0.168 6091.68 0.498 128.52 0.666 1633.41
512 32 4 2176 0.341 6010.68 0.542 236.37 0.882 2466.43
512 32 8 4352 0.665 6161.46 0.678 377.74 1.342 3241.72
512 32 16 8704 1.323 6193.19 0.902 567.41 2.225 3911.74
512 32 32 17408 2.642 6202.03 1.231 832.03 3.872 4495.36
4096 32 1 4128 0.701 5840.49 0.439 72.95 1.140 3621.23
4096 32 2 8256 1.387 5906.82 0.574 111.48 1.961 4210.12
4096 32 4 16512 2.758 5940.33 0.651 196.58 3.409 4843.33
4096 32 8 33024 5.491 5967.56 0.876 292.40 6.367 5187.12
4096 32 16 66048 10.978 5969.58 1.275 401.69 12.253 5390.38
4096 32 32 132096 21.944 5972.93 1.992 514.16 23.936 5518.73
8192 32 1 8224 1.402 5841.91 0.452 70.73 1.855 4434.12
8192 32 2 16448 2.793 5865.34 0.637 100.55 3.430 4795.51
8192 32 4 32896 5.564 5889.64 0.770 166.26 6.334 5193.95
8192 32 8 65792 11.114 5896.44 1.122 228.07 12.237 5376.51
8192 32 16 131584 22.210 5901.38 1.789 286.15 24.000 5482.74
8192 32 32 263168 44.382 5906.56 3.044 336.38 47.426 5549.02
  • llama-bench
model size params backend ngl n_ubatch fa mmap test t/s
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 pp2048 5810.04 ± 21.71
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 tg32 84.54 ± 0.18
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 pp2048 @ d4096 5288.04 ± 3.54
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 tg32 @ d4096 78.82 ± 1.37
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 pp2048 @ d8192 4960.43 ± 16.64
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 tg32 @ d8192 74.13 ± 0.30
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 pp2048 @ d16384 4495.92 ± 31.11
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 tg32 @ d16384 72.37 ± 0.29
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 pp2048 @ d32768 3746.90 ± 40.01
gemma3 4B Q4_0 2.35 GiB 3.88 B CUDA 99 2048 1 0 tg32 @ d32768 63.02 ± 0.20

build: eeee367de (6989)