mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-05-08 01:54:10 +00:00
Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ```