CUDA: add a fused top-K MoE kernel (#16130)

* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
2025-10-27 08:21:30 +00:00 · 2025-09-25 22:35:05 +08:00
parent aa3ee0eb0b
commit 077c94d0ca
5 changed files with 381 additions and 0 deletions
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -932,6 +932,7 @@ ggml_tensor * llm_graph_context::build_moe_ffn(
            ggml_reshape_3d(ctx0, probs, 1, n_expert, n_tokens), selected_experts); // [1, n_expert_used, n_tokens]
    cb(weights, "ffn_moe_weights", il);

+
    if (gating_op == LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX_WEIGHT) {
        weights = ggml_reshape_2d(ctx0, weights, n_expert_used, n_tokens);
        weights = ggml_soft_max(ctx0, weights); // [n_expert_used, n_tokens]
@@ -955,6 +956,9 @@ ggml_tensor * llm_graph_context::build_moe_ffn(
        cb(weights, "ffn_moe_weights_scaled", il);
    }

+    //call early so that topk-moe can be used
+    ggml_build_forward_expand(gf, weights);
+
    cur = ggml_reshape_3d(ctx0, cur, n_embd, 1, n_tokens);

    if (weight_before_ffn) {