[Common] Fix fused router for large top-K and expert counts by harryzhou2000 · Pull Request #2821 · NVIDIA/TransformerEngine

harryzhou2000 · 2026-04-01T14:41:29Z

Description

Fixed fused router support for large topk and num_expert. Now num_expert <=2304 and any topk is supported with reasonable performance.

Current benchmark shows fused topk forward kernel is faster than pytorch at topk=32, which would be around 8x faster than before optimization.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Fix fused MoE router kernels to support large top-K values and large numbers of experts (1024+) by adding a warp-level radix-selection top-K algorithm (O(E), independent of K) alongside the existing naive O(K^2*E) implementation, dispatched at topk >= 16 boundary.
Expand dynamic shared memory allocation via cudaFuncSetAttribute in both forward and backward
kernel launchers to avoid silent failures when expert count exceeds the default 48 KB limit.
Rewrite apply_softmax_on_float to use a numerically stable online max+sum accumulation
(two-pass → single-pass) with NaN-safe warp reduction, eliminating shared-memory round-trips.

Details

Radix top-K selection (utils.h):
Implements a 4-bit radix selection algorithm (8 passes over float32) that finds the K-th largest
value in O(E/32) per warp, independent of K. Phase 1 narrows the bit pattern of the K-th value
via histogram counting; Phase 2 gathers elements into output arrays with deterministic tie-breaking
(value DESC, index ASC) matching torch.topk behavior.
Dispatch logic (fused_topk_with_score_function.cu, fused_score_for_moe_aux_loss.cu):
Template parameter TopkFuncType (Naive/Radix) is selected at launch time based on
topk < 16. Both forward kernels and backward kernels now call cudaFuncSetAttribute to
request the required dynamic shared memory size before launch.
Tests (test_fused_router.py):

Add num_experts=1024 to all parametrized test cases.
Add _get_tolerances() helper that scales atol/rtol with expert count to account for
O(N * eps) accumulation divergence between fused and reference implementations.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-01T14:49:49Z

Greptile Summary

This PR fixes fused MoE router kernels for large top-K values and expert counts by adding a warp-level radix selection algorithm (O(E), independent of K) alongside the existing O(K²E) naive implementation, dispatched at topk >= 16. It also expands dynamic shared memory via cudaFuncSetAttribute to handle expert counts above the default 48 KB limit, rewrites apply_softmax_on_float with a numerically stable single-pass online accumulation, and updates tests to cover the new Radix path (topk=16, 32) and num_experts=1024.

Previous feedback on unchecked cudaFuncSetAttribute return values, untested Radix paths, and the dead dtype guard in _get_tolerances have all been addressed in this revision.

Confidence Score: 5/5

Safe to merge; the radix algorithm, smem expansion, and softmax rewrite are all correct, and the only remaining finding is a minor per-launch performance suggestion.

All P0/P1 concerns from prior review rounds have been addressed: cudaFuncSetAttribute return values are now checked, Radix path is covered by topk=16,32 tests, and _get_tolerances now raises on non-fp32 dtypes. The sole new finding is a P2 performance note about redundant device attribute queries per kernel launch, which does not affect correctness.

transformer_engine/common/fused_router/utils.h — check_shared_memory_capacity_num_experts queries device attributes on every launch; worth caching.

Important Files Changed

Filename	Overview
transformer_engine/common/fused_router/utils.h	Core of the PR: adds radix_topk_and_mask (warp-level 4-bit radix selection, Phase 1 + Phase 2 gather), the topk_and_mask dispatch template, TopkFuncType enum, check_shared_memory_capacity_num_experts host helper, and a rewritten single-pass numerically-stable apply_softmax_on_float. Algorithm is correct; minor concern on per-launch device attribute queries.
transformer_engine/common/fused_router/fused_topk_with_score_function.cu	Forward and backward launchers updated: cudaFuncSetAttribute now wrapped with NVTE_CHECK_CUDA, Naive/Radix dispatch added at topk<16 boundary, check_shared_memory_capacity_num_experts called before launch. Looks correct.
transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu	Same launcher pattern as the other .cu file: TopkFuncType template parameter added, Naive/Radix dispatch added, cudaFuncSetAttribute checked, backward launcher also updated with smem capacity check. Looks correct.
transformer_engine/common/utils.cuh	Adds float_to_ordered_uint, ordered_uint_to_float (unused, pre-existing concern), and warp_allreduce_sum — all correct standalone device utilities used by the new radix algorithm.
tests/pytorch/test_fused_router.py	Adds _get_tolerances helper with fp32 guard, topk=16,32 and num_experts=1024 to all parametrize sets, and pytest.skip guards for impossible (topk >= num_experts) configurations. Radix path is now exercised.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Kernel Launcher] --> B{topk lt 16?}
    B -->|Yes| C[TopkFuncType Naive]
    B -->|No| D[TopkFuncType Radix]
    C --> E[cudaFuncSetAttribute wrapped with NVTE_CHECK_CUDA]
    D --> E
    E --> F[check_shared_memory_capacity_num_experts]
    F --> G[Kernel Launch]
    G --> H[NVTE_CHECK_CUDA cudaGetLastError]
    subgraph radix_topk_and_mask
        I[Phase 1 - 8-pass radix selection] --> J[Phase 2a - Gather strictly greater elements]
        J --> K[Phase 2b - Fill ties in ascending index order]
    end
    subgraph apply_softmax_on_float
        L[Pass 1 - Online max and sum per lane] --> M[Warp butterfly reduction with NaN guard]
        M --> N[Pass 2 - Normalize in-place]
    end

_{Reviews (6): Last reviewed commit: "warning about dtype for tolerance in tes..." | Re-trigger Greptile}

denera

LGTM, pending rebase and clean CI

tdophung · 2026-04-14T21:03:35Z

/te-ci

…r of experts - expanding shared memory when needed - switch to radix topk selection when topk is large - test_fused_router.py updated with large num experts and tolerances refined for different cases Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

added return value check of cudaFuncSetAttribute in transformer_engine/common/fused_router/fused_topk_with_score_function.cu added dtype dependent eps in tests/pytorch/test_fused_router.py removed unneeded code in transformer_engine/common/fused_router/utils.h pr bot suggestions Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

cleaned up raw warp operations added comments added shared_memory check added return code check Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

tdophung · 2026-04-16T17:27:07Z

/te-ci

tdophung

approve pending CI. Seems like the issues seen in CI are not related to this PR

tdophung · 2026-04-16T22:09:45Z

retriggered CI and all passed: https://gitlab-master.nvidia.com/dl/transformerengine/transformerengine/-/pipelines/48710028

) * fix: enabling fused _router to be able to handle large topk and number of experts - expanding shared memory when needed - switch to radix topk selection when topk is large - test_fused_router.py updated with large num experts and tolerances refined for different cases * added topk>=16 in tests/pytorch/test_fused_router.py added return value check of cudaFuncSetAttribute in transformer_engine/common/fused_router/fused_topk_with_score_function.cu added dtype dependent eps in tests/pytorch/test_fused_router.py removed unneeded code in transformer_engine/common/fused_router/utils.h * test_fused_router.py needs to skip topk >= num_experts case Signed-off-by: Harry Zhou <hhanyu@nvidia.com> cleaned up raw warp operations added comments added shared_memory check added return code check * warning about dtype for tolerance in test_fused_router.py Signed-off-by: Harry Zhou <hhanyu@nvidia.com> --------- Signed-off-by: Harry Zhou <hhanyu@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

harryzhou2000 marked this pull request as ready for review April 1, 2026 14:44

greptile-apps Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread tests/pytorch/test_fused_router.py

Comment thread transformer_engine/common/fused_router/fused_topk_with_score_function.cu Outdated

Comment thread tests/pytorch/test_fused_router.py

Comment thread transformer_engine/common/fused_router/utils.h Outdated

harryzhou2000 marked this pull request as draft April 1, 2026 15:08

harryzhou2000 marked this pull request as ready for review April 1, 2026 15:14

harryzhou2000 changed the title ~~Fix fused router for large top-K and expert counts~~ [Common] Fix fused router for large top-K and expert counts Apr 2, 2026

tdophung reviewed Apr 2, 2026

View reviewed changes

Comment thread tests/pytorch/test_fused_router.py

tdophung reviewed Apr 2, 2026

View reviewed changes

harryzhou2000 marked this pull request as draft April 3, 2026 02:53

harryzhou2000 force-pushed the hhanyu/router_fix_p2 branch 2 times, most recently from ee33ea2 to fab73d1 Compare April 3, 2026 07:30

harryzhou2000 marked this pull request as ready for review April 3, 2026 07:31

harryzhou2000 force-pushed the hhanyu/router_fix_p2 branch from 09b6dfc to 14228cb Compare April 7, 2026 01:44

harryzhou2000 requested a review from tdophung April 7, 2026 01:46

ptrendx assigned denera Apr 7, 2026

denera self-requested a review April 7, 2026 21:29

harryzhou2000 force-pushed the hhanyu/router_fix_p2 branch from 14228cb to 08c2de9 Compare April 9, 2026 10:57

denera reviewed Apr 14, 2026

View reviewed changes

harryzhou2000 and others added 9 commits April 16, 2026 19:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

cb14e1b

for more information, see https://pre-commit.ci Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

10567a1

for more information, see https://pre-commit.ci Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

test_fused_router.py needs to skip topk >= num_experts case

8a755ea

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1cbe407

for more information, see https://pre-commit.ci Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

applying suggesions

8f9e90b

cleaned up raw warp operations added comments added shared_memory check added return code check Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

115eaaf

for more information, see https://pre-commit.ci Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

warning about dtype for tolerance in test_fused_router.py

6c4886b

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

harryzhou2000 force-pushed the hhanyu/router_fix_p2 branch from 08c2de9 to 6c4886b Compare April 16, 2026 11:55

tdophung approved these changes Apr 16, 2026

View reviewed changes

tdophung merged commit 1e9e48c into NVIDIA:main Apr 16, 2026
28 of 33 checks passed

harryzhou2000 mentioned this pull request May 19, 2026

[Common] Optimize fused router forward/backward kernels #3012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Fix fused router for large top-K and expert counts#2821

[Common] Fix fused router for large top-K and expert counts#2821
tdophung merged 9 commits into
NVIDIA:mainfrom
harryzhou2000:hhanyu/router_fix_p2

harryzhou2000 commented Apr 1, 2026

Uh oh!

greptile-apps Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

denera left a comment

Uh oh!

tdophung commented Apr 14, 2026

Uh oh!

tdophung commented Apr 16, 2026

Uh oh!

tdophung left a comment

Uh oh!

Uh oh!

tdophung commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

harryzhou2000 commented Apr 1, 2026

Description

Type of change

Changes

Details

Checklist:

Uh oh!

greptile-apps Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

denera left a comment

Choose a reason for hiding this comment

Uh oh!

tdophung commented Apr 14, 2026

Uh oh!

tdophung commented Apr 16, 2026

Uh oh!

tdophung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tdophung commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 1, 2026 •

edited

Loading