Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
17e5126
Expert Parallelism: common C API + NCCL EP v0.1 backend
phu0ngng May 22, 2026
cef4b33
Expert Parallelism: persistent ncclEpHandle cache with allow_handle_m…
phu0ngng May 23, 2026
0086be4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 23, 2026
2dc9fe7
Build: NCCL_HOME discovery supports Debian/Ubuntu multiarch lib paths
phu0ngng May 27, 2026
c93387c
bump NCCL
phu0ngng May 27, 2026
ead4344
Expert Parallelism: require token_dtype in NVTEEpGroupConfig and enfo…
phu0ngng May 28, 2026
eb83342
Expert Parallelism: document ep_comm lifetime, v0.1 single-GPU scope,…
phu0ngng May 28, 2026
20b32f4
Expert Parallelism: drop version label from initialize scope note
phu0ngng May 28, 2026
78e17dc
Common: add NVTE_CHECK_NCCL macro and use it in EP tests
phu0ngng May 28, 2026
11c9a10
Expert Parallelism tests: replace TensorHandle shim with TensorWrapper
phu0ngng May 28, 2026
799b3bc
Expert Parallelism tests: consolidate into single test_ep.cu with ess…
phu0ngng May 28, 2026
2b5cd81
tests: use test::CudaPtr in DevBuf, check full token rows, simplify b…
phu0ngng May 28, 2026
2873ac0
tests: reword EPTensors comment to not imply C-API churn
phu0ngng May 28, 2026
319937f
EP group config: rename token_dtype to max_token_dtype and allow narr…
phu0ngng May 28, 2026
4b39d0b
tests: parameterize FullForwardBackward over bf16, fp16, fp32
phu0ngng May 28, 2026
edf871d
cmake: drop NO_CMAKE_SYSTEM_PATH on TE_LIB lookup and order nvrtc aft…
phu0ngng May 28, 2026
c596afa
tests: use MPI for EP distributed tests (bootstrap, build, run script)
phu0ngng May 28, 2026
370e6a4
ep.h: add TODO note about struct versioning
phu0ngng May 28, 2026
dbd1ef5
tests/cpp_distributed: fix test_ep build (helper ordering, TensorWrap…
phu0ngng May 29, 2026
2856674
tests/cpp_distributed: skip FP16/FP32 FullForwardBackward at runtime …
phu0ngng May 29, 2026
1e74f99
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,7 @@
[submodule "3rdparty/cutlass"]
path = 3rdparty/cutlass
url = https://github.com/NVIDIA/cutlass.git
[submodule "3rdparty/nccl"]
path = 3rdparty/nccl
url = https://github.com/NVIDIA/nccl.git
branch = v2.30u1
1 change: 1 addition & 0 deletions 3rdparty/nccl
Submodule nccl added at 146496
1 change: 1 addition & 0 deletions memory/MEMORY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
- [Commit message: no TE subsystem prefix](feedback_commit_message_no_te_subsystem_prefix.md) — don't prefix commit subjects with "Expert Parallelism:" or "EP:" in this repo
12 changes: 12 additions & 0 deletions memory/feedback_commit_message_no_te_subsystem_prefix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: feedback-commit-message-no-te-subsystem-prefix
description: For this TE repo, commit messages should not prefix the subject with "Expert Parallelism:" or "EP:" — go straight to the change itself.
metadata:
type: feedback
---

When writing commit messages in this Transformer Engine repo, do not prefix the subject with subsystem labels like "Expert Parallelism:" or "EP:".

**Why:** User feedback during the EP reviewer-feedback session ("in your commit message, don't need to mention 'expert parallelism or EP'"). Subject lines should describe the change itself.

**How to apply:** Lead with the actual action (e.g. "require token_dtype in NVTEEpGroupConfig and enforce at dispatch", "consolidate EP tests into single test_ep.cu"). Path/file context implicitly identifies the subsystem.
3 changes: 3 additions & 0 deletions qa/L1_cpp_distributed/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,7 @@ if [[ $(nvidia-smi --list-gpus | wc -l) -ge 4 ]]; then
cmake -GNinja -S. -Bbuild
cmake --build build
mpirun --allow-run-as-root --np 4 --oversubscribe ./build/test_comm_gemm

# EP suites; runner self-skips on pre-Hopper GPUs.
bash ./run_test_ep.sh 4 ./build
fi
131 changes: 131 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,34 @@ def setup_common_extension() -> CMakeExtension:
cusolvermp_dir = os.getenv("CUSOLVERMP_HOME", "/usr")
cmake_flags.append(f"-DCUSOLVERMP_DIR={cusolvermp_dir}")

# NCCL EP: on by default; auto-disabled if no arch >= 90.
# Set NVTE_BUILD_WITH_NCCL_EP=0/1 to force off/on.
nccl_ep_env = os.getenv("NVTE_BUILD_WITH_NCCL_EP")
explicit_nccl_ep = nccl_ep_env is not None
build_with_nccl_ep = bool(int(nccl_ep_env)) if explicit_nccl_ep else True

if build_with_nccl_ep:
arch_tokens = [a.strip() for a in str(archs or "").split(";") if a.strip()]
has_hopper_or_newer = any(t.lower() == "native" for t in arch_tokens) or any(
int(t.rstrip("af")) >= 90 for t in arch_tokens if t.rstrip("af").isdigit()
)
if not has_hopper_or_newer:
if explicit_nccl_ep:
raise RuntimeError(
"NVTE_BUILD_WITH_NCCL_EP=1 requires at least one CUDA arch >= 90 in "
f"NVTE_CUDA_ARCHS (got '{archs}'). Add '90' or unset NVTE_BUILD_WITH_NCCL_EP."
)
print(
"[NCCL EP] No CUDA arch >= 90 in NVTE_CUDA_ARCHS"
f" ('{archs}'); auto-disabling NCCL EP (nvte_ep_* will throw at runtime)."
)
build_with_nccl_ep = False

if build_with_nccl_ep:
build_nccl_ep_submodule()
else:
cmake_flags.append("-DNVTE_WITH_NCCL_EP=OFF")

# Add custom CMake arguments from environment variable
nvte_cmake_extra_args = os.getenv("NVTE_CMAKE_EXTRA_ARGS")
if nvte_cmake_extra_args:
Expand Down Expand Up @@ -128,6 +156,109 @@ def setup_requirements() -> Tuple[List[str], List[str]]:
return [remove_dups(reqs) for reqs in [install_reqs, test_reqs]]


def _discover_nccl_home() -> str:
"""Resolve NCCL_HOME: honor env var, else probe well-known prefixes, else ldconfig."""
env_home = os.environ.get("NCCL_HOME")
if env_home:
if (Path(env_home) / "include" / "nccl.h").exists():
return env_home
print(
f"[NCCL EP] WARNING: NCCL_HOME='{env_home}' is set but "
f"'{env_home}/include/nccl.h' was not found; falling back to system probes."
)

lib_names = ("libnccl.so", "libnccl.so.2")
# Include Debian/Ubuntu multiarch subdirs (e.g. lib/aarch64-linux-gnu).
lib_subdirs = ("lib", "lib64", "lib/aarch64-linux-gnu", "lib/x86_64-linux-gnu")
for cand in ("/opt/nvidia/nccl", "/usr/local/nccl", "/usr"):
p = Path(cand)
if (p / "include" / "nccl.h").exists() and any(
(p / sub / name).exists() for sub in lib_subdirs for name in lib_names
):
return str(p)

try:
out = subprocess.check_output(["ldconfig", "-p"], stderr=subprocess.DEVNULL).decode()
for line in out.splitlines():
if "libnccl.so" in line and "=>" in line:
lib_path = Path(line.split("=>")[-1].strip())
# Walk upward so multiarch layouts (.../lib/<triplet>/libnccl.so)
# resolve to the prefix that contains include/nccl.h.
for root in (lib_path.parent.parent, lib_path.parent.parent.parent):
if (root / "include" / "nccl.h").exists():
return str(root)
except (subprocess.CalledProcessError, FileNotFoundError):
pass

raise RuntimeError(
"Could not locate NCCL core (nccl.h + libnccl.so). Set NCCL_HOME to the install prefix."
)


def build_nccl_ep_submodule() -> str:
"""Build libnccl_ep.so from the 3rdparty/nccl submodule.

NCCL EP is on by default; the system NCCL core (libnccl.so) supplies the
headers and runtime symbols. Returns the submodule build directory.
"""
nccl_root = current_file_path / "3rdparty" / "nccl"
if not (nccl_root / "Makefile").exists():
raise RuntimeError(
f"NCCL submodule not found at {nccl_root}. "
"Run `git submodule update --init --recursive`."
)

build_dir = nccl_root / "build"
nccl_ep_lib = build_dir / "lib" / "libnccl_ep.so"

archs = cuda_archs() or "90"
arch_list = []
for a in str(archs).split(";"):
a = a.strip().rstrip("af")
if a and a.isdigit() and int(a) >= 90:
arch_list.append(a)
if not arch_list:
arch_list = ["90"]
gencode = " ".join(f"-gencode=arch=compute_{a},code=sm_{a}" for a in arch_list)

nproc = os.cpu_count() or 8
env = os.environ.copy()
env["NVCC_GENCODE"] = gencode
# NCCL EP needs the core NCCL headers + libnccl.so; write NCCL EP build
# outputs to the submodule's local build/ tree.
nccl_home = _discover_nccl_home()
env["NCCL_HOME"] = nccl_home
env["NCCL_EP_BUILDDIR"] = str(build_dir)

if not nccl_ep_lib.exists():
print(f"[NCCL EP] Building libnccl_ep.so (gencode='{gencode}')")
subprocess.check_call(
["make", "-j", str(nproc), "-C", "contrib/nccl_ep", "lib"],
cwd=str(nccl_root),
env=env,
)

# TE's CMake expects nccl.h under 3rdparty/nccl/build/include/ for its
# version check. Mirror the top-level host headers from the system NCCL
# install — DON'T mirror nccl_device/ because the submodule ships its own
# newer copy at src/include/nccl_device/ with device-side templates that
# conflict with older system versions, and the JIT include path picks the
# submodule's.
nccl_include = build_dir / "include"
nccl_include.mkdir(parents=True, exist_ok=True)
for cand in (Path(nccl_home) / "include", Path("/usr/include")):
p = Path(cand)
if (p / "nccl.h").exists():
for name in ("nccl.h", "nccl_net.h", "nccl_tuner.h"):
src = p / name
dst = nccl_include / name
if src.exists() and not dst.exists():
dst.symlink_to(src)
break

return str(build_dir)


def git_check_submodules() -> None:
"""
Attempt to checkout git submodules automatically during setup.
Expand Down
74 changes: 73 additions & 1 deletion tests/cpp_distributed/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,84 @@ add_executable(test_comm_gemm

find_package(OpenMP REQUIRED)
find_package(MPI REQUIRED)

# ── NCCL library ──────────────────────────────────────────────────────────────
# Search order: NCCL_HOME env → 3rdparty/nccl submodule build → system paths.
set(NCCL_SUBMODULE_BUILD "${CMAKE_CURRENT_SOURCE_DIR}/../../3rdparty/nccl/build")
find_library(NCCL_LIB
NAMES nccl libnccl
PATH_SUFFIXES lib
HINTS $ENV{NCCL_HOME}/lib ${NCCL_SUBMODULE_BUILD}/lib
PATH_SUFFIXES lib lib64
REQUIRED)

# NCCL headers: prefer submodule build output (has the handle_init API),
# then submodule src, then system (CUDA toolkit).
set(NCCL_SUBMODULE_INCLUDE "${CMAKE_CURRENT_SOURCE_DIR}/../../3rdparty/nccl/build/include")
set(NCCL_SUBMODULE_SRC_INCLUDE "${CMAKE_CURRENT_SOURCE_DIR}/../../3rdparty/nccl/src/include")
if(EXISTS "${NCCL_SUBMODULE_INCLUDE}/nccl.h")
set(NCCL_INCLUDE_DIR "${NCCL_SUBMODULE_INCLUDE}")
elseif(EXISTS "${NCCL_SUBMODULE_SRC_INCLUDE}/nccl.h")
set(NCCL_INCLUDE_DIR "${NCCL_SUBMODULE_SRC_INCLUDE}")
elseif(DEFINED ENV{NCCL_HOME})
set(NCCL_INCLUDE_DIR "$ENV{NCCL_HOME}/include")
endif()
target_include_directories(test_comm_gemm PRIVATE ${MPI_CXX_INCLUDE_PATH} $ENV{CUBLASMP_HOME}/include)
target_link_libraries(test_comm_gemm PUBLIC CUDA::cuda_driver CUDA::cudart GTest::gtest ${TE_LIB} CUDA::nvrtc MPI::MPI_CXX ${NCCL_LIB} OpenMP::OpenMP_CXX)

include(GoogleTest)
gtest_discover_tests(test_comm_gemm DISCOVERY_TIMEOUT 600)

# ── EP distributed tests ──────────────────────────────────────────────────────
# Launched via mpirun; ncclUniqueId exchange uses MPI_Bcast (see test_ep_common.h).
# Headers + libs come from the in-tree 3rdparty/nccl submodule build.
set(NCCL_EP_SUBMODULE_ROOT
"${CMAKE_CURRENT_SOURCE_DIR}/../../3rdparty/nccl")
find_library(NCCL_EP_LIB
NAMES nccl_ep libnccl_ep
HINTS ${NCCL_EP_SUBMODULE_ROOT}/build/lib
NO_DEFAULT_PATH
REQUIRED)

set(NCCL_EP_INCLUDE_DIR "${NCCL_EP_SUBMODULE_ROOT}/contrib/nccl_ep/include")
if(NOT EXISTS "${NCCL_EP_INCLUDE_DIR}/nccl_ep.h")
message(FATAL_ERROR
"NCCL EP header not found at ${NCCL_EP_INCLUDE_DIR}/nccl_ep.h. "
"Run `git submodule update --init --recursive` to checkout 3rdparty/nccl.")
endif()
message(STATUS "EP test: NCCL EP headers: ${NCCL_EP_INCLUDE_DIR}")

# Collect NCCL include dirs shared by all EP test targets (nccl_ep.h + nccl.h).
set(EP_TEST_NCCL_INCLUDES ${NCCL_EP_INCLUDE_DIR})
if(DEFINED NCCL_INCLUDE_DIR)
list(APPEND EP_TEST_NCCL_INCLUDES ${NCCL_INCLUDE_DIR})
message(STATUS "EP test: NCCL headers: ${NCCL_INCLUDE_DIR}")
endif()

set(EP_TEST_COMMON_INCLUDES
${EP_TEST_NCCL_INCLUDES}
${MPI_CXX_INCLUDE_PATH}
../../transformer_engine/common/include
../../transformer_engine/common
${CMAKE_CURRENT_SOURCE_DIR})

# nvrtc must follow TE_LIB so symbols referenced from libtransformer_engine.so
# (loaded via dlopen in Python; not in its DT_NEEDED) resolve through nvrtc.
set(EP_TEST_COMMON_LIBS
CUDA::cuda_driver
CUDA::cudart
GTest::gtest
${TE_LIB}
CUDA::nvrtc
${NCCL_LIB}
${NCCL_EP_LIB}
MPI::MPI_CXX
OpenMP::OpenMP_CXX)

# ── EP distributed tests (per-op + full pipeline + zero-copy symm) ───────────
add_executable(test_ep test_ep.cu ../cpp/test_common.cu)
target_include_directories(test_ep PRIVATE ${EP_TEST_COMMON_INCLUDES})
target_link_libraries(test_ep PUBLIC ${EP_TEST_COMMON_LIBS})

# Do NOT use gtest_discover_tests — these binaries require multi-process
# launch via run_test_ep.sh, not direct single-process execution.
message(STATUS "EP distributed tests enabled: ${NCCL_EP_LIB}")
54 changes: 54 additions & 0 deletions tests/cpp_distributed/run_test_ep.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/usr/bin/env bash
# Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.
#
# Run TE EP distributed unit tests via mpirun. Each MPI rank pins to one GPU
# (rank % device_count) and exchanges ncclUniqueId through MPI_Bcast.
#
# Usage:
# bash run_test_ep.sh [num_gpus] [build_dir]
#
# Defaults:
# num_gpus = number of GPUs visible to nvidia-smi
# build_dir = <script_dir>/build
#
# Environment variables:
# GTEST_FILTER — forwarded to all processes (e.g., "EPPipelineTest.*")
# MPIRUN — override the mpirun binary (default: mpirun)
# MPIRUN_EXTRA — extra flags forwarded to mpirun

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
BUILD_DIR="${2:-${SCRIPT_DIR}/build}"
NUM_GPUS="${1:-$(nvidia-smi -L 2>/dev/null | wc -l)}"
MPIRUN="${MPIRUN:-mpirun}"

# Skip cleanly on pre-Hopper: NCCL EP requires SM>=90.
MIN_SM=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader 2>/dev/null \
| awk -F. 'NR==1 || ($1*10+$2)<min { min=$1*10+$2 } END { print min+0 }')
if (( MIN_SM > 0 && MIN_SM < 90 )); then
echo "NCCL EP requires SM>=90 (lowest visible GPU is SM${MIN_SM}); SKIPPING."
exit 0
fi

TEST_BIN="${BUILD_DIR}/test_ep"
if [[ ! -x "${TEST_BIN}" ]]; then
echo "ERROR: binary not found: ${TEST_BIN}"
echo "Build: cd ${SCRIPT_DIR} && mkdir -p build && cd build && cmake .. && make"
exit 1
fi

if (( NUM_GPUS < 2 )); then
echo "EP Tests: requires at least 2 GPUs, found ${NUM_GPUS}. Skipping."
exit 0
fi

GTEST_ARGS="${GTEST_FILTER:+--gtest_filter=${GTEST_FILTER}}"

echo "=== EP Tests ==="
echo " GPUs: ${NUM_GPUS} Binary: ${TEST_BIN}"
echo

"${MPIRUN}" -n "${NUM_GPUS}" ${MPIRUN_EXTRA:-} "${TEST_BIN}" ${GTEST_ARGS}
Loading
Loading