Skip to content

add dynamic fa3#1334

Merged
hiworldwzj merged 1 commit into
mtp_optimizationfrom
dynamic-mtp-fa3
Jun 9, 2026
Merged

add dynamic fa3#1334
hiworldwzj merged 1 commit into
mtp_optimizationfrom
dynamic-mtp-fa3

Conversation

@shihaobai

Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for dynamic Multi-Token Prediction (MTP) verification within the FlashAttention-3 (fa3) decode state, including the implementation of specialized Triton kernels for parameter building and compaction, along with corresponding unit tests and test scripts. Additionally, it refactors the cache tensor manager's recycling logic using a use-count bias. The code review highlights two important issues: first, a potential data corruption bug in cache_tensor_manager.py where a buffer node could be appended to the free list multiple times, which can be resolved by checking for prior existence in the list; second, a portability issue in the test script dynamic_fa3.sh due to hardcoded GPU indices, which should instead default to standard indices with environment overrides.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +103 to 105
free_use_count = t_buf_node.free_use_count_bias + 1 + len(t_buf_node.shape_to_tensor)
if self.use_count(ptr) <= free_use_count:
self.free_shape_dtype_to_bufs[t_buf_node.shape_key].append(t_buf_node)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If a temporary view of a cached tensor is deleted after the parent tensor has already been returned to the free list, it will trigger custom_del and add the pointer to changed_ptr again. Since the use count is still <= free_use_count, the same BufNode will be appended to free_shape_dtype_to_bufs multiple times. This leads to duplicate references in the free list, causing severe data corruption when the same buffer is allocated to multiple active tensors simultaneously. Adding a check to ensure the node is not already in the free list prevents this duplicate appending.

Suggested change
free_use_count = t_buf_node.free_use_count_bias + 1 + len(t_buf_node.shape_to_tensor)
if self.use_count(ptr) <= free_use_count:
self.free_shape_dtype_to_bufs[t_buf_node.shape_key].append(t_buf_node)
free_use_count = t_buf_node.free_use_count_bias + 1 + len(t_buf_node.shape_to_tensor)
if self.use_count(ptr) <= free_use_count and t_buf_node not in self.free_shape_dtype_to_bufs[t_buf_node.shape_key]:
self.free_shape_dtype_to_bufs[t_buf_node.shape_key].append(t_buf_node)

MAX_TOTAL_TOKEN_NUM=""
MAX_REQ_TOTAL_LEN=""
BATCH_MAX_TOKENS=""
export CUDA_VISIBLE_DEVICES=4,5

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding specific GPU indices like 4,5 makes the script non-portable and prone to failing on environments with fewer GPUs or different GPU allocations. It is better to default to standard indices (e.g., 0,1) while allowing overrides from the environment.

Suggested change
export CUDA_VISIBLE_DEVICES=4,5
export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1}

@hiworldwzj hiworldwzj merged commit 9229d28 into mtp_optimization Jun 9, 2026
@hiworldwzj hiworldwzj deleted the dynamic-mtp-fa3 branch June 9, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants