Skip to content

[FLUSS-3483][coordinator] Reject lake snapshot commits for dropped tables#3494

Open
Jackeyzhe wants to merge 4 commits into
apache:mainfrom
Jackeyzhe:fluss-3483
Open

[FLUSS-3483][coordinator] Reject lake snapshot commits for dropped tables#3494
Jackeyzhe wants to merge 4 commits into
apache:mainfrom
Jackeyzhe:fluss-3483

Conversation

@Jackeyzhe

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3483

This PR fixes a race where a stale V2 lake snapshot commit can be accepted after a table is dropped and recreated. In that case, the recreated table may get a /laketable znode pointing to an offsets metadata file generated for the old table id, causing lake tiering to repeatedly fail when reading the stale offsets path.

Brief change log

  • Reject V2 lake snapshot commits when the target table id no longer exists or is queued for deletion.
  • Validate the V2 lake snapshot offsets metadata before writing it to ZK:
    • the offsets file must be readable;
    • the file-level tableId must match the commit table id;
    • every TableBucket in the offsets file must also belong to the commit table id.
  • Validate both tiered offsets and readable offsets metadata paths when they differ.
  • Add regression coverage for:
    • non-existent table id commits;
    • table queued for deletion;
    • recreated table id with stale offsets metadata from the old table id;
    • RPC-level V2 commit after table drop.

Tests

mvn -pl fluss-server -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -Dtest=CoordinatorEventProcessorTest#testCommitLakeTableSnapshotV2RejectsMismatchedOffsetsFileTableId+testCommitLakeTableSnapshotV2RejectsNonExistentTable+testCommitLakeTableSnapshotV2RejectsQueuedForDeletionTable test
mvn -pl fluss-server -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -Dtest=CommitLakeTableSnapshotITCase#testCommitLakeTableSnapshotV2RejectedAfterDropTable test

API and Format

no

Documentation

no

@Jackeyzhe Jackeyzhe changed the title [FLUSS-3483][coordinator] Validate lake snapshot offsets before commit [FLUSS-3483][coordinator] Reject lake snapshot commits for dropped tables Jun 24, 2026
@luoyuxia luoyuxia requested a review from Copilot June 24, 2026 02:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a coordinator race where stale V2 lake snapshot commits could be accepted after a table drop/recreate, potentially writing an offsets-metadata pointer for an old tableId and permanently breaking lake tiering for the recreated table.

Changes:

  • Add coordinator-side rejection for V2 lake snapshot commits targeting a tableId that no longer exists or is queued for deletion.
  • Add regression tests (unit + integration) covering non-existent/queued-for-deletion tableIds and an RPC-level V2 commit after a drop.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java Adds a guard to reject V2 snapshot commits for dropped / queued-for-deletion tables.
fluss-server/src/test/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessorTest.java Adds unit tests verifying V2 commit rejection for non-existent and queued-for-deletion tables.
fluss-server/src/test/java/org/apache/fluss/server/replica/CommitLakeTableSnapshotITCase.java Adds an IT case verifying V2 snapshot commit is rejected after dropping a table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 2113 to 2117
throw new FlussRuntimeException(
"Lake snapshot metadata is null for table " + tableId);
}
ensureTableCanAcceptLakeSnapshot(tableId);
lakeTableHelper.registerLakeTableSnapshotV2(

@luoyuxia luoyuxia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackeyzhe Thanks for the pr. LGTM overall. Left minor comments. PTAL

}

@Test
void testCommitLakeTableSnapshotV2RejectsNonExistentTable() throws Exception {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can remove the ut since testCommitLakeTableSnapshotV2RejectedAfterDropTable alreay cover it. I want to keep this test class short if possible

});
}

private void ensureTableCanAcceptLakeSnapshot(long tableId) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to ensureTableNotDeleted? I can image maybe other event maywell need this method.

@luoyuxia

luoyuxia commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@Jackeyzhe Thanks for working on this.

I wonder if we can make this fix simpler and close the race at the ZK write boundary instead. The current approach adds coordinator-side lifecycle checks before submitting the actual ZK write to ioExecutor, but that check can become stale before registerLakeTableSnapshotV2 runs. For example, the commit event may pass the context check while the table still exists, then dropTable can delete /tabletservers/tables/<tableId>, and the later IO task can still recreate /tabletservers/tables/<tableId>/laketable because upsertLakeTable uses creatingParentsIfNeeded().

Would it be enough to change the create branch in ZooKeeperClient.upsertLakeTable to avoid recreating the table parent?

// Do not recreate /tabletservers/tables/<tableId> for a late commit.
zkClient.create().forPath(zkPath, zkData);

Then a late commit after deleteTableAssignment(tableId) would fail with NoNodeException instead of resurrecting the old tableId znode. We can convert that to TableNotExistException for the commit response. The coordinator-side queued/deleted check can still be kept as an early rejection, but the ZK write itself should be what prevents the resurrection race.

@Jackeyzhe

Copy link
Copy Markdown
Contributor Author

@Jackeyzhe Thanks for working on this.

I wonder if we can make this fix simpler and close the race at the ZK write boundary instead. The current approach adds coordinator-side lifecycle checks before submitting the actual ZK write to ioExecutor, but that check can become stale before registerLakeTableSnapshotV2 runs. For example, the commit event may pass the context check while the table still exists, then dropTable can delete /tabletservers/tables/<tableId>, and the later IO task can still recreate /tabletservers/tables/<tableId>/laketable because upsertLakeTable uses creatingParentsIfNeeded().

Would it be enough to change the create branch in ZooKeeperClient.upsertLakeTable to avoid recreating the table parent?

// Do not recreate /tabletservers/tables/<tableId> for a late commit.
zkClient.create().forPath(zkPath, zkData);

Then a late commit after deleteTableAssignment(tableId) would fail with NoNodeException instead of resurrecting the old tableId znode. We can convert that to TableNotExistException for the commit response. The coordinator-side queued/deleted check can still be kept as an early rejection, but the ZK write itself should be what prevents the resurrection race.

Thanks for pointing this out. I agree that the coordinator-side check is only an early rejection and does not fully close the race before the actual ZK write.

I'll update the fix to make the ZK write boundary authoritative: for the create branch in ZooKeeperClient.upsertLakeTable, avoid creating missing parents and use create().forPath(...) directly. Then a late commit after the table assignment znode has been deleted will fail with NoNodeException instead of recreating /tabletservers/tables/. I'll convert that failure to TableNotExistException for the commit response.

I'll keep the coordinator-side queued/deleted check as an early rejection, but rely on the ZK create semantics to prevent resurrection.

@luoyuxia

luoyuxia commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@Jackeyzhe Thanks for working on this.
I wonder if we can make this fix simpler and close the race at the ZK write boundary instead. The current approach adds coordinator-side lifecycle checks before submitting the actual ZK write to ioExecutor, but that check can become stale before registerLakeTableSnapshotV2 runs. For example, the commit event may pass the context check while the table still exists, then dropTable can delete /tabletservers/tables/<tableId>, and the later IO task can still recreate /tabletservers/tables/<tableId>/laketable because upsertLakeTable uses creatingParentsIfNeeded().
Would it be enough to change the create branch in ZooKeeperClient.upsertLakeTable to avoid recreating the table parent?

// Do not recreate /tabletservers/tables/<tableId> for a late commit.
zkClient.create().forPath(zkPath, zkData);

Then a late commit after deleteTableAssignment(tableId) would fail with NoNodeException instead of resurrecting the old tableId znode. We can convert that to TableNotExistException for the commit response. The coordinator-side queued/deleted check can still be kept as an early rejection, but the ZK write itself should be what prevents the resurrection race.

Thanks for pointing this out. I agree that the coordinator-side check is only an early rejection and does not fully close the race before the actual ZK write.

I'll update the fix to make the ZK write boundary authoritative: for the create branch in ZooKeeperClient.upsertLakeTable, avoid creating missing parents and use create().forPath(...) directly. Then a late commit after the table assignment znode has been deleted will fail with NoNodeException instead of recreating /tabletservers/tables/. I'll convert that failure to TableNotExistException for the commit response.

I'll keep the coordinator-side queued/deleted check as an early rejection, but rely on the ZK create semantics to prevent resurrection.

Can we remove the coordinator-side queued/deleted check to keep code simple?

@Jackeyzhe

Copy link
Copy Markdown
Contributor Author

@Jackeyzhe Thanks for working on this.
I wonder if we can make this fix simpler and close the race at the ZK write boundary instead. The current approach adds coordinator-side lifecycle checks before submitting the actual ZK write to ioExecutor, but that check can become stale before registerLakeTableSnapshotV2 runs. For example, the commit event may pass the context check while the table still exists, then dropTable can delete /tabletservers/tables/<tableId>, and the later IO task can still recreate /tabletservers/tables/<tableId>/laketable because upsertLakeTable uses creatingParentsIfNeeded().
Would it be enough to change the create branch in ZooKeeperClient.upsertLakeTable to avoid recreating the table parent?

// Do not recreate /tabletservers/tables/<tableId> for a late commit.
zkClient.create().forPath(zkPath, zkData);

Then a late commit after deleteTableAssignment(tableId) would fail with NoNodeException instead of resurrecting the old tableId znode. We can convert that to TableNotExistException for the commit response. The coordinator-side queued/deleted check can still be kept as an early rejection, but the ZK write itself should be what prevents the resurrection race.

Thanks for pointing this out. I agree that the coordinator-side check is only an early rejection and does not fully close the race before the actual ZK write.
I'll update the fix to make the ZK write boundary authoritative: for the create branch in ZooKeeperClient.upsertLakeTable, avoid creating missing parents and use create().forPath(...) directly. Then a late commit after the table assignment znode has been deleted will fail with NoNodeException instead of recreating /tabletservers/tables/. I'll convert that failure to TableNotExistException for the commit response.
I'll keep the coordinator-side queued/deleted check as an early rejection, but rely on the ZK create semantics to prevent resurrection.

Can we remove the coordinator-side queued/deleted check to keep code simple?

sure, I'll remove it

…ng dropped tables

Scope lake snapshot commit fix to dropped table race
@Jackeyzhe Jackeyzhe force-pushed the fluss-3483 branch 2 times, most recently from 17eb7f6 to 45acb33 Compare July 4, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

commitLakeTableSnapshot writes stale snapshot path after dropTable, leaving the tiering pipeline permanently stuck

3 participants