Skip to content

Conversation

@rionmonster
Copy link
Contributor

@rionmonster rionmonster commented Jan 20, 2026

Purpose

Linked issue: close #2420

Per Issue #2420, this pull request addresses a potential bug within the IcebergLakeCommitter that can arise during simultaneous operations (e.g., new data, rewrites, deletions) in the form of blocking, which ultimately can result in perpetual retries until a timeout.

More specifically — rewrite operations in call validateFromSnapshot(snapshotId) to ensure the files being replaced haven't changed since the rewrite was planned. If we commit data/delete files first (via AppendFiles or RowDelta), the table's current snapshot advances. The subsequent rewrite validation then fails because it's checking against a now-stale snapshot ID, triggering Iceberg's retry loop indefinitely.

Brief change log

Simply put these changes update the order of operations that occur during the IcebergLakeCommitter.commit() process by performing rewrite operations before any new file deletion operations as opposed to after (previously these were performed after).

Tests

A test suite was introduced as part of these changes to cover the various combinations of operations that could occur during a commit cycle (e.g, only data files, data files with deletions, rewrites, etc.) to ensure all of those worked as expected. Additionally a separate test case was added to reproduce the original issue (cycle containing data files, rewrites, and deletions), which was later updated to confirm the fix.

The following table and combinations cover the new tests that were added and the operation types (per commit cycle) tested within them:

Test Data Files Delete Files Rewrite
testCommitSucceeds - -
testCommitWithDeleteFilesSucceeds -
testRewriteOnlyCommitSucceeds - -
testRewriteWithDataFilesSucceeds -
testRewriteWithDeleteFilesInSameCycleSucceeds

Additionally addressing this bug also required adjusting the IcebergRewriteITCase.testPkTableCompactionWithConflict as the conflict no longer occurs. The test has since been renamed to testPkTableCompactionWithDeletedFiles to align with the behavior (i.e. confirming that compaction works as expected and deletion files are present).

  • As part of this change the FlinkIcebergTieringTestBase.checkFileStatusInIcebergTable helper function now checks for the presence of a deletion file as opposed to checking each individual file since it appears that a deletion status may be mixed across files (e.g., some may have it and others do not).

Documentation

N/A

Reviewers

@luoyuxia / @wuchong

@rionmonster rionmonster changed the title Fluss 2420 [FLUSS-2420][lake/iceberg] Address IcebergLakeCommitter Blocking During Simultaneous Rewrite Operations Jan 20, 2026
…o avoid blocking behavior

[lake/iceberg] Update IcebergLakeCommiter commit operation ordering to avoid blocking behavior
@rionmonster rionmonster marked this pull request as ready for review January 21, 2026 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[lake/iceberg] IcebergLakeCommitter Can Block Indefinitely During Simultaneous Rewrite Operations

1 participant