[FLUSS-2420][lake/iceberg] Address IcebergLakeCommitter Blocking During Simultaneous Rewrite Operations #2421
+441
−40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: close #2420
Per Issue #2420, this pull request addresses a potential bug within the
IcebergLakeCommitterthat can arise during simultaneous operations (e.g., new data, rewrites, deletions) in the form of blocking, which ultimately can result in perpetual retries until a timeout.More specifically — rewrite operations in call
validateFromSnapshot(snapshotId)to ensure the files being replaced haven't changed since the rewrite was planned. If we commit data/delete files first (viaAppendFilesorRowDelta), the table's current snapshot advances. The subsequent rewrite validation then fails because it's checking against a now-stale snapshot ID, triggering Iceberg's retry loop indefinitely.Brief change log
Simply put these changes update the order of operations that occur during the
IcebergLakeCommitter.commit()process by performing rewrite operations before any new file deletion operations as opposed to after (previously these were performed after).Tests
A test suite was introduced as part of these changes to cover the various combinations of operations that could occur during a commit cycle (e.g, only data files, data files with deletions, rewrites, etc.) to ensure all of those worked as expected. Additionally a separate test case was added to reproduce the original issue (cycle containing data files, rewrites, and deletions), which was later updated to confirm the fix.
The following table and combinations cover the new tests that were added and the operation types (per commit cycle) tested within them:
testCommitSucceedstestCommitWithDeleteFilesSucceedstestRewriteOnlyCommitSucceedstestRewriteWithDataFilesSucceedstestRewriteWithDeleteFilesInSameCycleSucceedsAdditionally addressing this bug also required adjusting the
IcebergRewriteITCase.testPkTableCompactionWithConflictas the conflict no longer occurs. The test has since been renamed totestPkTableCompactionWithDeletedFilesto align with the behavior (i.e. confirming that compaction works as expected and deletion files are present).FlinkIcebergTieringTestBase.checkFileStatusInIcebergTablehelper function now checks for the presence of a deletion file as opposed to checking each individual file since it appears that a deletion status may be mixed across files (e.g., some may have it and others do not).Documentation
N/A
Reviewers
@luoyuxia / @wuchong