-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-14432. SnapshotDeletingService incorrectly checks Ratis Buffer Limit #9656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes incorrect Ratis buffer limit checking in SnapshotDeletingService. The old implementation would add all snapshot-related entries to a request builder before checking if the message size exceeded the buffer limit, then attempted to recover by truncating the lists. This approach was flawed as it could create oversized messages.
Changes:
- Replaced the flawed size-checking logic with progressive batch building that checks size limits before adding each entry
- Introduced proper batching across deletedKeys, renamedKeys, and deletedDirs with early termination on failures
- Improved error handling by returning the count of successfully submitted entries for accurate retry tracking
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
.../ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/SnapshotDeletingService.java
Outdated
Show resolved
Hide resolved
.../ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/SnapshotDeletingService.java
Show resolved
Hide resolved
.../ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/SnapshotDeletingService.java
Show resolved
Hide resolved
smengcl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Could you add a test case for this fix? (which would have failed without this fix)
What changes were proposed in this pull request?
SnapshotDeletingServiceincorrectly checks the Ratis Buffer Limit when we invokesubmitSnapshotMoveDeletedKeys, it adds all the keys to the builder and then checks the size of the message. Later, it again adds moredeletedKeys,deletedDirsandrenamedKeyson it, The message size is already over the ratis buffer limit, instead of creating a new builder we re-use the same builder. Ideally we should stop iterating through the snapshot-related tables when we hit the buffer limit even before going tosubmitSnapshotMoveDeletedKeysWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14432
How was this patch tested?
Existing Tests