Skip to content

Conversation

@ivandika3
Copy link
Contributor

@ivandika3 ivandika3 commented Jan 28, 2026

What changes were proposed in this pull request?

Currently in OMRatisHelper#getOMResponseFromRaftClientReply, the OMResponse.leaderOMNodeId is set to the replierId (i.e. RaftClientReply.serverId) which is set to the Raft peer that is serving the request.

Previously, it was fine since all Ratis request all write requests that need to be served by the leader. However, now that Ozone support OM follower read, OM cannot use replierId as the OMResponse.leaderOMNodeId.

Hadoop27OmTransport and Hadoop3OmTransport submitRequest will set the next OM node ID to this OM node ID. If the request is incorrectly set to a non-leader OM, this might cause the subsequent failover to wrongly goes to a follower, which might add to the write latency.

Therefore, I decided to remove the failover logic from the Hadoop transport because it causes the client to failover twice

  1. The setNextOmProxy Hadoop3OMTransport#submitRequest will set the current proxy OM node ID
  2. However, selectNextOmProxy will be triggered in OmFailoverProxyProviderBase#getRetryPolicy
    • This will update the OM proxy from pointing to the leader to the next OM node
    • This will trigger long failover, especially since we need to wait longer client already goes through all OM nodes

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14426

How was this patch tested?

IT (Clean CI: https://github.com/ivandika3/ozone/actions/runs/21431935770)

Note: testOMProxyProviderFailoverToCurrentLeader is failing because of the removal of setNextOmProxy logic in Hadoop transport. I removed it because I believe the test is not correct.

The test assumes that if we trigger OmFailoverProxyProvider#performFailover, the next OM request will be sent to the new current OM node ID since FailoverProxyProvider#getProxy is already updated to the new OM node ID. However, this is incorrect since OM client wraps the RPC proxy with RetryProxy#create which uses RetryInvocationHandler . RetryInvocationHandler caches the current proxy in ProxyDescriptor.proxyInfo and only call FailoverProxyProvider#getProxy when there is a failover event which triggers ProxyDescriptor#failover. So even if the test changes the OM node ID to non-leader, the RetryInvocationHandler will keep sending to the leader until the leader steps down / down / disconnected.

We can technically make the test pass by removing the cached ProxyDescriptor.proxyInfo and make RetryInvocationHandler.ProxyDescriptor#getProxy calls FailoverProxyProvider#getProxy directly. However, this might affect other use Hadoop client use cases where FailoverProxyProvider#getProxy recreates the proxy every time.

@ivandika3 ivandika3 requested a review from szetszwo January 28, 2026 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant