Skip to content

Conversation

@dbolduc
Copy link
Member

@dbolduc dbolduc commented Jan 31, 2026

Fixes #4471

I have been running a benchmark for >1 hour with these changes, and the client reports no failed acks thus far.

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Jan 31, 2026
@codecov
Copy link

codecov bot commented Jan 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.03%. Comparing base (eb74294) to head (4bcaadf).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4475      +/-   ##
==========================================
- Coverage   95.04%   95.03%   -0.01%     
==========================================
  Files         195      195              
  Lines        7427     7433       +6     
==========================================
+ Hits         7059     7064       +5     
- Misses        368      369       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbolduc dbolduc marked this pull request as ready for review January 31, 2026 04:22
@dbolduc dbolduc requested a review from a team as a code owner January 31, 2026 04:22

impl RetryPolicy for AtLeastOnceRetryPolicy {
fn on_error(&self, state: &RetryState, error: Error) -> RetryResult {
if state.attempt_count == 1 && error.is_transport() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, but a OnlyTransportErrors.with_attempt_limit(1) would give you more reusable components.

Do you want to retry auth errors too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because auth errors would warrant some kind of backoff.

The problem I am trying to solve is losing Acks when a channel closes, which reliably happens every hour.

From testing, this change reduces the lost acks, but not entirely. I missed that the same Ack requests could get retried on a different channel that is also closed, and receive another transport error.

So maybe I should retry these immediately up to N times where N is the number of channels.

Or maybe I should manually reset the channels every 55 minutes, or something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because auth errors would warrant some kind of backoff.

The problem I am trying to solve is losing Acks when a channel closes, which reliably happens every hour.

Right. I am arguing that maybe you should should retry any request that never hit the network. 🤷

Or maybe I should manually reset the channels every 55 minutes, or something.

Maybe there is a way to keep the channels active. I think gRPC for C++ had some configs to send keep alive messages every X seconds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I didn't look there. I'll see if it helps. Thanks.


impl BackoffPolicy for AtLeastOnceBackoffPolicy {
fn on_failure(&self, _state: &RetryState) -> Duration {
Duration::ZERO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why, but this gives me the heebie-jeebies. Works, but only because of that attempt limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: pubsub Issues related to the Pub/Sub API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retry at-least-once lease operations that fail when a connection is closed

2 participants