Skip to content

Conversation

@easwars
Copy link
Contributor

@easwars easwars commented Nov 14, 2025

Fixes #7686

Current Behavior

  • When client exits IDLE and creates the name resolver, it stays in IDLE until the connectivity state is set by the LB policy.
  • When exiting IDLE mode (because of Connect being called or because of an RPC), if name resolver creation fails, we stay in IDLE.

New Behavior

  • When the client exits IDLE and creates the name resolver, it moves to CONNECTING. Moving forward, the connectivity state will be set by the LB policy.
  • When exiting IDLE mode (because of Connect being called or because of an RPC), we have already moved to CONNECTING (because of the previous bullet point). If name resolver creation fails, we will move to TRANSIENT_FAILURE and start the idle timer and move back to IDLE when the timer fires

Implementation details:

  • The client channel now treats resolver build errors encountered during exiting IDLE identically to resolver errors received prior to valid updates.
  • The idleness Manager now transitions out of IDLE even if the client channel's ExitIdleMode returns an error. Since the channel moves to TRANSIENT_FAILURE in this scenario, the Manager must correctly reflect this state and resume activity tracking.
  • OnFinish call options are now invoked even if stream creation fails during an RPC. This fulfills the guarantee for these options and ensures the idleness Manager’s activeCallsCount remains accurate.

RELEASE NOTES:

  • client: Change connectivity state to CONNECTING when creating the name resolver (as part of exiting IDLE).
  • client: Change connectivity state to TRANSIENT_FAILURE if name resolver creation fails (as part of exiting IDLE).
  • client: Change connectivity state to IDLE after idle timeout expires even when current state is TRANSIENT_FAILURE.
  • client: Fix a bug that resulted in OnFinish call option not being invoked for RPCs where stream creation failed.

@easwars easwars requested a review from dfawley November 14, 2025 21:48
@easwars easwars added Type: Bug Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. labels Nov 14, 2025
@easwars easwars added this to the 1.78 Release milestone Nov 14, 2025
@easwars easwars requested a review from arjan-bal November 14, 2025 21:48
@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 79.62963% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.13%. Comparing base (112ec12) to head (918c6f1).
⚠️ Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
clientconn.go 76.47% 1 Missing and 3 partials ⚠️
internal/idle/idle.go 88.00% 0 Missing and 3 partials ⚠️
stream.go 72.72% 1 Missing and 2 partials ⚠️
resolver_wrapper.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8710      +/-   ##
==========================================
- Coverage   83.28%   82.13%   -1.16%     
==========================================
  Files         416      418       +2     
  Lines       32267    32434     +167     
==========================================
- Hits        26874    26640     -234     
- Misses       4019     4093      +74     
- Partials     1374     1701     +327     
Files with missing lines Coverage Δ
resolver_wrapper.go 72.47% <0.00%> (-19.98%) ⬇️
internal/idle/idle.go 79.12% <88.00%> (-10.04%) ⬇️
stream.go 61.85% <72.72%> (-19.99%) ⬇️
clientconn.go 72.00% <76.47%> (-18.14%) ⬇️

... and 39 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dial_test.go Outdated

func (s stringerVal) String() string { return s.s }

const errResolverBuildercheme = "test-resolver-build-failure"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 84 to 92
// https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md
// defines CONNECTING as follows:
// - The channel is trying to establish a connection and is waiting to
// make progress on one of the steps involved in name resolution, TCP
// connection establishment or TLS handshake. This may be used as the
// initial state for channels upon creation.
//
// We are starting the name resolver here as part of exiting IDLE, so
// transitioning to CONNECTING is the right thing to do.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO comments should be short and to the point.

Short comments make the code take up less space, which makes it easier to read and understand. Long comments make long functions extremely long and not fit on the page.

Honestly, I think a comment for this action isn't even necessary. But if you think we need one, this could be:

// Set state to CONNECTING before building the name resolver
// so the channel does not remain in IDLE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 740 to 742
if state := cc.GetState(); state != connectivity.Idle {
t.Fatalf("Expected initial state to be IDLE, got %v", state)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AwaitState above already tested this IIUC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Ensure that the client is in IDLE before connecting.
ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
defer cancel()
testutils.AwaitState(ctx, t, cc, connectivity.Idle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need an Await right? It should just check the current state, and never wait for changes, as we know it starts idle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Moved the check for the current state to here, and got rid of the Await.

// Tests the case where the resolver reports an error to the channel before
// reporting an update. Verifies that the channel eventually moves to
// TransientFailure and a subsequent RPC returns the error reported by the
// TransientFailure and a subsequent RPCs returns the error reported by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ a / /

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Authority: ccr.cc.authority,
MetricsRecorder: ccr.cc.metricsRecorderList,
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this diff & file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dfawley dfawley assigned easwars and unassigned dfawley Dec 1, 2025
@easwars
Copy link
Contributor Author

easwars commented Dec 2, 2025

@dfawley
I have the changes ready for the above two comments, but I want to hear from you before sending a commit for it. Thanks.

@easwars easwars assigned dfawley and unassigned easwars Dec 2, 2025
dopts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithResolvers(&testResolverBuilder{logger: t, manualR: mr}),
grpc.WithIdleTimeout(time.Second),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid assuming that a 1s interval is sufficient for 2 attempts to fail?

One solution might be to split this into two tests:

  1. Test that setting a sufficiently large idle timeout keeps the channel in TF and doesn't trigger the resolver builder again.
  2. Set a defaultTestShortIdleTimeout and ensure the channel exits TF, calling the resolver builder again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a way to enter IDLE mode from tests. But looks like that goes through the regular idleness logic. So, I added another one to forcefully put the channel in IDLE (for testing purposes), and changed the test to forcefully enter IDLE after it enters TF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@easwars the way the test can flake is if there's a gap of more than 1s b/w the two RPC attempts. The sequence of events is:

  1. RPC 1 is made, results in testResolverBuilder.Build being called and failing. RPC fails with code UNAVAILABLE.
  2. 1s passes, the channel enters IDLE mode.
  3. RPC 2 is made, results in testResolverBuilder.Build being called. Since the builder only returns an error for the first call, the Build call succeeds this time. The RPC doesn't fail with the expected error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is a valid sequence of events. But there is nothing else happening in this test at all when that for loop making the two RPCs is running. And the RPCs don't even get far enough to create an LB policy, create subchannels, create transports, none of that good stuff is happening. While I agree it could possible take a second after the first RPC runs before the second one gets to run, I feel the probability is miniscule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 1s is sufficient, but it introduces a fixed latency—the test will always take the full second even if it could finish in milliseconds. It's not a blocker for this PR, but something to keep in mind.

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Dec 2, 2025
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great now, I think this is what we want. I think we should take about naming a bit (which is just textual changes), and I'm concerned about the ForceIdle thing. Otherwise LGTM!

Comment on lines 256 to 259
// MarkAsExitedIdleMode instructs the Manager to update its internal state to
// indicate that the channel has exited IDLE mode. This is only used by the gRPC
// client when it exits IDLE mode manually from Dial.
func (m *Manager) MarkAsExitedIdleMode() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs warnings on it. Generally you can't ever use it in a situation where you can possibly race with anything else calling ExitIdleMode or OnCallBegin, and you must know you are really idle.

Bikeshedding:

UnsafeSetNotIdle()? UnsafeSetActive()? InitializeAsActive() (to convey this is only to be used as part of creation)?

Aside but very related: Do we already say "active" is the opposite of "idle" anywhere? Or are the only states "in idle mode" and "not in idle mode". The latter feels a little awkward. Maybe we can rename things to use "active" vs "idle":

ExitIdleMode() -> Activate()
EnterIdleModeForTesting() -> GoIdle()  // Do we really need to ForTesting this? It seems safe to use.

Further bikeshedding that's unrelated: I never found the "Enforcer" name to be intuitive either. I've always thought of the manager as the thing doing the enforcing. The object it's controlling is just enacting the idle manager's commands. How about Actor or Agent or Worker or Channel instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs warnings on it

Improved the docstring to make it sound more scary.

Bikeshedding

Went with UnsafeSetNotIdle

Do we already say "active" is the opposite of "idle" anywhere?

Yeah, we have never used the term "active" to mean the opposite of "idle" anywhere so far.

Maybe we can rename things to use "active" vs "idle"

We probably could do that. Can we do it in a follow up though?

EnterIdleModeForTesting() -> GoIdle() // Do we really need to ForTesting this? It seems safe to use.

Yes, we need this to test the race between the channel going idle after idle_timeout firing and an RPC trying to keep it from not going idle.

I never found the "Enforcer" name to be intuitive either.

I liked Channel since it conveyed the meaning directly, but eventually decided to go with ClientConn to keep it consistent with other places where an interface's functionality is provided by the client channel.


// ForceEnterIdleModeForTesting bypasses the usual checks that happen before
// entering idle mode, and forcefully enter idle mode for testing purposes.
func (m *Manager) ForceEnterIdleModeForTesting() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? It feels pretty dangerous to even have it. What are we doing that requires us to enter idle mode even with pending calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so that we can force entering IDLE from tests after making RPCs (and completing those RPCs), but not having to set a small idle_timeout and be vulnerable to timing flakes.

Essentially instead of setting an idle_timeout of say 1s, making an RPC, and waiting for a second (or a little more) for the channel to enter idle, the test could make an RPC, and once the RPC is complete, it can force the channel to enter idle becuase it knows there are no pending calls.

The existing EnterIdleModeForTesting which I've renamed to TryEnterIdleModeForTesting works within the parameters of idleness and moves the channel to idle mode only if there has been no activity on the channel. We could have named it as FireIdleTimeoutForTesting because that is essentially what is happening as part of this method and is required for testing races between when the channel is trying to enter idle (as part of the idle timeout firing) and exit idle (as part an RPC).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this thread though: #8710 (comment)

Can we avoid assuming that a 1s interval is sufficient for 2 attempts to fail?

Actually the code wasn't assuming that a 1s interval was sufficient for 2 attempts to fail. The 1s interval just determined when the idle timeout would fire and everytime it fired, it would check if the channel had no activity since the last time it fired. I'm going to go back to the previous approach for the following reasons:

  • 1s should be plenty for these two attempts to fail because nothing happens in these two attempts, except the attempt to build the resolver which would fail, which would result in the RPC failing
  • Even if the two attempts took more than 1s, it wouldn't affect the correctness of the test as the idle timeout would fire and the idleness manager would see that there was activity on the channel and therefore would do nothing. And the next time it fires (or the next time), it would eventually find no activity on the channel and would move it to idle

This would allow me to get rid of this ForceEnterIdleModeForTesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Why would we want to simulate the idle timeout firing? Anything operating at that low of a level should be internal to the idle package. Everything else should have a safe way to say "go idle now -- unless there are pending RPCs by the time this call gets in and the idle lock is taken".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you concerned more about the name or the functionality of that API itself? This is how it was before the change also. After reading your last comment, I agree that the new name FireIdleTimeoutForTesting seems very low level and can be changed back to EnterIdleModeForTesting. The problem though is that it is possible that it doesn't enter idle mode, because it uses the same logic that non-test code would use to decide if it should try entering idle mode or not. The problem with using separate logic in tests to enter idle mode is that we might not be testing the right set of things for example when we want to test the race between idle timeout firing (and trying to put the channel in idle) and an RPC happening at the same time (that is trying to keep the channel from entering idle).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment in the initial thread about how the 1s interval can cause the test to flake: #8710 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not be testing the right set of things for example when we want to test the race between idle timeout firing (and trying to put the channel in idle) and an RPC happening at the same time (that is trying to keep the channel from entering idle).

We shouldn't need these kinds of tests in the channel, though, should we? From the channel's perspective, it needs to properly deal with the race between RPC start and a call to EnterIdle() from the idle manager. It doesn't know or care about the mechanics of the timer firing. Testing the idle manager might require such unit-style tests, in which case "simulating the timer firing" can be a non-exported function. Maybe I'm missing something though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need these kinds of tests in the channel, though, should we?

Looks like we have a unit style test for exactly the same thing and it does exactly what you are saying: one goroutine trying to fire the idle timeout and another one trying to call OnCallBeing to start an RPC

So, I ended up removing the test that triggers the same from the channel.

@easwars easwars assigned arjan-bal and unassigned easwars Dec 4, 2025
Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, there's still a discussion about removing ForceEnterIdleModeForTesting. I'm fine with either outcome.

@arjan-bal arjan-bal removed their assignment Dec 5, 2025
m.tryEnterIdleMode()
// FireIdleTimeoutForTesting forcefully triggers the idle timeout to fire.
func (m *Manager) FireIdleTimeoutForTesting() {
m.handleIdleTimeout()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your other comment:

Are you concerned more about the name or the functionality of that API itself? This is how it was before the change also.

But...this is a behavior change, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've brought back EnterIdleModeForTesting now, but the implementation takes inspiration from how to exit idle from Dial. So the implementation now calls the unexported enterIdleMode on the clientconn and calls a method on the idleness manager to let it know that we have entered idle (so go update your internal state).

This also means that I don't have to use the 1s idle_timeout from one of those tests, and also not directly cause the idle timeout to fire from the test, but instead use a higher level API to ask the clientconn to enter idle mode.

@easwars
Copy link
Contributor Author

easwars commented Dec 5, 2025

@dfawley
I'm hoping the PR now should have addressed all outstanding concerns. Please take another look.

//
// N.B. This method is intended only for testing purposes. The caller must
// ensure that there are no ongoing RPCs
func (m *Manager) UnsafeSetIdleForTesting() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prefer this over the old version? This is dangerous while the old one was safe. And the old one will work in all the same scenarios that the new one will work in (when you know for sure the channel is not able to perform RPCs), but not vice-versa.

cc.csMgr.updateState(connectivity.TransientFailure)
cc.mu.Lock()
cc.updateResolverStateAndUnlock(resolver.State{}, err)
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should augment this error to say something about the resolver building failing (or the wrapper should?) so that it's clear when we fail to exit idle mode that the subsequent text came from the name resolver.

errCh := make(chan error)
ccr.serializer.TrySchedule(func(ctx context.Context) {
if ctx.Err() != nil {
errCh <- ctx.Err()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, it seems like there should be a way to statically check something like this! 👀

Comment on lines +188 to +198
if channelz.IsOn() {
cc.incrCallsFailed()
}
// Invoke all the registered OnFinish call options explicitly. A
// non-nil error means that the stream wasn't created, and
// therefore these will be NOT be invoked as part of `cs.finish()`.
for _, o := range opts {
if o, ok := o.(OnFinishCallOption); ok {
o.OnFinish(err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can find a better way to unify these things?

Maybe we can have a shared endOfRPC function that is called by finish and from here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a test for the initial channel state while waiting for the first name resolver update

3 participants