client: Change connectivity state to CONNECTING when creating the name resolver #8710

easwars · 2025-11-14T21:48:06Z

Fixes #7686

Current Behavior

When client exits IDLE and creates the name resolver, it stays in IDLE until the connectivity state is set by the LB policy.
When exiting IDLE mode (because of Connect being called or because of an RPC), if name resolver creation fails, we stay in IDLE.

New Behavior

When the client exits IDLE and creates the name resolver, it moves to CONNECTING. Moving forward, the connectivity state will be set by the LB policy.
When exiting IDLE mode (because of Connect being called or because of an RPC), we have already moved to CONNECTING (because of the previous bullet point). If name resolver creation fails, we will move to TRANSIENT_FAILURE and start the idle timer and move back to IDLE when the timer fires

Implementation details:

The client channel now treats resolver build errors encountered during exiting IDLE identically to resolver errors received prior to valid updates.
The idleness Manager now transitions out of IDLE even if the client channel's ExitIdleMode returns an error. Since the channel moves to TRANSIENT_FAILURE in this scenario, the Manager must correctly reflect this state and resume activity tracking.
OnFinish call options are now invoked even if stream creation fails during an RPC. This fulfills the guarantee for these options and ensures the idleness Manager’s activeCallsCount remains accurate.

RELEASE NOTES:

client: Change connectivity state to CONNECTING when creating the name resolver (as part of exiting IDLE).
client: Change connectivity state to TRANSIENT_FAILURE if name resolver creation fails (as part of exiting IDLE).
client: Change connectivity state to IDLE after idle timeout expires even when current state is TRANSIENT_FAILURE.
client: Fix a bug that resulted in OnFinish call option not being invoked for RPCs where stream creation failed.

…resolver

codecov · 2025-11-14T21:51:17Z

Codecov Report

❌ Patch coverage is 79.62963% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.13%. Comparing base (112ec12) to head (918c6f1).
⚠️ Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
clientconn.go	76.47%	1 Missing and 3 partials ⚠️
internal/idle/idle.go	88.00%	0 Missing and 3 partials ⚠️
stream.go	72.72%	1 Missing and 2 partials ⚠️
resolver_wrapper.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8710      +/-   ##
==========================================
- Coverage   83.28%   82.13%   -1.16%     
==========================================
  Files         416      418       +2     
  Lines       32267    32434     +167     
==========================================
- Hits        26874    26640     -234     
- Misses       4019     4093      +74     
- Partials     1374     1701     +327

Files with missing lines	Coverage Δ
resolver_wrapper.go	`72.47% <0.00%> (-19.98%)`	⬇️
internal/idle/idle.go	`79.12% <88.00%> (-10.04%)`	⬇️
stream.go	`61.85% <72.72%> (-19.99%)`	⬇️
clientconn.go	`72.00% <76.47%> (-18.14%)`	⬇️

... and 39 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dfawley · 2025-11-20T22:11:20Z

dial_test.go


 func (s stringerVal) String() string { return s.s }
+
+const errResolverBuildercheme = "test-resolver-build-failure"


dfawley · 2025-11-20T22:27:38Z

resolver_wrapper.go

+		// https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md
+		// defines CONNECTING as follows:
+		// - The channel is trying to establish a connection and is waiting to
+		//   make progress on one of the steps involved in name resolution, TCP
+		//   connection establishment or TLS handshake. This may be used as the
+		//   initial state for channels upon creation.
+		//
+		// We are starting the name resolver here as part of exiting IDLE, so
+		// transitioning to CONNECTING is the right thing to do.


IMO comments should be short and to the point.

Short comments make the code take up less space, which makes it easier to read and understand. Long comments make long functions extremely long and not fit on the page.

Honestly, I think a comment for this action isn't even necessary. But if you think we need one, this could be:

// Set state to CONNECTING before building the name resolver // so the channel does not remain in IDLE.

dfawley · 2025-11-20T22:33:09Z

test/clientconn_state_transition_test.go

+			if state := cc.GetState(); state != connectivity.Idle {
+				t.Fatalf("Expected initial state to be IDLE, got %v", state)
+			}


The AwaitState above already tested this IIUC

dfawley · 2025-11-20T22:33:59Z

test/clientconn_state_transition_test.go

+			// Ensure that the client is in IDLE before connecting.
+			ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout)
+			defer cancel()
+			testutils.AwaitState(ctx, t, cc, connectivity.Idle)


This doesn't need an Await right? It should just check the current state, and never wait for changes, as we know it starts idle.

That's true. Moved the check for the current state to here, and got rid of the Await.

test/clientconn_state_transition_test.go

resolver_wrapper.go

dfawley · 2025-11-25T21:21:27Z

resolver_balancer_ext_test.go

 // Tests the case where the resolver reports an error to the channel before
 // reporting an update. Verifies that the channel eventually moves to
-// TransientFailure and a subsequent RPC returns the error reported by the
+// TransientFailure and a subsequent RPCs returns the error reported by the


test/clientconn_state_transition_test.go

dfawley · 2025-11-25T22:49:58Z

resolver_wrapper.go

 			Authority:            ccr.cc.authority,
 			MetricsRecorder:      ccr.cc.metricsRecorderList,
 		}
+


Please revert this diff & file.

…on fails for an RPC

… path

clientconn.go

easwars · 2025-12-02T07:53:12Z

@dfawley
I have the changes ready for the above two comments, but I want to hear from you before sending a commit for it. Thanks.

resolver_balancer_ext_test.go

arjan-bal · 2025-12-02T08:21:50Z

test/clientconn_state_transition_test.go

+			dopts := []grpc.DialOption{
+				grpc.WithTransportCredentials(insecure.NewCredentials()),
+				grpc.WithResolvers(&testResolverBuilder{logger: t, manualR: mr}),
+				grpc.WithIdleTimeout(time.Second),


Can we avoid assuming that a 1s interval is sufficient for 2 attempts to fail?

One solution might be to split this into two tests:

Test that setting a sufficiently large idle timeout keeps the channel in TF and doesn't trigger the resolver builder again.

Set a defaultTestShortIdleTimeout and ensure the channel exits TF, calling the resolver builder again.

We had a way to enter IDLE mode from tests. But looks like that goes through the regular idleness logic. So, I added another one to forcefully put the channel in IDLE (for testing purposes), and changed the test to forcefully enter IDLE after it enters TF.

@easwars the way the test can flake is if there's a gap of more than 1s b/w the two RPC attempts. The sequence of events is:

RPC 1 is made, results in testResolverBuilder.Build being called and failing. RPC fails with code UNAVAILABLE.

1s passes, the channel enters IDLE mode.

RPC 2 is made, results in testResolverBuilder.Build being called. Since the builder only returns an error for the first call, the Build call succeeds this time. The RPC doesn't fail with the expected error.

Yes, that is a valid sequence of events. But there is nothing else happening in this test at all when that for loop making the two RPCs is running. And the RPCs don't even get far enough to create an LB policy, create subchannels, create transports, none of that good stuff is happening. While I agree it could possible take a second after the first RPC runs before the second one gets to run, I feel the probability is miniscule.

I agree 1s is sufficient, but it introduces a fixed latency—the test will always take the full second even if it could finish in milliseconds. It's not a blocker for this PR, but something to keep in mind.

stream.go

clientconn.go

dfawley

This looks great now, I think this is what we want. I think we should take about naming a bit (which is just textual changes), and I'm concerned about the ForceIdle thing. Otherwise LGTM!

dfawley · 2025-12-04T18:46:36Z

internal/idle/idle.go

+// MarkAsExitedIdleMode instructs the Manager to update its internal state to
+// indicate that the channel has exited IDLE mode. This is only used by the gRPC
+// client when it exits IDLE mode manually from Dial.
+func (m *Manager) MarkAsExitedIdleMode() {


This probably needs warnings on it. Generally you can't ever use it in a situation where you can possibly race with anything else calling ExitIdleMode or OnCallBegin, and you must know you are really idle.

Bikeshedding:

UnsafeSetNotIdle()? UnsafeSetActive()? InitializeAsActive() (to convey this is only to be used as part of creation)?

Aside but very related: Do we already say "active" is the opposite of "idle" anywhere? Or are the only states "in idle mode" and "not in idle mode". The latter feels a little awkward. Maybe we can rename things to use "active" vs "idle":

ExitIdleMode() -> Activate() EnterIdleModeForTesting() -> GoIdle() // Do we really need to ForTesting this? It seems safe to use.

Further bikeshedding that's unrelated: I never found the "Enforcer" name to be intuitive either. I've always thought of the manager as the thing doing the enforcing. The object it's controlling is just enacting the idle manager's commands. How about Actor or Agent or Worker or Channel instead?

This probably needs warnings on it

Improved the docstring to make it sound more scary.

Bikeshedding

Went with UnsafeSetNotIdle

Do we already say "active" is the opposite of "idle" anywhere?

Yeah, we have never used the term "active" to mean the opposite of "idle" anywhere so far.

Maybe we can rename things to use "active" vs "idle"

We probably could do that. Can we do it in a follow up though?

EnterIdleModeForTesting() -> GoIdle() // Do we really need to ForTesting this? It seems safe to use.

Yes, we need this to test the race between the channel going idle after idle_timeout firing and an RPC trying to keep it from not going idle.

I never found the "Enforcer" name to be intuitive either.

I liked Channel since it conveyed the meaning directly, but eventually decided to go with ClientConn to keep it consistent with other places where an interface's functionality is provided by the client channel.

dfawley · 2025-12-04T18:47:09Z

internal/idle/idle.go


+// ForceEnterIdleModeForTesting bypasses the usual checks that happen before
+// entering idle mode, and forcefully enter idle mode for testing purposes.
+func (m *Manager) ForceEnterIdleModeForTesting() {


Why do we need this? It feels pretty dangerous to even have it. What are we doing that requires us to enter idle mode even with pending calls?

I added this so that we can force entering IDLE from tests after making RPCs (and completing those RPCs), but not having to set a small idle_timeout and be vulnerable to timing flakes.

Essentially instead of setting an idle_timeout of say 1s, making an RPC, and waiting for a second (or a little more) for the channel to enter idle, the test could make an RPC, and once the RPC is complete, it can force the channel to enter idle becuase it knows there are no pending calls.

The existing EnterIdleModeForTesting which I've renamed to TryEnterIdleModeForTesting works within the parameters of idleness and moves the channel to idle mode only if there has been no activity on the channel. We could have named it as FireIdleTimeoutForTesting because that is essentially what is happening as part of this method and is required for testing races between when the channel is trying to enter idle (as part of the idle timeout firing) and exit idle (as part an RPC).

Coming back to this thread though: #8710 (comment)

Can we avoid assuming that a 1s interval is sufficient for 2 attempts to fail?

Actually the code wasn't assuming that a 1s interval was sufficient for 2 attempts to fail. The 1s interval just determined when the idle timeout would fire and everytime it fired, it would check if the channel had no activity since the last time it fired. I'm going to go back to the previous approach for the following reasons:

1s should be plenty for these two attempts to fail because nothing happens in these two attempts, except the attempt to build the resolver which would fail, which would result in the RPC failing

Even if the two attempts took more than 1s, it wouldn't affect the correctness of the test as the idle timeout would fire and the idleness manager would see that there was activity on the channel and therefore would do nothing. And the next time it fires (or the next time), it would eventually find no activity on the channel and would move it to idle

This would allow me to get rid of this ForceEnterIdleModeForTesting.

I don't understand. Why would we want to simulate the idle timeout firing? Anything operating at that low of a level should be internal to the idle package. Everything else should have a safe way to say "go idle now -- unless there are pending RPCs by the time this call gets in and the idle lock is taken".

Are you concerned more about the name or the functionality of that API itself? This is how it was before the change also. After reading your last comment, I agree that the new name FireIdleTimeoutForTesting seems very low level and can be changed back to EnterIdleModeForTesting. The problem though is that it is possible that it doesn't enter idle mode, because it uses the same logic that non-test code would use to decide if it should try entering idle mode or not. The problem with using separate logic in tests to enter idle mode is that we might not be testing the right set of things for example when we want to test the race between idle timeout firing (and trying to put the channel in idle) and an RPC happening at the same time (that is trying to keep the channel from entering idle).

Added a comment in the initial thread about how the 1s interval can cause the test to flake: #8710 (comment)

we might not be testing the right set of things for example when we want to test the race between idle timeout firing (and trying to put the channel in idle) and an RPC happening at the same time (that is trying to keep the channel from entering idle).

We shouldn't need these kinds of tests in the channel, though, should we? From the channel's perspective, it needs to properly deal with the race between RPC start and a call to EnterIdle() from the idle manager. It doesn't know or care about the mechanics of the timer firing. Testing the idle manager might require such unit-style tests, in which case "simulating the timer firing" can be a non-exported function. Maybe I'm missing something though.

We shouldn't need these kinds of tests in the channel, though, should we?

Looks like we have a unit style test for exactly the same thing and it does exactly what you are saying: one goroutine trying to fire the idle timeout and another one trying to call OnCallBeing to start an RPC

So, I ended up removing the test that triggers the same from the channel.

…ting

arjan-bal

LGTM, there's still a discussion about removing ForceEnterIdleModeForTesting. I'm fine with either outcome.

dfawley · 2025-12-05T18:41:55Z

internal/idle/idle.go

-	m.tryEnterIdleMode()
+// FireIdleTimeoutForTesting forcefully triggers the idle timeout to fire.
+func (m *Manager) FireIdleTimeoutForTesting() {
+	m.handleIdleTimeout()


From your other comment:

Are you concerned more about the name or the functionality of that API itself? This is how it was before the change also.

But...this is a behavior change, right?

I've brought back EnterIdleModeForTesting now, but the implementation takes inspiration from how to exit idle from Dial. So the implementation now calls the unexported enterIdleMode on the clientconn and calls a method on the idleness manager to let it know that we have entered idle (so go update your internal state).

This also means that I don't have to use the 1s idle_timeout from one of those tests, and also not directly cause the idle timeout to fire from the test, but instead use a higher level API to ask the clientconn to enter idle mode.

easwars · 2025-12-05T20:11:22Z

@dfawley
I'm hoping the PR now should have addressed all outstanding concerns. Please take another look.

dfawley · 2025-12-05T23:29:59Z

internal/idle/idle.go

+//
+// N.B. This method is intended only for testing purposes. The caller must
+// ensure that there are no ongoing RPCs
+func (m *Manager) UnsafeSetIdleForTesting() error {


Why do you prefer this over the old version? This is dangerous while the old one was safe. And the old one will work in all the same scenarios that the new one will work in (when you know for sure the channel is not able to perform RPCs), but not vice-versa.

dfawley · 2025-12-05T23:33:46Z

clientconn.go

+		cc.csMgr.updateState(connectivity.TransientFailure)
+		cc.mu.Lock()
+		cc.updateResolverStateAndUnlock(resolver.State{}, err)
 		return err


I think we should augment this error to say something about the resolver building failing (or the wrapper should?) so that it's clear when we fail to exit idle mode that the subsequent text came from the name resolver.

dfawley · 2025-12-05T23:41:48Z

resolver_wrapper.go

 	errCh := make(chan error)
 	ccr.serializer.TrySchedule(func(ctx context.Context) {
 		if ctx.Err() != nil {
+			errCh <- ctx.Err()


Wow, it seems like there should be a way to statically check something like this! 👀

dfawley · 2025-12-05T23:43:33Z

stream.go

+			if channelz.IsOn() {
+				cc.incrCallsFailed()
+			}
+			// Invoke all the registered OnFinish call options explicitly. A
+			// non-nil error means that the stream wasn't created, and
+			// therefore these will be NOT be invoked as part of `cs.finish()`.
+			for _, o := range opts {
+				if o, ok := o.(OnFinishCallOption); ok {
+					o.OnFinish(err)
+				}
+			}


Let's see if we can find a better way to unify these things?

Maybe we can have a shared endOfRPC function that is called by finish and from here?

client: move connectivity state to CONNECTING when creating the name …

1cb398d

…resolver

easwars requested a review from dfawley November 14, 2025 21:48

easwars assigned dfawley Nov 14, 2025

easwars added Type: Bug Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. labels Nov 14, 2025

easwars added this to the 1.78 Release milestone Nov 14, 2025

easwars requested a review from arjan-bal November 14, 2025 21:48

easwars assigned arjan-bal Nov 14, 2025

make vet happy

86503d2

dfawley reviewed Nov 20, 2025

View reviewed changes

easwars added 8 commits November 21, 2025 07:10

fix typo

f496de6

shorten comment, move error handling and setting state to the channel

36f14e6

check for IDLE after channel creation without awaiting

0131893

ensure resolver build error is returned to RPC

bd0ee2b

add a test to verify the case where resolver reports error

a88158b

return Unavailable when resolver build fails

a7085d5

return Unavailable only in the RPC path

7c12dde

make more than one RPC

5273e8a

dfawley reviewed Nov 25, 2025

View reviewed changes

easwars added 3 commits November 26, 2025 20:40

ensure activeCallsCount value is correctly tracked when stream creati…

2c96437

…on fails for an RPC

ensure status error is returned on a failed ExitIdle only for the RPC…

8150866

… path

make multiple RPCs from the test

53aef00

dfawley reviewed Dec 1, 2025

View reviewed changes

clientconn.go Show resolved Hide resolved

clientconn.go Outdated Show resolved Hide resolved

dfawley assigned easwars and unassigned dfawley Dec 1, 2025

easwars assigned dfawley and unassigned easwars Dec 2, 2025

arjan-bal reviewed Dec 2, 2025

View reviewed changes

arjan-bal assigned easwars and unassigned arjan-bal Dec 2, 2025

easwars added 3 commits December 4, 2025 08:09

make ExitIdleMode on the idleness Manager infallible

7f5eef0

handle review comments from Arjan

e13bc58

make vet happy

6cbd782

dfawley reviewed Dec 4, 2025

View reviewed changes

easwars added 3 commits December 4, 2025 19:45

rename Enforcer to ClientConn

ee889ea

name the method UnsafeSetNotIdle and improve docstring

2da5357

remove ForceEnterIdleModeForTesting and rename TryEnterIdleModeForTes…

506ee4f

…ting

easwars assigned arjan-bal and unassigned easwars Dec 4, 2025

arjan-bal approved these changes Dec 5, 2025

View reviewed changes

arjan-bal removed their assignment Dec 5, 2025

dfawley reviewed Dec 5, 2025

View reviewed changes

easwars added 2 commits December 5, 2025 19:52

bring back EnterIdleModeForTesting

5030d4c

minor fixes

918c6f1

dfawley reviewed Dec 5, 2025

View reviewed changes


		func (s stringerVal) String() string { return s.s }

		const errResolverBuildercheme = "test-resolver-build-failure"

client: Change connectivity state to CONNECTING when creating the name resolver #8710

Are you sure you want to change the base?

client: Change connectivity state to CONNECTING when creating the name resolver #8710

Uh oh!

Conversation

easwars commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current Behavior

New Behavior

Implementation details:

Uh oh!

codecov bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

easwars commented Dec 2, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dfawley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

easwars commented Nov 14, 2025 •

edited

Loading

codecov bot commented Nov 14, 2025 •

edited

Loading