-
Notifications
You must be signed in to change notification settings - Fork 4.6k
client: Change connectivity state to CONNECTING when creating the name resolver #8710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
client: Change connectivity state to CONNECTING when creating the name resolver #8710
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8710 +/- ##
==========================================
- Coverage 83.28% 82.13% -1.16%
==========================================
Files 416 418 +2
Lines 32267 32434 +167
==========================================
- Hits 26874 26640 -234
- Misses 4019 4093 +74
- Partials 1374 1701 +327
🚀 New features to boost your workflow:
|
dial_test.go
Outdated
|
|
||
| func (s stringerVal) String() string { return s.s } | ||
|
|
||
| const errResolverBuildercheme = "test-resolver-build-failure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
resolver_wrapper.go
Outdated
| // https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md | ||
| // defines CONNECTING as follows: | ||
| // - The channel is trying to establish a connection and is waiting to | ||
| // make progress on one of the steps involved in name resolution, TCP | ||
| // connection establishment or TLS handshake. This may be used as the | ||
| // initial state for channels upon creation. | ||
| // | ||
| // We are starting the name resolver here as part of exiting IDLE, so | ||
| // transitioning to CONNECTING is the right thing to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO comments should be short and to the point.
Short comments make the code take up less space, which makes it easier to read and understand. Long comments make long functions extremely long and not fit on the page.
Honestly, I think a comment for this action isn't even necessary. But if you think we need one, this could be:
// Set state to CONNECTING before building the name resolver
// so the channel does not remain in IDLE.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| if state := cc.GetState(); state != connectivity.Idle { | ||
| t.Fatalf("Expected initial state to be IDLE, got %v", state) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AwaitState above already tested this IIUC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| // Ensure that the client is in IDLE before connecting. | ||
| ctx, cancel := context.WithTimeout(context.Background(), defaultTestTimeout) | ||
| defer cancel() | ||
| testutils.AwaitState(ctx, t, cc, connectivity.Idle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need an Await right? It should just check the current state, and never wait for changes, as we know it starts idle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Moved the check for the current state to here, and got rid of the Await.
resolver_balancer_ext_test.go
Outdated
| // Tests the case where the resolver reports an error to the channel before | ||
| // reporting an update. Verifies that the channel eventually moves to | ||
| // TransientFailure and a subsequent RPC returns the error reported by the | ||
| // TransientFailure and a subsequent RPCs returns the error reported by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ a / /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
resolver_wrapper.go
Outdated
| Authority: ccr.cc.authority, | ||
| MetricsRecorder: ccr.cc.metricsRecorderList, | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this diff & file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
@dfawley |
| dopts := []grpc.DialOption{ | ||
| grpc.WithTransportCredentials(insecure.NewCredentials()), | ||
| grpc.WithResolvers(&testResolverBuilder{logger: t, manualR: mr}), | ||
| grpc.WithIdleTimeout(time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid assuming that a 1s interval is sufficient for 2 attempts to fail?
One solution might be to split this into two tests:
- Test that setting a sufficiently large idle timeout keeps the channel in TF and doesn't trigger the resolver builder again.
- Set a defaultTestShortIdleTimeout and ensure the channel exits TF, calling the resolver builder again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a way to enter IDLE mode from tests. But looks like that goes through the regular idleness logic. So, I added another one to forcefully put the channel in IDLE (for testing purposes), and changed the test to forcefully enter IDLE after it enters TF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@easwars the way the test can flake is if there's a gap of more than 1s b/w the two RPC attempts. The sequence of events is:
- RPC 1 is made, results in
testResolverBuilder.Buildbeing called and failing. RPC fails with code UNAVAILABLE. - 1s passes, the channel enters IDLE mode.
- RPC 2 is made, results in
testResolverBuilder.Buildbeing called. Since the builder only returns an error for the first call, theBuildcall succeeds this time. The RPC doesn't fail with the expected error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is a valid sequence of events. But there is nothing else happening in this test at all when that for loop making the two RPCs is running. And the RPCs don't even get far enough to create an LB policy, create subchannels, create transports, none of that good stuff is happening. While I agree it could possible take a second after the first RPC runs before the second one gets to run, I feel the probability is miniscule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree 1s is sufficient, but it introduces a fixed latency—the test will always take the full second even if it could finish in milliseconds. It's not a blocker for this PR, but something to keep in mind.
dfawley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great now, I think this is what we want. I think we should take about naming a bit (which is just textual changes), and I'm concerned about the ForceIdle thing. Otherwise LGTM!
internal/idle/idle.go
Outdated
| // MarkAsExitedIdleMode instructs the Manager to update its internal state to | ||
| // indicate that the channel has exited IDLE mode. This is only used by the gRPC | ||
| // client when it exits IDLE mode manually from Dial. | ||
| func (m *Manager) MarkAsExitedIdleMode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs warnings on it. Generally you can't ever use it in a situation where you can possibly race with anything else calling ExitIdleMode or OnCallBegin, and you must know you are really idle.
Bikeshedding:
UnsafeSetNotIdle()? UnsafeSetActive()? InitializeAsActive() (to convey this is only to be used as part of creation)?
Aside but very related: Do we already say "active" is the opposite of "idle" anywhere? Or are the only states "in idle mode" and "not in idle mode". The latter feels a little awkward. Maybe we can rename things to use "active" vs "idle":
ExitIdleMode() -> Activate()
EnterIdleModeForTesting() -> GoIdle() // Do we really need to ForTesting this? It seems safe to use.
Further bikeshedding that's unrelated: I never found the "Enforcer" name to be intuitive either. I've always thought of the manager as the thing doing the enforcing. The object it's controlling is just enacting the idle manager's commands. How about Actor or Agent or Worker or Channel instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs warnings on it
Improved the docstring to make it sound more scary.
Bikeshedding
Went with UnsafeSetNotIdle
Do we already say "active" is the opposite of "idle" anywhere?
Yeah, we have never used the term "active" to mean the opposite of "idle" anywhere so far.
Maybe we can rename things to use "active" vs "idle"
We probably could do that. Can we do it in a follow up though?
EnterIdleModeForTesting() -> GoIdle() // Do we really need to ForTesting this? It seems safe to use.
Yes, we need this to test the race between the channel going idle after idle_timeout firing and an RPC trying to keep it from not going idle.
I never found the "Enforcer" name to be intuitive either.
I liked Channel since it conveyed the meaning directly, but eventually decided to go with ClientConn to keep it consistent with other places where an interface's functionality is provided by the client channel.
internal/idle/idle.go
Outdated
|
|
||
| // ForceEnterIdleModeForTesting bypasses the usual checks that happen before | ||
| // entering idle mode, and forcefully enter idle mode for testing purposes. | ||
| func (m *Manager) ForceEnterIdleModeForTesting() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? It feels pretty dangerous to even have it. What are we doing that requires us to enter idle mode even with pending calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this so that we can force entering IDLE from tests after making RPCs (and completing those RPCs), but not having to set a small idle_timeout and be vulnerable to timing flakes.
Essentially instead of setting an idle_timeout of say 1s, making an RPC, and waiting for a second (or a little more) for the channel to enter idle, the test could make an RPC, and once the RPC is complete, it can force the channel to enter idle becuase it knows there are no pending calls.
The existing EnterIdleModeForTesting which I've renamed to TryEnterIdleModeForTesting works within the parameters of idleness and moves the channel to idle mode only if there has been no activity on the channel. We could have named it as FireIdleTimeoutForTesting because that is essentially what is happening as part of this method and is required for testing races between when the channel is trying to enter idle (as part of the idle timeout firing) and exit idle (as part an RPC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this thread though: #8710 (comment)
Can we avoid assuming that a 1s interval is sufficient for 2 attempts to fail?
Actually the code wasn't assuming that a 1s interval was sufficient for 2 attempts to fail. The 1s interval just determined when the idle timeout would fire and everytime it fired, it would check if the channel had no activity since the last time it fired. I'm going to go back to the previous approach for the following reasons:
1sshould be plenty for these two attempts to fail because nothing happens in these two attempts, except the attempt to build the resolver which would fail, which would result in the RPC failing- Even if the two attempts took more than
1s, it wouldn't affect the correctness of the test as the idle timeout would fire and the idleness manager would see that there was activity on the channel and therefore would do nothing. And the next time it fires (or the next time), it would eventually find no activity on the channel and would move it to idle
This would allow me to get rid of this ForceEnterIdleModeForTesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Why would we want to simulate the idle timeout firing? Anything operating at that low of a level should be internal to the idle package. Everything else should have a safe way to say "go idle now -- unless there are pending RPCs by the time this call gets in and the idle lock is taken".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you concerned more about the name or the functionality of that API itself? This is how it was before the change also. After reading your last comment, I agree that the new name FireIdleTimeoutForTesting seems very low level and can be changed back to EnterIdleModeForTesting. The problem though is that it is possible that it doesn't enter idle mode, because it uses the same logic that non-test code would use to decide if it should try entering idle mode or not. The problem with using separate logic in tests to enter idle mode is that we might not be testing the right set of things for example when we want to test the race between idle timeout firing (and trying to put the channel in idle) and an RPC happening at the same time (that is trying to keep the channel from entering idle).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment in the initial thread about how the 1s interval can cause the test to flake: #8710 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might not be testing the right set of things for example when we want to test the race between idle timeout firing (and trying to put the channel in idle) and an RPC happening at the same time (that is trying to keep the channel from entering idle).
We shouldn't need these kinds of tests in the channel, though, should we? From the channel's perspective, it needs to properly deal with the race between RPC start and a call to EnterIdle() from the idle manager. It doesn't know or care about the mechanics of the timer firing. Testing the idle manager might require such unit-style tests, in which case "simulating the timer firing" can be a non-exported function. Maybe I'm missing something though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need these kinds of tests in the channel, though, should we?
Looks like we have a unit style test for exactly the same thing and it does exactly what you are saying: one goroutine trying to fire the idle timeout and another one trying to call OnCallBeing to start an RPC
So, I ended up removing the test that triggers the same from the channel.
arjan-bal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, there's still a discussion about removing ForceEnterIdleModeForTesting. I'm fine with either outcome.
internal/idle/idle.go
Outdated
| m.tryEnterIdleMode() | ||
| // FireIdleTimeoutForTesting forcefully triggers the idle timeout to fire. | ||
| func (m *Manager) FireIdleTimeoutForTesting() { | ||
| m.handleIdleTimeout() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your other comment:
Are you concerned more about the name or the functionality of that API itself? This is how it was before the change also.
But...this is a behavior change, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've brought back EnterIdleModeForTesting now, but the implementation takes inspiration from how to exit idle from Dial. So the implementation now calls the unexported enterIdleMode on the clientconn and calls a method on the idleness manager to let it know that we have entered idle (so go update your internal state).
This also means that I don't have to use the 1s idle_timeout from one of those tests, and also not directly cause the idle timeout to fire from the test, but instead use a higher level API to ask the clientconn to enter idle mode.
|
@dfawley |
| // | ||
| // N.B. This method is intended only for testing purposes. The caller must | ||
| // ensure that there are no ongoing RPCs | ||
| func (m *Manager) UnsafeSetIdleForTesting() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you prefer this over the old version? This is dangerous while the old one was safe. And the old one will work in all the same scenarios that the new one will work in (when you know for sure the channel is not able to perform RPCs), but not vice-versa.
| cc.csMgr.updateState(connectivity.TransientFailure) | ||
| cc.mu.Lock() | ||
| cc.updateResolverStateAndUnlock(resolver.State{}, err) | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should augment this error to say something about the resolver building failing (or the wrapper should?) so that it's clear when we fail to exit idle mode that the subsequent text came from the name resolver.
| errCh := make(chan error) | ||
| ccr.serializer.TrySchedule(func(ctx context.Context) { | ||
| if ctx.Err() != nil { | ||
| errCh <- ctx.Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, it seems like there should be a way to statically check something like this! 👀
| if channelz.IsOn() { | ||
| cc.incrCallsFailed() | ||
| } | ||
| // Invoke all the registered OnFinish call options explicitly. A | ||
| // non-nil error means that the stream wasn't created, and | ||
| // therefore these will be NOT be invoked as part of `cs.finish()`. | ||
| for _, o := range opts { | ||
| if o, ok := o.(OnFinishCallOption); ok { | ||
| o.OnFinish(err) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if we can find a better way to unify these things?
Maybe we can have a shared endOfRPC function that is called by finish and from here?
Fixes #7686
Current Behavior
Connectbeing called or because of an RPC), if name resolver creation fails, we stay in IDLE.New Behavior
Connectbeing called or because of an RPC), we have already moved to CONNECTING (because of the previous bullet point). If name resolver creation fails, we will move to TRANSIENT_FAILURE and start the idle timer and move back to IDLE when the timer firesImplementation details:
ExitIdleModereturns an error. Since the channel moves to TRANSIENT_FAILURE in this scenario, the Manager must correctly reflect this state and resume activity tracking.OnFinishcall options are now invoked even if stream creation fails during an RPC. This fulfills the guarantee for these options and ensures the idleness Manager’sactiveCallsCountremains accurate.RELEASE NOTES:
OnFinishcall option not being invoked for RPCs where stream creation failed.