Skip to content

Conversation

@mbissa
Copy link
Contributor

@mbissa mbissa commented Dec 2, 2025

Addresses : https://github.com/grpc/proposal/blob/master/A94-subchannel-otel-metrics.md

this PR adds sub channel metrics with applicable labels as per the RFC proposal.

RELEASE NOTES:

  • stats/otel: add subchannel metrics to eventually replace the pickfirst metrics

@mbissa mbissa added this to the 1.78 Release milestone Dec 2, 2025
@mbissa mbissa requested a review from easwars December 2, 2025 19:58
@mbissa mbissa added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Dec 2, 2025
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.35%. Comparing base (432bda3) to head (0dceb8f).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8738      +/-   ##
==========================================
+ Coverage   83.22%   83.35%   +0.13%     
==========================================
  Files         419      419              
  Lines       32454    32504      +50     
==========================================
+ Hits        27009    27095      +86     
+ Misses       4057     4020      -37     
- Partials     1388     1389       +1     
Files with missing lines Coverage Δ
clientconn.go 90.71% <100.00%> (+0.20%) ⬆️
internal/internal.go 100.00% <ø> (ø)
internal/transport/http2_client.go 92.23% <100.00%> (-0.39%) ⬇️
internal/transport/transport.go 90.86% <ø> (+2.15%) ⬆️
internal/xds/xds.go 81.08% <100.00%> (+6.08%) ⬆️

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars
Copy link
Contributor

easwars commented Dec 2, 2025

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

@easwars
Copy link
Contributor

easwars commented Dec 2, 2025

@mbissa

A94 states the following:

Implementations that have already implemented the pick-first metrics should give 
enough time for users to transition to the new metrics. For example, implementations 
should report both the old pick-first metrics and the new subchannel metrics for
2 releases, and then remove the old pick-first metrics.

Can you please ensure that we have an issue filed to track the removal of the old metrics and that it captures the correct release where it needs to be removed.

@arjan-bal
Copy link
Contributor

@arjan-bal

A94 states the following:

If we end up discarding connection attempts as we do with the “happy eyeballs” algorithm
(as per A61), we should not record the connection attempt or the disconnection.

How are we currently handling this in the pickfirst metrics?

Here is how pickfirst handles this:

  1. When a subchannel connects, all remaining subchannels are removed from pickfirst's subchannel map.
    if newState.ConnectivityState == connectivity.Ready {
    connectionAttemptsSucceededMetric.Record(b.metricsRecorder, 1, b.target)
    b.shutdownRemainingLocked(sd)
  2. The updateSubConnState method ignores any updates from subchannels that are not in the subchannel map.
    // Previously relevant SubConns can still callback with state updates.
    // To prevent pickers from returning these obsolete SubConns, this logic
    // is included to check if the current list of active SubConns includes this
    // SubConn.
    if !b.isActiveSCData(sd) {
    return
    }

Description: "EXPERIMENTAL. Number of open connections.",
Unit: "{attempt}",
Labels: []string{"grpc.target"},
OptionalLabels: []string{"grpc.lb.backend_service", "grpc.security_level", "grpc.lb.locality"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the order of these labels matters, but the order here is not consistent with the order specified in the gRFC.

Comment on lines +55 to +58
labels := make(map[string]string)
labels["grpc.lb.locality"] = locality
labels["grpc.lb.backend_service"] = cluster
return labels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can be shorted using the literal initialization syntax in Go for maps:

	return map[string]string {
	    "grpc.lb.locality": locality,
	    "grpc.lb.backend_service": cluster,
    }

// address atrributes.
func AddressToTelemetryLabels(addr resolver.Address) map[string]string {
cluster, _ := GetXDSHandshakeClusterName(addr.Attributes)
locality := LocalityString(GetLocalityID(addr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who sets the locality attribute on the address? I see that there is a SetLocality function, but there are no callers of it.

Comment on lines +1375 to +1382
var locality, backendService string
labelsFromAddress, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string)
if len(ac.addrs) > 0 && internal.AddressToTelemetryLabels != nil && ok {
labels := labelsFromAddress(ac.addrs[0])
locality = labels["grpc.lb.locality"]
backendService = labels["grpc.lb.backend_service"]
}
return locality, backendService
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be shorted as follows:

Suggested change
var locality, backendService string
labelsFromAddress, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string)
if len(ac.addrs) > 0 && internal.AddressToTelemetryLabels != nil && ok {
labels := labelsFromAddress(ac.addrs[0])
locality = labels["grpc.lb.locality"]
backendService = labels["grpc.lb.backend_service"]
}
return locality, backendService
labelsFunc, ok := internal.AddressToTelemetryLabels.(func(resolver.Address) map[string]string)
if !ok || len(ac.addrs) == 0 {
return "", ""
}
labels := labelsFunc(ac.addrs[0])
return labels["grpc.lb.locality"], labels["grpc.lb.backend_service"]

Follows from the principle of handling errors before proceeding to the rest of the code: See: go/go-style/decisions#indent-error-flow

func (ac *addrConn) securityLevel() string {
var secLevel string
if ac.transport == nil {
secLevel, _ = ac.curAddr.Attributes.Value(securityLevelKey{}).(string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a valid security_level when the connection is established (and the upDown counter is being incremented), and there is no valid security_level when the connection is disconnected (and the upDown counter is being decremented), does it affect the metric in anyway since the labels are going to be different on the two calls?

Also, an empty string is not a valid value for the security level as per the gRFC:

Allowed values - "none", "integrity_only" and "privacy_and_integrity".

So, would it make sense to drop the nil check here and instead pass none from where the disconnections are handled? Or if the change in the optional labels matters, would we have to store the security_level in the addrConn when the connection is established so that we can use the same value when the connection is terminated?

}
locality, backendService := fetchLabels(ac)
if ac.state == connectivity.Ready || (ac.state == connectivity.Connecting && s == connectivity.Idle) {
disconnectionsMetric.Record(ac.cc.metricsRecorderList, 1, ac.cc.target, backendService, locality, "unknown")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the disconnect_error label set to unknown here? The gRFC talks about a set of possible values for this label.

if ac.state == s {
return
}
locality, backendService := fetchLabels(ac)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of fetching the labels everytime the connectivity state changes, would it make sense to fetch them when the addrConn is created as part of handling NewSubConn and updating the labels when UpdateAddresses is invoked?

@easwars easwars removed their assignment Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants