Fix issue where origins could be unintentionally marked as down #12729
+4
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
#9181 introduced an issue where an origin server was marked as down even though a connection had been successfully established.
This issue occurs under the following conditions:
proxy.config.http.server_session_sharing.matchis set to a value other thannone(i.e., server session reuse is enabled).proxy.config.http.connect_attempts_rr_retries.The issue has been confirmed in the following branches/versions (other versions not tested):
Cause
When ATS begins processing an origin connection, it executes
t_state.set_connect_fail(EIO)to tentatively setconnect_resulttoEIO:trafficserver/src/proxy/http/HttpSM.cc
Line 8054 in 90dbc21
trafficserver/include/proxy/http/HttpTransact.h
Line 932 in 90dbc21
If server session reuse is not possible,
connect_resultis cleared once the connection is established:trafficserver/src/proxy/http/HttpSM.cc
Line 1860 in 90dbc21
However, when a server session is reused,
connect_resultis not cleared and remains set toEIO.This regression was triggered by the change introduced in #9181 .
Before the PR was merged,
t_state.set_connect_fail(EIO)was not executed when a server session was reused.After the PR, it is executed regardless of whether a server session is reused or not.
With
connect_resultincorrectly left asEIO, if the connection is closed after sending a request to the origin, the following call chain leads to execution ofHttpSM::mark_host_failure, causing thefail_countto be incremented:trafficserver/src/proxy/http/HttpTransact.cc
Line 3466 in 90dbc21
trafficserver/src/proxy/http/HttpTransact.cc
Line 3786 in 90dbc21
trafficserver/src/proxy/http/HttpTransact.cc
Line 3884 in 90dbc21
trafficserver/src/proxy/http/HttpSM.cc
Line 4630 in 90dbc21
trafficserver/src/proxy/http/HttpSM.cc
Line 5876 in 90dbc21
If this happens repeatedly and reaches the threshold defined by
proxy.config.http.connect_attempts_rr_retries, the origin server is incorrectly marked as down:trafficserver/src/proxy/http/HttpSM.cc
Lines 5876 to 5885 in 90dbc21
Since the connection to the origin is actually successful, marking it as down is incorrect.
Fix
Update the logic so that
t_state.set_connect_fail(EIO)is executed only when establishing a new connection to the origin (i.e., when a server session is not reused), and ensure thatconnect_resultis cleared once the connection succeeds.Additionally, when
multiplexed_originis true,connect_resultwas also not being cleared after a successful connection.In this case, although
t_state.set_connect_fail(EIO)is executed (see below), the lack of a corresponding clear operation results inconnect_resultremainingEIO:trafficserver/src/proxy/http/HttpSM.cc
Lines 5706 to 5723 in 90dbc21
This patch ensures that
connect_resultis cleared whenever the connection succeeds, regardless of whethermultiplexed_originis enabled.