Fix issue where origins could be unintentionally marked as down #12729

yknoya · 2025-12-03T23:51:46Z

Problem

#9181 introduced an issue where an origin server was marked as down even though a connection had been successfully established.

This issue occurs under the following conditions:

proxy.config.http.server_session_sharing.match is set to a value other than none (i.e., server session reuse is enabled).
A server session is reused when connecting to the origin.
The connection is closed after sending a request to the origin.
Condition 3 occurs repeatedly until it reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries.

The issue has been confirmed in the following branches/versions (other versions not tested):

master (90dbc21)
10.1.0
9.2.11

Cause

When ATS begins processing an origin connection, it executes t_state.set_connect_fail(EIO) to tentatively set connect_result to EIO:

trafficserver/src/proxy/http/HttpSM.cc

Line 8054 in 90dbc21

t_state.set_connect_fail(EIO);

trafficserver/include/proxy/http/HttpTransact.h

Line 932 in 90dbc21

this->current.server->connect_result = e;

If server session reuse is not possible, connect_result is cleared once the connection is established:

trafficserver/src/proxy/http/HttpSM.cc

Line 1860 in 90dbc21

t_state.current.server->clear_connect_fail();

However, when a server session is reused, connect_result is not cleared and remains set to EIO.
This regression was triggered by the change introduced in #9181 .

Before the PR was merged, t_state.set_connect_fail(EIO) was not executed when a server session was reused.
After the PR, it is executed regardless of whether a server session is reused or not.

With connect_result incorrectly left as EIO, if the connection is closed after sending a request to the origin, the following call chain leads to execution of HttpSM::mark_host_failure, causing the fail_count to be incremented:

trafficserver/src/proxy/http/HttpTransact.cc

Line 3466 in 90dbc21

handle_response_from_server(s);
trafficserver/src/proxy/http/HttpTransact.cc

Line 3786 in 90dbc21

handle_server_connection_not_open(s);
trafficserver/src/proxy/http/HttpTransact.cc

Line 3884 in 90dbc21

s->state_machine->do_hostdb_update_if_necessary();
trafficserver/src/proxy/http/HttpSM.cc

Line 4630 in 90dbc21

this->mark_host_failure(&t_state.dns_info, ts_clock::from_time_t(t_state.client_request_time));
trafficserver/src/proxy/http/HttpSM.cc

Line 5876 in 90dbc21

if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries);

If this happens repeatedly and reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries, the origin server is incorrectly marked as down:

trafficserver/src/proxy/http/HttpSM.cc

Lines 5876 to 5885 in 90dbc21

    
           if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries); 
        
               down) { 
        
             char            *url_str = t_state.hdr_info.client_request.url_string_get_ref(nullptr); 
        
             std::string_view host_name{t_state.unmapped_url.host_get()}; 
        
             swoc::bwprint(error_bw_buffer, "CONNECT : {::s} connecting to {} for host='{}' url='{}' fail_count='{}' marking down", 
        
                           swoc::bwf::Errno(t_state.current.server->connect_result), t_state.current.server->dst_addr, host_name, 
        
                           swoc::bwf::FirstOf(url_str, "<none>"), fail_count); 
        
             Log::error("%s", error_bw_buffer.c_str()); 
        
             SMDbg(dbg_ctl_http, "hostdb update marking IP: %s as down", addrbuf); 
        
             ATS_PROBE2(hostdb_mark_ip_as_down, sm_id, addrbuf);

Since the connection to the origin is actually successful, marking it as down is incorrect.

Fix

Update the logic so that t_state.set_connect_fail(EIO) is executed only when establishing a new connection to the origin (i.e., when a server session is not reused), and ensure that connect_result is cleared once the connection succeeds.

Additionally, when multiplexed_origin is true, connect_result was also not being cleared after a successful connection.
In this case, although t_state.set_connect_fail(EIO) is executed (see below), the lack of a corresponding clear operation results in connect_result remaining EIO:

trafficserver/src/proxy/http/HttpSM.cc

Lines 5706 to 5723 in 90dbc21

    
           if (multiplexed_origin) { 
        
             EThread *ethread = this_ethread(); 
        
             if (nullptr != ethread->connecting_pool) { 
        
               SMDbg(dbg_ctl_http_ss, "Queue multiplexed request"); 
        
               new_entry          = new ConnectingEntry(); 
        
               new_entry->mutex   = this->mutex; 
        
               new_entry->ua_txn  = _ua.get_txn(); 
        
               new_entry->handler = (ContinuationHandler)&ConnectingEntry::state_http_server_open; 
        
               new_entry->ipaddr.assign(&t_state.current.server->dst_addr.sa); 
        
               new_entry->hostname            = t_state.current.server->name; 
        
               new_entry->sni                 = this->get_outbound_sni(); 
        
               new_entry->cert_name           = this->get_outbound_cert(); 
        
               new_entry->is_no_plugin_tunnel = plugin_tunnel_type == HttpPluginTunnel_t::NONE; 
        
               this->t_state.set_connect_fail(EIO); 
        
               new_entry->connect_sms.insert(this); 
        
               ethread->connecting_pool->m_ip_pool.insert(std::make_pair(new_entry->ipaddr, new_entry)); 
        
             } 
        
           }

This patch ensures that connect_result is cleared whenever the connection succeeds, regardless of whether multiplexed_origin is enabled.

Copilot

Pull request overview

This PR fixes a critical bug where origin servers were incorrectly marked as down when server session reuse was enabled. The issue was introduced in PR #9181 and affected versions 9.2.11, 10.1.0, and master.

Key Changes:

Moved the pre-emptive set_connect_fail(EIO) call from the general ORIGIN_SERVER_OPEN state action to the specific NET_EVENT_OPEN handler, ensuring it's only set when establishing a new connection
Added clear_connect_fail() call in the CONNECT_EVENT_TXN handler to properly clear connection failures for session reuse and multiplexed origin scenarios
Preserved existing clear_connect_fail() behavior for successful connection handshake events

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Fix issue where origins could be unintentionally marked as down

b696a36

yknoya marked this pull request as ready for review December 4, 2025 00:55

bryancall requested a review from Copilot December 4, 2025 02:49

bryancall assigned yknoya Dec 4, 2025

bryancall added Bug HostDB labels Dec 4, 2025

Copilot started reviewing on behalf of bryancall December 4, 2025 02:49 View session

bryancall added this to the 10.2.0 milestone Dec 4, 2025

Copilot finished reviewing on behalf of bryancall December 4, 2025 02:52

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issue where origins could be unintentionally marked as down #12729

Fix issue where origins could be unintentionally marked as down #12729

yknoya commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries);
	down) {
	char *url_str = t_state.hdr_info.client_request.url_string_get_ref(nullptr);
	std::string_view host_name{t_state.unmapped_url.host_get()};
	swoc::bwprint(error_bw_buffer, "CONNECT : {::s} connecting to {} for host='{}' url='{}' fail_count='{}' marking down",
	swoc::bwf::Errno(t_state.current.server->connect_result), t_state.current.server->dst_addr, host_name,
	swoc::bwf::FirstOf(url_str, "<none>"), fail_count);
	Log::error("%s", error_bw_buffer.c_str());
	SMDbg(dbg_ctl_http, "hostdb update marking IP: %s as down", addrbuf);
	ATS_PROBE2(hostdb_mark_ip_as_down, sm_id, addrbuf);

	if (multiplexed_origin) {
	EThread *ethread = this_ethread();
	if (nullptr != ethread->connecting_pool) {
	SMDbg(dbg_ctl_http_ss, "Queue multiplexed request");
	new_entry = new ConnectingEntry();
	new_entry->mutex = this->mutex;
	new_entry->ua_txn = _ua.get_txn();
	new_entry->handler = (ContinuationHandler)&ConnectingEntry::state_http_server_open;
	new_entry->ipaddr.assign(&t_state.current.server->dst_addr.sa);
	new_entry->hostname = t_state.current.server->name;
	new_entry->sni = this->get_outbound_sni();
	new_entry->cert_name = this->get_outbound_cert();
	new_entry->is_no_plugin_tunnel = plugin_tunnel_type == HttpPluginTunnel_t::NONE;
	this->t_state.set_connect_fail(EIO);
	new_entry->connect_sms.insert(this);
	ethread->connecting_pool->m_ip_pool.insert(std::make_pair(new_entry->ipaddr, new_entry));
	}
	}

Fix issue where origins could be unintentionally marked as down #12729

Are you sure you want to change the base?

Fix issue where origins could be unintentionally marked as down #12729

Conversation

yknoya commented Dec 3, 2025

Problem

Cause

Fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants