Skip to content

Conversation

@yknoya
Copy link
Contributor

@yknoya yknoya commented Dec 3, 2025

Problem

#9181 introduced an issue where an origin server was marked as down even though a connection had been successfully established.

This issue occurs under the following conditions:

  1. proxy.config.http.server_session_sharing.match is set to a value other than none (i.e., server session reuse is enabled).
  2. A server session is reused when connecting to the origin.
  3. The connection is closed after sending a request to the origin.
  4. Condition 3 occurs repeatedly until it reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries.

The issue has been confirmed in the following branches/versions (other versions not tested):

Cause

When ATS begins processing an origin connection, it executes t_state.set_connect_fail(EIO) to tentatively set connect_result to EIO:

t_state.set_connect_fail(EIO);

this->current.server->connect_result = e;

If server session reuse is not possible, connect_result is cleared once the connection is established:

t_state.current.server->clear_connect_fail();

However, when a server session is reused, connect_result is not cleared and remains set to EIO.
This regression was triggered by the change introduced in #9181 .

Before the PR was merged, t_state.set_connect_fail(EIO) was not executed when a server session was reused.
After the PR, it is executed regardless of whether a server session is reused or not.

With connect_result incorrectly left as EIO, if the connection is closed after sending a request to the origin, the following call chain leads to execution of HttpSM::mark_host_failure, causing the fail_count to be incremented:

  1. handle_response_from_server(s);
  2. handle_server_connection_not_open(s);
  3. s->state_machine->do_hostdb_update_if_necessary();
  4. this->mark_host_failure(&t_state.dns_info, ts_clock::from_time_t(t_state.client_request_time));
  5. if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries);

If this happens repeatedly and reaches the threshold defined by proxy.config.http.connect_attempts_rr_retries, the origin server is incorrectly marked as down:

if (auto [down, fail_count] = info->active->increment_fail_count(time_down, t_state.txn_conf->connect_attempts_rr_retries);
down) {
char *url_str = t_state.hdr_info.client_request.url_string_get_ref(nullptr);
std::string_view host_name{t_state.unmapped_url.host_get()};
swoc::bwprint(error_bw_buffer, "CONNECT : {::s} connecting to {} for host='{}' url='{}' fail_count='{}' marking down",
swoc::bwf::Errno(t_state.current.server->connect_result), t_state.current.server->dst_addr, host_name,
swoc::bwf::FirstOf(url_str, "<none>"), fail_count);
Log::error("%s", error_bw_buffer.c_str());
SMDbg(dbg_ctl_http, "hostdb update marking IP: %s as down", addrbuf);
ATS_PROBE2(hostdb_mark_ip_as_down, sm_id, addrbuf);

Since the connection to the origin is actually successful, marking it as down is incorrect.

Fix

Update the logic so that t_state.set_connect_fail(EIO) is executed only when establishing a new connection to the origin (i.e., when a server session is not reused), and ensure that connect_result is cleared once the connection succeeds.

Additionally, when multiplexed_origin is true, connect_result was also not being cleared after a successful connection.
In this case, although t_state.set_connect_fail(EIO) is executed (see below), the lack of a corresponding clear operation results in connect_result remaining EIO:

if (multiplexed_origin) {
EThread *ethread = this_ethread();
if (nullptr != ethread->connecting_pool) {
SMDbg(dbg_ctl_http_ss, "Queue multiplexed request");
new_entry = new ConnectingEntry();
new_entry->mutex = this->mutex;
new_entry->ua_txn = _ua.get_txn();
new_entry->handler = (ContinuationHandler)&ConnectingEntry::state_http_server_open;
new_entry->ipaddr.assign(&t_state.current.server->dst_addr.sa);
new_entry->hostname = t_state.current.server->name;
new_entry->sni = this->get_outbound_sni();
new_entry->cert_name = this->get_outbound_cert();
new_entry->is_no_plugin_tunnel = plugin_tunnel_type == HttpPluginTunnel_t::NONE;
this->t_state.set_connect_fail(EIO);
new_entry->connect_sms.insert(this);
ethread->connecting_pool->m_ip_pool.insert(std::make_pair(new_entry->ipaddr, new_entry));
}
}

This patch ensures that connect_result is cleared whenever the connection succeeds, regardless of whether multiplexed_origin is enabled.

@yknoya yknoya marked this pull request as ready for review December 4, 2025 00:55
@bryancall bryancall requested a review from Copilot December 4, 2025 02:49
@bryancall bryancall added this to the 10.2.0 milestone Dec 4, 2025
Copilot finished reviewing on behalf of bryancall December 4, 2025 02:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug where origin servers were incorrectly marked as down when server session reuse was enabled. The issue was introduced in PR #9181 and affected versions 9.2.11, 10.1.0, and master.

Key Changes:

  • Moved the pre-emptive set_connect_fail(EIO) call from the general ORIGIN_SERVER_OPEN state action to the specific NET_EVENT_OPEN handler, ensuring it's only set when establishing a new connection
  • Added clear_connect_fail() call in the CONNECT_EVENT_TXN handler to properly clear connection failures for session reuse and multiplexed origin scenarios
  • Preserved existing clear_connect_fail() behavior for successful connection handshake events

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants