Skip to content

Conversation

@masaori335
Copy link
Contributor

@masaori335 masaori335 commented Dec 1, 2025

We observed odd crashes, one of core file captured infinite recursive call of CacheVC::openReadStartEarliest. I tried to reproduce this with unit test, but haven't succeeded yet. However, we can detect the loop and return error.

#0  0x000055cec642c084 in RamCacheLRU::get (this=0x7fe2a3e04230, key=0x7fe0b396ec48, ret_data=0x7fe0b396ed68, auxkey=229618572) at /src/iocore/cache/RamCacheLRU.cc:136
#1  0x000055cec6416d75 in CacheVC::load_from_ram_cache (this=0x7fe0b396ec00) at /src/iocore/cache/CacheVC.cc:561
#2  CacheVC::handleRead (this=0x7fe0b396ec00) at /src/iocore/cache/CacheVC.cc:505
#3  0x000055cec6412544 in CacheVC::do_read_call (this=0x7fe0b396ec00, akey=0x7fe0b396ec48) at /src/iocore/cache/P_CacheInternal.h:279
#4  CacheVC::openReadStartEarliest (this=0x7fe0b396ec00) at /src/iocore/cache/CacheRead.cc:949
#5  0x000055cec64125a1 in Continuation::handleEvent (this=<optimized out>, event=3900, data=0x0) at /include/iocore/eventsystem/Continuation.h:236
#6  CacheVC::openReadStartEarliest (this=<optimized out>) at /src/iocore/cache/CacheRead.cc:1020
#7  0x000055cec64125a1 in Continuation::handleEvent (this=<optimized out>, event=3900, data=0x0) at /include/iocore/eventsystem/Continuation.h:236
#8  CacheVC::openReadStartEarliest (this=<optimized out>) at /src/iocore/cache/CacheRead.cc:1020
#9  0x000055cec64125a1 in Continuation::handleEvent (this=<optimized out>, event=3900, data=0x0) at /include/iocore/eventsystem/Continuation.h:236
#10 CacheVC::openReadStartEarliest (this=<optimized out>) at /src/iocore/cache/CacheRead.cc:1020
#11 0x000055cec64125a1 in Continuation::handleEvent (this=<optimized out>, event=3900, data=0x0) at /include/iocore/eventsystem/Continuation.h:236
#12 CacheVC::openReadStartEarliest (this=<optimized out>) at /src/iocore/cache/CacheRead.cc:1020
#13 0x000055cec64125a1 in Continuation::handleEvent (this=<optimized out>, event=3900, data=0x0) at /include/iocore/eventsystem/Continuation.h:236
#14 CacheVC::openReadStartEarliest (this=<optimized out>) at /src/iocore/cache/CacheRead.cc:1020
#15 0x000055cec64125a1 in Continuation::handleEvent (this=<optimized out>, event=3900, data=0x0) at /include/iocore/eventsystem/Continuation.h:236
#16 CacheVC::openReadStartEarliest (this=<optimized out>) at /src/iocore/cache/CacheRead.cc:1020
#17 0x000055cec64125a1 in Continuation::handleEvent (this=<optimized out>, event=3900, data=0x0) at /include/iocore/eventsystem/Continuation.h:236
#18 CacheVC::openReadStartEarliest (this=<optimized out>) at /src/iocore/cache/CacheRead.cc:1020
#19 0x000055cec64125a1 in Continuation::handleEvent (this=<optimized out>, event=3900, data=0x0) at /include/iocore/eventsystem/Continuation.h:236
...
#18445 0x000055cec65488da in EThread::execute_regular (this=0x7fe2b0980e00) at /src/iocore/eventsystem/UnixEThread.cc:299
#18446 0x000055cec6548e84 in EThread::execute (this=0x7fe2b0980e00) at /src/iocore/eventsystem/UnixEThread.cc:348
#18447 0x000055cec6546cf7 in spawn_thread_internal (a=0x7fe2b2c85920) at /src/iocore/eventsystem/Thread.cc:75
#18448 0x00007fe2b30897e2 in start_thread () from /lib64/libc.so.6
#18449 0x00007fe2b310e800 in clone3 () from /lib64/libc.so.6   

FWIW, warning logs of truncated document are observed before the crash.

[ET_NET 19] WARNING: Document DBE12FF7 truncated at 118480952 of -833896214, missing fragment 5622DCEC
[ET_NET 38] WARNING: Document DBE12FF7 truncated at 16145624 of -833896214, missing fragment 56483DE3
[ET_NET 36] WARNING: Document DBE12FF7 truncated at 19555584 of -833896214, missing fragment 25673E0F
[ET_NET 33] WARNING: Document DBE12FF7 truncated at 0 of -833896214, missing fragment 4B78A4D6
[ET_NET 20] WARNING: Document DBE12FF7 truncated at 4187464 of -833896214, missing fragment FF88126F
[ET_NET 25] WARNING: Document DBE12FF7 truncated at 4673944 of -833896214, missing fragment 2CD72334
[ET_NET 35] WARNING: Document DBE12FF7 truncated at 46973400 of -833896214, missing fragment 703DE205
[ET_NET 28] WARNING: Document DBE12FF7 truncated at 7038352 of -833896214, missing fragment 7C490333
[ET_NET 20] WARNING: Document DBE12FF7 truncated at 35631856 of -833896214, missing fragment 4FC3577
[ET_NET 27] WARNING: Document DBE12FF7 truncated at 98559376 of -833896214, missing fragment 7C490333

The DBE12FF7 is the first key in this CacheVC, so the warning log must be related.

(gdb) p/x this->first_key.u32[1] ///< first_key.slice(1)
$12 = 0xdbe12ff7

@masaori335 masaori335 added this to the 10.2.0 milestone Dec 1, 2025
@masaori335 masaori335 self-assigned this Dec 1, 2025
@bryancall bryancall self-requested a review December 1, 2025 22:35
bryancall
bryancall previously approved these changes Dec 2, 2025
Copy link
Contributor

@bryancall bryancall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ LGTM - Approve

This is a solid, conservative fix that acts as a circuit breaker for abnormal cache conditions. The approach is minimal, low-risk, and follows existing patterns in the codebase.

What This Fixes

Prevents stack overflow when openReadStartEarliest recursively calls itself via handleEvent(AIO_EVENT_DONE) at Lcallreturn. This occurs when:

  • Multiple directory probes are needed (hash collisions, invalid entries, corrupt docs)
  • All reads hit RAM cache/aggregation buffer (synchronous EVENT_RETURN)
  • Each synchronous return causes immediate re-entry before returning to event loop

Why The Limit Is Important

The limit of 10 isn't just preventing stack overflow - it's detecting abnormal conditions. If we're recursing 10+ times, something is wrong (cache corruption, pathological hash collisions, etc.). The limit acts as a circuit breaker to:

  • Stop gracefully rather than looping forever through bad cache data
  • Log the error with the problematic cache key for debugging
  • Return CACHE_EVENT_OPEN_READ_FAILED rather than hanging or crashing

An iteration-based approach would be dangerous here - it would convert stack overflow into potential infinite loops on corrupted cache data.

Why This Approach Works

  1. Reuses existing infrastructure - The recursive counter already exists and is used throughout CacheVC
  2. Minimal changes - Only 18 lines added, very localized
  3. Safe failure mode - Fails gracefully with error logging
  4. Tested - All tests pass with ASan enabled

Testing

Verified all 109 unit tests pass with ASan enabled.

Recommendation: Merge. This is the right approach - detecting and stopping abnormal behavior rather than trying to handle unlimited recursion.

@bryancall bryancall requested a review from Copilot December 2, 2025 20:45
Copilot finished reviewing on behalf of bryancall December 2, 2025 20:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses a critical production issue where infinite recursive calls in CacheVC::openReadStartEarliest cause crashes. The fix adds recursion detection with a counter and error handling to break the infinite loop after 10 recursive calls. The issue appears to be related to cache corruption (truncated documents with missing fragments), which causes the function to repeatedly call itself through do_read_call()handleRead()handleEvent(AIO_EVENT_DONE)openReadStartEarliest().

Key Changes:

  • Added recursion depth tracking using the existing recursive member variable
  • Implemented a limit of 10 recursive calls with error logging when exceeded
  • Modified return path to properly manage the recursion counter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@masaori335
Copy link
Contributor Author

@bryancall Please take another look. I addressed comments from Copilot.

@masaori335
Copy link
Contributor Author

[approve ci ubuntu]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants