-
Notifications
You must be signed in to change notification settings - Fork 841
Break recursive call of CacheVC::openReadStartEarliest #12710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
bryancall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ LGTM - Approve
This is a solid, conservative fix that acts as a circuit breaker for abnormal cache conditions. The approach is minimal, low-risk, and follows existing patterns in the codebase.
What This Fixes
Prevents stack overflow when openReadStartEarliest recursively calls itself via handleEvent(AIO_EVENT_DONE) at Lcallreturn. This occurs when:
- Multiple directory probes are needed (hash collisions, invalid entries, corrupt docs)
- All reads hit RAM cache/aggregation buffer (synchronous
EVENT_RETURN) - Each synchronous return causes immediate re-entry before returning to event loop
Why The Limit Is Important
The limit of 10 isn't just preventing stack overflow - it's detecting abnormal conditions. If we're recursing 10+ times, something is wrong (cache corruption, pathological hash collisions, etc.). The limit acts as a circuit breaker to:
- Stop gracefully rather than looping forever through bad cache data
- Log the error with the problematic cache key for debugging
- Return
CACHE_EVENT_OPEN_READ_FAILEDrather than hanging or crashing
An iteration-based approach would be dangerous here - it would convert stack overflow into potential infinite loops on corrupted cache data.
Why This Approach Works
- Reuses existing infrastructure - The
recursivecounter already exists and is used throughoutCacheVC - Minimal changes - Only 18 lines added, very localized
- Safe failure mode - Fails gracefully with error logging
- Tested - All tests pass with ASan enabled
Testing
Verified all 109 unit tests pass with ASan enabled.
Recommendation: Merge. This is the right approach - detecting and stopping abnormal behavior rather than trying to handle unlimited recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request addresses a critical production issue where infinite recursive calls in CacheVC::openReadStartEarliest cause crashes. The fix adds recursion detection with a counter and error handling to break the infinite loop after 10 recursive calls. The issue appears to be related to cache corruption (truncated documents with missing fragments), which causes the function to repeatedly call itself through do_read_call() → handleRead() → handleEvent(AIO_EVENT_DONE) → openReadStartEarliest().
Key Changes:
- Added recursion depth tracking using the existing
recursivemember variable - Implemented a limit of 10 recursive calls with error logging when exceeded
- Modified return path to properly manage the recursion counter
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@bryancall Please take another look. I addressed comments from Copilot. |
|
[approve ci ubuntu] |
We observed odd crashes, one of core file captured infinite recursive call of
CacheVC::openReadStartEarliest. I tried to reproduce this with unit test, but haven't succeeded yet. However, we can detect the loop and return error.FWIW, warning logs of truncated document are observed before the crash.
The
DBE12FF7is the first key in thisCacheVC, so the warning log must be related.