[fix](checker) Avoid false-positive leaked delete bitmaps for unexpired job tmp rowsets#64313
[fix](checker) Avoid false-positive leaked delete bitmaps for unexpired job tmp rowsets#64313mymeiyi wants to merge 2 commits into
Conversation
…red job tmp rowsets The delete bitmap inverted check reported a delete bitmap as leaked when its owning rowset was not found among the committed rowsets / pending delete bitmaps of the tablet. However, a compaction/SC job writes delete bitmaps keyed by its tmp output rowset id before the job commits and makes the rowset visible. During that window the delete bitmap exists while the rowset is not yet in the committed list, so it was wrongly flagged as leaked. Add InstanceChecker::collect_unexpired_job_tmp_rowsets() to gather, per tablet, the rowset ids of job tmp rowsets that are still alive from the recycler's point of view, and skip them during the inverted check. The same calculate_tmp_rowset_expired_time() threshold as the recycler is reused (now exposed via recycler.h) so a delete bitmap is never reported as leaked while its tmp rowset is still alive from the recycler's view. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Review conclusion: request changes.
Critical checkpoints:
- Goal/test coverage: The PR aims to avoid false leaked-delete-bitmap reports for still-live job tmp rowsets and adds focused unit coverage for unexpired, expired, and non-job tmp rowsets. The main path is covered, but the new overflow path is not safely handled.
- Scope/focus: The change is small and localized to the Cloud recycler checker.
- Concurrency/lifecycle: No new shared mutable state or long-lived lifecycle issue was found; this is a synchronous checker scan.
- Configuration/compatibility: No new config or storage/protocol compatibility change.
- Parallel paths: The checker now reuses the recycler tmp-rowset expiration helper, which is appropriate.
- Error handling/data correctness: One issue: when the collection cap is reached, the checker continues with incomplete tmp-rowset state and can report live delete bitmaps as leaked.
- Testing: Positive/negative tests were added, but the cap/overflow case that causes false positives is missing.
- Observability/performance: Logging exists for the cap, but logging does not prevent an incorrect checker result.
User focus: No additional user-provided focus points were listed.
| << ", num_expired=" << num_expired | ||
| << ", limit=" << max_unexpired_tmp_rowsets; | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
Returning success here lets do_delete_bitmap_inverted_check() continue with an intentionally incomplete unexpired_tmp_rowsets map. In a cluster with more than 1000 unexpired compaction/schema-change tmp rowsets, any live tmp rowset after the cap is not considered by the later belongs_to_unexpired_tmp_rowset check, so its delete bitmap is reported as leaked and the checker returns failure even though the rowset is still within the same retention window used by the recycler. Please either keep collecting all live tmp rowsets or fail/skip this check when the cap is reached, rather than producing a known false positive.
|
PR approved by at least one committer and no changes requested. |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
The delete bitmap inverted check reported a delete bitmap as leaked when its owning rowset was not found among the committed rowsets / pending delete bitmaps of the tablet. However, a compaction job writes delete bitmaps keyed by its tmp output rowset id before the job commits and makes the rowset visible. During that window the delete bitmap exists while the rowset is not yet in the committed list, so it was wrongly flagged as leaked.
Add InstanceChecker::collect_unexpired_job_tmp_rowsets() to gather, per tablet, the rowset ids of job tmp rowsets that are still alive from the recycler's point of view, and skip them during the inverted check. The same calculate_tmp_rowset_expired_time() threshold as the recycler is reused (now exposed via recycler.h) so a delete bitmap is never reported as leaked while its tmp rowset is still alive from the recycler's view.