Skip to content

[fix](checker) Avoid false-positive leaked delete bitmaps for unexpired job tmp rowsets#64313

Open
mymeiyi wants to merge 2 commits into
apache:masterfrom
mymeiyi:fix-dbm-checker
Open

[fix](checker) Avoid false-positive leaked delete bitmaps for unexpired job tmp rowsets#64313
mymeiyi wants to merge 2 commits into
apache:masterfrom
mymeiyi:fix-dbm-checker

Conversation

@mymeiyi

@mymeiyi mymeiyi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

The delete bitmap inverted check reported a delete bitmap as leaked when its owning rowset was not found among the committed rowsets / pending delete bitmaps of the tablet. However, a compaction job writes delete bitmaps keyed by its tmp output rowset id before the job commits and makes the rowset visible. During that window the delete bitmap exists while the rowset is not yet in the committed list, so it was wrongly flagged as leaked.

Add InstanceChecker::collect_unexpired_job_tmp_rowsets() to gather, per tablet, the rowset ids of job tmp rowsets that are still alive from the recycler's point of view, and skip them during the inverted check. The same calculate_tmp_rowset_expired_time() threshold as the recycler is reused (now exposed via recycler.h) so a delete bitmap is never reported as leaked while its tmp rowset is still alive from the recycler's view.

…red job tmp rowsets

The delete bitmap inverted check reported a delete bitmap as leaked when its
owning rowset was not found among the committed rowsets / pending delete
bitmaps of the tablet. However, a compaction/SC job writes delete bitmaps
keyed by its tmp output rowset id before the job commits and makes the rowset
visible. During that window the delete bitmap exists while the rowset is not
yet in the committed list, so it was wrongly flagged as leaked.

Add InstanceChecker::collect_unexpired_job_tmp_rowsets() to gather, per tablet,
the rowset ids of job tmp rowsets that are still alive from the recycler's
point of view, and skip them during the inverted check. The same
calculate_tmp_rowset_expired_time() threshold as the recycler is reused (now
exposed via recycler.h) so a delete bitmap is never reported as leaked while
its tmp rowset is still alive from the recycler's view.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 9, 2026 10:28
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread cloud/src/recycler/recycler.h
Comment thread cloud/src/recycler/checker.cpp
@mymeiyi

mymeiyi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review conclusion: request changes.

Critical checkpoints:

  • Goal/test coverage: The PR aims to avoid false leaked-delete-bitmap reports for still-live job tmp rowsets and adds focused unit coverage for unexpired, expired, and non-job tmp rowsets. The main path is covered, but the new overflow path is not safely handled.
  • Scope/focus: The change is small and localized to the Cloud recycler checker.
  • Concurrency/lifecycle: No new shared mutable state or long-lived lifecycle issue was found; this is a synchronous checker scan.
  • Configuration/compatibility: No new config or storage/protocol compatibility change.
  • Parallel paths: The checker now reuses the recycler tmp-rowset expiration helper, which is appropriate.
  • Error handling/data correctness: One issue: when the collection cap is reached, the checker continues with incomplete tmp-rowset state and can report live delete bitmaps as leaked.
  • Testing: Positive/negative tests were added, but the cap/overflow case that causes false positives is missing.
  • Observability/performance: Logging exists for the cap, but logging does not prevent an incorrect checker result.

User focus: No additional user-provided focus points were listed.

<< ", num_expired=" << num_expired
<< ", limit=" << max_unexpired_tmp_rowsets;
return 0;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning success here lets do_delete_bitmap_inverted_check() continue with an intentionally incomplete unexpired_tmp_rowsets map. In a cluster with more than 1000 unexpired compaction/schema-change tmp rowsets, any live tmp rowset after the cap is not considered by the later belongs_to_unexpired_tmp_rowset check, so its delete bitmap is reported as leaked and the checker returns failure even though the rowset is still within the same retention window used by the recycler. Please either keep collecting all live tmp rowsets or fail/skip this check when the cap is reached, rather than producing a known false positive.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@mymeiyi

mymeiyi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 71.82% (79/110) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.40% (1916/2444)
Line Coverage 64.89% (34282/52828)
Region Coverage 65.32% (17636/26998)
Branch Coverage 54.00% (9372/17354)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants