Skip to content

fix: eliminate race condition in crash_after_dequeue_before_append_completion test#20

Open
pinodeca wants to merge 4 commits into
mainfrom
pinodeca/fix-race
Open

fix: eliminate race condition in crash_after_dequeue_before_append_completion test#20
pinodeca wants to merge 4 commits into
mainfrom
pinodeca/fix-race

Conversation

@pinodeca
Copy link
Copy Markdown

@pinodeca pinodeca commented May 18, 2026

This PR eliminates flaky external duplicate event tests by splitting the logic into two distinct, deterministic test cases: one for same-scoop behavior and one for different-scoop-after-terminal behavior.

Problem

Previously, duplicate external events could be fetched in either a single dispatcher scoop or across separate scoops. The old live-runtime enqueue test was non-deterministic because it could observe either 1 or 2 ExternalEvents depending on timing.

Approach

  1. external_duplicate_workitems_same_scoop_materializes_both: This test forces the same scoop by waiting for the subscription to be active, shutting down the runtime, enqueueing both duplicate events while it is down, and then restarting. It expects 2 ExternalEvents to be materialized.
  2. crash_after_dequeue_before_append_completion_different_scoop_after_terminal: This case simulates a race where the first event is fetched/locked but not yet acknowledged. It waits for the lock to expire, restarts the runtime with only the recovered row, waits for completion, enqueues a late duplicate after the terminal state is reached, waits for the queue to drain, and expects only 1 ExternalEvent.

Tests

Verified with:

cargo nt --test reliability_tests

…mpletion test

The test had a timing race: the runtime could dequeue and process the
first ExternalRaised (completing the orchestration) before the second
was enqueued. When the second arrived in a subsequent batch, the
orchestration was already terminal, so the dispatcher acked it without
materializing — resulting in 1 ExternalEvent instead of the expected 2.

Fix: shut down the runtime after the subscription is confirmed, enqueue
both duplicates while the dispatcher is stopped, then restart. This
guarantees both land in the same batch, faithfully simulating the real
crash window (where lock expiry makes both copies visible simultaneously).
Copy link
Copy Markdown
Author

@pinodeca pinodeca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is GPT-5.5 assessment of the 1st commit/attempt to fix the race condition (by Opus 4.6). Subsequently, GPT-5.5 was used for the second attempt (2nd commit in this PR).

The current fix makes the test deterministic, but it no longer exercises the crash window named by the test. By shutting the runtime down before either duplicate ExternalRaised message is enqueued, the test only verifies that two never-dequeued visible messages are fetched in the same batch. That overlaps with the existing duplicate external work-item coverage and would not catch a regression in recovery of a previously dequeued-but-unacked row after lock expiry.

The test should instead stop the runtime after the subscription is recorded, enqueue one ExternalRaised message, directly fetch it through the provider with a short orchestration lock timeout without acking it, wait for the lock to expire, enqueue the duplicate/new copy, then restart the runtime. That models crash-after-dequeue-before-append: one expired locked row plus one fresh visible row must be fetched together and both materialized before replay delivers only one to the pending subscription.

@pinodeca pinodeca force-pushed the pinodeca/fix-race branch from 2bfd447 to a554c31 Compare May 18, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant