fix: eliminate race condition in crash_after_dequeue_before_append_completion test#20
fix: eliminate race condition in crash_after_dequeue_before_append_completion test#20pinodeca wants to merge 4 commits into
Conversation
…mpletion test The test had a timing race: the runtime could dequeue and process the first ExternalRaised (completing the orchestration) before the second was enqueued. When the second arrived in a subsequent batch, the orchestration was already terminal, so the dispatcher acked it without materializing — resulting in 1 ExternalEvent instead of the expected 2. Fix: shut down the runtime after the subscription is confirmed, enqueue both duplicates while the dispatcher is stopped, then restart. This guarantees both land in the same batch, faithfully simulating the real crash window (where lock expiry makes both copies visible simultaneously).
There was a problem hiding this comment.
Here is GPT-5.5 assessment of the 1st commit/attempt to fix the race condition (by Opus 4.6). Subsequently, GPT-5.5 was used for the second attempt (2nd commit in this PR).
The current fix makes the test deterministic, but it no longer exercises the crash window named by the test. By shutting the runtime down before either duplicate ExternalRaised message is enqueued, the test only verifies that two never-dequeued visible messages are fetched in the same batch. That overlaps with the existing duplicate external work-item coverage and would not catch a regression in recovery of a previously dequeued-but-unacked row after lock expiry.
The test should instead stop the runtime after the subscription is recorded, enqueue one ExternalRaised message, directly fetch it through the provider with a short orchestration lock timeout without acking it, wait for the lock to expire, enqueue the duplicate/new copy, then restart the runtime. That models crash-after-dequeue-before-append: one expired locked row plus one fresh visible row must be fetched together and both materialized before replay delivers only one to the pending subscription.
2bfd447 to
a554c31
Compare
This PR eliminates flaky external duplicate event tests by splitting the logic into two distinct, deterministic test cases: one for same-scoop behavior and one for different-scoop-after-terminal behavior.
Problem
Previously, duplicate external events could be fetched in either a single dispatcher scoop or across separate scoops. The old live-runtime enqueue test was non-deterministic because it could observe either 1 or 2
ExternalEventsdepending on timing.Approach
external_duplicate_workitems_same_scoop_materializes_both: This test forces the same scoop by waiting for the subscription to be active, shutting down the runtime, enqueueing both duplicate events while it is down, and then restarting. It expects 2ExternalEventsto be materialized.crash_after_dequeue_before_append_completion_different_scoop_after_terminal: This case simulates a race where the first event is fetched/locked but not yet acknowledged. It waits for the lock to expire, restarts the runtime with only the recovered row, waits for completion, enqueues a late duplicate after the terminal state is reached, waits for the queue to drain, and expects only 1ExternalEvent.Tests
Verified with: