Skip to content

Make scheduled outerloop builds succeed when only Helix tests fail#129049

Open
mmitche wants to merge 4 commits into
dotnet:release/8.0from
mmitche:dev/scheduled-outerloop-helix-continueonerror
Open

Make scheduled outerloop builds succeed when only Helix tests fail#129049
mmitche wants to merge 4 commits into
dotnet:release/8.0from
mmitche:dev/scheduled-outerloop-helix-continueonerror

Conversation

@mmitche
Copy link
Copy Markdown
Member

@mmitche mmitche commented Jun 5, 2026

Note

This pull request was authored with the assistance of GitHub Copilot.

Problem

Several scheduled outerloop pipelines (the outerloop.yml family: runtime-libraries-coreclr outerloop and its -windows/-linux/-osx variants) use an always: false scheduled trigger. With always: false, AzDO only starts a new scheduled run if the source changed since the last successful scheduled run.

Because the repo has many flaky outerloop tests, the Helix test work items virtually always have at least one failure, which fails the "Send to Helix" step and therefore the whole build. The build never reaches a succeeded state, so AzDO re-queues the same, unchanged commit day after day, submitting more and more Helix work for no benefit. (Empirically confirmed: a single commit was re-run and failed for 19 consecutive days; once a sibling definition produced a genuinely successful run, the same-SHA re-queue stopped.)

Why continueOnError is not enough

continueOnError: true only downgrades the build to partiallySucceeded, which AzDO's always: false scheduler still does not treat as successful — so the same commit keeps getting re-queued. The Helix step must end fully successful (exit 0).

Fix

Make the "Send to Helix" step actually succeed on scheduled runs by disabling the two Arcade Microsoft.DotNet.Helix.Sdk properties that fail the build (both default to true):

  • FailOnWorkItemFailureCheckHelixJobStatus errors when a work item exits non-zero.
  • FailOnTestFailureCheckAzurePipelinesTestResults errors when any published test failed.

Setting both to false lets the msbuild step exit 0, producing a fully succeeded build. Failed tests are still published and visible in the test results tab; AzDO does not auto-degrade a build to partiallySucceeded just because a published test run contains failures — only a failing task would.

Changes

  • eng/pipelines/libraries/helix.yml: Added a failOnTestFailures parameter (default true, preserving today's behavior) wired to /p:FailOnWorkItemFailure and /p:FailOnTestFailure on the Send to Helix msbuild invocation.
  • eng/pipelines/libraries/outerloop.yml: Passes failOnTestFailures: false only on scheduled runs (Build.Reason == 'Schedule') for all three matrix legs (Release, Debug, NET48).

Behavior preservation

The new parameter defaults to true, so all other helix.yml callers are unaffected (none set WaitForWorkItemCompletion or these properties on this path, so they already resolve to true). Only scheduled outerloop runs change behavior. PR / rolling / manual outerloop runs continue to fail on Helix failures exactly as before. Build/compile breaks still fail scheduled runs (this only affects the Helix step).

Tradeoff

On scheduled runs, FailOnWorkItemFailure=false also masks work-item crashes/timeouts/infra failures, not just test-assertion failures. This is an accepted tradeoff for the goal of stopping the wasteful daily re-queue of unchanged commits; results remain visible in the Helix/test reporting.

The libraries outerloop pipeline runs on a daily schedule with always:false,
meaning AzDO only re-queues a commit if there were changes since the last
successful scheduled run. Because flaky outerloop tests cause the 'Send to
Helix' task to fail on essentially every scheduled run, the build never
succeeds, so AzDO re-queues the same commit every day and submits ever more
Helix work for an unchanged sha.

Set shouldContinueOnError on the Send to Helix step for scheduled builds only
(Build.Reason == 'Schedule'), so Helix work item failures no longer fail the
build. Compile/build breaks still fail the build, and PR/CI/manual runs are
unaffected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the libraries outerloop Azure DevOps pipeline to avoid failing scheduled runs due to Helix work item/test failures, with the intent of preventing always: false schedules from repeatedly re-queuing the same commit and submitting duplicate Helix work.

Changes:

  • Pass shouldContinueOnError: ${{ eq(variables['Build.Reason'], 'Schedule') }} into the three platform-matrix.yml invocations in outerloop.yml.
  • Add inline YAML comments explaining the rationale (avoid same-SHA daily re-queues and wasted Helix capacity).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eng/pipelines/libraries/outerloop.yml Outdated
Comment on lines +26 to +29
# Don't fail scheduled builds on Helix work item failures. Otherwise a perpetually
# failing scheduled build (flaky outerloop tests) causes AzDO to re-queue the same
# commit every day, wasting Helix resources. See always:false on the schedule above.
shouldContinueOnError: ${{ eq(variables['Build.Reason'], 'Schedule') }}
@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Jun 5, 2026

Bleh, it's right. partiallySucceeded won't cause AzDO to avoid scheduling.

continueOnError only marks the build partiallySucceeded, which AzDO's
always:false scheduler still treats as not-successful, so the same commit
keeps getting re-queued daily.

Instead, for scheduled builds, tell the Helix SDK not to fail the build on
work item / test failures by passing FailOnWorkItemFailure=false and
FailOnTestFailure=false. The Send to Helix step then fully succeeds, so a
perpetually-flaky scheduled run no longer causes AzDO to re-queue the same sha.

- helix.yml: add failOnTestFailures parameter (default true = current behavior)
  wired to the FailOnWorkItemFailure/FailOnTestFailure Helix SDK properties.
- outerloop.yml: pass failOnTestFailures=false only for scheduled builds
  (Build.Reason == 'Schedule'); replaces the earlier shouldContinueOnError
  approach.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mmitche mmitche changed the title Don't fail scheduled outerloop builds on Helix work item failures Make scheduled outerloop builds succeed when only Helix tests fail Jun 5, 2026
…will revert)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 5, 2026 18:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mmitche mmitche requested review from akoeplinger and lewing June 5, 2026 20:55
@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Jun 5, 2026

If this looks reasonble we should backport to 9.0 and 10.0 for outerloop.

@lewing
Copy link
Copy Markdown
Member

lewing commented Jun 6, 2026

/azp list

@azure-pipelines
Copy link
Copy Markdown

CI/CD Pipelines for this repository:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants