Skip to content

fix(ssl): detect and rebuild stale ACME orders on verify/retry#493

Open
mrrobot47 wants to merge 1 commit into
EasyEngine:developfrom
mrrobot47:fix/ssl-stale-order-self-heal
Open

fix(ssl): detect and rebuild stale ACME orders on verify/retry#493
mrrobot47 wants to merge 1 commit into
EasyEngine:developfrom
mrrobot47:fix/ssl-stale-order-self-heal

Conversation

@mrrobot47

Copy link
Copy Markdown
Member

Problem

A one-time DNS hiccup or timeout could permanently break SSL for a site. On issuance failure the ACME order/challenge files are left on disk; the user fixes DNS and re-runs ee site ssl-verifycheck() blindly reloads the on-disk order and reuses it. Within ~7 days while the order is still pending this works (and is preserved). But once Let's Encrypt expires the pending order, the reloaded order is dead, validation fails forever, and the site shows a generic "Cannot validate challenge" on every retry — because the retry path never requests a fresh order (only init_le calls authorize()).

Fix

check() now detects a stale stored order and rebuilds it before validating. isCertificateOrderStale() reloads each domain's authorization from Let's Encrypt and treats the order as stale when an authorization's live status is not pending/valid (expired/deactivated/revoked/invalid), or when the stored order doesn't cover a requested domain. On staleness it removes the dead order + challenge, calls authorize() for a fresh order, and continues.

Safeguards (validated by review):

  • A transient Let's Encrypt error (5xx/429/timeout) during the reload is treated as inconclusive — the order is preserved and the run fails gracefully for a later retry — rather than destroying a healthy in-flight order and burning a rate-limited newOrder slot.
  • The creation path (fresh order) and the normal "DNS not ready yet, retry later" path (pending order) do not trigger a rebuild — behavior there is unchanged.
  • For manual DNS-01 (no Cloudflare API key) a rebuild issues a new TXT value, so the command warns the user to publish the new record and re-run, instead of failing validation confusingly. HTTP-01 and Cloudflare-automated DNS continue to validation as before.

Known limitation

If Let's Encrypt fully purges an expired authorization (the reload returns 404 rather than status expired), the order is not auto-rebuilt — the conservative choice that avoids the transient-error regression above. The common expiry case (status expired on a successful reload) is handled.

Testing

Manual (needs the ACME/Docker harness): simulate a stale order (let a pending order expire, or deactivate the authorization server-side), point DNS correctly, then run ee site ssl-verify → it requests a fresh order instead of looping. Negative test: with the order still pending, confirm no rebuild and the normal retry path.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants