worker: stop receiving new messages on SIGTERM#1165
Open
arpitjain099 wants to merge 1 commit into
Open
Conversation
Kubernetes sends the analysis worker a SIGTERM and then a SIGKILL about 30 seconds later. Previously the worker had no signal handling, so it could pull a fresh pubsub message right before being killed and leave that analysis unfinished. Install a SIGTERM/SIGINT handler via signal.NotifyContext in main and have the receive loop check for a requested shutdown before pulling the next message, returning cleanly instead of treating the cancelled context as an error. The decision is factored into a small testable shutdownRequested helper. This is part 1 of ossf#376 (stop grabbing new work). Cancelling an in-flight analysis through context after a grace period is a deliberate follow-up and is not included here. Signed-off-by: Arpit Jain <arpitjain099@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the analysis worker runs under Kubernetes it receives a SIGTERM and then a SIGKILL roughly 30 seconds later. Today the worker installs no signal handler, so it can pull a fresh pubsub message right before the SIGKILL and leave that analysis half-finished.
This is part 1 of #376: stop grabbing new work once a SIGTERM has been received.
What this changes
mainnow derives its context fromsignal.NotifyContext(..., syscall.SIGTERM, syscall.SIGINT), so the context is cancelled when the worker is asked to shut down (defer stop()restores default signal handling on exit). SIGINT is included so a localCtrl-Cbehaves the same way as a Kubernetes SIGTERM.sub.Receive, and returns cleanly instead of pulling another message. If the shutdown arrives while the loop is already blocked inReceive, the resulting cancelled-context error is now treated as a clean stop rather than a failure (a genuine receive error is still returned as before).shutdownRequested(ctx)helper so it can be unit tested without standing up pubsub.By the time the loop reaches the top, any message that was already being handled has finished, so this gives an in-flight analysis the chance to complete within the grace period before the loop exits.
Out of scope (deliberate follow-up)
Per @calebbrown's breakdown on the issue, this PR only covers item 1 (do not start new work after SIGTERM). Threading context through
internal/analysisand wiring up a grace-period timer that cancels an in-flight analysis (items 2 and 3) is intentionally left for a separate change so this slice stays small and reviewable.Testing
Added
cmd/worker/main_test.gocoveringshutdownRequestedfor a live context, a cancelled context, and a real SIGTERM delivered to the process viasignal.NotifyContext.internal/packetcapturedoes not build natively on macOS (libpcap is Linux-only), so I ran the authoritative checks for Linux:GOOS=linux go build ./cmd/worker/andGOOS=linux go vet ./cmd/worker/: both clean.go test ./cmd/worker/run inside agolang:1.26Linux container: all three tests pass.gofmtclean.Part of #376.