Fix goroutine leak in SSH DialContext#3882
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Elvand-Lie The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @Elvand-Lie. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## main #3882 +/- ##
==========================================
- Coverage 54.03% 48.63% -5.41%
==========================================
Files 200 200
Lines 23567 23558 -9
==========================================
- Hits 12734 11457 -1277
- Misses 9604 11062 +1458
+ Partials 1229 1039 -190
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Looking at the code I am not sure whether even the old version is correct: https://pkg.go.dev/net#Dialer.DialContext
I guess we should simply ignore the ctx and spawn no goroutine at all. |
There was a problem hiding this comment.
Pull request overview
Fixes a goroutine leak in pkg/ssh’s DialContext by ensuring the background context-monitoring goroutine can also exit when the returned connection is manually closed (not only when the context is cancelled). This aligns the implementation with the lifecycle patterns described in Issue #3881 for long-lived contexts (e.g., context.Background()).
Changes:
- Introduced a
connWithDonewrapper that signals adonechannel onClose(). - Updated the monitoring goroutine to
selecton bothctx.Done()and the connection’sdonesignal. - Returned the wrapped connection from
DialContextto ensure caller-initiatedClose()triggers goroutine cleanup.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| select { | ||
| case <-ctx.Done(): | ||
| conn.Close() | ||
| case <-wrapped.done: | ||
| // Connection was closed by the caller; exit cleanly. |
| go func() { | ||
| if ctx != nil { | ||
| <-ctx.Done() | ||
| conn.Close() | ||
| select { | ||
| case <-ctx.Done(): | ||
| conn.Close() |
The monitoring goroutine in DialContext violated net.Dialer.DialContext semantics: context should only govern connection establishment, not connection lifetime. The goroutine also leaked when connections were closed manually while the context remained open. Remove the goroutine and connWithDone wrapper entirely. Delegate to ssh.Client.DialContext which already handles context-cancellable dialing correctly per the Go contract.
|
@matejvasek I've removed the goroutine entirely instead of patching it since the goroutine is a bit redundant. x/crypto/ssh already has Client.DialContext that handles context cancellation during dial correctly, so I'm just delegating to it now. Net -30 lines. Should we have a regression test for this, or is the deletion self-evident enough? |
Changes
Fixes #3881
Why is this needed?
The previous
DialContextimplementation spawned a background goroutine that watchedctx.Done()and force-closed the connection when the context expired. This had two problems:Goroutine leak: If the caller closed the connection manually while the context was still alive (e.g.,
context.Background()), the goroutine blocked forever on<-ctx.Done(), leaking 1 goroutine per connection.Violated
net.Dialer.DialContextsemantics: Per the Go docs: "Once successfully connected, any expiration of the context will not affect the connection." The old code did the opposite it killed established connections when the context expired. The k8s dialer in this same repo already documents this contract correctly (pkg/k8s/dialer.goL80-84).The fix
The
golang.org/x/crypto/sshlibrary already providesClient.DialContextwhich handles context-cancellable dialing correctly:So the fix simply delegates to
d.sshClient.DialContext(ctx, d.network, d.addr)instead of callingd.Dial()+ spawning a broken monitoring goroutine. TheconnWithDonewrapper andsyncimport are removed as dead code.Testing performed
go test ./pkg/ssh/...passes.