Podman: fix compose-devcontainer health + multi-service checkpoint/restore#102
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughTwo features are added: (1) an ChangesCheckpoint / Restore fixes
Compose orchestrator self-probed health (Podman)
Sequence Diagram(s)sequenceDiagram
actor Caller
participant Engine
participant Orchestrator
participant PodmanRuntime
rect rgba(135, 206, 235, 0.5)
Note over Caller,PodmanRuntime: Restore flow with IgnoreVolumes
Caller->>Engine: RestoreProject(opts.IgnoreVolumes=true)
Engine->>PodmanRuntime: CreateNetwork(<project>_default)
PodmanRuntime-->>Engine: networkID
Engine->>PodmanRuntime: InspectContainer(<project>-<svc>-1)
alt stale stopped container exists
Engine->>PodmanRuntime: RemoveContainer(force=true)
end
Engine->>PodmanRuntime: Restore(RestoreSpec{IgnoreVolumes: true})
PodmanRuntime-->>Engine: restored Container
end
rect rgba(144, 238, 144, 0.5)
Note over Caller,PodmanRuntime: Self-probe health gate flow
Caller->>Orchestrator: up (service_healthy dependency)
Orchestrator->>PodmanRuntime: RunContainer(RunSpec{HealthCheck: {Disable: true}})
Orchestrator->>Orchestrator: gateLevel → waitFor(healthCheckSpec)
loop poll until exit 0 or timeout
Orchestrator->>PodmanRuntime: ExecContainer(healthProbeCmd)
PodmanRuntime-->>Orchestrator: ExecResult
end
Orchestrator-->>Caller: gate satisfied (or HealthTimeoutError)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
design/checkpoint-restore-fixes.md (1)
66-68: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd language identifier to fenced code block.
The code block is missing a language specifier. Mark it as shell/bash output to pass linting and clarify intent.
📝 Proposed fix
-``` +```sh criu/files-reg.c: File var/lib/postgresql/data has bad mode 040755 (expect 040700) -``` +```🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design/checkpoint-restore-fixes.md` around lines 66 - 68, The fenced code block containing the criu error message starting with "criu/files-reg.c: File var/lib/postgresql/data has bad mode 040755" is missing a language identifier in the opening fence. Add the language specifier "sh" to the opening triple backticks (change ``` to ```sh) to properly mark this as shell output, which will satisfy linting requirements and make the intent clear.Source: Linters/SAST tools
design/compose-native-health.md (2)
20-22: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd language specifier to code block.
The fenced code block should declare a language for proper syntax highlighting and linting compliance.
📝 Proposed fix
-``` +```shell Error when reading /var/lib/rabbitmq/.erlang.cookie: eacces -``` +```🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design/compose-native-health.md` around lines 20 - 22, The fenced code block containing the error message "Error when reading /var/lib/rabbitmq/.erlang.cookie: eacces" is missing a language specifier. Add the language identifier "shell" immediately after the opening triple backticks (```shell) to enable proper syntax highlighting and meet linting compliance requirements.
92-104: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueUse more formal phrasing in probe semantics section.
The prose on lines 104 ("Exit 0 → healthy → gate satisfied. Non-zero → keep polling until deadline") uses informal arrow notation. Consider a more structured phrasing for consistency with the rest of the technical specification.
Suggested revision:
Exit code 0 indicates the service is healthy and the gate is satisfied. Non-zero exit codes trigger continued polling until the deadline is reached.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design/compose-native-health.md` around lines 92 - 104, Replace the informal arrow notation in the probe semantics section (the final bullet point starting with "Exit 0") with more formal phrasing. Instead of using arrows to connect concepts, rephrase this to use complete sentences with explicit structure, such as stating that exit code 0 indicates the service is healthy and satisfies the gate, while non-zero exit codes result in continued polling until the deadline is reached. This will maintain consistency with the technical specification style used throughout the document.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@checkpoint_project.go`:
- Around line 240-247: The InspectContainer call in the checkpoint restore logic
currently ignores errors other than nil. When InspectContainer fails with a real
error (daemon/API failure, permission issue), the code should explicitly handle
it instead of silently proceeding. After calling e.runtime.InspectContainer, add
a check for when ierr is not nil to determine if the error is a "not found"
error (which allows restore to proceed) or a genuine failure that should be
returned immediately with the actual error context. Only proceed with the
container removal logic in the RemoveContainer call when either InspectContainer
succeeds or the error is specifically a "not found" type error.
In `@compose/orchestrator.go`:
- Around line 506-511: In the orchestrator.go file within the self-probe mode
conditional block (if o.selfProbe), setting spec.HealthCheck to nil does not
actually disable the native healthcheck but instead inherits the image's
baked-in HEALTHCHECK. To fix this, replace the nil assignment with an explicit
disabled healthcheck configuration that prevents native healthchecks from
running, rather than relying on nil inheritance semantics.
---
Nitpick comments:
In `@design/checkpoint-restore-fixes.md`:
- Around line 66-68: The fenced code block containing the criu error message
starting with "criu/files-reg.c: File var/lib/postgresql/data has bad mode
040755" is missing a language identifier in the opening fence. Add the language
specifier "sh" to the opening triple backticks (change ``` to ```sh) to properly
mark this as shell output, which will satisfy linting requirements and make the
intent clear.
In `@design/compose-native-health.md`:
- Around line 20-22: The fenced code block containing the error message "Error
when reading /var/lib/rabbitmq/.erlang.cookie: eacces" is missing a language
specifier. Add the language identifier "shell" immediately after the opening
triple backticks (```shell) to enable proper syntax highlighting and meet
linting compliance requirements.
- Around line 92-104: Replace the informal arrow notation in the probe semantics
section (the final bullet point starting with "Exit 0") with more formal
phrasing. Instead of using arrows to connect concepts, rephrase this to use
complete sentences with explicit structure, such as stating that exit code 0
indicates the service is healthy and satisfies the gate, while non-zero exit
codes result in continued polling until the deadline is reached. This will
maintain consistency with the technical specification style used throughout the
document.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 813a4d00-2246-44b2-8a51-37ddca545343
📒 Files selected for processing (11)
checkpoint.gocheckpoint_project.gocheckpoint_project_test.gocompose/orchestrator.gocompose/orchestrator_health_test.gocompose/orchestrator_test.godesign/checkpoint-restore-fixes.mddesign/compose-native-health.mdruntime/podman/checkpoint.goruntime/podman/podman.goruntime/runtime.go
Podman runs a container's HEALTHCHECK as root and fires the first probe immediately at start (it ignores start_period for the initial run, and exposes no per-restore flag to defer it — confirmed on 5.7.0). For privilege-dropping images this breaks the service: rabbitmq's `rabbitmq-diagnostics` probe, run as root, creates a root-owned /var/lib/rabbitmq/.erlang.cookie that the gosu-dropped uid-999 server then can't read (eacces), so rabbitmq exits 1. Docker is unaffected because it defers the first probe past the interval. Have the compose orchestrator probe health itself on backends that opt in via the new selfHealthProber interface (Podman returns true; Docker and Apple unchanged). When set: omit RunSpec.HealthCheck so the backend runs no native healthcheck, and evaluate service_healthy gates by exec'ing the compose test command in waitFor — which defers the first probe until after the service has initialized, matching Docker. See design/compose-native-health.md (incl. the two-path divergence and the out-of-scope rootless-gosu issue). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Validating CheckpointProject/RestoreProject against a real multi-service
compose devcontainer (8 services) surfaced restore failures the
single-container alpine spike never hit. Fixes:
- RestoreProject recreates the <project>_default network before
restoring containers. A cross-node restore lands on a fresh store with
no network, and libpod restore fails with "network not found". Mirrors
how compose.Orchestrator.Up names + labels it; idempotent, so same-node
is unaffected. (Custom/extra networks aren't recorded yet — follow-up.)
- RestoreProject clears a stale same-named container before restore.
CheckpointProject(StopAfter) stops but does not remove containers, so a
same-node restore-in-place collides ("ID already in use"). Inspect the
deterministic <project>-<service>-1 name and remove it only when NOT
running (state is in the archive; RemoveVolumes stays false so the data
volume survives); refuse to clobber a running container.
- Add IgnoreVolumes to ProjectRestoreOptions/RestoreOptions ->
runtime.RestoreSpec -> Podman restore ignoreVolumes=true, so a
same-node restore-in-place reuses existing volumes instead of colliding
("volume already exists"). Cross-node leaves it false (content comes
from the archive).
Folded the design decisions + two node prerequisites (CRIU
skip-file-rwx-check; base images on the destination) into
design/checkpoint-restore.md §3.3/§7. Verified end-to-end incl. true
cross-node migration with state preserved.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
83aad2f to
0c62dbe
Compare
Two fixes from PR review:
- compose: in self-probe mode set RunSpec.HealthCheck to {Disable:true}
rather than nil. nil means "inherit from image" (toHealthcheck), so an
image-baked HEALTHCHECK would still run natively on Podman and could
reintroduce the eager-root-probe failure. Disable emits the NONE
sentinel that actually suppresses it. Adds a regression test.
- checkpoint: RestoreProject now distinguishes a not-found container
(normal fresh/cross-node case → proceed) from any other InspectContainer
error (daemon/API/permission), which is surfaced instead of masked
behind a downstream restore error.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document everything merged since v0.3.0: Podman backend + checkpoint/restore (#98), compose feature security metadata + entrypoint chaining (#103), Podman compose health probing (#102), deps bump (#101), and the prebuild dev-environment / CI tooling (#88–#93). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Validates the Podman backend against a real multi-service compose devcontainer (8 services) and fixes the issues that blocked it. Two independent fixes, one per commit — happy to split into two PRs if preferred.
1. Orchestrator-driven health probing on Podman
Problem: Podman runs a container's
HEALTHCHECKas root and fires the first probe immediately at start (it ignoresstart_periodfor the initial run and exposes no per-restore flag to defer it — confirmed on 5.7.0). For privilege-dropping images this breaks startup: rabbitmq'srabbitmq-diagnosticsprobe, run as root, creates a root-owned/var/lib/rabbitmq/.erlang.cookiethat thegosu-dropped uid-999 server then can't read (eacces), so rabbitmq exits 1. Docker is unaffected because it defers the first probe past the interval. Reproduces on barepodman run --health-cmd …, so it's a Podman behavior, not our translation.Fix: The compose orchestrator probes health itself on backends that opt in via a new
selfHealthProberinterface (Podman returns true; Docker/Apple unchanged). When set: omitRunSpec.HealthCheck(no native eager-root probe), and evaluateservice_healthygates by exec'ing the composetestcommand inwaitFor— which defers the first probe until after the service initializes, matching Docker.See
design/compose-native-health.md(incl. the deliberate two-path divergence and the out-of-scope rootless-gosuissue).2. Multi-service project checkpoint/restore
Validating
CheckpointProject/RestoreProjectagainst a real 8-service compose project surfaced restore failures the single-container alpine spike never hit. Design decisions folded intodesign/checkpoint-restore.md(§3.3 / §7):<project>_defaultnetwork and libpod fails withnetwork not found. Idempotent; same-node unaffected.CheckpointProject(StopAfter)stops but doesn't remove, so same-node restore-in-place collides (ID already in use). Removes the deterministic<project>-<service>-1name only when not running (state is in the archive; volumes preserved); refuses to clobber a running container (replace a dead husk, never duplicate).IgnoreVolumesoption (ProjectRestoreOptions/RestoreOptions→runtime.RestoreSpec→ PodmanignoreVolumes=true) so same-node restore-in-place reuses existing volumes instead of colliding (volume already exists). Cross-node leaves it false (content comes from the archive).Verified end-to-end including true cross-node migration with state preserved (a file in the app container and a Postgres row written pre-checkpoint were present post-restore on a separate, never-ran-this Podman store).
Two node prerequisites documented (node-provisioning, not library code):
skip-file-rwx-checkin/etc/criu/default.conf(Postgres data-dir is 0700 at checkpoint vs the 0755 restored mountpoint — trips CRIU's file-mode check; Podman has no per-restore flag and the engine talks to a remote socket).docker://…if absent).Timings (8-service project, ~580 MB)
--ignore-volumes; up to ~20–52s when re-extracting volume content off slow storage.Testing
runtime/applecontainertest failures are environmental (ApplecontainerJSON decode) and unrelated.🤖 Generated with Claude Code
Summary by CodeRabbit
IgnoreVolumesoption to skip restoring volume contents and reuse existing volumes.service_healthygating.