Skip to content

runtime: checkpoint/restore (CheckpointRuntime) + Podman backend#98

Merged
bilby91 merged 4 commits into
mainfrom
feat/checkpoint-restore-contract
Jun 22, 2026
Merged

runtime: checkpoint/restore (CheckpointRuntime) + Podman backend#98
bilby91 merged 4 commits into
mainfrom
feat/checkpoint-restore-contract

Conversation

@bilby91

@bilby91 bilby91 commented Jun 21, 2026

Copy link
Copy Markdown
Member

Why

Docker's checkpoint/restore is broken on current engines — start --checkpoint fails on the netns bind-mount (open upstream containerd#12141 / moby#37344). Podman does the full round trip (checkpoint --export / restore --import: process + memory + writable rootfs in a portable, node-independent archive). So this adds an optional CheckpointRuntime sub-interface and a Podman backend that implements it — for migrating spot-evicted devcontainers without losing in-memory work.

See design/checkpoint-restore.md (primitive + empirical runtime matrix + why-not-docker) and design/podman-backend.md (backend impl).

What's in it

Contract (backend-agnostic)

  • runtime.CheckpointRuntime + CheckpointSpec/RestoreSpec/CheckpointRef, Capabilities.Checkpoint, and typed errors (ErrCheckpointUnsupported, CheckpointFailedError, RestoreFailedError).
  • Engine.Checkpoint / Engine.Restore — gate on the capability bit; Restore returns a fully reattached *Workspace (re-inspect + rebuild config via a reattachWorkspace helper shared with Attach).
  • Engine.CheckpointProject / RestoreProject — a thin orchestrator over the per-container primitive for multi-service projects: enumerate by com.docker.compose.project label, one archive per service + a manifest, reattach the devcontainer service as Primary. Decoupled from the compose machinery (label-based only).

Backend (runtime/podman, Linux)

  • Embeds *docker.Runtime over Podman's docker-compatible socket for the standard surface; drives checkpoint/restore + buildah build through the libpod REST API on the same socket via a thin stdlib net/http client. No CLI shell-out, no pkg/bindings (rejected: deps 76→384 + cgo/gpgme). Zero new modules; cross-compiles clean.
  • Capabilities().Checkpoint gates on libpod reachability plus an optional Options.CheckpointProbe — the REST transport has no criu check equivalent, so the deployer (who runs the service) asserts CRIU.

Testing

  • Unit: libpod request/response shapes (httptest), engine wrappers + project orchestrator (incl. partial-failure → no manifest, RestoreFailedError propagation), capability-probe gating, reattach metadata merge.
  • Integration (PODMAN_SOCKET-gated; skip without a live Podman+CRIU host): single-container reattach, multi-service project round trip, and a two-phase cross-node test (restore on a fresh store that never pulled the image — proves the archive is self-contained, and pins the requirement that the workspace bind source must exist on the destination).
  • Validated end-to-end on real Podman + CRIU — all integration tests pass.
  • CI: new test-integration-podman job installs podman+criu+crun+iptables and runs the gated tests for real (continue-on-error until the hosted-runner CRIU capability is confirmed).

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features
    • Added a Podman runtime backend option.
    • Added CRIU-based container checkpoint/restore support (including TCP-establishment support).
    • Added multi-service project checkpoint/restore with primary workspace reattachment on restore.
  • Bug Fixes
    • Improved capability detection and clearer unsupported/failure outcomes for checkpoint/restore.
  • Documentation
    • Updated README with Podman setup and prerequisites.
    • Added/expanded design documentation for checkpoint/restore and the Podman backend.
  • Tests
    • Added unit and Podman-focused integration tests, plus CI integration coverage with guarded execution in Podman.

Docker's checkpoint/restore is broken on current engines (the netns
bind-mount on restore — open upstream containerd#12141 / moby#37344).
Podman does the full round trip (`checkpoint --export` /
`restore --import`: process + memory + writable rootfs in a portable,
node-independent archive). So expose an optional CheckpointRuntime
sub-interface and add a Podman backend that implements it.

Contract (backend-agnostic):
- runtime.CheckpointRuntime + CheckpointSpec/RestoreSpec/CheckpointRef
  (export/import model), Capabilities.Checkpoint, and typed errors
  (ErrCheckpointUnsupported, CheckpointFailedError, RestoreFailedError).
- Engine.Checkpoint / Engine.Restore wrappers (checkpoint.go) that
  type-assert the sub-interface and gate on the capability bit. Restore
  returns a fully reattached *Workspace (re-inspect, rebuild config) via a
  shared reattachWorkspace helper factored out of Attach.
- Engine.CheckpointProject / RestoreProject (checkpoint_project.go): a thin
  orchestrator over the per-container primitive for multi-service compose
  projects — enumerate by com.docker.compose.project label, checkpoint each
  service to its own archive + a manifest, restore each and reattach the
  devcontainer service as the Primary workspace. Decoupled from the
  compose-go / `docker compose` machinery (label-based only).

Backend (runtime/podman, Linux):
- Embeds *docker.Runtime over Podman's docker-compatible socket for the
  standard surface; drives checkpoint/restore + buildah build through the
  libpod REST API on the same socket via a thin stdlib net/http client.
  No CLI shell-out, no pkg/bindings (that spike took deps 76->384 +
  cgo/gpgme), zero new modules, cross-compiles clean.
- Capabilities().Checkpoint gates on libpod reachability plus an optional
  Options.CheckpointProbe: the REST transport has no `criu check`
  equivalent, so the deployer (who runs the service) asserts CRIU.

Tests:
- Unit: httptest-based libpod request/response shapes, fake runtimes for the
  engine wrappers + project orchestrator (incl. partial-failure → no
  manifest, RestoreFailedError propagation), criu-probe gating, and the
  reattach image-metadata merge.
- Integration (PODMAN_SOCKET-gated; skip without a live Podman+CRIU host):
  single-container reattach, multi-service project round trip, and a
  two-phase cross-node test (checkpoint on one store, restore on a fresh
  store that never pulled the image — proving the archive is self-contained;
  it also pins the workspace-bind-source-must-exist-on-the-destination
  requirement).
- CI: a test-integration-podman job installs podman+criu+crun+iptables,
  starts the socket, and runs the gated tests for real (continue-on-error
  until the hosted-runner CRIU capability is confirmed).

Validated end-to-end on real Podman+CRIU (all integration tests pass).
Design records: design/checkpoint-restore.md (primitive + project
orchestrator §3.3 + empirical runtime matrix) and design/podman-backend.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Introduces a runtime/podman backend that embeds the existing Docker runtime for standard operations and adds CRIU-based container checkpoint/restore via Podman's libpod REST API. New CheckpointRuntime interface, error types, and Capabilities.Checkpoint gate are added to the runtime package. Engine-level Checkpoint/Restore and CheckpointProject/RestoreProject orchestrators are implemented, along with reattachWorkspace refactoring in attach.go. Unit and integration tests plus CI, docs, and README updates are included.

Changes

Podman Checkpoint/Restore Feature

Layer / File(s) Summary
Runtime checkpoint contracts, capabilities, and errors
runtime/runtime.go, runtime/compose_primitives.go, runtime/errors.go
Defines CheckpointRuntime interface with Checkpoint/Restore method signatures, CheckpointSpec/RestoreSpec/CheckpointRef types, Capabilities.Checkpoint boolean gate, and ErrCheckpointUnsupported sentinel plus CheckpointFailedError/RestoreFailedError structs with Unwrap() support.
reattachWorkspace helper extracted from AttachWith
attach.go
Extracts shared reattachWorkspace helper from Engine.AttachWith to reconstruct a *Workspace from already-inspected container details, stamp DevcontainerID, merge image metadata label, and probe userEnv; used by both attach and restore paths.
Engine checkpoint/restore primitives + unit tests
checkpoint.go, checkpoint_test.go
Adds CheckpointOptions/RestoreOptions structs and Engine.Checkpoint/Engine.Restore methods that validate inputs, capability-gate via type assertion, call the backend, and reattach the workspace post-restore via inspectStable + reattachWorkspace; unit tests cover happy paths, unsupported backends, capability false gate, error propagation, image metadata merging, and input validation.
Engine project checkpoint/restore orchestration + unit tests
checkpoint_project.go, checkpoint_project_test.go
Adds CheckpointProject (enumerates compose containers, checkpoints each to per-service archive, writes project.json manifest atomically on success) and RestoreProject (reads manifest, restores each service, reattaches primary workspace); tests cover round-trip, validation, partial-failure atomicity, error propagation, and unsupported backend.
Podman libpod HTTP client and runtime core + unit tests
runtime/podman/libpod.go, runtime/podman/podman.go, runtime/podman/podman_test.go
Introduces libpodClient (stdlib Unix-socket HTTP client with ping, do, errorBody), Runtime embedding *docker.Runtime with checkpointOK cache, Options with Socket and CheckpointProbe, New constructor running probeCheckpoint, and Capabilities() returning the checkpoint gate; unit tests validate capability gating, probeCheckpoint short-circuit, and compile-time interface assertions.
Podman checkpoint/restore implementations + unit tests
runtime/podman/checkpoint.go, runtime/podman/podman_test.go
Implements Runtime.Checkpoint (POST libpod endpoint, stream body to archive with 0600 perms, return CheckpointRef with size, wrap as CheckpointFailedError) and Runtime.Restore (open archive, POST endpoint as application/x-tar, parse container ID, wrap as RestoreFailedError); unit tests validate HTTP request shape, query parameters, error wrapping, and archive persistence.
Podman BuildImage implementation + unit tests
runtime/podman/build.go, runtime/podman/podman_test.go
Implements Runtime.BuildImage (tar context directory, POST /build with spec params, parse streaming JSON for image ID and tags, emit BuildEventLog events for log messages, handle error detection); includes buildQuery param encoding, tarDir context streaming, and parseBuildStream JSON parsing; unit tests validate query encoding, log streaming, image ID extraction, and error paths.
Integration tests: runtime, engine, cross-node, multi-service
runtime/podman/integration_test.go, test/integration/podman_checkpoint_restore_test.go, test/integration/podman_crossnode_test.go, test/integration/podman_project_checkpoint_test.go
Live-Podman integration tests gated on PODMAN_SOCKET: runtime-level counter resume after checkpoint/delete/restore and BuildImage log streaming; engine-level single-container workspace reattach with rootfs marker persistence; cross-node source/destination phases sharing archive over filesystem; multi-service compose project checkpoint/restore with DNS link and counter resume verification.
CI job, design docs, README, and .gitignore
.github/workflows/ci.yml, .github/scripts/podman-cr.sh, README.md, design/checkpoint-restore.md, design/podman-backend.md, design/README.md, .gitignore
Adds test-integration-podman CI job compiling integration binaries and running them in a privileged Podman container with CRIU capability probing; includes podman-cr.sh shell script for service setup and test gating; adds two comprehensive design documents detailing checkpoint/restore semantics and Podman backend architecture; extends README with Podman backend description and prerequisites; updates design index and .gitignore.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant Engine
  participant podman_Runtime as runtime/podman.Runtime
  participant libpodClient
  participant Filesystem

  Note over Caller,Filesystem: Checkpoint flow
  Caller->>Engine: Checkpoint(ctx, ws, CheckpointOptions)
  Engine->>Engine: validate + type-assert CheckpointRuntime
  Engine->>podman_Runtime: Checkpoint(ctx, containerID, CheckpointSpec)
  podman_Runtime->>libpodClient: POST /containers/{id}/checkpoint?export=true&tcpestablished=...
  libpodClient-->>podman_Runtime: HTTP 200 (archive stream)
  podman_Runtime->>Filesystem: write archive to ArchivePath
  podman_Runtime-->>Engine: CheckpointRef{ArchivePath, Size}
  Engine-->>Caller: CheckpointRef

  Note over Caller,Filesystem: Restore flow
  Caller->>Engine: Restore(ctx, RestoreOptions)
  Engine->>Engine: validate + type-assert CheckpointRuntime
  Engine->>podman_Runtime: Restore(ctx, RestoreSpec)
  podman_Runtime->>Filesystem: open archive from ArchivePath
  podman_Runtime->>libpodClient: POST /containers/import/restore (archive body)
  libpodClient-->>podman_Runtime: HTTP 200 {Id: newContainerID}
  podman_Runtime-->>Engine: *Container{State: Running}
  Engine->>Engine: inspectStable → extract WorkspaceID from label
  Engine->>Engine: reattachWorkspace(ctx, details, id, LocalEnv)
  Engine-->>Caller: *Workspace
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • crunchloop/devcontainer#21: Modifies the same Engine.AttachWith reattach path to parse devcontainer.metadata image labels and merge feature metadata into the reconstructed workspace config, which is the same code path refactored into reattachWorkspace in this PR.
  • crunchloop/devcontainer#64: Both PRs modify runtime/compose_primitives.go type Capabilities, with this PR adding the Checkpoint boolean field for checkpoint/restore support.

Poem

🐇✨ Hop, hop, checkpoint the state!
CRIU freezes containers — isn't that great?
libpod whispers through a Unix socket's door,
tar archives carry memory to a distant shore.
Restore from the archive, reattach with care —
the workspace leaps back like a bunny through air! 🐾

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'runtime: checkpoint/restore (CheckpointRuntime) + Podman backend' accurately and concisely summarizes the main change: adding checkpoint/restore functionality with a CheckpointRuntime interface and implementing a Podman backend for it.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/checkpoint-restore-contract

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (5)
runtime/podman/integration_test.go (2)

68-69: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Replace fixed sleeps with deadline-based polling to reduce flakes.

The hard sleeps make this test timing-sensitive on slower runners; polling for observable state (counter file available/increasing) is more stable.

Also applies to: 87-88, 94-95

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@runtime/podman/integration_test.go` around lines 68 - 69, Replace the fixed
time.Sleep calls at multiple locations in the test (including the time.Sleep(6 *
time.Second) on line 68-69, and the additional sleep calls at lines 87-88 and
94-95) with deadline-based polling loops that check for observable state
changes, such as verifying that the counter file exists or that its value has
increased. This approach should use a timeout context and repeatedly check the
actual condition with small intervals rather than waiting for a fixed duration,
making the test more reliable across different execution environments.

36-37: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Use bounded contexts for live Podman integration calls.

Both tests run external socket operations with context.Background(), so a stuck Podman API can hang until the global test timeout instead of failing at the operation boundary.

Suggested patch
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
+	defer cancel()
 	rt, err := New(ctx, Options{Socket: socket})
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
+	defer cancel()
 	rt, err := New(ctx, Options{Socket: socket})

Also applies to: 130-131

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@runtime/podman/integration_test.go` around lines 36 - 37, Replace the use of
context.Background() with a bounded context that has a timeout to prevent
indefinite hangs during Podman API operations. In the integration tests where
context.Background() is passed to the New function (both at the initial creation
and in the other test location), use context.WithTimeout or context.WithDeadline
to create a context with an appropriate timeout duration. This ensures that if
the Podman socket operations get stuck, they will fail at the operation boundary
rather than hanging until the global test timeout.
test/integration/podman_project_checkpoint_test.go (1)

184-193: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

readCount hides the failure mode by collapsing all errors to zero.

Returning 0 for exec and parse failures makes diagnostics ambiguous when the test fails; surfacing the error will speed triage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/integration/podman_project_checkpoint_test.go` around lines 184 - 193,
The readCount function currently silences all errors by returning 0 when
ExecContainer fails or when parsing the stdout fails. Instead of returning 0 on
errors, use the testing.T helper (t.Fatalf) to fail the test with the actual
error details. Specifically, when ExecContainer returns an error or non-zero
exit code, report that error using t.Fatalf with the error message. Similarly,
when strconv.Atoi fails to parse the count, report the parse error using
t.Fatalf. This way, test failures will show the actual underlying problem rather
than just silently returning 0, making debugging much faster.
design/checkpoint-restore.md (2)

116-116: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Minor language polish: "exactly" is redundant.

Line 116: "That is exactly why it works end-to-end..." could become "That is why it works end-to-end..." The emphasis is already strong from the preceding sentence.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design/checkpoint-restore.md` at line 116, The word "exactly" in the sentence
"That is exactly why it works end-to-end" is redundant and weakens the emphasis
already conveyed by the preceding sentence. Remove the word "exactly" from this
sentence so it reads "That is why it works end-to-end..." to improve the
language flow and eliminate the unnecessary emphasis.

300-300: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Minor language polish: "very" is a weak intensifier.

Line 300: "pulls in the very large containers/podman module..." Consider "pulls in the massive / bulky containers/podman module (cgo, storage build tags, ~300 added modules)" to make the cost concrete rather than vague.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design/checkpoint-restore.md` at line 300, The phrase "pulls in the very
large" uses "very" as a weak intensifier that lacks concrete information about
why this is a problem. Replace "very" with a more descriptive term like
"massive" or "bulky" and add specific details about the actual cost of including
the containers/podman module, such as mentioning cgo requirements, storage build
tags, and the approximate number of added modules (around 300) to make the
impact concrete and easier to understand.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 145-146: The GitHub Actions references for actions/checkout and
actions/setup-go on lines 145-146 are using version tags (`@v6`) without being
pinned to specific commit hashes, which is a security risk. Replace the `@v6`
references with exact commit hashes (e.g., the full SHA like
`@a05bd36ab36d3e7f7a1506cbb51ccad719569de1`) for both actions/checkout@v6 and
actions/setup-go@v6 to pin them to their current release commits and ensure
supply-chain integrity.

In `@checkpoint_project.go`:
- Around line 140-153: The code currently allows multiple containers to resolve
to the same service name through serviceName(c), which causes archive file
overwrites since archive := svc + ".tar" will be identical for duplicates.
Before the checkpoint loop iterates through containers, add validation to detect
and reject duplicate service names. Track seen service names in a map or set as
you iterate through the containers, and return an error immediately if a
duplicate service name is encountered, preventing silent data loss during the
checkpoint operation.
- Around line 236-249: The readProjectManifest function does not validate the
archive paths in the services array before using them, creating a path traversal
vulnerability. After unmarshaling the JSON into the ref variable, add validation
logic to ensure each services[].archive path does not contain path traversal
sequences like ../ and is not an absolute path. Check these constraints for all
archive entries and return an error if any archive path violates these security
requirements before assigning ref.ArchiveDir and returning the ref.
- Around line 228-234: The writeProjectManifest function uses os.WriteFile
directly which is not atomic and can leave a truncated file if interrupted,
breaking the completion-marker guarantee. Modify the function to write the JSON
manifest data to a temporary file first (using a temp file pattern), then
atomically rename that temporary file to the final destination using os.Rename,
ensuring that the project manifest file only appears complete or not at all.

In `@checkpoint.go`:
- Around line 134-135: The Engine.Restore method currently allows restoration
without validating that the required devcontainer ID label exists in
details.Labels, which can result in an empty WorkspaceID being returned. Before
calling reattachWorkspace, add a check to verify that details.Labels contains
the LabelDevcontainerID key. If the label is missing, return an error instead of
proceeding with the reattach operation. This ensures the method fails explicitly
rather than returning a workspace with an empty ID that violates the fully
reattached contract.

In `@design/podman-backend.md`:
- Around line 132-141: The design document contains conflicting descriptions of
how the container build and runtime operations are implemented. The sections
discussing podman build behavior (around the BuildImage implementation) claim a
CLI-based approach that shells out to podman, while other sections in the same
document describe a REST-based implementation. Review the sections describing
the build and container runtime behavior, identify which approach (CLI-based or
REST-based) actually matches the current code implementation, and update all
conflicting sections to consistently describe the same implementation path.
Ensure sections around line 132-141 and 192-203 align with each other and with
the actual implementation.

In `@runtime/podman/build.go`:
- Around line 105-111: The tarDir goroutine spawned to write to the pipe writer
(pw) can become blocked if the r.lp.do() request call fails before consuming the
full request body. When an error occurs on line 108, the function returns early
without consuming data from the pipe reader (pr), causing the goroutine to hang
indefinitely. Close the pipe reader (pr) on error by adding a defer statement or
explicit close in the error handling path to signal EOF to the tarDir goroutine
and allow it to finish gracefully and avoid goroutine leaks.

In `@runtime/podman/checkpoint.go`:
- Around line 84-89: The code in the Decode block does not validate that the
restored container ID is populated before returning success. After the
json.NewDecoder successfully decodes the response into the rep variable (which
is of type restoreReport), add a validation check to ensure rep.ID is not empty.
If rep.ID is empty, return an appropriate error (similar to the existing
RestoreFailedError) to prevent returning a Container with an empty ID that
causes issues downstream.
- Around line 35-38: The os.Create call used to create the checkpoint archive
file at spec.ArchivePath may result in files with overly permissive access
permissions depending on the system umask. Replace os.Create with os.OpenFile,
specifying explicit flags (O_CREATE, O_WRONLY, O_TRUNC) and restrictive file
permissions (0600) to ensure the checkpoint archive file is only readable and
writable by the owner, protecting the sensitive process memory contents.
- Around line 39-46: When either copyErr or closeErr is encountered in the
checkpoint archive writing process, a partial or truncated file is left at
ArchivePath that could be mistaken for valid data. In both error handling
branches (the if copyErr != nil block and the if closeErr != nil block), add a
file deletion operation to remove the archive file at ArchivePath before
returning the CheckpointFailedError, ensuring no partial checkpoint artifacts
remain that could cause issues during subsequent operations.

In `@test/integration/podman_project_checkpoint_test.go`:
- Around line 58-64: The CreateNetwork call is not idempotent across interrupted
reruns because if a network with the same netName already exists from a previous
failed run, the creation will fail. To fix this, before calling CreateNetwork
with the netName, first attempt to remove any existing network with that name
using rt.RemoveNetwork (or similar method), ignoring errors if the network
doesn't exist, so that each test run starts with a clean state and the network
creation succeeds regardless of prior run failures.

---

Nitpick comments:
In `@design/checkpoint-restore.md`:
- Line 116: The word "exactly" in the sentence "That is exactly why it works
end-to-end" is redundant and weakens the emphasis already conveyed by the
preceding sentence. Remove the word "exactly" from this sentence so it reads
"That is why it works end-to-end..." to improve the language flow and eliminate
the unnecessary emphasis.
- Line 300: The phrase "pulls in the very large" uses "very" as a weak
intensifier that lacks concrete information about why this is a problem. Replace
"very" with a more descriptive term like "massive" or "bulky" and add specific
details about the actual cost of including the containers/podman module, such as
mentioning cgo requirements, storage build tags, and the approximate number of
added modules (around 300) to make the impact concrete and easier to understand.

In `@runtime/podman/integration_test.go`:
- Around line 68-69: Replace the fixed time.Sleep calls at multiple locations in
the test (including the time.Sleep(6 * time.Second) on line 68-69, and the
additional sleep calls at lines 87-88 and 94-95) with deadline-based polling
loops that check for observable state changes, such as verifying that the
counter file exists or that its value has increased. This approach should use a
timeout context and repeatedly check the actual condition with small intervals
rather than waiting for a fixed duration, making the test more reliable across
different execution environments.
- Around line 36-37: Replace the use of context.Background() with a bounded
context that has a timeout to prevent indefinite hangs during Podman API
operations. In the integration tests where context.Background() is passed to the
New function (both at the initial creation and in the other test location), use
context.WithTimeout or context.WithDeadline to create a context with an
appropriate timeout duration. This ensures that if the Podman socket operations
get stuck, they will fail at the operation boundary rather than hanging until
the global test timeout.

In `@test/integration/podman_project_checkpoint_test.go`:
- Around line 184-193: The readCount function currently silences all errors by
returning 0 when ExecContainer fails or when parsing the stdout fails. Instead
of returning 0 on errors, use the testing.T helper (t.Fatalf) to fail the test
with the actual error details. Specifically, when ExecContainer returns an error
or non-zero exit code, report that error using t.Fatalf with the error message.
Similarly, when strconv.Atoi fails to parse the count, report the parse error
using t.Fatalf. This way, test failures will show the actual underlying problem
rather than just silently returning 0, making debugging much faster.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 83a254fe-5d7e-4146-8e93-79120c50801b

📥 Commits

Reviewing files that changed from the base of the PR and between 3e81722 and f5f58d5.

📒 Files selected for processing (23)
  • .github/workflows/ci.yml
  • .gitignore
  • README.md
  • attach.go
  • checkpoint.go
  • checkpoint_project.go
  • checkpoint_project_test.go
  • checkpoint_test.go
  • design/README.md
  • design/checkpoint-restore.md
  • design/podman-backend.md
  • runtime/compose_primitives.go
  • runtime/errors.go
  • runtime/podman/build.go
  • runtime/podman/checkpoint.go
  • runtime/podman/integration_test.go
  • runtime/podman/libpod.go
  • runtime/podman/podman.go
  • runtime/podman/podman_test.go
  • runtime/runtime.go
  • test/integration/podman_checkpoint_restore_test.go
  • test/integration/podman_crossnode_test.go
  • test/integration/podman_project_checkpoint_test.go

Comment thread .github/workflows/ci.yml
Comment thread checkpoint_project.go
Comment thread checkpoint_project.go
Comment thread checkpoint_project.go
Comment thread checkpoint.go
Comment thread runtime/podman/build.go
Comment thread runtime/podman/checkpoint.go Outdated
Comment thread runtime/podman/checkpoint.go
Comment thread runtime/podman/checkpoint.go
Comment thread test/integration/podman_project_checkpoint_test.go Outdated
bilby91 and others added 3 commits June 21, 2026 21:36
CodeRabbit review on #98. Substantive fixes:

- runtime/podman/checkpoint.go: write the archive 0600 (it carries process
  memory) instead of os.Create's umask default; delete a partial archive
  if the copy/close fails; reject an empty container id from libpod restore.
- runtime/podman/build.go: CloseWithError the pipe reader when the /build
  request errors, so the tarDir goroutine doesn't leak parked on the pipe.
- checkpoint.go: Engine.Restore now errors if the restored container has no
  dev.containers.id label (was silently returning an empty Workspace.ID).
- checkpoint_project.go: reject duplicate service names (would overwrite
  archives); write project.json atomically via temp+rename so a present
  manifest is always complete; validate manifest archive entries are plain
  basenames (block path traversal on restore).
- ci.yml: criu was dropped from Ubuntu 24.04 (the install failed with "no
  installation candidate"); pin ubuntu-22.04 (jammy ships criu in universe)
  and have the setup step skip the job green when criu can't be installed
  or `criu check` fails, so it's a ready harness rather than a red X.
- design/podman-backend.md: §5.1/§5.2 still presented the CLI plan as final;
  marked superseded by the libpod-REST implementation.
- multi-service integration test: clear a leftover network before
  CreateNetwork so reruns after an interrupted run are idempotent.

Skipped: the unpinned-actions flag — matches the repo-wide @vn convention
(every existing job uses it); out of scope for this PR.

Re-validated on real Podman+CRIU (OrbStack): all integration tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… runners

The test-integration-podman job failed for real reasons, not flakiness:
Ubuntu 24.04 dropped criu; 22.04 ships podman 3.4.4 whose OCI runtime
reports "does not support checkpoint/restore" (and 3.x predates the libpod
v5 API this backend targets). criu check passing wasn't a sufficient gate.

Now the job always compiles the gated tests (build step), and only runs
them when the host is genuinely capable (criu + criu check + podman>=5),
skipping green with a warning otherwise. Dropped continue-on-error so a
failure on a capable (self-hosted) runner is honest. Real C/R is validated
locally on podman 5.x + criu (OrbStack).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hosted runner's apt podman can't do C/R (24.04 has no criu; 22.04's
podman 3.4.4 can't checkpoint and predates the libpod v5 API), so the
previous job could only skip. Instead bring a modern stack in a container:
build the gated tests on the runner (static, CGO_ENABLED=0) for compile
coverage, then run them inside quay.io/podman/stable (podman 5.x + crun +
criu) launched --privileged --cgroupns=host so CRIU can drive the runner's
kernel. .github/scripts/podman-cr.sh smoke-tests an actual checkpoint and
skips green with the real reason if the runner can't (e.g. cgroup-freezer
perms), so a capable runner runs for real and an incapable one stays green.

Validated in the equivalent topology (Linux VM → Docker → modern-podman
container → CRIU) on a criu-capable kernel: all four integration tests pass
(checkpoint/restore, buildah build, engine reattach, multi-service project).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)

143-166: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add an explicit timeout to the Podman integration job.

This job runs privileged containerized tests and can hang on host/kernel edge cases; set timeout-minutes to bound CI resource consumption and improve failure recovery.

Suggested diff
   test-integration-podman:
     runs-on: ubuntu-latest
+    timeout-minutes: 30
     needs: [lint, test-linux]
     steps:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml around lines 143 - 166, The test-integration-podman
job lacks an explicit timeout which can cause the CI pipeline to hang when the
privileged containerized tests encounter kernel edge cases. Add a
timeout-minutes property to the test-integration-podman job at the same
indentation level as the runs-on, needs, and steps properties to set a
reasonable timeout limit that bounds resource consumption and improves failure
recovery.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 143-166: The test-integration-podman job lacks an explicit timeout
which can cause the CI pipeline to hang when the privileged containerized tests
encounter kernel edge cases. Add a timeout-minutes property to the
test-integration-podman job at the same indentation level as the runs-on, needs,
and steps properties to set a reasonable timeout limit that bounds resource
consumption and improves failure recovery.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 51e096f4-f990-4c68-8ce3-6fa94f03404c

📥 Commits

Reviewing files that changed from the base of the PR and between 5a8dd70 and fa74dcc.

📒 Files selected for processing (2)
  • .github/scripts/podman-cr.sh
  • .github/workflows/ci.yml
✅ Files skipped from review due to trivial changes (1)
  • .github/scripts/podman-cr.sh

@bilby91 bilby91 merged commit 2c0c14c into main Jun 22, 2026
19 checks passed
bilby91 added a commit that referenced this pull request Jun 24, 2026
Document everything merged since v0.3.0: Podman backend +
checkpoint/restore (#98), compose feature security metadata + entrypoint
chaining (#103), Podman compose health probing (#102), deps bump (#101),
and the prebuild dev-environment / CI tooling (#88#93).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant