Skip to content

OSAC-55: Add VM Snapshots enhancement proposal#48

Open
ygalblum wants to merge 3 commits into
osac-project:mainfrom
ygalblum:enhancement/vm-snapshots
Open

OSAC-55: Add VM Snapshots enhancement proposal#48
ygalblum wants to merge 3 commits into
osac-project:mainfrom
ygalblum:enhancement/vm-snapshots

Conversation

@ygalblum
Copy link
Copy Markdown
Contributor

@ygalblum ygalblum commented Jun 1, 2026

Introduce a Snapshot API for OSAC ComputeInstance resources, enabling tenants to create, list, get, delete, and restore from point-in-time snapshots of their VMs through the fulfillment-service API.

Key design decisions:

  • New Snapshot resource (proto + CRD + dual controller)
  • Both online and offline snapshot modes via KubeVirt
  • In-place restore via declarative signal on ComputeInstance
  • Cascade deletion when parent ComputeInstance is deleted

Assisted-by: Claude Code noreply@anthropic.com

Summary by CodeRabbit

  • Documentation
    • Added comprehensive VM Snapshots enhancement docs: tenants can create/list/get/delete point-in-time VM snapshots and request in-place restores; describes online vs offline snapshot semantics, required downtime for in-place restores, restore progress/conditions, cascade deletion behavior, and tenant isolation expectations. Also documents failure modes, prerequisites for snapshot support, and testing/graduation and recovery procedures.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Warning

Rate limit exceeded

@ygalblum has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 33 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 10ff5836-3363-4482-969d-ccdcbb5f7c9c

📥 Commits

Reviewing files that changed from the base of the PR and between a497052 and 81a7838.

📒 Files selected for processing (1)
  • enhancements/vm-snapshots/README.md

Walkthrough

Adds a VM Snapshots enhancement document defining a ComputeInstanceSnapshot API, end-to-end snapshot/restore lifecycle, operator CRDs/reconciliation with KubeVirt, fulfillment-service persistence and RPCs, tenant isolation, failure modes, and operational/test guidance. Risk severity: medium; impact: snapshot data exposure and restore failures.

Changes

VM Snapshots Enhancement Specification

Layer / File(s) Summary
Feature overview and motivation
enhancements/vm-snapshots/README.md
Document header, terminology, user stories, goals, and non-goals.
Snapshot & restore workflows
enhancements/vm-snapshots/README.md
Tenant workflows for create/list/get/delete snapshots, online vs offline semantics, restore signalling and error cases, and cascade deletion ordering.
Fulfillment-service API & ComputeInstance changes
enhancements/vm-snapshots/README.md
Defines ComputeInstanceSnapshot proto/contract, public/private RPCs, and ComputeInstance restore spec/status/condition additions.
osac-operator CRD & reconciliation
enhancements/vm-snapshots/README.md
CRD spec/status shapes; operator reconcilers provision KubeVirt VirtualMachineSnapshot/VirtualMachineRestore, validate VM stopped state for restores, and sync statuses back to fulfillment-service.
Persistence, risks, tests & operations
enhancements/vm-snapshots/README.md
Adds compute_instance_snapshots DB schema, integration notes, failure modes, test plan (unit/integration/E2E), graduation criteria, upgrade/downgrade guidance, and infra prerequisites (KubeVirt snapshot + CSI VolumeSnapshot).

Sequence Diagram(s):

sequenceDiagram
  participant Tenant
  participant FulfillmentService
  participant OsacOperator
  participant KubeVirtCluster
  Tenant->>FulfillmentService: CreateSnapshot RPC (ComputeInstanceSnapshot)
  FulfillmentService->>OsacOperator: Create ComputeInstanceSnapshot CR
  OsacOperator->>KubeVirtCluster: Create VirtualMachineSnapshot
  KubeVirtCluster->>OsacOperator: Snapshot phase/conditions
  OsacOperator->>FulfillmentService: Update snapshot status
  Tenant->>FulfillmentService: Request in-place restore (spec.restore_from_snapshot + spec.restore_requested_at)
  FulfillmentService->>OsacOperator: Mark ComputeInstance restore signal
  OsacOperator->>KubeVirtCluster: Validate snapshot readiness & VM stopped -> Create VirtualMachineRestore
  KubeVirtCluster->>OsacOperator: Restore completion/failure
  OsacOperator->>FulfillmentService: Sync restore status (last_restored_at, conditions)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • avishayt
  • mhrivnak

Poem

Snap the VM, a frozen line,
Metadata keeps the moment fine,
Restore by signal, guarded gate,
Tenant scope and careful state,
Docs penned neat — ops calibrate.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: adding a VM Snapshots enhancement proposal (OSAC-55) to define a snapshot/restore API for ComputeInstance resources.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets ✅ Passed PR contains only a documentation file (README.md) with no hardcoded secrets, API keys, credentials, tokens, or sensitive data. All code examples use placeholders instead of real values.
No-Weak-Crypto ✅ Passed Documentation-only PR with no cryptographic implementations. No MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto, or timing-sensitive comparisons detected.
No-Injection-Vectors ✅ Passed PR adds documentation only (README.md). No executable code with SQL concatenation, shell injection, eval/exec, pickle, unsafe YAML, os.system, or innerHTML vulnerabilities detected.
Container-Privileges ✅ Passed PR contains only documentation (README.md); no container manifests or K8s configurations present to evaluate for privileged settings.
No-Sensitive-Data-In-Logs ✅ Passed Documentation does not expose passwords, tokens, API keys, PII, or sensitive data in logging. Error messages reference KubeVirt failures generically without exposing customer data.
Ai-Attribution ✅ Passed AI tool usage (Claude Code) is properly attributed with "Assisted-by" trailer in commit message, following Red Hat attribution standards. No misattribution via Co-Authored-By detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@enhancements/vm-snapshots/README.md`:
- Around line 523-531: The README currently pins concrete proto field numbers
(e.g., ComputeInstance fields restore_from_snapshot = 16 and
restore_requested_at = 17) without instructing validation or reserving retired
numbers; update the doc to require verifying the target proto's current field
numbering and reserved ranges before assigning new numbers, and mandate adding
any conflicting/retired numbers to the reserved section of the ComputeInstance
proto (or equivalent upstream message) to avoid wire incompatibility.
Specifically: call out validating existing definitions/enums, checking upstream
repos/descriptor sets for collisions, reserve any retired or conflicting
numbers, and update the README examples (including references around the other
affected lines) to state these validation and reservation steps as a
prerequisite to implementation.
- Around line 832-833: The README currently contradicts itself by calling
ComputeInstanceSnapshot additive/backward-compatible (noting restore fields are
optional) while elsewhere labeling this as a breaking API change with "Not
applicable" upgrade/version-skew; reconcile these sections by choosing one
compatibility model and making upgrade expectations explicit: update the
paragraphs referencing ComputeInstanceSnapshot, ComputeInstance restore fields,
osac-operator, and fulfillment-service so they consistently state whether this
is backward-compatible or breaking, and if backward-compatible provide the
recommended upgrade order (osac-operator first, then fulfillment-service) and
allowable version skew; if breaking, mark upgrade/version-skew requirements
accordingly and remove the additive/backward-compatible wording.
- Around line 188-191: Pick a single authoritative contract for the "VM must be
stopped" precondition and make both the API/fulfillment path and the
ComputeInstance controller follow it: either (A) enforce synchronously in the
fulfillment-service handler that creates/updates
spec.restore_requested_at/spec.restore_from_snapshot (return an API error when
the VM is running) and remove the controller's synchronous RestoreFailed setting
for that same precondition, or (B) accept the request in fulfillment-service and
have the ComputeInstance controller be the single place that validates VM state
and sets the RestoreFailed condition. Implement the chosen approach by updating
the fulfillment-service validation logic (the code that currently returns
"Error: ... must be stopped") and the ComputeInstance controller code that
detects spec.restore_requested_at > status.last_restored_at and sets the
RestoreFailed condition so they are consistent (either both perform the check
and return an API error, or the controller alone performs the check and sets
RestoreFailed); ensure the RestoreFailed condition logic (and any API error
messages) reflect only the selected contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: eb3f6df5-ce01-46ec-a525-100e2a549f69

📥 Commits

Reviewing files that changed from the base of the PR and between 5fdd806 and 80cf33e.

📒 Files selected for processing (1)
  • enhancements/vm-snapshots/README.md

Comment thread enhancements/vm-snapshots/README.md Outdated
Comment thread enhancements/vm-snapshots/README.md
Comment thread enhancements/vm-snapshots/README.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
enhancements/vm-snapshots/README.md (2)

231-236: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

High risk (major): avoid ownership-disclosing restore errors that can leak snapshot existence across scopes.

"does not belong to compute instance" reveals that the snapshot ID exists. For security-hardening, restore should use the same non-enumerable response pattern documented elsewhere (not found) whenever caller cannot access that snapshot in the current scope.
Impact: tenant/resource enumeration signal in error surface.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/vm-snapshots/README.md` around lines 231 - 236, The restore flow
for the CLI flag --restore-from-snapshot (used by the osac update
computeinstance command / computeinstance restore handler) currently returns a
distinct "does not belong to compute instance" error that leaks the snapshot's
existence; change the error handling so any case where the caller cannot access
the snapshot (not found, wrong scope, or ownership mismatch) returns the same
non-enumerable "not found" style response used elsewhere (e.g., "compute
instance snapshot \"<id>\" not found") rather than revealing ownership, updating
the error text/error code in the computeinstance restore logic that produces the
current message.

1000-1004: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Medium risk (minor): infrastructure requirement statement is internally inconsistent.

Line 1000 says no additional infrastructure is required, but Lines 1002-1004 define mandatory cluster/storage prerequisites. Please reword to avoid operational misreads during rollout.
Impact: incorrect deployment expectations and failed enablement planning.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/vm-snapshots/README.md` around lines 1000 - 1004, The opening
sentence ("No additional infrastructure is required...") is contradictory with
the following bullet points; update the README section so the first sentence
clearly states that cluster and storage prerequisites are required rather than
“no additional infrastructure.” Specifically, replace that sentence with a brief
statement that KubeVirt snapshot support (KubeVirt snapshot feature gate enabled
and a CSI driver supporting VolumeSnapshots) and a storage backend with CSI
VolumeSnapshot support are required, and keep the existing bullets ("KubeVirt
snapshot support" and "Storage backend with CSI VolumeSnapshot support") as the
explicit prerequisites to avoid operational misreads.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@enhancements/vm-snapshots/README.md`:
- Around line 231-236: The restore flow for the CLI flag --restore-from-snapshot
(used by the osac update computeinstance command / computeinstance restore
handler) currently returns a distinct "does not belong to compute instance"
error that leaks the snapshot's existence; change the error handling so any case
where the caller cannot access the snapshot (not found, wrong scope, or
ownership mismatch) returns the same non-enumerable "not found" style response
used elsewhere (e.g., "compute instance snapshot \"<id>\" not found") rather
than revealing ownership, updating the error text/error code in the
computeinstance restore logic that produces the current message.
- Around line 1000-1004: The opening sentence ("No additional infrastructure is
required...") is contradictory with the following bullet points; update the
README section so the first sentence clearly states that cluster and storage
prerequisites are required rather than “no additional infrastructure.”
Specifically, replace that sentence with a brief statement that KubeVirt
snapshot support (KubeVirt snapshot feature gate enabled and a CSI driver
supporting VolumeSnapshots) and a storage backend with CSI VolumeSnapshot
support are required, and keep the existing bullets ("KubeVirt snapshot support"
and "Storage backend with CSI VolumeSnapshot support") as the explicit
prerequisites to avoid operational misreads.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8b233a0e-7d78-490e-b1ab-ca37b6229183

📥 Commits

Reviewing files that changed from the base of the PR and between 80cf33e and ff844ae.

📒 Files selected for processing (1)
  • enhancements/vm-snapshots/README.md

@ygalblum ygalblum requested a review from avishayt June 2, 2026 21:36
Introduce a Snapshot API for OSAC ComputeInstance resources, enabling
tenants to create, list, get, delete, and restore from point-in-time
snapshots of their VMs through the fulfillment-service API.

Key design decisions:
- New Snapshot resource (proto + CRD + dual controller)
- Both online and offline snapshot modes via KubeVirt
- In-place restore via declarative signal on ComputeInstance
- Cascade deletion when parent ComputeInstance is deleted

Assisted-by: Claude Code <noreply@anthropic.com>
Signed-off-by: Ygal Blum <ygal.blum@gmail.com>
@ygalblum ygalblum force-pushed the enhancement/vm-snapshots branch from 5881c3f to a497052 Compare June 3, 2026 10:30
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
enhancements/vm-snapshots/README.md (1)

950-956: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Upgrade/Downgrade strategy incorrectly claims breaking API change.

The proposal introduces a new resource (ComputeInstanceSnapshot) and adds optional fields to an existing resource (ComputeInstance restore fields). Neither of these changes is wire-breaking or backward-incompatible:

  • New resources do not affect existing clients that don't use them
  • Optional proto fields (restore_from_snapshot, restore_requested_at) can be safely ignored by older clients
  • New enum values (RESTORE_IN_PROGRESS, RESTORE_FAILED conditions) are backward-compatible per protobuf evolution rules

Risk severity: Minor (documentation/process issue). Incorrectly marking this as breaking could trigger unnecessary release coordination or client migration work.

Correction: Replace "Not applicable - OSAC is pre-GA. This is a breaking API change." with proper upgrade guidance:

  • Upgrade strategy: Deploy osac-operator first (to handle new CRD and restore logic), then fulfillment-service (to expose new APIs). Existing ComputeInstance workloads are unaffected.
  • Downgrade strategy: If downgrading, existing ComputeInstanceSnapshot resources will become unmanaged orphans. Recommend deleting all snapshots before downgrade to avoid storage leaks.
  • Version skew: fulfillment-service can be up to 1 minor version ahead or behind osac-operator. Snapshot APIs will return 404/NotImplemented if osac-operator does not support snapshots.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@enhancements/vm-snapshots/README.md` around lines 950 - 956, The README
currently incorrectly marks the change as a breaking API; update the "Upgrade /
Downgrade Strategy" and "Version Skew Strategy" sections to document a proper
upgrade path: state that operators should deploy osac-operator first (to add the
ComputeInstanceSnapshot CRD and restore logic) then rollout fulfillment-service
(which exposes the new snapshot APIs), note that existing ComputeInstance
workloads are unaffected, and explain downgrade behavior (existing
ComputeInstanceSnapshot resources become unmanaged/orphaned—recommend deleting
snapshots before downgrade to avoid storage leaks); also update version skew
guidance to say fulfillment-service may be one minor version ahead/behind
osac-operator and that snapshot endpoints may return 404/NotImplemented if
osac-operator lacks snapshot support; keep mention that new optional fields
restore_from_snapshot and restore_requested_at and new enum values
RESTORE_IN_PROGRESS/RESTORE_FAILED are protobuf-backward-compatible.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@enhancements/vm-snapshots/README.md`:
- Around line 806-823: Add a new high-severity risk entry under the "Risks and
Mitigations" section describing snapshot data exposure via storage-backend
compromise (KubeVirt VolumeSnapshots / CSI VolumeSnapshot referencing provider
snapshots), and include the recommended mitigations: isolate storage backend
authentication per workload cluster, enable encryption at rest, restrict network
access to storage backends via network policies, and treat storage backend
credentials as critical secrets in operational docs; also mention this
dependency when describing the ComputeInstanceSnapshot controller and
prerequisites so operators are aware to enforce these controls.
- Around line 806-823: The restore flow has a TOCTOU: after initial validation
the controller immediately creates a VirtualMachineRestore and can fail if the
referenced ComputeInstanceSnapshot was deleted or changed; update the
ComputeInstance controller's restore logic (the code path that handles
restore_requested_at and calls to create VirtualMachineRestore) to re-fetch the
referenced ComputeInstanceSnapshot CR and verify snapshot.status.phase ===
"Ready" (and that the CR still exists) immediately before creating the
VirtualMachineRestore, and if the check fails abort the restore and update the
ComputeInstance status/condition to RestoreFailed with a clear message like
"Snapshot was deleted or is no longer ready"; do not attempt to create the
VirtualMachineRestore when this re-validation fails.
- Around line 758-765: Ensure the ComputeInstance controller's cascade deletion
handler explicitly verifies tenant ownership when deleting snapshots: when
matching child ComputeInstanceSnapshot CRs for deletion (the cascade deletion
logic referenced in the ComputeInstance controller), require both the
owner/reference to the parent ComputeInstance and that the
osac.openshift.io/tenant annotation equals the parent ComputeInstance's tenant
annotation before performing deletion; update the cascade deletion description
to state this dual-filter requirement so snapshots are only deletable by the
same tenant.

---

Duplicate comments:
In `@enhancements/vm-snapshots/README.md`:
- Around line 950-956: The README currently incorrectly marks the change as a
breaking API; update the "Upgrade / Downgrade Strategy" and "Version Skew
Strategy" sections to document a proper upgrade path: state that operators
should deploy osac-operator first (to add the ComputeInstanceSnapshot CRD and
restore logic) then rollout fulfillment-service (which exposes the new snapshot
APIs), note that existing ComputeInstance workloads are unaffected, and explain
downgrade behavior (existing ComputeInstanceSnapshot resources become
unmanaged/orphaned—recommend deleting snapshots before downgrade to avoid
storage leaks); also update version skew guidance to say fulfillment-service may
be one minor version ahead/behind osac-operator and that snapshot endpoints may
return 404/NotImplemented if osac-operator lacks snapshot support; keep mention
that new optional fields restore_from_snapshot and restore_requested_at and new
enum values RESTORE_IN_PROGRESS/RESTORE_FAILED are protobuf-backward-compatible.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3efebbac-0087-4c1f-b6a7-18f100333e22

📥 Commits

Reviewing files that changed from the base of the PR and between ff844ae and a497052.

📒 Files selected for processing (1)
  • enhancements/vm-snapshots/README.md

Comment thread enhancements/vm-snapshots/README.md
Comment thread enhancements/vm-snapshots/README.md
ygalblum added 2 commits June 3, 2026 07:10
Signed-off-by: Ygal Blum <ygal.blum@gmail.com>
Signed-off-by: Ygal Blum <ygal.blum@gmail.com>
@ygalblum ygalblum requested a review from mhrivnak June 3, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants