OSAC-55: Add VM Snapshots enhancement proposal#48
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository: osac-project/coderabbit/.coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughAdds a VM Snapshots enhancement document defining a ComputeInstanceSnapshot API, end-to-end snapshot/restore lifecycle, operator CRDs/reconciliation with KubeVirt, fulfillment-service persistence and RPCs, tenant isolation, failure modes, and operational/test guidance. Risk severity: medium; impact: snapshot data exposure and restore failures. ChangesVM Snapshots Enhancement Specification
Sequence Diagram(s): sequenceDiagram
participant Tenant
participant FulfillmentService
participant OsacOperator
participant KubeVirtCluster
Tenant->>FulfillmentService: CreateSnapshot RPC (ComputeInstanceSnapshot)
FulfillmentService->>OsacOperator: Create ComputeInstanceSnapshot CR
OsacOperator->>KubeVirtCluster: Create VirtualMachineSnapshot
KubeVirtCluster->>OsacOperator: Snapshot phase/conditions
OsacOperator->>FulfillmentService: Update snapshot status
Tenant->>FulfillmentService: Request in-place restore (spec.restore_from_snapshot + spec.restore_requested_at)
FulfillmentService->>OsacOperator: Mark ComputeInstance restore signal
OsacOperator->>KubeVirtCluster: Validate snapshot readiness & VM stopped -> Create VirtualMachineRestore
KubeVirtCluster->>OsacOperator: Restore completion/failure
OsacOperator->>FulfillmentService: Sync restore status (last_restored_at, conditions)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@enhancements/vm-snapshots/README.md`:
- Around line 523-531: The README currently pins concrete proto field numbers
(e.g., ComputeInstance fields restore_from_snapshot = 16 and
restore_requested_at = 17) without instructing validation or reserving retired
numbers; update the doc to require verifying the target proto's current field
numbering and reserved ranges before assigning new numbers, and mandate adding
any conflicting/retired numbers to the reserved section of the ComputeInstance
proto (or equivalent upstream message) to avoid wire incompatibility.
Specifically: call out validating existing definitions/enums, checking upstream
repos/descriptor sets for collisions, reserve any retired or conflicting
numbers, and update the README examples (including references around the other
affected lines) to state these validation and reservation steps as a
prerequisite to implementation.
- Around line 832-833: The README currently contradicts itself by calling
ComputeInstanceSnapshot additive/backward-compatible (noting restore fields are
optional) while elsewhere labeling this as a breaking API change with "Not
applicable" upgrade/version-skew; reconcile these sections by choosing one
compatibility model and making upgrade expectations explicit: update the
paragraphs referencing ComputeInstanceSnapshot, ComputeInstance restore fields,
osac-operator, and fulfillment-service so they consistently state whether this
is backward-compatible or breaking, and if backward-compatible provide the
recommended upgrade order (osac-operator first, then fulfillment-service) and
allowable version skew; if breaking, mark upgrade/version-skew requirements
accordingly and remove the additive/backward-compatible wording.
- Around line 188-191: Pick a single authoritative contract for the "VM must be
stopped" precondition and make both the API/fulfillment path and the
ComputeInstance controller follow it: either (A) enforce synchronously in the
fulfillment-service handler that creates/updates
spec.restore_requested_at/spec.restore_from_snapshot (return an API error when
the VM is running) and remove the controller's synchronous RestoreFailed setting
for that same precondition, or (B) accept the request in fulfillment-service and
have the ComputeInstance controller be the single place that validates VM state
and sets the RestoreFailed condition. Implement the chosen approach by updating
the fulfillment-service validation logic (the code that currently returns
"Error: ... must be stopped") and the ComputeInstance controller code that
detects spec.restore_requested_at > status.last_restored_at and sets the
RestoreFailed condition so they are consistent (either both perform the check
and return an API error, or the controller alone performs the check and sets
RestoreFailed); ensure the RestoreFailed condition logic (and any API error
messages) reflect only the selected contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: eb3f6df5-ce01-46ec-a525-100e2a549f69
📒 Files selected for processing (1)
enhancements/vm-snapshots/README.md
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
enhancements/vm-snapshots/README.md (2)
231-236:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHigh risk (major): avoid ownership-disclosing restore errors that can leak snapshot existence across scopes.
"does not belong to compute instance"reveals that the snapshot ID exists. For security-hardening, restore should use the same non-enumerable response pattern documented elsewhere (not found) whenever caller cannot access that snapshot in the current scope.
Impact: tenant/resource enumeration signal in error surface.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/vm-snapshots/README.md` around lines 231 - 236, The restore flow for the CLI flag --restore-from-snapshot (used by the osac update computeinstance command / computeinstance restore handler) currently returns a distinct "does not belong to compute instance" error that leaks the snapshot's existence; change the error handling so any case where the caller cannot access the snapshot (not found, wrong scope, or ownership mismatch) returns the same non-enumerable "not found" style response used elsewhere (e.g., "compute instance snapshot \"<id>\" not found") rather than revealing ownership, updating the error text/error code in the computeinstance restore logic that produces the current message.
1000-1004:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMedium risk (minor): infrastructure requirement statement is internally inconsistent.
Line 1000 says no additional infrastructure is required, but Lines 1002-1004 define mandatory cluster/storage prerequisites. Please reword to avoid operational misreads during rollout.
Impact: incorrect deployment expectations and failed enablement planning.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/vm-snapshots/README.md` around lines 1000 - 1004, The opening sentence ("No additional infrastructure is required...") is contradictory with the following bullet points; update the README section so the first sentence clearly states that cluster and storage prerequisites are required rather than “no additional infrastructure.” Specifically, replace that sentence with a brief statement that KubeVirt snapshot support (KubeVirt snapshot feature gate enabled and a CSI driver supporting VolumeSnapshots) and a storage backend with CSI VolumeSnapshot support are required, and keep the existing bullets ("KubeVirt snapshot support" and "Storage backend with CSI VolumeSnapshot support") as the explicit prerequisites to avoid operational misreads.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@enhancements/vm-snapshots/README.md`:
- Around line 231-236: The restore flow for the CLI flag --restore-from-snapshot
(used by the osac update computeinstance command / computeinstance restore
handler) currently returns a distinct "does not belong to compute instance"
error that leaks the snapshot's existence; change the error handling so any case
where the caller cannot access the snapshot (not found, wrong scope, or
ownership mismatch) returns the same non-enumerable "not found" style response
used elsewhere (e.g., "compute instance snapshot \"<id>\" not found") rather
than revealing ownership, updating the error text/error code in the
computeinstance restore logic that produces the current message.
- Around line 1000-1004: The opening sentence ("No additional infrastructure is
required...") is contradictory with the following bullet points; update the
README section so the first sentence clearly states that cluster and storage
prerequisites are required rather than “no additional infrastructure.”
Specifically, replace that sentence with a brief statement that KubeVirt
snapshot support (KubeVirt snapshot feature gate enabled and a CSI driver
supporting VolumeSnapshots) and a storage backend with CSI VolumeSnapshot
support are required, and keep the existing bullets ("KubeVirt snapshot support"
and "Storage backend with CSI VolumeSnapshot support") as the explicit
prerequisites to avoid operational misreads.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8b233a0e-7d78-490e-b1ab-ca37b6229183
📒 Files selected for processing (1)
enhancements/vm-snapshots/README.md
Introduce a Snapshot API for OSAC ComputeInstance resources, enabling tenants to create, list, get, delete, and restore from point-in-time snapshots of their VMs through the fulfillment-service API. Key design decisions: - New Snapshot resource (proto + CRD + dual controller) - Both online and offline snapshot modes via KubeVirt - In-place restore via declarative signal on ComputeInstance - Cascade deletion when parent ComputeInstance is deleted Assisted-by: Claude Code <noreply@anthropic.com> Signed-off-by: Ygal Blum <ygal.blum@gmail.com>
5881c3f to
a497052
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
enhancements/vm-snapshots/README.md (1)
950-956:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpgrade/Downgrade strategy incorrectly claims breaking API change.
The proposal introduces a new resource (ComputeInstanceSnapshot) and adds optional fields to an existing resource (ComputeInstance restore fields). Neither of these changes is wire-breaking or backward-incompatible:
- New resources do not affect existing clients that don't use them
- Optional proto fields (restore_from_snapshot, restore_requested_at) can be safely ignored by older clients
- New enum values (RESTORE_IN_PROGRESS, RESTORE_FAILED conditions) are backward-compatible per protobuf evolution rules
Risk severity: Minor (documentation/process issue). Incorrectly marking this as breaking could trigger unnecessary release coordination or client migration work.
Correction: Replace "Not applicable - OSAC is pre-GA. This is a breaking API change." with proper upgrade guidance:
- Upgrade strategy: Deploy osac-operator first (to handle new CRD and restore logic), then fulfillment-service (to expose new APIs). Existing ComputeInstance workloads are unaffected.
- Downgrade strategy: If downgrading, existing ComputeInstanceSnapshot resources will become unmanaged orphans. Recommend deleting all snapshots before downgrade to avoid storage leaks.
- Version skew: fulfillment-service can be up to 1 minor version ahead or behind osac-operator. Snapshot APIs will return 404/NotImplemented if osac-operator does not support snapshots.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@enhancements/vm-snapshots/README.md` around lines 950 - 956, The README currently incorrectly marks the change as a breaking API; update the "Upgrade / Downgrade Strategy" and "Version Skew Strategy" sections to document a proper upgrade path: state that operators should deploy osac-operator first (to add the ComputeInstanceSnapshot CRD and restore logic) then rollout fulfillment-service (which exposes the new snapshot APIs), note that existing ComputeInstance workloads are unaffected, and explain downgrade behavior (existing ComputeInstanceSnapshot resources become unmanaged/orphaned—recommend deleting snapshots before downgrade to avoid storage leaks); also update version skew guidance to say fulfillment-service may be one minor version ahead/behind osac-operator and that snapshot endpoints may return 404/NotImplemented if osac-operator lacks snapshot support; keep mention that new optional fields restore_from_snapshot and restore_requested_at and new enum values RESTORE_IN_PROGRESS/RESTORE_FAILED are protobuf-backward-compatible.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@enhancements/vm-snapshots/README.md`:
- Around line 806-823: Add a new high-severity risk entry under the "Risks and
Mitigations" section describing snapshot data exposure via storage-backend
compromise (KubeVirt VolumeSnapshots / CSI VolumeSnapshot referencing provider
snapshots), and include the recommended mitigations: isolate storage backend
authentication per workload cluster, enable encryption at rest, restrict network
access to storage backends via network policies, and treat storage backend
credentials as critical secrets in operational docs; also mention this
dependency when describing the ComputeInstanceSnapshot controller and
prerequisites so operators are aware to enforce these controls.
- Around line 806-823: The restore flow has a TOCTOU: after initial validation
the controller immediately creates a VirtualMachineRestore and can fail if the
referenced ComputeInstanceSnapshot was deleted or changed; update the
ComputeInstance controller's restore logic (the code path that handles
restore_requested_at and calls to create VirtualMachineRestore) to re-fetch the
referenced ComputeInstanceSnapshot CR and verify snapshot.status.phase ===
"Ready" (and that the CR still exists) immediately before creating the
VirtualMachineRestore, and if the check fails abort the restore and update the
ComputeInstance status/condition to RestoreFailed with a clear message like
"Snapshot was deleted or is no longer ready"; do not attempt to create the
VirtualMachineRestore when this re-validation fails.
- Around line 758-765: Ensure the ComputeInstance controller's cascade deletion
handler explicitly verifies tenant ownership when deleting snapshots: when
matching child ComputeInstanceSnapshot CRs for deletion (the cascade deletion
logic referenced in the ComputeInstance controller), require both the
owner/reference to the parent ComputeInstance and that the
osac.openshift.io/tenant annotation equals the parent ComputeInstance's tenant
annotation before performing deletion; update the cascade deletion description
to state this dual-filter requirement so snapshots are only deletable by the
same tenant.
---
Duplicate comments:
In `@enhancements/vm-snapshots/README.md`:
- Around line 950-956: The README currently incorrectly marks the change as a
breaking API; update the "Upgrade / Downgrade Strategy" and "Version Skew
Strategy" sections to document a proper upgrade path: state that operators
should deploy osac-operator first (to add the ComputeInstanceSnapshot CRD and
restore logic) then rollout fulfillment-service (which exposes the new snapshot
APIs), note that existing ComputeInstance workloads are unaffected, and explain
downgrade behavior (existing ComputeInstanceSnapshot resources become
unmanaged/orphaned—recommend deleting snapshots before downgrade to avoid
storage leaks); also update version skew guidance to say fulfillment-service may
be one minor version ahead/behind osac-operator and that snapshot endpoints may
return 404/NotImplemented if osac-operator lacks snapshot support; keep mention
that new optional fields restore_from_snapshot and restore_requested_at and new
enum values RESTORE_IN_PROGRESS/RESTORE_FAILED are protobuf-backward-compatible.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3efebbac-0087-4c1f-b6a7-18f100333e22
📒 Files selected for processing (1)
enhancements/vm-snapshots/README.md
Signed-off-by: Ygal Blum <ygal.blum@gmail.com>
Signed-off-by: Ygal Blum <ygal.blum@gmail.com>
Introduce a Snapshot API for OSAC ComputeInstance resources, enabling tenants to create, list, get, delete, and restore from point-in-time snapshots of their VMs through the fulfillment-service API.
Key design decisions:
Assisted-by: Claude Code noreply@anthropic.com
Summary by CodeRabbit