Advanced VM Lifecycle Management
This document describes the advanced VM lifecycle features in VirtRigaud, including reconfiguration, snapshots, cloning, multi-VM sets, and placement policies.
Overview
VirtRigaud Stage E introduces comprehensive VM lifecycle management capabilities that go beyond basic create/delete operations:
- VM Reconfiguration: Modify CPU, memory, and disk resources of running VMs
- Snapshot Management: Create, delete, and revert VM snapshots
- VM Cloning: Create new VMs from existing ones with linked clone support
- Multi-VM Sets: Manage groups of VMs with rolling updates
- Placement Policies: Advanced placement rules and anti-affinity constraints
- Image Preparation: Automated image import and preparation workflows
VM Reconfiguration
Online vs Offline Reconfiguration
VirtRigaud supports both online (hot) and offline reconfiguration depending on provider capabilities:
vSphere: Supports online CPU/memory changes and hot disk expansion Libvirt: Typically requires power cycle for resource changes
Example: CPU/Memory Upgrade
# Original VM with 2 CPU, 4GB RAM
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
name: web-server
spec:
resources:
cpu: 2
memoryMiB: 4096
# Patch to upgrade resources
# kubectl patch vm web-server --type merge -p '{"spec":{"resources":{"cpu":4,"memoryMiB":8192}}}'
The controller will:
- Detect resource changes in VM spec
- Attempt online reconfiguration if supported
- If offline required, orchestrate graceful power cycle:
- Set condition
ReconfigurePendingPowerCycle=True - Power off VM gracefully
- Apply reconfiguration
- Power on VM
- Update
status.lastReconfigureTime
- Set condition
Disk Expansion
spec:
disks:
- name: data
sizeGiB: 100 # Expanded from 50GB
expandPolicy: "Online" # Try online first
Snapshot Management
Creating Snapshots
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
name: pre-maintenance-backup
spec:
vmRef:
name: web-server
nameHint: "maintenance-backup"
memory: true # Include memory state
description: "Backup before maintenance"
retentionPolicy:
maxAge: "7d"
deleteOnVMDelete: true
Snapshot Lifecycle
- Creating: Snapshot creation in progress
- Ready: Snapshot available for use
- Deleting: Snapshot being removed
- Failed: Snapshot operation failed
Reverting to Snapshots
# Patch VM to revert to snapshot
spec:
snapshot:
revertToRef:
name: pre-maintenance-backup
The controller will:
- Power off VM if running
- Call provider’s SnapshotRevert RPC
- Power on VM
- Clear
revertToRefwhen complete
VM Cloning
Basic Cloning
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClone
metadata:
name: web-server-clone
spec:
sourceRef:
name: web-server
target:
name: web-server-test
classRef:
name: test-class
linked: true # Faster, space-efficient
powerOn: true
Clone Customization
spec:
customization:
hostname: web-server-test
networks:
- name: primary
ipAddress: "192.168.1.100"
gateway: "192.168.1.1"
dns: ["8.8.8.8"]
userData:
cloudInit:
inline: |
#cloud-config
runcmd:
- echo "Test environment" > /etc/motd
Multi-VM Sets (VMSet)
VMSets provide declarative management of multiple VMs with rolling updates.
Basic VMSet
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSet
metadata:
name: web-tier
spec:
replicas: 3
selector:
matchLabels:
app: web-server
template:
metadata:
labels:
app: web-server
spec:
providerRef:
name: vsphere-prod
classRef:
name: web-class
imageRef:
name: nginx-image
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
Rolling Updates
When you update the template spec, VMSet will:
- Create new VMs with updated configuration
- Wait for new VMs to be ready
- Delete old VMs respecting
maxUnavailable - Continue until all replicas are updated
Placement Policies
Advanced Placement Rules
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMPlacementPolicy
metadata:
name: production-policy
spec:
hard:
clusters: ["prod-cluster-1", "prod-cluster-2"]
datastores: ["ssd-datastore-1", "ssd-datastore-2"]
hosts: ["esxi-01", "esxi-02", "esxi-03"]
soft:
folders: ["/Production/WebServers"]
zones: ["zone-a", "zone-b"]
antiAffinity:
hostAntiAffinity: true # Spread across hosts
clusterAntiAffinity: false
datastoreAntiAffinity: true # Spread across datastores
Using Placement Policies
spec:
placementRef:
name: production-policy
The provider will attempt to satisfy:
- Hard constraints: Must be satisfied
- Soft constraints: Best effort
- Anti-affinity rules: Avoid co-location
Image Preparation
Automated Image Import
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
name: ubuntu-22-04
spec:
vsphere:
ovaURL: "https://releases.ubuntu.com/22.04/ubuntu-22.04-server.ova"
checksum: "sha256:abcd1234..."
libvirt:
url: "https://cloud-images.ubuntu.com/22.04/ubuntu-22.04-server.img"
format: "qcow2"
prepare:
onMissing: "Import" # Auto-import if missing
validateChecksum: true
timeout: "30m"
retries: 3
storage:
vsphere:
datastore: "images-datastore"
folder: "/Templates"
thinProvisioned: true
Image Preparation Phases
- Pending: Waiting to start preparation
- Importing: Downloading/importing image
- Preparing: Processing image (conversion, etc.)
- Ready: Image ready for use
- Failed: Preparation failed
Provider Capabilities
Different providers support different features. Query capabilities:
# Example capabilities response
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
status:
capabilities:
supportsReconfigureOnline: true # vSphere: true, Libvirt: false
supportsDiskExpansionOnline: true # vSphere: true, Libvirt: false
supportsSnapshots: true # Both: true
supportsMemorySnapshots: true # vSphere: true, Libvirt: varies
supportsLinkedClones: true # Both: true
supportsImageImport: true # Both: true
supportedDiskTypes: ["thin", "thick"]
supportedNetworkTypes: ["VMXNET3", "E1000"]
Observability
Metrics
New metrics for advanced lifecycle operations:
virtrigaud_vm_reconfigure_total{provider_type,outcome}
virtrigaud_vm_snapshot_total{action,provider_type,outcome}
virtrigaud_vm_clone_total{linked,provider_type,outcome}
virtrigaud_vm_image_prepare_total{provider_type,outcome}
Events
Detailed events for lifecycle operations:
Normal SnapshotCreating Started snapshot creation
Normal SnapshotReady Snapshot created successfully
Normal ReconfigureStarted Started VM reconfiguration
Warning ReconfigurePowerCycle Reconfiguration requires power cycle
Normal CloneCompleted VM clone created successfully
Conditions
Comprehensive condition reporting:
VM Conditions:
Ready: VM is ready for useProvisioning: VM is being createdReconfiguring: VM is being reconfiguredReconfigurePendingPowerCycle: Needs power cycle for changes
Snapshot Conditions:
Ready: Snapshot is readyCreating: Snapshot being createdDeleting: Snapshot being deleted
Clone Conditions:
Ready: Clone completed successfullyCloning: Clone operation in progressCustomizing: Applying customizations
Best Practices
Snapshot Management
- Retention Policies: Always set appropriate retention policies
- Memory Snapshots: Use sparingly due to storage overhead
- Cleanup: Implement automated cleanup for old snapshots
- Testing: Test snapshot revert procedures regularly
VM Reconfiguration
- Gradual Changes: Make incremental resource changes
- Monitoring: Monitor VM performance after changes
- Rollback Plan: Have snapshots before major changes
- Capacity Planning: Ensure host resources before scaling up
Placement Policies
- Start Simple: Begin with basic constraints
- Test Anti-Affinity: Verify rules work as expected
- Monitor Placement: Check actual VM placement matches policy
- Balance Performance: Don’t over-constrain placement
Multi-VM Operations
- Rolling Updates: Use appropriate
maxUnavailablesettings - Health Checks: Implement proper readiness checks
- Monitoring: Monitor rollout progress
- Rollback Strategy: Plan for rollback scenarios
Troubleshooting
Common Issues
Reconfiguration Fails:
- Check provider capabilities
- Verify resource availability on host
- Check for VM tools/agent issues
Snapshot Operations Fail:
- Verify storage backend supports snapshots
- Check available storage space
- Ensure VM is not in transitional state
Clone Customization Issues:
- Verify network configuration
- Check cloud-init/guest tools
- Validate IP address availability
Placement Policy Violations:
- Check resource availability in target locations
- Verify anti-affinity rules aren’t too restrictive
- Review cluster resource distribution
Debugging
# Check VM reconfiguration status
kubectl describe vm web-server
# Monitor snapshot progress
kubectl get vmsnapshots -w
# Check clone status
kubectl describe vmclone web-server-clone
# Review placement policy usage
kubectl describe vmplacementpolicy production-policy
# Check VMSet rollout
kubectl describe vmset web-tier
Migration from Basic VMs
Existing VMs can be enhanced with advanced features:
- Add Placement Policy: Update VM spec with
placementRef - Enable Reconfiguration: Add resource overrides
- Create Snapshots: Deploy VMSnapshot resources
- Scale with VMSets: Migrate to VMSet for multi-instance workloads
The controller maintains backward compatibility with existing VM definitions.