Advanced VM Lifecycle Management

This document describes the advanced VM lifecycle features in VirtRigaud, including reconfiguration, snapshots, cloning, multi-VM sets, and placement policies.

Overview

VirtRigaud Stage E introduces comprehensive VM lifecycle management capabilities that go beyond basic create/delete operations:

VM Reconfiguration: Modify CPU, memory, and disk resources of running VMs
Snapshot Management: Create, delete, and revert VM snapshots
VM Cloning: Create new VMs from existing ones with linked clone support
Multi-VM Sets: Manage groups of VMs with rolling updates
Placement Policies: Advanced placement rules and anti-affinity constraints
Image Preparation: Automated image import and preparation workflows

VM Reconfiguration

Online vs Offline Reconfiguration

VirtRigaud supports both online (hot) and offline reconfiguration depending on provider capabilities:

vSphere: Supports online CPU/memory changes and hot disk expansion Libvirt: Typically requires power cycle for resource changes

Example: CPU/Memory Upgrade

# Original VM with 2 CPU, 4GB RAM
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  resources:
    cpu: 2
    memoryMiB: 4096

# Patch to upgrade resources
# kubectl patch vm web-server --type merge -p '{"spec":{"resources":{"cpu":4,"memoryMiB":8192}}}'

The controller will:

Detect resource changes in VM spec
Attempt online reconfiguration if supported
If offline required, orchestrate graceful power cycle:
- Set condition ReconfigurePendingPowerCycle=True
- Power off VM gracefully
- Apply reconfiguration
- Power on VM
- Update status.lastReconfigureTime

Disk Expansion

spec:
  disks:
    - name: data
      sizeGiB: 100  # Expanded from 50GB
      expandPolicy: "Online"  # Try online first

Snapshot Management

Creating Snapshots

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: pre-maintenance-backup
spec:
  vmRef:
    name: web-server
  nameHint: "maintenance-backup"
  memory: true  # Include memory state
  description: "Backup before maintenance"
  retentionPolicy:
    maxAge: "7d"
    deleteOnVMDelete: true

Snapshot Lifecycle

Creating: Snapshot creation in progress
Ready: Snapshot available for use
Deleting: Snapshot being removed
Failed: Snapshot operation failed

Reverting to Snapshots

# Patch VM to revert to snapshot
spec:
  snapshot:
    revertToRef:
      name: pre-maintenance-backup

The controller will:

Power off VM if running
Call provider’s SnapshotRevert RPC
Power on VM
Clear revertToRef when complete

VM Cloning

Basic Cloning

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClone
metadata:
  name: web-server-clone
spec:
  sourceRef:
    name: web-server
  target:
    name: web-server-test
    classRef:
      name: test-class
  linked: true  # Faster, space-efficient
  powerOn: true

Clone Customization

spec:
  customization:
    hostname: web-server-test
    networks:
      - name: primary
        ipAddress: "192.168.1.100"
        gateway: "192.168.1.1"
        dns: ["8.8.8.8"]
    userData:
      cloudInit:
        inline: |
          #cloud-config
          runcmd:
            - echo "Test environment" > /etc/motd

Multi-VM Sets (VMSet)

VMSets provide declarative management of multiple VMs with rolling updates.

Basic VMSet

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSet
metadata:
  name: web-tier
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-server
  template:
    metadata:
      labels:
        app: web-server
    spec:
      providerRef:
        name: vsphere-prod
      classRef:
        name: web-class
      imageRef:
        name: nginx-image
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

Rolling Updates

When you update the template spec, VMSet will:

Create new VMs with updated configuration
Wait for new VMs to be ready
Delete old VMs respecting maxUnavailable
Continue until all replicas are updated

Placement Policies

Advanced Placement Rules

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMPlacementPolicy
metadata:
  name: production-policy
spec:
  hard:
    clusters: ["prod-cluster-1", "prod-cluster-2"]
    datastores: ["ssd-datastore-1", "ssd-datastore-2"]
    hosts: ["esxi-01", "esxi-02", "esxi-03"]
  soft:
    folders: ["/Production/WebServers"]
    zones: ["zone-a", "zone-b"]
  antiAffinity:
    hostAntiAffinity: true      # Spread across hosts
    clusterAntiAffinity: false
    datastoreAntiAffinity: true # Spread across datastores

Using Placement Policies

spec:
  placementRef:
    name: production-policy

The provider will attempt to satisfy:

Hard constraints: Must be satisfied
Soft constraints: Best effort
Anti-affinity rules: Avoid co-location

Image Preparation

Automated Image Import

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-04
spec:
  vsphere:
    ovaURL: "https://releases.ubuntu.com/22.04/ubuntu-22.04-server.ova"
    checksum: "sha256:abcd1234..."
  libvirt:
    url: "https://cloud-images.ubuntu.com/22.04/ubuntu-22.04-server.img"
    format: "qcow2"
  prepare:
    onMissing: "Import"  # Auto-import if missing
    validateChecksum: true
    timeout: "30m"
    retries: 3
    storage:
      vsphere:
        datastore: "images-datastore"
        folder: "/Templates"
        thinProvisioned: true

Image Preparation Phases

Pending: Waiting to start preparation
Importing: Downloading/importing image
Preparing: Processing image (conversion, etc.)
Ready: Image ready for use
Failed: Preparation failed

Provider Capabilities

Different providers support different features. Query capabilities:

# Example capabilities response
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
status:
  capabilities:
    supportsReconfigureOnline: true      # vSphere: true, Libvirt: false
    supportsDiskExpansionOnline: true    # vSphere: true, Libvirt: false
    supportsSnapshots: true              # Both: true
    supportsMemorySnapshots: true        # vSphere: true, Libvirt: varies
    supportsLinkedClones: true           # Both: true
    supportsImageImport: true            # Both: true
    supportedDiskTypes: ["thin", "thick"]
    supportedNetworkTypes: ["VMXNET3", "E1000"]

Observability

Metrics

New metrics for advanced lifecycle operations:

virtrigaud_vm_reconfigure_total{provider_type,outcome}
virtrigaud_vm_snapshot_total{action,provider_type,outcome}
virtrigaud_vm_clone_total{linked,provider_type,outcome}
virtrigaud_vm_image_prepare_total{provider_type,outcome}

Events

Detailed events for lifecycle operations:

Normal   SnapshotCreating    Started snapshot creation
Normal   SnapshotReady       Snapshot created successfully
Normal   ReconfigureStarted  Started VM reconfiguration
Warning  ReconfigurePowerCycle  Reconfiguration requires power cycle
Normal   CloneCompleted      VM clone created successfully

Conditions

Comprehensive condition reporting:

VM Conditions:

Ready: VM is ready for use
Provisioning: VM is being created
Reconfiguring: VM is being reconfigured
ReconfigurePendingPowerCycle: Needs power cycle for changes

Snapshot Conditions:

Ready: Snapshot is ready
Creating: Snapshot being created
Deleting: Snapshot being deleted

Clone Conditions:

Ready: Clone completed successfully
Cloning: Clone operation in progress
Customizing: Applying customizations

Best Practices

Snapshot Management

Retention Policies: Always set appropriate retention policies
Memory Snapshots: Use sparingly due to storage overhead
Cleanup: Implement automated cleanup for old snapshots
Testing: Test snapshot revert procedures regularly

VM Reconfiguration

Gradual Changes: Make incremental resource changes
Monitoring: Monitor VM performance after changes
Rollback Plan: Have snapshots before major changes
Capacity Planning: Ensure host resources before scaling up

Placement Policies

Start Simple: Begin with basic constraints
Test Anti-Affinity: Verify rules work as expected
Monitor Placement: Check actual VM placement matches policy
Balance Performance: Don’t over-constrain placement

Multi-VM Operations

Rolling Updates: Use appropriate maxUnavailable settings
Health Checks: Implement proper readiness checks
Monitoring: Monitor rollout progress
Rollback Strategy: Plan for rollback scenarios

Troubleshooting

Common Issues

Reconfiguration Fails:

Check provider capabilities
Verify resource availability on host
Check for VM tools/agent issues

Snapshot Operations Fail:

Verify storage backend supports snapshots
Check available storage space
Ensure VM is not in transitional state

Clone Customization Issues:

Verify network configuration
Check cloud-init/guest tools
Validate IP address availability

Placement Policy Violations:

Check resource availability in target locations
Verify anti-affinity rules aren’t too restrictive
Review cluster resource distribution

Debugging

# Check VM reconfiguration status
kubectl describe vm web-server

# Monitor snapshot progress
kubectl get vmsnapshots -w

# Check clone status
kubectl describe vmclone web-server-clone

# Review placement policy usage
kubectl describe vmplacementpolicy production-policy

# Check VMSet rollout
kubectl describe vmset web-tier

Migration from Basic VMs

Existing VMs can be enhanced with advanced features:

Add Placement Policy: Update VM spec with placementRef
Enable Reconfiguration: Add resource overrides
Create Snapshots: Deploy VMSnapshot resources
Scale with VMSets: Migrate to VMSet for multi-instance workloads

The controller maintains backward compatibility with existing VM definitions.

VirtRigaud Documentation