VirtRigaud Documentation

Welcome to the VirtRigaud documentation. VirtRigaud is a Kubernetes operator for managing virtual machines across multiple hypervisors including vSphere, Libvirt/KVM, and Proxmox VE.

Quick Navigation

Getting Started

Core Documentation

Provider-Specific Guides

Advanced Features

Operations & Administration

Security Configuration

API Reference

Development

Examples Directory

Version Information

This documentation covers VirtRigaud v0.2.3.

Recent Changes

  • v0.2.3: Provider feature parity - Reconfigure, Clone, TaskStatus, ConsoleURL
  • v0.2.2: Nested virtualization, TPM support, snapshot management
  • v0.2.1: Critical fixes and documentation updates
  • v0.2.0: Production-ready vSphere and Libvirt providers

See CHANGELOG.md for complete version history.

Provider Status

ProviderStatusMaturityDocumentation
vSphereProduction ReadyStableGuide
Libvirt/KVMProduction ReadyStableGuide
Proxmox VEProduction ReadyBetaGuide
MockCompleteTestingPROVIDERS.md

Support

15-Minute Quickstart

This guide will get you up and running with VirtRigaud in 15 minutes using both vSphere and Libvirt providers.

Prerequisites

  • Kubernetes cluster (1.24+)
  • kubectl configured
  • Helm 3.x
  • Access to a vSphere environment (optional)
  • Access to a Libvirt/KVM host (optional)

API Support

Default API: v1beta1 - The recommended stable API for all new deployments.

Legacy API: v1beta1 - Served for compatibility but deprecated. See the upgrade guide for migration instructions.

All resources support seamless conversion between API versions via webhooks.

Step 1: Install VirtRigaud

# Add the VirtRigaud Helm repository
helm repo add virtrigaud https://projectbeskar.github.io/virtrigaud
helm repo update

# Install with default settings (CRDs included automatically)
helm install virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --create-namespace

# Or install with specific providers enabled
helm install virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --create-namespace \
  --set providers.vsphere.enabled=true \
  --set providers.libvirt.enabled=true

# To skip CRDs if already installed separately
helm install virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --create-namespace \
  --skip-crds

Using Kustomize

# Clone the repository
git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Apply base installation
kubectl apply -k deploy/kustomize/base

# Or apply with overlays
kubectl apply -k deploy/kustomize/overlays/standard

Step 2: Verify Installation

# Check that the manager is running
kubectl get pods -n virtrigaud-system

# Check CRDs are installed
kubectl get crds | grep virtrigaud

# Verify API conversion is working (v1beta1 <-> v1beta1)
kubectl get crd virtualmachines.infra.virtrigaud.io -o yaml | yq '.spec.conversion'

# Check manager logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager

Step 3: Configure a Provider

Option A: vSphere Provider

Create a secret with vSphere credentials:

kubectl create secret generic vsphere-credentials \
  --namespace default \
  --from-literal=endpoint=https://vcenter.example.com \
  --from-literal=username=administrator@vsphere.local \
  --from-literal=password=your-password \
  --from-literal=insecure=false

Create a vSphere provider:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-prod
  namespace: default
spec:
  type: vsphere
  endpoint: https://vcenter.example.com
  credentialSecretRef:
    name: vsphere-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.3"
    service:
      port: 9090
  defaults:
    datastore: "datastore1"
    cluster: "cluster1"
    folder: "virtrigaud-vms"

Option B: Libvirt Provider

Create a secret with Libvirt connection details:

kubectl create secret generic libvirt-credentials \
  --namespace default \
  --from-literal=uri=qemu+ssh://root@libvirt-host.example.com/system \
  --from-literal=username=root \
  --from-literal=privateKey="$(cat ~/.ssh/id_rsa)"

Create a Libvirt provider:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-lab
  namespace: default
spec:
  type: libvirt
  endpoint: qemu+ssh://root@libvirt-host.example.com/system
  credentialSecretRef:
    name: libvirt-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.0"
    service:
      port: 9090
  defaults:
    defaultStoragePool: "default"
    defaultNetwork: "default"

Apply the provider configuration:

kubectl apply -f provider.yaml

πŸ’‘ Behind the scenes: VirtRigaud automatically converts your Provider resource into the appropriate command-line arguments, environment variables, and secret mounts for the provider pod. See the configuration flow documentation for complete details.

Step 4: Create a VM Class

Define resource templates for your VMs:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: small
  namespace: default
spec:
  cpu: 2
  memoryMiB: 2048
  disks:
  - name: root
    sizeGiB: 20
    type: thin
  networks:
  - name: default
    type: "VM Network"  # vSphere network name
kubectl apply -f vmclass.yaml

Step 5: Create a VM Image

Define the base image for your VMs:

vSphere Image (OVA)

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-20-04
  namespace: virtrigaud-system
spec:
  source:
    vsphere:
      ovaURL: "https://cloud-images.ubuntu.com/releases/20.04/ubuntu-20.04-server-cloudimg-amd64.ova"
      checksum: "sha256:abc123..."
      datastore: "datastore1"
      folder: "vm-templates"
  prepare:
    onMissing: Import
    timeout: "30m"

Libvirt Image (qcow2)

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-20-04
  namespace: virtrigaud-system
spec:
  source:
    libvirt:
      qcow2URL: "https://cloud-images.ubuntu.com/releases/20.04/ubuntu-20.04-server-cloudimg-amd64.img"
      checksum: "sha256:def456..."
      storagePool: "default"
  prepare:
    onMissing: Import
    timeout: "30m"
kubectl apply -f vmimage.yaml

Step 6: Create Your First VM

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: my-first-vm
  namespace: default
spec:
  providerRef:
    name: vsphere-prod  # or libvirt-lab
    namespace: default
  classRef:
    name: small
    namespace: default
  imageRef:
    name: ubuntu-20-04
    namespace: default
  powerState: "On"
  userData:
    cloudInit:
      inline: |
        #cloud-config
        users:
          - name: ubuntu
            sudo: ALL=(ALL) NOPASSWD:ALL
            ssh_authorized_keys:
              - ssh-rsa AAAAB3... your-public-key
        packages:
          - curl
          - vim
  networks:
  - name: default
    networkRef:
      name: default-network
      namespace: default
kubectl apply -f vm.yaml

Step 7: Monitor VM Creation

# Watch VM status
kubectl get vm my-first-vm -w

# Check detailed status
kubectl describe vm my-first-vm

# View events
kubectl get events --field-selector involvedObject.name=my-first-vm

# Check provider logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-provider-vsphere

Step 8: Access Your VM

# Get VM IP address
kubectl get vm my-first-vm -o jsonpath='{.status.ips[0]}'

# Get console URL (if supported)
kubectl get vm my-first-vm -o jsonpath='{.status.consoleURL}'

# SSH to the VM (once it has an IP)
ssh ubuntu@<vm-ip>

Step 9: Try Advanced Operations

Create a Snapshot

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: my-vm-snapshot
  namespace: default
spec:
  vmRef:
    name: my-first-vm
  nameHint: "pre-update-snapshot"
  memory: true

Clone the VM

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClone
metadata:
  name: my-vm-clone
  namespace: default
spec:
  sourceRef:
    name: my-first-vm
  target:
    name: cloned-vm
    classRef:
      name: small
      namespace: default
  linked: true

Scale with VMSet

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSet
metadata:
  name: web-servers
  namespace: default
spec:
  replicas: 3
  template:
    spec:
      providerRef:
        name: vsphere-prod
        namespace: default
      classRef:
        name: small
        namespace: default
      imageRef:
        name: ubuntu-20-04
        namespace: default
      powerState: "On"

Step 10: Clean Up

# Delete VM
kubectl delete vm my-first-vm

# Delete snapshots and clones
kubectl delete vmsnapshot my-vm-snapshot
kubectl delete vmclone my-vm-clone
kubectl delete vmset web-servers

# Uninstall VirtRigaud (optional)
helm uninstall virtrigaud -n virtrigaud-system
kubectl delete namespace virtrigaud-system

Next Steps

Troubleshooting

If you encounter issues:

  1. Check the Troubleshooting Guide
  2. Verify your provider credentials and connectivity
  3. Check the manager and provider logs
  4. Ensure your Kubernetes cluster meets the requirements
  5. File an issue on GitHub

Helm-only Installation & Verify Conversion

This guide covers installing virtrigaud using only Helm (without pre-applying CRDs via Kustomize) and verifying that API conversion is working correctly.

Helm-only Install

VirtRigaud can be installed using only Helm, which will automatically install all required CRDs including conversion webhook configuration.

Prerequisites

  • Kubernetes cluster (1.26+)
  • Helm 3.8+
  • kubectl configured to access your cluster

Installation

# Add the virtrigaud Helm repository
helm repo add virtrigaud https://projectbeskar.github.io/virtrigaud
helm repo update

# Or install directly from source
git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Install virtrigaud with CRDs
helm install virtrigaud charts/virtrigaud \
  --namespace virtrigaud \
  --create-namespace \
  --wait \
  --timeout 10m

Or install directly from source

git clone https://github.com/projectbeskar/virtrigaud.git cd virtrigaud helm install virtrigaud charts/virtrigaud
–namespace virtrigaud
–create-namespace
–wait
–timeout 10m


### Skip CRDs (if already installed)

If you need to install the chart without CRDs (e.g., they're managed separately):

```bash
helm install virtrigaud charts/virtrigaud \
  --namespace virtrigaud \
  --create-namespace \
  --skip-crds \
  --wait

Verify Conversion

After installation, verify that API conversion is working correctly.

Check CRD Conversion Configuration

# Verify all CRDs have conversion webhook configuration
kubectl get crd virtualmachines.infra.virtrigaud.io -o yaml | yq '.spec.conversion'

Expected output:

strategy: Webhook
webhook:
  clientConfig:
    service:
      name: virtrigaud-webhook
      namespace: virtrigaud
      path: /convert
  conversionReviewVersions:
  - v1

Check API Versions

Verify that both v1beta1 and v1beta1 versions are available:

# Check available versions for VirtualMachine CRD
kubectl get crd virtualmachines.infra.virtrigaud.io -o jsonpath='{.spec.versions[*].name}' | tr ' ' '\n'

Expected output:

v1beta1
v1beta1

Verify Storage Version

Confirm that v1beta1 is set as the storage version:

# Check storage version
kubectl get crd virtualmachines.infra.virtrigaud.io -o jsonpath='{.spec.versions[?(@.storage==true)].name}'

Expected output:

v1beta1

Test Conversion

Create resources using different API versions and verify conversion works:

# Create a VM using v1beta1 API
cat <<EOF | kubectl apply -f -
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: test-vm-alpha
  namespace: default
spec:
  providerRef:
    name: test-provider
  classRef:
    name: small
  imageRef:
    name: ubuntu-22
  powerState: "On"
EOF

# Read it back as v1beta1
kubectl get vm test-vm-alpha -o yaml | grep "apiVersion:"
# Should show: apiVersion: infra.virtrigaud.io/v1beta1

# Create a VM using v1beta1 API
cat <<EOF | kubectl apply -f -
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: test-vm-beta
  namespace: default
spec:
  providerRef:
    name: test-provider
  classRef:
    name: small
  imageRef:
    name: ubuntu-22
  powerState: On
EOF

# Clean up test resources
kubectl delete vm test-vm-alpha test-vm-beta

Troubleshooting

Conversion Webhook Missing

If the conversion webhook is missing or not configured:

# Check if webhook service exists
kubectl get svc virtrigaud-webhook -n virtrigaud

# Check webhook pod logs
kubectl logs -l app.kubernetes.io/name=virtrigaud -n virtrigaud

# Verify webhook certificate
kubectl get secret virtrigaud-webhook-certs -n virtrigaud

Conversion Webhook Failing

If conversion is failing:

# Check conversion webhook logs
kubectl logs -l app.kubernetes.io/name=virtrigaud -n virtrigaud | grep conversion

# Test webhook connectivity
kubectl get --raw "/api/v1/namespaces/virtrigaud/services/virtrigaud-webhook:webhook/proxy/convert"

# Check webhook certificate validity
kubectl get secret virtrigaud-webhook-certs -n virtrigaud -o yaml

API Version Issues

If certain API versions aren’t working:

# List all available APIs
kubectl api-resources | grep virtrigaud

# Check specific CRD status
kubectl describe crd virtualmachines.infra.virtrigaud.io

# Verify controller is running
kubectl get pods -l app.kubernetes.io/name=virtrigaud -n virtrigaud

Integration with GitOps

ArgoCD

apiVersion: argoproj.io/v1beta1
kind: Application
metadata:
  name: virtrigaud
spec:
  source:
    chart: virtrigaud
    repoURL: https://projectbeskar.github.io/virtrigaud
    targetRevision: "1.0.0"
    helm:
      values: |
        manager:
          image:
            repository: ghcr.io/projectbeskar/virtrigaud/manager
            tag: v1.0.0

Flux

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: virtrigaud
spec:
  chart:
    spec:
      chart: virtrigaud
      sourceRef:
        kind: HelmRepository
        name: virtrigaud
      version: "1.0.0"
  values:
    manager:
      image:
        repository: ghcr.io/projectbeskar/virtrigaud/manager
        tag: v1.0.0

Migration from Kustomize to Helm

If you’re currently using Kustomize for CRD management and want to switch to Helm:

  1. Backup existing resources:

    kubectl get vms,providers,vmclasses -A -o yaml > virtrigaud-backup.yaml
    
  2. Uninstall Kustomize-managed CRDs (optional):

    kubectl delete -k config/default
    
  3. Install via Helm:

    helm install virtrigaud charts/virtrigaud --namespace virtrigaud --create-namespace
    
  4. Restore resources:

    kubectl apply -f virtrigaud-backup.yaml
    

The conversion webhook will handle any necessary API version transformations automatically.

Automatic CRD Upgrades in VirtRigaud Helm Chart

Overview

VirtRigaud Helm chart now supports automatic CRD upgrades during helm upgrade. This eliminates the need for manual CRD management and provides a seamless upgrade experience.

The Problem

By default, Helm has a limitation:

  • CRDs are installed during helm install
  • CRDs are NOT upgraded during helm upgrade

This means users had to manually apply CRD updates before upgrading, which was:

  • Error-prone
  • Easy to forget
  • Breaks GitOps workflows
  • Causes version drift between chart and CRDs

The Solution

VirtRigaud uses Helm Hooks with a Kubernetes Job to automatically apply CRDs during both install and upgrade:

kubectl Image

VirtRigaud builds and publishes its own kubectl image as part of the release process. This image:

  • Based on Alpine Linux for minimal size (~50MB)
  • Includes kubectl 1.32.0 binary from official Kubernetes releases
  • Includes bash and shell for scripting support
  • Runs as non-root user (UID 65532)
  • Verified with SHA256 checksums
  • Signed with Cosign and includes SBOM
  • Security scanned but uses official kubectl binary (vulnerabilities tracked upstream)

The image is automatically built and tagged to match each VirtRigaud release version, ensuring version consistency across all components.

Image Location: ghcr.io/projectbeskar/virtrigaud/kubectl:<version>

How It Works

  1. Pre-Upgrade Hook: Before the main upgrade starts, a Job is created
  2. CRD Application: The Job applies all CRDs using kubectl apply --server-side
  3. Safe Upgrades: Server-side apply handles conflicts gracefully
  4. Automatic Cleanup: Job is deleted after successful completion

Architecture

helm upgrade virtrigaud
    ↓
[Pre-Upgrade Hook -10]
    ↓
ConfigMap with CRDs created
    ↓
[Pre-Upgrade Hook -5]
    ↓
ServiceAccount + RBAC created
    ↓
[Pre-Upgrade Hook 0]
    ↓
Job applies CRDs via kubectl
    ↓
[Standard Helm Resources]
    ↓
Manager & Providers deployed
    ↓
[Hook Cleanup]
    ↓
Job & Hook resources deleted

Features

Enabled by Default

No configuration needed - just works:

helm upgrade virtrigaud virtrigaud/virtrigaud -n virtrigaud-system

Server-Side Apply

Uses kubectl apply --server-side for:

  • Safe conflict resolution
  • Field management
  • No ownership conflicts

GitOps Compatible

Works seamlessly with:

  • ArgoCD: Helm hooks execute properly
  • Flux: Compatible with HelmRelease CRD upgrades
  • Terraform: Helm provider handles hooks

Configurable

Customize the upgrade behavior:

crdUpgrade:
  enabled: true  # Enable/disable automatic upgrades
  
  image:
    repository: ghcr.io/projectbeskar/virtrigaud/kubectl  # VirtRigaud kubectl image
    tag: "v0.2.0"  # Auto-updated to match release version
  
  backoffLimit: 3
  ttlSecondsAfterFinished: 300
  waitSeconds: 5
  
  resources:
    limits:
      cpu: 100m
      memory: 128Mi

Usage Examples

Standard Upgrade (Automatic CRDs)

# CRDs are automatically upgraded
helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system

Disable Automatic CRD Upgrade

# Disable if you manage CRDs separately
helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --set crdUpgrade.enabled=false

Manual CRD Management

# Apply CRDs manually before upgrade
kubectl apply -f charts/virtrigaud/crds/

# Then upgrade without CRD management
helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --set crdUpgrade.enabled=false

Skip CRDs Entirely

# Skip CRDs during upgrade (for external CRD management)
helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --skip-crds \
  --set crdUpgrade.enabled=false

GitOps Integration

ArgoCD

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: virtrigaud
spec:
  source:
    chart: virtrigaud
    targetRevision: 0.2.2
    helm:
      values: |
        crdUpgrade:
          enabled: true  # Automatic upgrades work!
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Note: ArgoCD executes Helm hooks properly, so CRDs will be upgraded automatically.

Flux

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: virtrigaud
spec:
  chart:
    spec:
      chart: virtrigaud
      version: 0.2.2
  values:
    crdUpgrade:
      enabled: true  # Automatic upgrades work!
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace

Note: Flux’s crds: CreateReplace works alongside our hook-based upgrades for maximum compatibility.

Troubleshooting

Check CRD Upgrade Job

# View job status
kubectl get jobs -n virtrigaud-system -l app.kubernetes.io/component=crd-upgrade

# View job logs
kubectl logs -n virtrigaud-system -l app.kubernetes.io/component=crd-upgrade

# View job details
kubectl describe job -n virtrigaud-system -l app.kubernetes.io/component=crd-upgrade

Common Issues

1. RBAC Permissions

Symptom: Job fails with β€œforbidden” errors

Solution: Ensure the ServiceAccount has CRD permissions:

kubectl get clusterrole -l app.kubernetes.io/component=crd-upgrade
kubectl describe clusterrole <role-name>

2. Image Pull Failures

Symptom: Job fails to start, ImagePullBackOff

Solution: Check image configuration:

crdUpgrade:
  image:
    repository: ghcr.io/projectbeskar/virtrigaud/kubectl
    tag: "v0.2.2-rc1"  # Use matching VirtRigaud version
    pullPolicy: IfNotPresent

3. CRD Conflicts

Symptom: Apply errors about field conflicts

Solution: Server-side apply handles this automatically, but you can force:

kubectl apply --server-side=true --force-conflicts -f charts/virtrigaud/crds/

4. Job Not Cleaning Up

Symptom: Old jobs remain after upgrade

Solution: Adjust TTL or manually clean:

kubectl delete jobs -n virtrigaud-system -l app.kubernetes.io/component=crd-upgrade

Debug Mode

Enable verbose logging:

helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --debug

Migration Guide

Migrating from Manual CRD Management

If you were previously managing CRDs manually:

  1. Enable automatic upgrades:

    helm upgrade virtrigaud virtrigaud/virtrigaud \
      -n virtrigaud-system \
      --set crdUpgrade.enabled=true
    
  2. Verify CRDs are upgraded:

    kubectl get crd -l app.kubernetes.io/name=virtrigaud
    
  3. Remove manual steps from your upgrade process

Migrating to External CRD Management

If you want to manage CRDs externally (e.g., separate Helm chart):

  1. Disable automatic upgrades:

    crdUpgrade:
      enabled: false
    
  2. Extract CRDs:

    helm show crds virtrigaud/virtrigaud > my-crds.yaml
    
  3. Manage CRDs separately:

    kubectl apply -f my-crds.yaml
    

Technical Details

Hook Weights

The upgrade process uses weighted hooks for proper ordering:

WeightResourcePurpose
-10ConfigMapStore CRD content
-5RBACCreate permissions
0JobApply CRDs

Resource Requirements

The CRD upgrade job is lightweight:

resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 50m
    memory: 64Mi

Security

  • Runs as non-root user (65532)
  • Read-only root filesystem
  • No privilege escalation
  • Minimal RBAC (only CRD permissions)
  • Automatic cleanup after completion

Compatibility

  • Kubernetes: 1.25+
  • Helm: 3.8+
  • kubectl: 1.24+ (in Job image)

Best Practices

  1. Use Automatic Upgrades: Enable by default for best UX
  2. Monitor Job Logs: Check logs during first upgrade
  3. Test in Dev First: Verify upgrades in non-production
  4. Backup CRDs: Keep backups before major upgrades
  5. Review Changelogs: Check for breaking CRD changes

FAQ

Q: Will this delete my existing resources?

A: No. CRD upgrades are additive and preserve existing Custom Resources.

Q: What happens if the job fails?

A: Helm upgrade will fail, leaving your cluster in the previous state. Fix the issue and retry.

Q: Can I use this with ArgoCD?

A: Yes! ArgoCD properly executes Helm hooks.

Q: Does this work with Flux?

A: Yes! Flux HelmRelease handles hooks correctly.

Q: How do I roll back?

A: Use helm rollback. CRDs are not rolled back (Kubernetes limitation).

Q: Can I customize the kubectl image?

A: Yes, via crdUpgrade.image.repository and crdUpgrade.image.tag. The default uses the official Kubernetes kubectl image from registry.k8s.io.

References

Custom Resource Definitions (CRDs)

This document describes all the Custom Resource Definitions (CRDs) provided by virtrigaud.

VirtualMachine

The VirtualMachine CRD represents a virtual machine instance.

Spec

FieldTypeRequiredDescription
providerRefObjectRefYesReference to the Provider resource
classRefObjectRefYesReference to the VMClass resource
imageRefObjectRefYesReference to the VMImage resource
networks[]VMNetworkRefNoNetwork attachments
disks[]DiskSpecNoAdditional disks
userDataUserDataNoCloud-init configuration
metaDataMetaDataNoCloud-init metadata configuration
placementPlacementNoPlacement hints
powerStatestringNoDesired power state (On/Off)
tags[]stringNoTags for organization

Status

FieldTypeDescription
idstringProvider-specific VM identifier
powerStatestringCurrent power state
ips[]stringAssigned IP addresses
consoleURLstringConsole access URL
conditions[]ConditionStatus conditions
observedGenerationint64Last observed generation
lastTaskRefstringReference to last async task
providermap[string]stringProvider-specific details

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: demo-web-01
spec:
  providerRef:
    name: vsphere-prod
  classRef:
    name: small
  imageRef:
    name: ubuntu-22-template
  networks:
    - name: app-net
      ipPolicy: dhcp
  powerState: On

VMClass

The VMClass CRD defines resource allocation for virtual machines.

Spec

FieldTypeRequiredDescription
cpuint32YesNumber of virtual CPUs
memoryMiBint32YesMemory in MiB
firmwarestringNoFirmware type (BIOS/UEFI)
diskDefaultsDiskDefaultsNoDefault disk settings
guestToolsPolicystringNoGuest tools policy
extraConfigmap[string]stringNoProvider-specific configuration

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: small
spec:
  cpu: 2
  memoryMiB: 4096
  firmware: UEFI
  diskDefaults:
    type: thin
    sizeGiB: 40

VMImage

The VMImage CRD defines base templates/images for virtual machines.

Spec

FieldTypeRequiredDescription
vsphereVSphereImageSpecNovSphere-specific configuration
libvirtLibvirtImageSpecNoLibvirt-specific configuration
prepareImagePrepareNoImage preparation options

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-template
spec:
  vsphere:
    templateName: "tmpl-ubuntu-22.04-cloudimg"
  libvirt:
    url: "https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img"
    format: qcow2

VMNetworkAttachment

The VMNetworkAttachment CRD defines network configurations.

Spec

FieldTypeRequiredDescription
vsphereVSphereNetworkSpecNovSphere-specific network config
libvirtLibvirtNetworkSpecNoLibvirt-specific network config
ipPolicystringNoIP assignment policy
macAddressstringNoStatic MAC address

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMNetworkAttachment
metadata:
  name: app-net
spec:
  vsphere:
    portgroup: "PG-App"
  ipPolicy: dhcp

Provider

The Provider CRD configures hypervisor connection details.

Spec

FieldTypeRequiredDescription
typestringYesProvider type (vsphere/libvirt/etc)
endpointstringYesProvider endpoint URI
credentialSecretRefObjectRefYesSecret containing credentials
insecureSkipVerifyboolNoSkip TLS verification
defaultsProviderDefaultsNoDefault placement settings
rateLimitRateLimitNoAPI rate limiting

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-prod
spec:
  type: vsphere
  endpoint: https://vcenter.example.com
  credentialSecretRef:
    name: vsphere-creds
  defaults:
    datastore: datastore1
    cluster: compute-cluster-a

Common Types

ObjectRef

FieldTypeRequiredDescription
namestringYesObject name
namespacestringNoObject namespace

DiskSpec

FieldTypeRequiredDescription
sizeGiBint32YesDisk size in GiB
typestringNoDisk type
namestringNoDisk name

UserData

FieldTypeRequiredDescription
cloudInitCloudInitConfigNoCloud-init configuration

MetaData

FieldTypeRequiredDescription
inlinestringNoInline cloud-init metadata in YAML format
secretRefObjectRefNoSecret containing cloud-init metadata

CloudInitConfig

FieldTypeRequiredDescription
secretRefObjectRefNoSecret containing cloud-init data
inlinestringNoInline cloud-init configuration

Examples

This document provides practical examples for using VirtRigaud with the Remote provider architecture.

Quick Start Examples

All VirtRigaud providers now run as Remote providers. Here are the essential examples to get started:

Basic Provider Setup

Complete Working Examples

Individual Resource Examples

Advanced Examples

Example Directory Structure

docs/examples/
β”œβ”€β”€ provider-*.yaml          # Provider configurations
β”œβ”€β”€ complete-example.yaml    # Full working setup
β”œβ”€β”€ *-advanced-example.yaml  # Production configurations
β”œβ”€β”€ vm*.yaml                 # Individual resource definitions
β”œβ”€β”€ advanced/                # Advanced operations
β”œβ”€β”€ security/                # Security configurations
└── secrets/                 # Credential examples

Key Changes from Previous Versions

Remote-Only Architecture

All providers now run as separate pods with the Remote runtime:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: my-provider
spec:
  type: vsphere  # or libvirt, proxmox
  endpoint: https://vcenter.example.com
  credentialSecretRef:
    name: provider-creds
  runtime:
    mode: Remote              # Required - only mode supported
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.3"
    service:
      port: 9090

Current API Schema (v0.2.3)

  • VMClass: Standard Kubernetes resource quantities (cpus: 4, memory: "4Gi")
  • VMImage: Provider-specific source configurations
  • VMNetworkAttachment: Network provider abstractions
  • VirtualMachine: Declarative power state management

Configuration Management

Providers receive configuration through:

  • Endpoint: Environment variable PROVIDER_ENDPOINT
  • Credentials: Mounted secret files in /etc/virtrigaud/credentials/
  • Runtime: Managed automatically by the provider controller

Getting Started

  1. Choose your provider from the basic examples above
  2. Create credentials secret (see examples/secrets/)
  3. Apply provider configuration with required runtime section
  4. Define VM resources (VMClass, VMImage, VMNetworkAttachment)
  5. Create VirtualMachine referencing your resources

For detailed setup instructions, see:

Need Help?

Provider Development Guide

This document explains how to implement a new provider for VirtRigaud.

Overview

Providers are responsible for implementing VM lifecycle operations on specific hypervisor platforms. VirtRigaud uses a Remote Provider architecture where each provider runs as an independent gRPC service, communicating with the manager controller.

Provider Interface

All providers must implement the contracts.Provider interface:

type Provider interface {
    // Validate ensures the provider session/credentials are healthy
    Validate(ctx context.Context) error

    // Create creates a new VM if it doesn't exist (idempotent)
    Create(ctx context.Context, req CreateRequest) (CreateResponse, error)

    // Delete removes a VM (idempotent)
    Delete(ctx context.Context, id string) (taskRef string, err error)

    // Power performs a power operation on the VM
    Power(ctx context.Context, id string, op PowerOp) (taskRef string, err error)

    // Reconfigure modifies VM resources
    Reconfigure(ctx context.Context, id string, desired CreateRequest) (taskRef string, err error)

    // Describe returns the current state of the VM
    Describe(ctx context.Context, id string) (DescribeResponse, error)

    // IsTaskComplete checks if an async task is complete
    IsTaskComplete(ctx context.Context, taskRef string) (done bool, err error)
}

Implementation Steps

1. Create Provider Package

Create a new package under internal/providers/ for your provider:

internal/providers/yourprovider/
β”œβ”€β”€ provider.go      # Main provider implementation
β”œβ”€β”€ session.go       # Connection/session management
β”œβ”€β”€ tasks.go         # Async task handling
β”œβ”€β”€ converter.go     # Type conversions
β”œβ”€β”€ network.go       # Network operations
└── storage.go       # Storage operations

2. Implement the Provider

package yourprovider

import (
    "context"
    "github.com/projectbeskar/virtrigaud/api/v1beta1"
    "github.com/projectbeskar/virtrigaud/internal/providers/contracts"
)

type Provider struct {
    config   *v1beta1.Provider
    client   YourProviderClient
}

func NewProvider(ctx context.Context, provider *v1beta1.Provider) (contracts.Provider, error) {
    // Initialize your provider client
    // Parse credentials from secret
    // Establish connection
    return &Provider{
        config: provider,
        client: client,
    }, nil
}

func (p *Provider) Validate(ctx context.Context) error {
    // Check connection health
    // Validate credentials
    return nil
}

// Implement other interface methods...

3. Create Provider gRPC Server

Create a gRPC server for your provider:

// cmd/provider-yourprovider/main.go
package main

import (
    "context"
    "log"
    "net"
    
    "google.golang.org/grpc"
    "github.com/projectbeskar/virtrigaud/pkg/grpc/provider"
    "github.com/projectbeskar/virtrigaud/internal/providers/yourprovider"
)

func main() {
    lis, err := net.Listen("tcp", ":9090")
    if err != nil {
        log.Fatal(err)
    }
    
    s := grpc.NewServer()
    provider.RegisterProviderServer(s, &yourprovider.GRPCServer{})
    
    log.Println("Provider server listening on :9090")
    if err := s.Serve(lis); err != nil {
        log.Fatal(err)
    }
}

4. Handle Credentials

Providers should read credentials from Kubernetes secrets. Common credential fields:

  • username / password: Basic authentication
  • token: API token authentication
  • tls.crt / tls.key: TLS client certificates

Example:

func (p *Provider) getCredentials(ctx context.Context) (*Credentials, error) {
    secret := &corev1.Secret{}
    err := p.client.Get(ctx, types.NamespacedName{
        Name:      p.config.Spec.CredentialSecretRef.Name,
        Namespace: p.config.Namespace,
    }, secret)
    if err != nil {
        return nil, err
    }

    return &Credentials{
        Username: string(secret.Data["username"]),
        Password: string(secret.Data["password"]),
    }, nil
}

Error Handling

Use the provided error types for consistent error handling:

import "github.com/projectbeskar/virtrigaud/internal/providers/contracts"

// For not found errors
return contracts.NewNotFoundError("VM not found", err)

// For retryable errors
return contracts.NewRetryableError("Connection timeout", err)

// For validation errors
return contracts.NewInvalidSpecError("Invalid CPU count", nil)

Asynchronous Operations

For long-running operations, return a task reference:

func (p *Provider) Create(ctx context.Context, req CreateRequest) (CreateResponse, error) {
    taskID, err := p.client.CreateVMAsync(...)
    if err != nil {
        return CreateResponse{}, err
    }

    return CreateResponse{
        ID:      vmID,
        TaskRef: taskID,
    }, nil
}

func (p *Provider) IsTaskComplete(ctx context.Context, taskRef string) (bool, error) {
    task, err := p.client.GetTask(taskRef)
    if err != nil {
        return false, err
    }
    return task.IsComplete(), nil
}

Type Conversions

Convert between CRD types and provider-specific types:

func (p *Provider) convertVMClass(class contracts.VMClass) YourProviderVMSpec {
    return YourProviderVMSpec{
        CPUs:   class.CPU,
        Memory: class.MemoryMiB * 1024 * 1024, // Convert to bytes
        // ... other conversions
    }
}

Testing

Create unit tests for your provider:

func TestProvider_Create(t *testing.T) {
    provider := &Provider{
        client: &mockClient{},
    }

    req := contracts.CreateRequest{
        Name: "test-vm",
        // ... populate request
    }

    resp, err := provider.Create(context.Background(), req)
    assert.NoError(t, err)
    assert.NotEmpty(t, resp.ID)
}

Provider-Specific CRD Fields

Update the CRD types to include provider-specific fields:

// In VMImage types
type YourProviderImageSpec struct {
    ImageID   string `json:"imageId,omitempty"`
    Checksum  string `json:"checksum,omitempty"`
}

// In VMNetworkAttachment types
type YourProviderNetworkSpec struct {
    NetworkID string `json:"networkId,omitempty"`
    VLAN      int32  `json:"vlan,omitempty"`
}

Best Practices

  1. Idempotency: All operations should be idempotent
  2. Error Classification: Use appropriate error types
  3. Resource Cleanup: Ensure proper cleanup in Delete operations
  4. Logging: Use structured logging with context
  5. Timeouts: Respect context timeouts
  6. Rate Limiting: Implement client-side rate limiting
  7. Retry Logic: Handle transient failures gracefully

Examples

See the existing providers for reference:

  • internal/providers/vsphere/ - vSphere implementation
  • internal/providers/libvirt/ - Libvirt implementation (production ready)

Provider Configuration

Each provider type should support these configuration options:

  • Connection endpoints
  • Authentication credentials
  • Default placement settings
  • Rate limiting configuration
  • Provider-specific options

Example Provider spec:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: my-provider
spec:
  type: yourprovider
  endpoint: https://api.yourprovider.com
  credentialSecretRef:
    name: provider-creds
  defaults:
    region: us-west-2
    zone: us-west-2a
  rateLimit:
    qps: 10
    burst: 20

Provider Capabilities Matrix

This document provides a comprehensive overview of VirtRigaud provider capabilities as of v0.2.3.

Overview

VirtRigaud supports multiple hypervisor platforms through a provider architecture. Each provider implements the core VirtRigaud API while supporting platform-specific features and capabilities.

Core Provider Interface

All providers implement these core operations:

  • Validate: Test provider connectivity and credentials
  • Create: Create new virtual machines
  • Delete: Remove virtual machines and cleanup resources
  • Power: Control VM power state (On/Off/Reboot)
  • Describe: Query VM state and properties
  • GetCapabilities: Report provider-specific capabilities

Provider Status

ProviderStatusImplementationMaturity
vSphereβœ… Production Readygovmomi-basedStable
Libvirt/KVMβœ… Production Readyvirsh-basedStable
Proxmox VEβœ… Production ReadyREST API-basedBeta
Mockβœ… CompleteIn-memory simulationTesting

Comprehensive Capability Matrix

Core Operations

CapabilityvSphereLibvirtProxmoxMockNotes
VM Createβœ…βœ…βœ…βœ…All providers support VM creation
VM Deleteβœ…βœ…βœ…βœ…With resource cleanup
Power On/Offβœ…βœ…βœ…βœ…Basic power management
Rebootβœ…βœ…βœ…βœ…Graceful and forced restart
Suspendβœ…βŒβœ…βœ…Memory state preservation
Describeβœ…βœ…βœ…βœ…VM state and properties
Reconfigureβœ…βš οΈβœ…βœ…CPU/Memory/Disk changes (Libvirt requires restart)
TaskStatusβœ…N/Aβœ…βœ…Async operation tracking
ConsoleURLβœ…βœ…βš οΈβœ…Remote console access (Proxmox planned)

Resource Management

CapabilityvSphereLibvirtProxmoxMockNotes
CPU Configurationβœ…βœ…βœ…βœ…Cores, sockets, threading
Memory Allocationβœ…βœ…βœ…βœ…Static memory sizing
Hot CPU Addβœ…βŒβœ…βœ…Online CPU expansion
Hot Memory Addβœ…βŒβœ…βœ…Online memory expansion
Resource Reservationsβœ…βŒβœ…βœ…Guaranteed resources
Resource Limitsβœ…βŒβœ…βœ…Resource capping

Storage Operations

CapabilityvSphereLibvirtProxmoxMockNotes
Disk Creationβœ…βœ…βœ…βœ…Virtual disk provisioning
Disk Expansionβœ…βœ…βœ…βœ…Online disk growth
Multiple Disksβœ…βœ…βœ…βœ…Multi-disk VMs
Thin Provisioningβœ…βœ…βœ…βœ…Space-efficient disks
Thick Provisioningβœ…βœ…βœ…βœ…Pre-allocated storage
Storage Policiesβœ…βŒβœ…βœ…Policy-based placement
Storage Poolsβœ…βœ…βœ…βœ…Organized storage management

Network Configuration

CapabilityvSphereLibvirtProxmoxMockNotes
Basic Networkingβœ…βœ…βœ…βœ…Single network interface
Multiple NICsβœ…βœ…βœ…βœ…Multi-interface VMs
VLAN Supportβœ…βœ…βœ…βœ…Network segmentation
Static IPβœ…βœ…βœ…βœ…Fixed IP assignment
DHCPβœ…βœ…βœ…βœ…Dynamic IP assignment
Bridge NetworksβŒβœ…βœ…βœ…Direct host bridging
Distributed Switchesβœ…βŒβŒβœ…Advanced vSphere networking

VM Lifecycle

CapabilityvSphereLibvirtProxmoxMockNotes
Template Deploymentβœ…βœ…βœ…βœ…Deploy from templates
Clone Operationsβœ… Completeβœ…βœ…βœ…Full VM duplication with snapshot support
Linked Clonesβœ…βŒβœ…βœ…COW-based clones with automatic snapshot creation
Full Clonesβœ…βœ…βœ…βœ…Independent copies
VM Reconfigurationβœ… Complete⚠️ Restart Requiredβœ…βœ…Online resource modification

Snapshot Operations

CapabilityvSphereLibvirtProxmoxMockNotes
Create Snapshotsβœ…βœ…βœ…βœ…Point-in-time captures
Delete Snapshotsβœ…βœ…βœ…βœ…Snapshot cleanup
Revert Snapshotsβœ…βœ…βœ…βœ…Restore VM state
Memory Snapshotsβœ…βŒβœ…βœ…Include RAM state
Quiesced Snapshotsβœ…βŒβœ…βœ…Consistent filesystem
Snapshot Treesβœ…βœ…βœ…βœ…Hierarchical snapshots

Image Management

CapabilityvSphereLibvirtProxmoxMockNotes
OVA/OVF Importβœ…βŒβœ…βœ…Standard VM formats
Cloud Image DownloadβŒβœ…βœ…βœ…Remote image fetch
Content Librariesβœ…βŒβŒβœ…Centralized image management
Image ConversionβŒβœ…βœ…βœ…Format transformation
Image Cachingβœ…βœ…βœ…βœ…Performance optimization

Guest Operating System

CapabilityvSphereLibvirtProxmoxMockNotes
Cloud-Initβœ…βœ…βœ…βœ…Guest initialization
Guest Toolsβœ…βœ…βœ…βœ…Enhanced guest integration
Guest Agentβœ…βœ…βœ…βœ…Runtime guest communication
Guest Customizationβœ…βœ…βœ…βœ…OS-specific customization
Guest Monitoringβœ…βœ…βœ…βœ…Resource usage tracking

Advanced Features

CapabilityvSphereLibvirtProxmoxMockNotes
High Availabilityβœ…βŒβœ…βœ…Automatic failover
DRS/Load Balancingβœ…βŒβŒβœ…Resource optimization
Fault Toleranceβœ…βŒβŒβœ…Zero-downtime protection
vMotion/Migrationβœ…βŒβœ…βœ…Live VM migration
Resource Poolsβœ…βŒβœ…βœ…Hierarchical resource mgmt
Affinity Rulesβœ…βŒβœ…βœ…VM placement policies

Monitoring & Observability

CapabilityvSphereLibvirtProxmoxMockNotes
Performance Metricsβœ…βœ…βœ…βœ…CPU, memory, disk, network
Event Loggingβœ…βœ…βœ…βœ…Operation audit trail
Health Checksβœ…βœ…βœ…βœ…VM and guest health
Alertingβœ…βŒβœ…βœ…Threshold-based notifications
Historical Dataβœ…βŒβœ…βœ…Performance history
Console URL Generationβœ…βœ…βš οΈβœ…Web/VNC console access (Proxmox planned)
Guest Agent Integrationβœ…βœ…βœ… Completeβœ…IP detection and guest info

Provider-Specific Features

vSphere Exclusive

  • vCenter Integration: Full vCenter Server and ESXi support
  • Content Library: Centralized template and ISO management
  • Distributed Resource Scheduler (DRS): Automatic load balancing
  • vMotion: Live migration between hosts
  • High Availability (HA): Automatic VM restart on host failure
  • Fault Tolerance: Zero-downtime VM protection
  • Storage vMotion: Live storage migration
  • vSAN Integration: Hyper-converged storage
  • NSX Integration: Software-defined networking
  • Hot Reconfiguration: Online CPU/memory/disk changes with hot-add support
  • TaskStatus Tracking: Real-time async operation monitoring via govmomi
  • Clone Operations: Full and linked clones with automatic snapshot handling
  • Web Console URLs: Direct vSphere web client console access

Libvirt/KVM Exclusive

  • Virsh Integration: Command-line management
  • QEMU Guest Agent: Advanced guest OS integration
  • KVM Optimization: Native Linux virtualization
  • Bridge Networking: Direct host network bridging
  • Storage Pool Flexibility: Multiple storage backend support
  • Cloud Image Support: Direct cloud image deployment
  • Host Device Passthrough: Hardware device assignment
  • Reconfiguration Support: CPU/memory/disk changes via virsh (restart required)
  • VNC Console Access: Direct VNC console URL generation for remote viewers

Proxmox VE Exclusive

  • Web UI Integration: Built-in management interface
  • Container Support: LXC container management
  • Backup Integration: Built-in backup and restore
  • Cluster Management: Multi-node cluster support
  • ZFS Integration: Advanced filesystem features
  • Ceph Integration: Distributed storage
  • Guest Agent IP Detection: Accurate IP address extraction via QEMU guest agent
  • Hot-plug Reconfiguration: Online CPU/memory/disk modifications
  • Complete CRD Integration: Full Kubernetes custom resource support

Mock Provider Features

  • Testing Scenarios: Configurable failure modes
  • Performance Simulation: Controllable operation delays
  • Sample Data: Pre-populated demonstration VMs
  • Development Support: Full API coverage for testing

Supported Disk Types

ProviderDisk FormatsNotes
vSpherethin, thick, eagerZeroedThickvSphere native formats
Libvirtqcow2, raw, vmdkQEMU-supported formats
Proxmoxqcow2, raw, vmdkProxmox storage formats
Mockthin, thick, raw, qcow2Simulated formats

Supported Network Types

ProviderNetwork TypesNotes
vSpheredistributed, standard, vlanvSphere networking
Libvirtvirtio, e1000, rtl8139QEMU network adapters
Proxmoxvirtio, e1000, rtl8139Proxmox network models
Mockbridge, nat, distributedSimulated network types

Provider Images

All provider images are available from the GitHub Container Registry:

  • vSphere: ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.3
  • Libvirt: ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.3
  • Proxmox: ghcr.io/projectbeskar/virtrigaud/provider-proxmox:v0.2.3
  • Mock: ghcr.io/projectbeskar/virtrigaud/provider-mock:v0.2.3

Choosing a Provider

Use vSphere When:

  • You have existing VMware infrastructure
  • You need enterprise features (HA, DRS, vMotion)
  • You require advanced networking (NSX, distributed switches)
  • You need centralized management (vCenter)

Use Libvirt/KVM When:

  • You want open-source virtualization
  • You’re running on Linux hosts
  • You need cost-effective virtualization
  • You want direct host integration

Use Proxmox VE When:

  • You need both VMs and containers
  • You want integrated backup solutions
  • You need cluster management
  • You want web-based management

Use Mock Provider When:

  • You’re developing or testing VirtRigaud
  • You need to simulate VM operations
  • You’re creating demos or training materials
  • You’re testing VirtRigaud without hypervisors

Performance Considerations

vSphere

  • Best for: Large-scale enterprise deployments
  • Scalability: Hundreds to thousands of VMs
  • Overhead: Higher due to feature richness
  • Resource Efficiency: Excellent with DRS

Libvirt/KVM

  • Best for: Linux-based deployments
  • Scalability: Moderate to large deployments
  • Overhead: Low, near-native performance
  • Resource Efficiency: Good with proper tuning

Proxmox VE

  • Best for: SMB and mixed workloads
  • Scalability: Small to medium deployments
  • Overhead: Moderate
  • Resource Efficiency: Good with clustering

Future Roadmap

Planned Enhancements

vSphere

  • vSphere 8.0 support
  • Enhanced NSX integration
  • GPU passthrough support
  • vSAN policy automation

Libvirt

  • Live migration support
  • SR-IOV networking
  • NUMA topology optimization
  • Enhanced performance monitoring

Proxmox

  • HA configuration
  • Storage replication
  • Advanced networking
  • Performance optimizations

Support Matrix

Feature CategoryvSphereLibvirtProxmoxMock
Production Readyβœ…βœ…βœ… Betaβœ… Testing
DocumentationCompleteCompleteCompleteComplete
Community SupportActiveActiveGrowingN/A
Enterprise SupportAvailableAvailableAvailableN/A

Version History

  • v0.2.3: Provider feature parity - Reconfigure, Clone, TaskStatus, ConsoleURL
  • v0.2.2: Nested virtualization, TPM support, comprehensive snapshot management
  • v0.2.1: Critical fixes, documentation updates, VMClass disk settings
  • v0.2.0: Production-ready vSphere and Libvirt providers
  • v0.1.0: Initial provider framework and mock implementation

This document reflects VirtRigaud v0.2.3 capabilities. For the latest updates, see the VirtRigaud documentation.

vSphere Provider

The vSphere provider enables VirtRigaud to manage virtual machines on VMware vSphere environments, including vCenter Server and standalone ESXi hosts. This provider is designed for enterprise production environments with comprehensive support for vSphere features.

Overview

This provider implements the VirtRigaud provider interface to manage VM lifecycle operations on VMware vSphere:

  • Create: Create VMs from templates, content libraries, or OVF/OVA files
  • Delete: Remove VMs and associated storage (with configurable retention)
  • Power: Start, stop, restart, and suspend virtual machines
  • Describe: Query VM state, resource usage, guest info, and vSphere properties
  • Reconfigure: Hot-add CPU/memory, resize disks, modify network adapters (v0.2.3+)
  • Clone: Create full or linked clones from existing VMs or templates (v0.2.3+)
  • Snapshot: Create, delete, and revert VM snapshots with memory state
  • TaskStatus: Track asynchronous operations with progress monitoring (v0.2.3+)
  • ConsoleURL: Generate vSphere web client console URLs (v0.2.3+)
  • ImagePrepare: Import OVF/OVA, deploy from content library, or ensure template existence

Prerequisites

⚠️ IMPORTANT: Active vSphere Environment Required

The vSphere provider connects to VMware vSphere infrastructure and requires active vCenter Server or ESXi hosts.

Requirements:

  • vCenter Server 7.0+ or ESXi 7.0+ (running and accessible)
  • User account with appropriate privileges for VM management
  • Network connectivity from VirtRigaud to vCenter/ESXi (HTTPS/443)
  • vSphere infrastructure:
    • Configured datacenters, clusters, and hosts
    • Storage (datastores) for VM files
    • Networks (port groups) for VM connectivity
    • Resource pools for VM placement (optional)

Testing/Development:

For development environments:

  • Use VMware vSphere Hypervisor (ESXi) free version
  • vCenter Server Appliance evaluation license
  • VMware Workstation/Fusion with nested ESXi
  • EVE-NG or GNS3 with vSphere emulation

Authentication

The vSphere provider supports multiple authentication methods:

Username/Password Authentication (Common)

Standard vSphere user authentication:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-prod
  namespace: default
spec:
  type: vsphere
  endpoint: https://vcenter.example.com/sdk
  credentialSecretRef:
    name: vsphere-credentials
  # Optional: Skip TLS verification (development only)
  insecureSkipVerify: false
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.3"
    service:
      port: 9090

Create credentials secret:

apiVersion: v1
kind: Secret
metadata:
  name: vsphere-credentials
  namespace: default
type: Opaque
stringData:
  username: "virtrigaud@vsphere.local"
  password: "SecurePassword123!"

Session Token Authentication (Advanced)

For environments using external authentication:

apiVersion: v1
kind: Secret
metadata:
  name: vsphere-token
  namespace: default
type: Opaque
stringData:
  token: "vmware-api-session-id:abcd1234..."

Create a dedicated service account with minimal required privileges:

# vSphere privileges for VirtRigaud service account:
# - Datastore: Allocate space, Browse datastore, Low level file operations
# - Network: Assign network  
# - Resource: Assign virtual machine to resource pool
# - Virtual machine: All privileges (or subset based on requirements)
# - Global: Enable methods, Disable methods, Licenses

Configuration

Connection Endpoints

Endpoint TypeFormatUse Case
vCenter Serverhttps://vcenter.example.com/sdkMulti-host management (recommended)
vCenter FQDNhttps://vcenter.corp.local/sdkInternal domain environments
vCenter IPhttps://192.168.1.10/sdkDirect IP access
ESXi Hosthttps://esxi-host.example.comSingle host environments

Deployment Configuration

Using Helm Values

# values.yaml
providers:
  vsphere:
    enabled: true
    endpoint: "https://vcenter.example.com/sdk"
    insecureSkipVerify: false  # Set to true for self-signed certificates
    credentialSecretRef:
      name: vsphere-credentials
      namespace: virtrigaud-system

Production Configuration with TLS

# Create secret with credentials and TLS certificates
apiVersion: v1
kind: Secret
metadata:
  name: vsphere-secure-credentials
  namespace: virtrigaud-system
type: Opaque
stringData:
  username: "svc-virtrigaud@vsphere.local"
  password: "SecurePassword123!"
  # Optional: Custom CA certificate for vCenter
  ca.crt: |
    -----BEGIN CERTIFICATE-----
    # Your vCenter CA certificate here
    -----END CERTIFICATE-----

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-production
  namespace: virtrigaud-system
spec:
  type: vsphere
  endpoint: https://vcenter.prod.example.com/sdk
  credentialSecretRef:
    name: vsphere-secure-credentials
  insecureSkipVerify: false

Development Configuration

# For development with self-signed certificates
providers:
  vsphere:
    enabled: true
    endpoint: "https://esxi-dev.local"
    insecureSkipVerify: true  # Only for development!
    credentialSecretRef:
      name: vsphere-dev-credentials

Multi-vCenter Configuration

# Deploy multiple providers for different vCenters
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-datacenter-a
spec:
  type: vsphere
  endpoint: https://vcenter-a.example.com/sdk
  credentialSecretRef:
    name: vsphere-credentials-a

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-datacenter-b
spec:
  type: vsphere
  endpoint: https://vcenter-b.example.com/sdk
  credentialSecretRef:
    name: vsphere-credentials-b

vSphere Infrastructure Setup

Required vSphere Objects

The provider expects the following vSphere infrastructure to be configured:

Datacenters and Clusters

# Example vSphere hierarchy:
Datacenter: "Production"
β”œβ”€β”€ Cluster: "Compute-Cluster"
β”‚   β”œβ”€β”€ ESXi Host: esxi-01.example.com
β”‚   β”œβ”€β”€ ESXi Host: esxi-02.example.com
β”‚   └── ESXi Host: esxi-03.example.com
β”œβ”€β”€ Datastores:
β”‚   β”œβ”€β”€ "datastore-ssd"     # High-performance storage
β”‚   β”œβ”€β”€ "datastore-hdd"     # Standard storage
β”‚   └── "datastore-backup"  # Backup storage
└── Networks:
    β”œβ”€β”€ "VM Network"        # Default VM network
    β”œβ”€β”€ "DMZ-Network"       # DMZ port group
    └── "Management"        # Management network

Resource Pools (Optional)

# Create resource pools for workload isolation
Datacenter: "Production"
└── Cluster: "Compute-Cluster"
    └── Resource Pools:
        β”œβ”€β”€ "Development"    # Dev workloads (lower priority)
        β”œβ”€β”€ "Production"     # Prod workloads (high priority)
        └── "Testing"        # Test workloads (medium priority)

VM Configuration

VMClass Specification

Define CPU, memory, and vSphere-specific settings:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: standard-vm
spec:
  cpus: 4
  memory: "8Gi"
  # vSphere-specific configuration
  spec:
    # VM hardware settings
    hardware:
      version: "vmx-19"              # Hardware version
      firmware: "efi"                # BIOS or EFI
      secureBoot: true               # Secure boot (EFI only)
      enableCpuHotAdd: true          # Hot-add CPU
      enableMemoryHotAdd: true       # Hot-add memory
    
    # CPU configuration
    cpu:
      coresPerSocket: 2              # CPU topology
      enableVirtualization: false    # Nested virtualization
      reservationMHz: 1000           # CPU reservation
      limitMHz: 4000                 # CPU limit
    
    # Memory configuration  
    memory:
      reservationMB: 2048            # Memory reservation
      limitMB: 8192                  # Memory limit
      shareLevel: "normal"           # Memory shares (low/normal/high)
    
    # Storage configuration
    storage:
      diskFormat: "thin"             # thick/thin/eagerZeroedThick
      storagePolicy: "VM Storage Policy - SSD"  # vSAN storage policy
    
    # vSphere placement
    placement:
      datacenter: "Production"       # Target datacenter
      cluster: "Compute-Cluster"     # Target cluster  
      resourcePool: "Production"     # Target resource pool
      datastore: "datastore-ssd"     # Preferred datastore
      folder: "/vm/virtrigaud"       # VM folder

VMImage Specification

Reference vSphere templates, content library items, or OVF files:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-04-template
spec:
  # Template from vSphere inventory
  source:
    template: "ubuntu-22.04-template"
    datacenter: "Production"
    folder: "/vm/templates"
  
  # Or from content library
  # source:
  #   contentLibrary: "OS Templates"
  #   item: "ubuntu-22.04-cloud"
  
  # Or from OVF/OVA URL
  # source:
  #   ovf: "https://releases.ubuntu.com/22.04/ubuntu-22.04-server-cloudimg-amd64.ova"
  
  # Guest OS identification
  guestOS: "ubuntu64Guest"
  
  # Customization specification
  customization:
    type: "cloudInit"              # cloudInit, sysprep, or linux
    spec: "ubuntu-cloud-init"      # Reference to customization spec

Complete VM Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-application
spec:
  providerRef:
    name: vsphere-prod
  classRef:
    name: standard-vm
  imageRef:
    name: ubuntu-22-04-template
  powerState: On
  
  # Disk configuration
  disks:
    - name: root
      size: "100Gi"
      storageClass: "ssd-storage"
      # vSphere-specific disk options
      spec:
        diskMode: "persistent"       # persistent, independent_persistent, independent_nonpersistent
        diskFormat: "thin"           # thick, thin, eagerZeroedThick
        controllerType: "scsi"       # scsi, ide, nvme
        unitNumber: 0                # SCSI unit number
    
    - name: data
      size: "500Gi" 
      storageClass: "hdd-storage"
      spec:
        diskFormat: "thick"
        controllerType: "scsi"
        unitNumber: 1
  
  # Network configuration
  networks:
    # Primary application network
    - name: app-network
      portGroup: "VM Network"
      # Optional: Static IP assignment
      staticIP:
        address: "192.168.100.50/24"
        gateway: "192.168.100.1"
        dns: ["192.168.1.10", "8.8.8.8"]
    
    # Management network
    - name: mgmt-network
      portGroup: "Management"
      # DHCP assignment (default)
  
  # vSphere-specific placement
  placement:
    datacenter: "Production"
    cluster: "Compute-Cluster"
    resourcePool: "Production"
    folder: "/vm/applications"
    datastore: "datastore-ssd"      # Override class default
    host: "esxi-01.example.com"      # Pin to specific host (optional)
  
  # Guest customization
  userData:
    cloudInit:
      inline: |
        #cloud-config
        hostname: web-application
        users:
          - name: ubuntu
            sudo: ALL=(ALL) NOPASSWD:ALL
            ssh_authorized_keys:
              - "ssh-ed25519 AAAA..."
        packages:
          - nginx
          - docker.io
          - open-vm-tools          # VMware tools for guest integration
        runcmd:
          - systemctl enable nginx
          - systemctl enable docker
          - systemctl enable open-vm-tools

Advanced Features

VM Reconfiguration (v0.2.3+)

The vSphere provider supports online VM reconfiguration for CPU, memory, and disk resources:

# Reconfigure VM resources
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  vmClassRef: medium  # Change from small to medium
  powerState: "On"

Capabilities:

  • Online CPU Changes: Hot-add CPUs to running VMs (requires guest OS support)
  • Online Memory Changes: Hot-add memory to running VMs (requires guest OS support)
  • Disk Resizing: Expand disks online (shrinking not supported for safety)
  • Automatic Fallback: Falls back to offline changes if hot-add not supported
  • Intelligent Detection: Only applies changes when needed

Memory Format Support:

  • Standard units: 2Gi, 4096Mi, 2048MiB, 2GiB
  • Parser handles multiple memory unit formats

Limitations:

  • Disk shrinking prevented to avoid data loss
  • Some guest operating systems require special configuration for hot-add
  • BIOS firmware VMs have limited hot-add support (use EFI firmware)

VM Cloning (v0.2.3+)

Create full or linked clones of existing VMs and templates:

# Clone from existing VM
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server-02
spec:
  vmClassRef: small
  vmImageRef: web-server-01  # Source VM
  cloneType: linked  # or "full"

Clone Types:

  • Full Clone: Independent copy with separate storage
  • Linked Clone: Space-efficient copy using snapshots
    • Automatically creates snapshot if none exists
    • Requires less storage and faster creation
    • Parent VM must remain available

Use Cases:

  • Rapid test environment provisioning
  • Development environment duplication
  • Template-based deployments
  • Disaster recovery scenarios

Task Status Tracking (v0.2.3+)

Monitor asynchronous vSphere operations in real-time:

# VirtRigaud automatically tracks long-running operations
# No manual configuration needed

# Task tracking provides:
# - Real-time task state (queued, running, success, error)
# - Progress percentage
# - Error messages for failed tasks
# - Integration with vSphere task manager

Features:

  • Automatic tracking of all async operations
  • Progress monitoring via govmomi task manager
  • Detailed error reporting
  • Task history visibility in vCenter

Console Access (v0.2.3+)

Generate direct vSphere web client console URLs:

# Access provided in VM status
kubectl get vm web-server -o yaml

status:
  consolURL: "https://vcenter.example.com/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-123:xxxxx/summary"
  phase: Running

Features:

  • Direct browser-based VM console access
  • No additional tools required
  • Works with vSphere web client
  • Includes VM instance UUID for reliable identification
  • Generated automatically in Describe operations

Template Management

Creating Templates

# Convert existing VM to template
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMTemplate
metadata:
  name: create-ubuntu-template
spec:
  sourceVM: "ubuntu-base-vm"
  datacenter: "Production"
  targetFolder: "/vm/templates"
  templateName: "ubuntu-22.04-template"
  
  # Template metadata
  annotation: |
    Ubuntu 22.04 LTS Template
    Created: 2024-01-15
    Includes: cloud-init, open-vm-tools
  
  # Template customization
  powerOff: true                   # Power off before conversion
  removeSnapshots: true           # Clean up snapshots
  updateTools: true               # Update VMware tools

Content Library Integration

# Deploy from content library
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage  
metadata:
  name: centos-stream-9
spec:
  source:
    contentLibrary: "OS Templates"
    item: "CentOS-Stream-9"
    datacenter: "Production"
  
  # Content library item properties
  properties:
    version: "9.0"
    provider: "CentOS"
    osType: "linux"

Storage Policies

# VMClass with vSAN storage policy
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: high-performance
spec:
  cpus: 8
  memory: "32Gi"
  spec:
    storage:
      # vSAN storage policies
      homePolicy: "VM Storage Policy - Performance"    # VM home/config files
      diskPolicy: "VM Storage Policy - SSD Only"       # Virtual disks
      swapPolicy: "VM Storage Policy - Standard"        # Swap files
      
      # Traditional storage
      datastoreCluster: "DatastoreCluster-SSD"         # Datastore cluster
      antiAffinityRules: true                          # VM anti-affinity

Network Advanced Configuration

# Advanced networking with distributed switches
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMNetworkAttachment
metadata:
  name: advanced-networking
spec:
  networks:
    # Distributed port group
    - name: frontend
      portGroup: "DPG-Frontend-VLAN100"
      distributedSwitch: "DSwitch-Production"
      vlan: 100
      
    # NSX-T logical switch
    - name: backend  
      portGroup: "LS-Backend-App"
      nsx: true
      securityPolicy: "Backend-Security-Policy"
      
    # SR-IOV for high performance
    - name: storage
      portGroup: "DPG-Storage-VLAN200"
      sriov: true
      bandwidth:
        reservation: 1000  # Mbps
        limit: 10000      # Mbps
        shares: 100       # Priority

High Availability

# VM with HA/DRS settings
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: critical-application
spec:
  providerRef:
    name: vsphere-prod
  # ... other config ...
  
  # High availability configuration
  availability:
    # HA restart priority
    restartPriority: "high"          # disabled, low, medium, high
    isolationResponse: "powerOff"    # none, powerOff, shutdown
    vmMonitoring: "vmMonitoringOnly" # vmMonitoringDisabled, vmMonitoringOnly, vmAndAppMonitoring
    
    # DRS configuration
    drsAutomationLevel: "fullyAutomated"  # manual, partiallyAutomated, fullyAutomated
    drsVmBehavior: "fullyAutomated"       # manual, partiallyAutomated, fullyAutomated
    
    # Anti-affinity rules
    antiAffinityGroups: ["web-tier", "database-tier"]
    
    # Host affinity (pin to specific hosts)
    hostAffinityGroups: ["production-hosts"]

Snapshot Management

# Advanced snapshot configuration
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: pre-upgrade-snapshot
spec:
  vmRef:
    name: web-application
  
  # Snapshot settings
  name: "Pre-upgrade snapshot"
  description: "Snapshot before application upgrade"
  memory: true                    # Include memory state
  quiesce: true                   # Quiesce guest filesystem
  
  # Retention policy
  retention:
    maxSnapshots: 3               # Keep max 3 snapshots
    maxAge: "7d"                  # Delete after 7 days
    
  # Schedule (optional)
  schedule: "0 2 * * 0"          # Weekly at 2 AM Sunday

Troubleshooting

Common Issues

❌ Connection Failed

Symptom: failed to connect to vSphere: connection refused

Causes & Solutions:

  1. Network connectivity:

    # Test connectivity to vCenter
    telnet vcenter.example.com 443
    
    # Test from Kubernetes pod
    kubectl run debug --rm -i --tty --image=curlimages/curl -- \
      curl -k https://vcenter.example.com
    
  2. DNS resolution:

    # Test DNS resolution
    nslookup vcenter.example.com
    
    # Use IP address if DNS fails
    
  3. Firewall rules: Ensure port 443 is accessible from Kubernetes cluster

❌ Authentication Failed

Symptom: Login failed: incorrect user name or password

Solutions:

  1. Verify credentials:

    # Test credentials manually
    kubectl get secret vsphere-credentials -o yaml
    
    # Decode and verify
    echo "base64-password" | base64 -d
    
  2. Check user permissions:

    • Verify user exists in vCenter
    • Check assigned roles and privileges
    • Ensure user is not locked out
  3. Test login via vSphere Client: Verify credentials work in the GUI

❌ Insufficient Privileges

Symptom: operation requires privilege 'VirtualMachine.Interact.PowerOn'

Solution: Grant required privileges to the service account:

# Required privileges for VirtRigaud:
# - Datastore privileges:
#   * Datastore.AllocateSpace
#   * Datastore.Browse  
#   * Datastore.FileManagement
# - Network privileges:
#   * Network.Assign
# - Resource privileges:
#   * Resource.AssignVMToPool
# - Virtual machine privileges:
#   * VirtualMachine.* (all) or specific subset
# - Global privileges:
#   * Global.EnableMethods
#   * Global.DisableMethods

❌ Template Not Found

Symptom: template 'ubuntu-template' not found

Solutions:

# List available templates
govc ls /datacenter/vm/templates/

# Check template path and permissions
govc object.collect -s vm/templates/ubuntu-template summary.config.name

# Verify template is properly marked as template
govc object.collect -s vm/templates/ubuntu-template config.template

❌ Datastore Issues

Symptom: insufficient disk space or datastore not accessible

Solutions:

# Check datastore capacity
govc datastore.info datastore-name

# List accessible datastores
govc datastore.ls

# Check datastore cluster configuration
govc cluster.ls

❌ Network Configuration

Symptom: network 'VM Network' not found

Solutions:

# List available networks
govc ls /datacenter/network/

# Check distributed port groups
govc dvs.portgroup.info

# Verify network accessibility from cluster
govc cluster.network.info

Validation Commands

Test your vSphere setup before deploying:

# 1. Install and configure govc CLI tool
export GOVC_URL='https://vcenter.example.com'
export GOVC_USERNAME='administrator@vsphere.local'
export GOVC_PASSWORD='password'
export GOVC_INSECURE=1  # for self-signed certificates

# 2. Test connectivity
govc about

# 3. List datacenters
govc ls

# 4. List clusters and hosts
govc ls /datacenter/host/

# 5. List datastores
govc ls /datacenter/datastore/

# 6. List networks
govc ls /datacenter/network/

# 7. List templates
govc ls /datacenter/vm/templates/

# 8. Test VM creation (dry run)
govc vm.create -c 1 -m 1024 -g ubuntu64Guest -net "VM Network" test-vm
govc vm.destroy test-vm

Debug Logging

Enable verbose logging for the vSphere provider:

providers:
  vsphere:
    env:
      - name: LOG_LEVEL
        value: "debug"
      - name: GOVMOMI_DEBUG
        value: "true"
    endpoint: "https://vcenter.example.com"

Monitor vSphere tasks:

# Monitor recent tasks in vCenter
govc task.ls

# Get details of specific task
govc task.info task-123

Performance Optimization

Resource Allocation

# High-performance VMClass
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: performance-optimized
spec:
  cpus: 16
  memory: "64Gi"
  spec:
    cpu:
      coresPerSocket: 8            # Match physical CPU topology
      reservationMHz: 8000         # Guarantee CPU resources
      shares: 2000                 # High priority (normal=1000)
      enableVirtualization: false  # Disable if not needed for performance
    
    memory:
      reservationMB: 65536         # Guarantee memory
      shares: 2000                 # High priority
      shareLevel: "high"           # Alternative to shares value
    
    hardware:
      enableCpuHotAdd: false       # Better performance when disabled
      enableMemoryHotAdd: false    # Better performance when disabled
      
    # NUMA configuration for large VMs
    numa:
      enabled: true
      coresPerSocket: 8            # Align with NUMA topology

Storage Optimization

# Storage-optimized configuration
spec:
  storage:
    diskFormat: "eagerZeroedThick"  # Best performance, more space usage
    controllerType: "pvscsi"        # Paravirtual SCSI for better performance
    multiwriter: false              # Disable unless needed
    
    # vSAN optimization
    storagePolicy: "Performance-Tier"
    cachingPolicy: "writethrough"   # or "writeback" for better performance
    
    # Multiple controllers for high IOPS
    scsiControllers:
      - type: "pvscsi"
        busNumber: 0
        maxDevices: 15
      - type: "pvscsi" 
        busNumber: 1
        maxDevices: 15

Network Optimization

# High-performance networking
networks:
  - name: high-performance
    portGroup: "DPG-HighPerf-SR-IOV"
    adapter: "vmxnet3"             # Best performance adapter
    sriov: true                    # SR-IOV for near-native performance
    bandwidth:
      reservation: 1000            # Guaranteed bandwidth (Mbps)
      limit: 10000                 # Maximum bandwidth (Mbps)
      shares: 100                  # Priority level

API Reference

For complete API reference, see the Provider API Documentation.

Contributing

To contribute to the vSphere provider:

  1. See the Provider Development Guide
  2. Check the GitHub repository
  3. Review open issues

Support

LibVirt/KVM Provider

The LibVirt provider enables VirtRigaud to manage virtual machines on KVM/QEMU hypervisors using the LibVirt API. This provider runs as a dedicated pod that communicates with LibVirt daemons locally or remotely, making it ideal for development, on-premises deployments, and cloud environments.

Overview

This provider implements the VirtRigaud provider interface to manage VM lifecycle operations on LibVirt/KVM:

  • Create: Create VMs from cloud images with comprehensive cloud-init support
  • Delete: Remove VMs and associated storage volumes (with cleanup)
  • Power: Start, stop, and reboot virtual machines
  • Describe: Query VM state, resource usage, guest agent information, and network details
  • Reconfigure: Modify VM resources (v0.2.3+ - requires VM restart)
  • Clone: Create new VMs based on existing VM configurations
  • Snapshot: Create, delete, and revert VM snapshots (storage-dependent)
  • ConsoleURL: Generate VNC console URLs for remote access (v0.2.3+)
  • ImagePrepare: Download and prepare cloud images from URLs
  • Storage Management: Advanced storage pool and volume operations
  • Cloud-Init: Full NoCloud datasource support with ISO generation
  • QEMU Guest Agent: Integration for enhanced guest OS monitoring
  • Network Configuration: Support for various network types and bridges

Prerequisites

The LibVirt provider connects to a LibVirt daemon (libvirtd) which can run locally or remotely. This makes it flexible for both development and production environments.

Connection Options:

  • Local LibVirt: Connects to local libvirtd via qemu:///system (ideal for development)
  • Remote LibVirt: Connects to remote libvirtd over SSH/TLS (production)
  • Container LibVirt: Works with containerized libvirt or KubeVirt

Requirements:

  • LibVirt daemon (libvirtd) running locally or accessible remotely
  • KVM/QEMU hypervisor support (hardware virtualization recommended)
  • Storage pools configured for VM disk storage
  • Network bridges or interfaces for VM networking
  • Appropriate permissions for VM management operations

Development Setup:

For local development, you can:

  • Linux: Install libvirt-daemon-system and qemu-kvm packages
  • macOS/Windows: Use remote LibVirt or nested virtualization
  • Testing: The provider can connect to local libvirtd without complex infrastructure

Authentication & Connection

The LibVirt provider supports multiple connection methods:

Local LibVirt Connection

For connecting to a LibVirt daemon on the same host as the provider pod:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-local
  namespace: default
spec:
  type: libvirt
  endpoint: "qemu:///system"  # Local system connection
  credentialSecretRef:
    name: libvirt-local-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.3"
    service:
      port: 9090

Note: When using local connections, ensure the provider pod has appropriate permissions to access the LibVirt socket.

Remote Connection with SSH

For remote LibVirt over SSH:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-remote
  namespace: default
spec:
  type: libvirt
  endpoint: "qemu+ssh://user@libvirt-host/system"
  credentialSecretRef:
    name: libvirt-ssh-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.3"
    service:
      port: 9090

Create SSH credentials secret:

apiVersion: v1
kind: Secret
metadata:
  name: libvirt-ssh-credentials
  namespace: default
type: Opaque
stringData:
  username: "libvirt-user"
  # For key-based auth (recommended):
  tls.key: |
    -----BEGIN PRIVATE KEY-----
    # Your SSH private key here
    -----END PRIVATE KEY-----
  # For password auth (less secure):
  password: "your-password"

Remote Connection with TLS

For remote LibVirt over TLS:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-tls
  namespace: default
spec:
  type: libvirt
  endpoint: "qemu+tls://libvirt-host:16514/system"
  credentialSecretRef:
    name: libvirt-tls-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.3"
    service:
      port: 9090

Create TLS credentials secret:

apiVersion: v1
kind: Secret
metadata:
  name: libvirt-tls-credentials
  namespace: default
type: kubernetes.io/tls
data:
  tls.crt: # Base64 encoded client certificate
  tls.key: # Base64 encoded client private key
  ca.crt:  # Base64 encoded CA certificate

Configuration

Connection URIs

The LibVirt provider supports standard LibVirt connection URIs:

URI FormatDescriptionUse Case
qemu:///systemLocal system connectionDevelopment, single-host
qemu+ssh://user@host/systemSSH connectionRemote access with SSH
qemu+tls://host:16514/systemTLS connectionSecure remote access
qemu+tcp://host:16509/systemTCP connectionInsecure remote (testing only)

⚠️ Note: All LibVirt URI schemes are now supported in the CRD validation pattern.

Deployment Configuration

Using Helm Values

# values.yaml
providers:
  libvirt:
    enabled: true
    endpoint: "qemu:///system"  # Adjust for your environment
    # For remote connections:
    # endpoint: "qemu+ssh://user@libvirt-host/system"
    credentialSecretRef:
      name: libvirt-credentials  # Optional for local connections

Development Configuration

# For local development with LibVirt
providers:
  libvirt:
    enabled: true
    endpoint: "qemu:///system"
    runtime:
      # Mount host libvirt socket (for local access)
      volumes:
      - name: libvirt-sock
        hostPath:
          path: /var/run/libvirt/libvirt-sock
      volumeMounts:
      - name: libvirt-sock
        mountPath: /var/run/libvirt/libvirt-sock

Production Configuration

# For production with remote LibVirt
apiVersion: v1
kind: Secret
metadata:
  name: libvirt-credentials
  namespace: virtrigaud-system
type: Opaque
stringData:
  username: "virtrigaud-service"
  tls.crt: |
    -----BEGIN CERTIFICATE-----
    # Client certificate for TLS authentication
    -----END CERTIFICATE-----
  tls.key: |
    -----BEGIN PRIVATE KEY-----
    # Client private key
    -----END PRIVATE KEY-----
  ca.crt: |
    -----BEGIN CERTIFICATE-----
    # CA certificate
    -----END CERTIFICATE-----

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-production
  namespace: virtrigaud-system
spec:
  type: libvirt
  endpoint: "qemu+tls://libvirt.example.com:16514/system"
  credentialSecretRef:
    name: libvirt-credentials

Storage Configuration

Storage Pools

LibVirt requires storage pools for VM disks. Common configurations:

# Create directory-based storage pool
virsh pool-define-as default dir --target /var/lib/libvirt/images
virsh pool-build default
virsh pool-start default
virsh pool-autostart default

# Create LVM-based storage pool (performance)
virsh pool-define-as lvm-pool logical --source-name vg-libvirt --target /dev/vg-libvirt
virsh pool-start lvm-pool
virsh pool-autostart lvm-pool

VMClass Storage Specification

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: standard
spec:
  cpus: 2
  memory: "4Gi"
  # LibVirt-specific storage settings
  spec:
    storage:
      pool: "default"        # Storage pool name
      format: "qcow2"        # Disk format (qcow2, raw)
      cache: "writethrough"  # Cache mode
      io: "threads"          # I/O mode

Network Configuration

Network Setup

Configure LibVirt networks for VM connectivity:

# Create NAT network (default)
virsh net-define /usr/share/libvirt/networks/default.xml
virsh net-start default
virsh net-autostart default

# Create bridge network (for external access)
cat > /tmp/bridge-network.xml << EOF
<network>
  <name>br0</name>
  <forward mode='bridge'/>
  <bridge name='br0'/>
</network>
EOF
virsh net-define /tmp/bridge-network.xml
virsh net-start br0

Network Bridge Mapping

Network NameLibVirt NetworkUse Case
default, natdefaultNAT networking
bridge, br0br0Bridged networking
isolatedisolatedHost-only networking

VM Network Configuration

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  providerRef:
    name: libvirt-local
  networks:
    # Use default NAT network
    - name: default
    # Use bridged network for external access
    - name: bridge
      bridge: br0
      mac: "52:54:00:12:34:56"  # Optional MAC address

VM Configuration

VMClass Specification

Define hardware resources and LibVirt-specific settings:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: development
spec:
  cpus: 2
  memory: "4Gi"
  # LibVirt-specific configuration
  spec:
    machine: "pc-i440fx-2.12"  # Machine type
    cpu:
      mode: "host-model"       # CPU mode (host-model, host-passthrough)
      topology:
        sockets: 1
        cores: 2
        threads: 1
    features:
      acpi: true
      apic: true
      pae: true
    clock:
      offset: "utc"
      timers:
        rtc: "catchup"
        pit: "delay"
        hpet: false

VMImage Specification

Reference existing disk images or templates:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-04
spec:
  source:
    # Path to existing image in storage pool
    disk: "/var/lib/libvirt/images/ubuntu-22.04-base.qcow2"
    # Or reference by pool and volume
    # pool: "default"
    # volume: "ubuntu-22.04-base"
  format: "qcow2"
  
  # Cloud-init preparation
  cloudInit:
    enabled: true
    userDataTemplate: |
      #cloud-config
      hostname: {{"{{ .Name }}"}}
      users:
        - name: ubuntu
          sudo: ALL=(ALL) NOPASSWD:ALL
          ssh_authorized_keys:
            - {{"{{ .SSHPublicKey }}"}}

Complete VM Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: dev-workstation
spec:
  providerRef:
    name: libvirt-local
  classRef:
    name: development
  imageRef:
    name: ubuntu-22-04
  powerState: On
  
  # Disk configuration
  disks:
    - name: root
      size: "50Gi"
      storageClass: "fast-ssd"  # Maps to LibVirt storage pool
  
  # Network configuration  
  networks:
    - name: default  # NAT network for internet
    - name: bridge   # Bridge for LAN access
      staticIP:
        address: "192.168.1.100/24"
        gateway: "192.168.1.1"
        dns: ["8.8.8.8", "1.1.1.1"]
  
  # Cloud-init user data
  userData:
    cloudInit:
      inline: |
        #cloud-config
        hostname: dev-workstation
        users:
          - name: developer
            sudo: ALL=(ALL) NOPASSWD:ALL
            shell: /bin/bash
            ssh_authorized_keys:
              - "ssh-ed25519 AAAA..."
        packages:
          - build-essential
          - docker.io
          - code
        runcmd:
          - systemctl enable docker
          - usermod -aG docker developer

Cloud-Init Integration

Automatic Configuration

The LibVirt provider automatically handles cloud-init setup:

  • ISO Generation: Creates cloud-init ISO with user-data and meta-data
  • Attachment: Attaches ISO as CD-ROM device to VM
  • Network Config: Generates network configuration from VM spec
  • User Data: Renders templates with VM-specific values

Advanced Cloud-Init

userData:
  cloudInit:
    inline: |
      #cloud-config
      hostname: {{"{{ .Name }}"}}
      
      # Network configuration (if not using DHCP)
      network:
        version: 2
        ethernets:
          ens3:
            addresses: [192.168.1.100/24]
            gateway4: 192.168.1.1
            nameservers:
              addresses: [8.8.8.8, 1.1.1.1]
      
      # Storage configuration
      disk_setup:
        /dev/vdb:
          table_type: gpt
          layout: true
      
      fs_setup:
        - device: /dev/vdb1
          filesystem: ext4
          label: data
      
      mounts:
        - [/dev/vdb1, /data, ext4, defaults]
      
      # Package installation
      packages:
        - qemu-guest-agent  # Enable guest agent
        - cloud-init
        - curl
      
      # Enable services
      runcmd:
        - systemctl enable qemu-guest-agent
        - systemctl start qemu-guest-agent

Performance Optimization

KVM Optimization

# VMClass with performance optimizations
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: high-performance
spec:
  cpus: 8
  memory: "16Gi"
  spec:
    cpu:
      mode: "host-passthrough"  # Best performance
      topology:
        sockets: 1
        cores: 8
        threads: 1
    # NUMA topology for large VMs
    numa:
      cells:
        - id: 0
          cpus: "0-7"
          memory: "16"
    
    # Virtio devices for performance
    devices:
      disk:
        bus: "virtio"
        cache: "none"
        io: "native"
      network:
        model: "virtio"
      video:
        model: "virtio"

Storage Performance

# Create high-performance storage pool
virsh pool-define-as ssd-pool logical --source-name vg-ssd --target /dev/vg-ssd
virsh pool-start ssd-pool

# Use raw format for better performance (larger disk usage)
virsh vol-create-as ssd-pool vm-disk 100G --format raw

# Enable native AIO and disable cache for direct I/O
# (configured automatically by provider based on VMClass)

Troubleshooting

Common Issues

❌ Connection Failed

Symptom: failed to connect to Libvirt: <error>

Causes & Solutions:

  1. Local connection issues:

    # Check libvirtd status
    sudo systemctl status libvirtd
    
    # Start if not running
    sudo systemctl start libvirtd
    sudo systemctl enable libvirtd
    
    # Test connection
    virsh -c qemu:///system list
    
  2. Remote SSH connection:

    # Test SSH connectivity
    ssh user@libvirt-host virsh list
    
    # Check SSH key permissions
    chmod 600 ~/.ssh/id_rsa
    
  3. Remote TLS connection:

    # Verify certificates
    openssl x509 -in client-cert.pem -text -noout
    
    # Test TLS connection
    virsh -c qemu+tls://host:16514/system list
    

❌ Permission Denied

Symptom: authentication failed or permission denied

Solutions:

# Add user to libvirt group
sudo usermod -a -G libvirt $USER

# Check libvirt group membership
groups $USER

# Verify permissions on libvirt socket
ls -la /var/run/libvirt/libvirt-sock

# For containerized providers, ensure socket is mounted

❌ Storage Pool Not Found

Symptom: storage pool 'default' not found

Solution:

# List available pools
virsh pool-list --all

# Create default pool if missing
virsh pool-define-as default dir --target /var/lib/libvirt/images
virsh pool-build default
virsh pool-start default
virsh pool-autostart default

# Verify pool is active
virsh pool-info default

❌ Network Not Available

Symptom: network 'default' not found

Solution:

# List networks
virsh net-list --all

# Start default network
virsh net-start default
virsh net-autostart default

# Create bridge network if needed
virsh net-define /usr/share/libvirt/networks/default.xml

❌ KVM Not Available

Symptom: KVM is not available or hardware acceleration not available

Solutions:

  1. Check virtualization support:

    # Check CPU virtualization features
    egrep -c '(vmx|svm)' /proc/cpuinfo
    
    # Check KVM modules
    lsmod | grep kvm
    
    # Load KVM modules if missing
    sudo modprobe kvm
    sudo modprobe kvm_intel  # or kvm_amd
    
  2. BIOS/UEFI settings: Enable Intel VT-x or AMD-V

  3. Nested virtualization: If running in a VM, enable nested virtualization

Validation Commands

Test your LibVirt setup before deploying:

# 1. Test LibVirt connection
virsh -c qemu:///system list

# 2. Check storage pools
virsh pool-list --all

# 3. Check networks
virsh net-list --all

# 4. Test VM creation (simple test)
virt-install --name test-vm --memory 512 --vcpus 1 \
  --disk size=1 --network network=default \
  --boot cdrom --noautoconsole --dry-run

# 5. From within Kubernetes pod
kubectl run debug --rm -i --tty --image=ubuntu:22.04 -- bash
# Then test virsh commands if socket is mounted

Debug Logging

Enable verbose logging for the LibVirt provider:

providers:
  libvirt:
    env:
      - name: LOG_LEVEL
        value: "debug"
      - name: LIBVIRT_DEBUG
        value: "1"
    endpoint: "qemu:///system"

Advanced Features

VM Reconfiguration (v0.2.3+)

The Libvirt provider supports VM reconfiguration for CPU, memory, and disk resources:

# Reconfigure VM resources
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  vmClassRef: medium  # Change from small to medium
  powerState: "On"

Capabilities:

  • Online CPU Changes: Modify CPU count using virsh setvcpus --live for running VMs
  • Online Memory Changes: Modify memory using virsh setmem --live for running VMs
  • Disk Resizing: Expand disk volumes via storage provider integration
  • Offline Configuration: Updates persistent config for stopped VMs via --config flag

Important Notes:

  • Most changes require VM restart for full effect
  • Online changes apply to running VM but may need restart for persistence
  • Disk shrinking not supported for safety
  • Memory format parsing supports bytes, KiB, MiB, GiB

Implementation Details:

  • Uses virsh setvcpus --live --config for CPU changes
  • Uses virsh setmem --live --config for memory changes
  • Parses current VM configuration with virsh dominfo
  • Integrates with storage provider for volume resizing

VNC Console Access (v0.2.3+)

Generate VNC console URLs for direct VM access:

# Access provided in VM status
kubectl get vm web-server -o yaml

status:
  consoleURL: "vnc://libvirt-host.example.com:5900"
  phase: Running

Features:

  • Automatic VNC port extraction from domain XML
  • Direct connection URLs for VNC clients
  • Support for standard VNC viewers (TigerVNC, RealVNC, etc.)
  • Web-based VNC viewers compatible (noVNC)

VNC Client Usage:

# Using vncviewer
vncviewer libvirt-host.example.com:5900

# Using TigerVNC
tigervnc libvirt-host.example.com:5900

# Web browser (with noVNC)
# Access through web-based VNC proxy

Configuration: VNC is automatically configured during VM creation. The provider:

  1. Extracts VNC configuration from domain XML using virsh dumpxml
  2. Parses the graphics port number
  3. Constructs the VNC URL with host and port
  4. Returns URL in Describe operations

Advanced Configuration

High Availability Setup

# Multiple LibVirt hosts for HA
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-cluster
spec:
  type: libvirt
  # Use load balancer or failover endpoint
  endpoint: "qemu+tls://libvirt-cluster.example.com:16514/system"
  runtime:
    replicas: 2  # Multiple provider instances
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: libvirt-provider
            topologyKey: kubernetes.io/hostname

GPU Passthrough

# VMClass with GPU passthrough
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: gpu-workstation
spec:
  cpus: 8
  memory: "32Gi"
  spec:
    devices:
      hostdev:
        - type: "pci"
          source:
            address:
              domain: "0x0000"
              bus: "0x01"
              slot: "0x00"
              function: "0x0"
          managed: true

API Reference

For complete API reference, see the Provider API Documentation.

Contributing

To contribute to the LibVirt provider:

  1. See the Provider Development Guide
  2. Check the GitHub repository
  3. Review open issues

Support

Proxmox VE Provider

The Proxmox VE provider enables VirtRigaud to manage virtual machines on Proxmox Virtual Environment (PVE) clusters using the native Proxmox API.

Overview

This provider implements the VirtRigaud provider interface to manage VM lifecycle operations on Proxmox VE:

  • Create: Create VMs from templates or ISO images with cloud-init support
  • Delete: Remove VMs and associated resources
  • Power: Start, stop, and reboot virtual machines
  • Describe: Query VM state, IPs, and console access
  • Guest Agent Integration: Enhanced IP detection via QEMU guest agent (v0.2.3+)
  • Reconfigure: Hot-plug CPU/memory changes, disk expansion
  • Clone: Create linked or full clones of existing VMs
  • Snapshot: Create, delete, and revert VM snapshots with memory state
  • ImagePrepare: Import and prepare VM templates from URLs or ensure existence

Prerequisites

⚠️ IMPORTANT: Active Proxmox VE Server Required

The Proxmox provider requires a running Proxmox VE server to function. Unlike some providers that can operate in simulation mode, this provider performs actual API calls to Proxmox VE during startup and operation.

Requirements:

  • Proxmox VE 7.0 or later (running and accessible)
  • API token or user account with appropriate privileges
  • Network connectivity from VirtRigaud to Proxmox API (port 8006/HTTPS)
  • Valid TLS configuration (production) or skip verification (development)

Testing/Development:

If you don’t have a Proxmox VE server available:

  • Use Proxmox VE in a VM for testing
  • Consider alternative providers (libvirt, vSphere) for local development
  • The provider will fail startup validation without a reachable Proxmox endpoint

Authentication

The Proxmox provider supports two authentication methods:

API tokens provide secure, scope-limited access without exposing user passwords.

  1. Create API Token in Proxmox:

    # In Proxmox web UI: Datacenter -> Permissions -> API Tokens
    # Or via CLI:
    pveum user token add <USER@REALM> <TOKENID> --privsep 0
    
  2. Configure Provider:

    apiVersion: infra.virtrigaud.io/v1beta1
    kind: Provider
    metadata:
      name: proxmox-prod
      namespace: default
    spec:
      type: proxmox
      endpoint: https://pve.example.com:8006
      credentialSecretRef:
        name: pve-credentials
      runtime:
        mode: Remote
        image: "ghcr.io/projectbeskar/virtrigaud/provider-proxmox:v0.2.3"
        service:
          port: 9090
    
  3. Create Credentials Secret:

    apiVersion: v1
    kind: Secret
    metadata:
      name: pve-credentials
      namespace: default
    type: Opaque
    stringData:
      token_id: "virtrigaud@pve!vrtg-token"
      token_secret: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
    

For environments that cannot use API tokens:

apiVersion: v1
kind: Secret
metadata:
  name: pve-credentials
  namespace: default
type: Opaque
stringData:
  username: "virtrigaud@pve"
  password: "secure-password"

Deployment Configuration

Required Environment Variables

The Proxmox provider requires environment variables to connect to your Proxmox VE server. Configure these variables in your Helm values file:

VariableRequiredDescriptionExample
PVE_ENDPOINTβœ… YesProxmox VE API endpoint URLhttps://pve.example.com:8006
PVE_USERNAMEβœ… Yes*Username for password authroot@pam or user@realm
PVE_PASSWORDβœ… Yes*Password for usernamesecure-password
PVE_TOKEN_IDβœ… Yes**API token ID (alternative)user@realm!tokenid
PVE_TOKEN_SECRETβœ… Yes**API token secret (alternative)xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PVE_INSECURE_SKIP_VERIFYπŸ”΅ OptionalSkip TLS verificationtrue (dev only)

* Either username/password OR token authentication is required
** API token authentication is recommended for production

Helm Configuration Examples

Username/Password Authentication

# values.yaml
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        value: "https://your-proxmox-server.example.com:8006"
      - name: PVE_USERNAME
        value: "root@pam"
      - name: PVE_PASSWORD
        value: "your-secure-password"
# values.yaml  
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        value: "https://your-proxmox-server.example.com:8006"
      - name: PVE_TOKEN_ID
        value: "virtrigaud@pve!automation"
      - name: PVE_TOKEN_SECRET
        value: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

Using Kubernetes Secrets (Production)

For production environments, use Kubernetes secrets:

# Create secret first
apiVersion: v1
kind: Secret
metadata:
  name: proxmox-credentials
type: Opaque
stringData:
  PVE_ENDPOINT: "https://your-proxmox-server.example.com:8006"
  PVE_TOKEN_ID: "virtrigaud@pve!automation"  
  PVE_TOKEN_SECRET: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

---
# values.yaml - Reference the secret
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        valueFrom:
          secretKeyRef:
            name: proxmox-credentials
            key: PVE_ENDPOINT
      - name: PVE_TOKEN_ID
        valueFrom:
          secretKeyRef:
            name: proxmox-credentials
            key: PVE_TOKEN_ID
      - name: PVE_TOKEN_SECRET
        valueFrom:
          secretKeyRef:
            name: proxmox-credentials
            key: PVE_TOKEN_SECRET

Configuration Validation

The provider validates configuration at startup and will fail to start if:

  • βœ… PVE_ENDPOINT is missing or invalid
  • βœ… Neither username/password nor token credentials are provided
  • βœ… Proxmox server is unreachable
  • βœ… Authentication fails

Error Examples

# Missing endpoint
ERROR Failed to create PVE client error="endpoint is required"

# Invalid endpoint format  
ERROR Failed to create PVE client error="invalid endpoint URL"

# Authentication failure
ERROR Failed to authenticate error="authentication failed: invalid credentials"

# Connection failure
ERROR Failed to connect error="dial tcp: no route to host"

Development vs Production

EnvironmentEndpointAuthenticationTLSNotes
Developmenthttps://pve-test.local:8006Username/PasswordSkip verifyUse PVE_INSECURE_SKIP_VERIFY=true
Staginghttps://pve-staging.company.com:8006API TokenCustom CAConfigure CA bundle
Productionhttps://pve.company.com:8006API TokenValid certUse Kubernetes secrets

TLS Configuration

Self-Signed Certificates (Development)

For test environments with self-signed certificates:

spec:
  runtime:
    env:
      - name: PVE_INSECURE_SKIP_VERIFY
        value: "true"

Custom CA Certificate (Production)

For production with custom CA:

apiVersion: v1
kind: Secret
metadata:
  name: pve-credentials
type: Opaque
stringData:
  ca.crt: |
    -----BEGIN CERTIFICATE-----
    MIIDXTCCAkWgAwIBAgIJAL...
    -----END CERTIFICATE-----

Reconfiguration Support

Online Reconfiguration

The Proxmox provider supports online (hot-plug) reconfiguration for:

  • CPU: Add/remove vCPUs while VM is running (guest OS support required)
  • Memory: Increase memory using balloon driver (guest tools required)
  • Disk Expansion: Expand disks online (disk shrinking not supported)

Reconfigure Matrix

OperationOnline SupportRequirementsNotes
CPU increaseβœ… YesGuest OS supportMost modern Linux/Windows
CPU decreaseβœ… YesGuest OS supportMay require guest cooperation
Memory increaseβœ… YesBalloon driverInstall qemu-guest-agent
Memory decrease⚠️ LimitedBalloon driver + guestMay require power cycle
Disk expandβœ… YesOnline resize supportFilesystem resize separate
Disk shrink❌ NoNot supportedSecurity/data protection

Example Reconfiguration

# Scale up VM resources
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  # ... existing spec ...
  classRef:
    name: large  # Changed from 'small'
---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: large
spec:
  cpus: 8        # Increased from 2
  memory: "16Gi" # Increased from 4Gi

Snapshot Management

Snapshot Features

  • Memory Snapshots: Include VM memory state for consistent restore
  • Crash-Consistent: Without memory for faster snapshots
  • Snapshot Trees: Nested snapshots with parent-child relationships
  • Metadata: Description and timestamp tracking

Snapshot Operations

# Create snapshot with memory
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: before-upgrade
spec:
  vmRef:
    name: web-server
  description: "Pre-maintenance snapshot"
  includeMemory: true  # Include running memory state
# Create snapshot via kubectl
kubectl create vmsnapshot before-upgrade \
  --vm=web-server \
  --description="Before major upgrade" \
  --include-memory=true

Multi-NIC Networking

Network Configuration

The provider supports multiple network interfaces with:

  • Bridge Assignment: Map to Proxmox bridges (vmbr0, vmbr1, etc.)
  • VLAN Tagging: 802.1Q VLAN support
  • Static IPs: Cloud-init integration for network configuration
  • MAC Addresses: Custom MAC assignment

Example Multi-NIC VM

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: multi-nic-vm
spec:
  providerRef:
    name: proxmox-prod
  classRef:
    name: medium
  imageRef:
    name: ubuntu-22
  networks:
    # Primary LAN interface
    - name: lan
      bridge: vmbr0
      staticIP:
        address: "192.168.1.100/24"
        gateway: "192.168.1.1"
        dns: ["8.8.8.8", "1.1.1.1"]
    
    # DMZ interface with VLAN
    - name: dmz
      bridge: vmbr1
      vlan: 100
      staticIP:
        address: "10.0.100.50/24"
    
    # Management interface
    - name: mgmt
      bridge: vmbr2
      mac: "02:00:00:aa:bb:cc"

Network Bridge Mapping

Network NameDefault BridgeUse Case
lan, defaultvmbr0General LAN connectivity
dmzvmbr1DMZ/public services
mgmt, managementvmbr2Management network
vmbr*Same nameDirect bridge reference

Configuration

Required Environment Variables

⚠️ The provider requires environment variables to connect to Proxmox VE:

VariableDescriptionRequiredDefaultExample
PVE_ENDPOINTProxmox API endpoint URLYes-https://pve.example.com:8006/api2
PVE_TOKEN_IDAPI token identifierYes*-virtrigaud@pve!vrtg-token
PVE_TOKEN_SECRETAPI token secretYes*-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PVE_USERNAMEUsername for session authYes*-virtrigaud@pve
PVE_PASSWORDPassword for session authYes*-secure-password
PVE_NODE_SELECTORPreferred nodes (comma-separated)NoAuto-detectpve-node-1,pve-node-2
PVE_INSECURE_SKIP_VERIFYSkip TLS verificationNofalsetrue
PVE_CA_BUNDLECustom CA certificateNo------BEGIN CERTIFICATE-----...

* Either token (PVE_TOKEN_ID + PVE_TOKEN_SECRET) or username/password (PVE_USERNAME + PVE_PASSWORD) is required

Deployment Configuration

The provider needs environment variables to connect to Proxmox. Here are complete deployment examples:

Using Helm Values

# values.yaml
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        value: "https://pve.example.com:8006/api2"
      - name: PVE_TOKEN_ID
        value: "virtrigaud@pve!vrtg-token"
      - name: PVE_TOKEN_SECRET
        value: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
      - name: PVE_INSECURE_SKIP_VERIFY
        value: "true"  # Only for development!
      - name: PVE_NODE_SELECTOR
        value: "pve-node-1,pve-node-2"  # Optional
# Create secret with credentials
apiVersion: v1
kind: Secret
metadata:
  name: proxmox-credentials
  namespace: virtrigaud-system
type: Opaque
stringData:
  PVE_ENDPOINT: "https://pve.example.com:8006/api2"
  PVE_TOKEN_ID: "virtrigaud@pve!vrtg-token"
  PVE_TOKEN_SECRET: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
  PVE_INSECURE_SKIP_VERIFY: "false"

---
# Reference secret in deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: virtrigaud-provider-proxmox
spec:
  template:
    spec:
      containers:
      - name: provider-proxmox
        image: ghcr.io/projectbeskar/virtrigaud/provider-proxmox:v0.2.3
        envFrom:
        - secretRef:
            name: proxmox-credentials

Development/Testing Configuration

# For development with a local Proxmox VE instance
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        value: "https://192.168.1.100:8006/api2"
      - name: PVE_USERNAME
        value: "root@pam"
      - name: PVE_PASSWORD
        value: "your-password"
      - name: PVE_INSECURE_SKIP_VERIFY
        value: "true"

Node Selection

The provider can be configured to prefer specific nodes:

env:
  - name: PVE_NODE_SELECTOR
    value: "pve-node-1,pve-node-2"

If not specified, the provider will automatically select nodes based on availability.

VM Configuration

VMClass Specification

Define CPU and memory resources:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: small
spec:
  cpus: 2
  memory: "4Gi"
  # Proxmox-specific settings
  spec:
    machine: "q35"
    bios: "uefi"

VMImage Specification

Reference Proxmox templates:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22
spec:
  source: "ubuntu-22-template"  # Template name in Proxmox
  # Or clone from existing VM:
  # source: "9000"  # VMID to clone from

VirtualMachine Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  providerRef:
    name: proxmox-prod
  classRef:
    name: small
  imageRef:
    name: ubuntu-22
  powerState: On
  networks:
    - name: lan
      # Maps to Proxmox bridge or VLAN configuration
  disks:
    - name: root
      size: "40Gi"
  userData:
    cloudInit:
      inline: |
        #cloud-config
        hostname: web-server
        users:
          - name: ubuntu
            ssh_authorized_keys:
              - "ssh-ed25519 AAAA..."
        packages:
          - nginx

Cloud-Init Integration

The provider automatically configures cloud-init for supported VMs:

Automatic Configuration

  • IDE2 Device: Attached as cloudinit drive
  • User Data: Rendered from VirtualMachine spec
  • Network Config: Generated from network specifications
  • SSH Keys: Extracted from userData or secrets

Static IP Configuration

Configure static IPs using cloud-init:

userData:
  cloudInit:
    inline: |
      #cloud-config
      write_files:
        - path: /etc/netplan/01-static.yaml
          content: |
            network:
              version: 2
              ethernets:
                ens18:
                  addresses: [192.168.1.100/24]
                  gateway4: 192.168.1.1
                  nameservers:
                    addresses: [8.8.8.8, 1.1.1.1]

Or use Proxmox IP configuration:

# This would be handled by the provider internally
# when processing network specifications

Guest Agent Integration (v0.2.3+)

The Proxmox provider now integrates with the QEMU Guest Agent for enhanced VM monitoring:

IP Address Detection

When a VM is running, the provider automatically queries the QEMU guest agent to retrieve accurate IP addresses:

# IP addresses are automatically populated in VM status
kubectl get vm my-vm -o yaml

status:
  phase: Running
  ipAddresses:
    - 192.168.1.100
    - fd00::1234:5678:9abc:def0

Features

  • Automatic IP Detection: Retrieves all network interface IPs from running VMs
  • IPv4 and IPv6 Support: Reports both address families
  • Smart Filtering: Excludes loopback (127.0.0.1, ::1) and link-local (169.254.x.x, fe80::) addresses
  • Real-time Updates: Information updated during Describe operations
  • Graceful Degradation: Falls back gracefully when guest agent is not available

Requirements

For guest agent integration to work, the VM must have:

  1. QEMU Guest Agent Installed:

    # Ubuntu/Debian
    apt-get install qemu-guest-agent
    
    # CentOS/RHEL
    yum install qemu-guest-agent
    
    # Enable and start the service
    systemctl enable --now qemu-guest-agent
    
  2. VM Configuration: Guest agent is automatically enabled during VM creation

Implementation Details

The provider:

  1. Checks if VM is in running state
  2. Makes API call to /api2/json/nodes/{node}/qemu/{vmid}/agent/network-get-interfaces
  3. Parses network interface details from guest agent response
  4. Filters out irrelevant addresses (loopback, link-local)
  5. Populates status.ipAddresses field

Troubleshooting

If IP addresses are not appearing:

  • Verify guest agent is installed: systemctl status qemu-guest-agent
  • Check Proxmox VM options: qm config <vmid> | grep agent
  • Ensure VM has network connectivity
  • Check provider logs for guest agent errors

Cloning Behavior

Linked Clones (Default)

Efficient space usage, faster creation:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClone
metadata:
  name: web-clone
spec:
  sourceVMRef:
    name: template-vm
  linkedClone: true  # Default

Full Clones

Independent copies, slower creation:

spec:
  linkedClone: false

Snapshots

Create and manage VM snapshots:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: before-upgrade
spec:
  vmRef:
    name: web-server
  description: "Snapshot before system upgrade"

Troubleshooting

Common Issues

Authentication Failures

Error: failed to connect to Proxmox VE: authentication failed

Solutions:

  • Verify API token permissions
  • Check token expiration
  • Ensure user has VM.* privileges

TLS Certificate Errors

Error: x509: certificate signed by unknown authority

Solutions:

  • Add custom CA certificate to credentials secret
  • Use PVE_INSECURE_SKIP_VERIFY=true for testing
  • Verify certificate chain

VM Creation Failures

Error: create VM failed with status 400: storage 'local-lvm' does not exist

Solutions:

  • Verify storage configuration in Proxmox
  • Check node availability
  • Ensure sufficient resources

Debug Logging

Enable debug logging for troubleshooting:

env:
  - name: LOG_LEVEL
    value: "debug"

Health Checks

Monitor provider health:

# Check provider pod logs
kubectl logs -n virtrigaud-system deployment/provider-proxmox

# Test connectivity
kubectl exec -n virtrigaud-system deployment/provider-proxmox -- \
  curl -k https://pve.example.com:8006/api2/json/version

Performance Considerations

Resource Allocation

For production environments:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Concurrent Operations

The provider handles concurrent VM operations efficiently but consider:

  • Node capacity limits
  • Storage I/O constraints
  • Network bandwidth

Task Polling

Task completion is polled every 2 seconds with a 5-minute timeout. These can be tuned via environment variables if needed.

Minimal Proxmox VE Permissions

Required API Token Permissions

Create an API token with these minimal privileges:

# Create user for VirtRigaud
pveum user add virtrigaud@pve --comment "VirtRigaud Provider"

# Create API token
pveum user token add virtrigaud@pve vrtg-token --privsep 1

# Grant minimal required permissions
pveum acl modify / --users virtrigaud@pve --roles PVEVMAdmin,PVEDatastoreUser

# Custom role with minimal permissions (alternative)
pveum role add VirtRigaud --privs "VM.Allocate,VM.Audit,VM.Config.CPU,VM.Config.Memory,VM.Config.Disk,VM.Config.Network,VM.Config.Options,VM.Monitor,VM.PowerMgmt,VM.Snapshot,VM.Clone,Datastore.Allocate,Datastore.AllocateSpace,Pool.Allocate"
pveum acl modify / --users virtrigaud@pve --roles VirtRigaud

Permission Details

PermissionUsageRequired
VM.AllocateCreate new VMsβœ… Core
VM.AuditRead VM configurationβœ… Core
VM.Config.*Modify VM settingsβœ… Reconfigure
VM.MonitorVM status monitoringβœ… Core
VM.PowerMgmtPower operationsβœ… Core
VM.SnapshotSnapshot operations⚠️ Optional
VM.CloneVM cloning⚠️ Optional
Datastore.AllocateCreate VM disksβœ… Core
Pool.AllocateResource pool usage⚠️ Optional

Token Rotation Procedure

# 1. Create new token
NEW_TOKEN=$(pveum user token add virtrigaud@pve vrtg-token-2 --privsep 1 --output-format json | jq -r '.value')

# 2. Update Kubernetes secret
kubectl patch secret pve-credentials -n virtrigaud-system --type='merge' -p='{"stringData":{"token_id":"virtrigaud@pve!vrtg-token-2","token_secret":"'$NEW_TOKEN'"}}'

# 3. Restart provider to use new token
kubectl rollout restart deployment provider-proxmox -n virtrigaud-system

# 4. Verify new token works
kubectl logs deployment/provider-proxmox -n virtrigaud-system

# 5. Remove old token
pveum user token remove virtrigaud@pve vrtg-token

NetworkPolicy Examples

Production NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: provider-proxmox-netpol
  namespace: virtrigaud-system
spec:
  podSelector:
    matchLabels:
      app: provider-proxmox
  policyTypes: [Ingress, Egress]
  
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: virtrigaud-manager
    ports: [9443, 8080]
  
  egress:
  # DNS resolution
  - to: []
    ports: [53]
  
  # Proxmox VE API
  - to:
    - ipBlock:
        cidr: 192.168.1.0/24  # Your PVE network
    ports: [8006]

Development NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: provider-proxmox-dev-netpol
  namespace: virtrigaud-system
spec:
  podSelector:
    matchLabels:
      app: provider-proxmox
      environment: development
  egress:
  - to: []  # Allow all egress for development

Storage and Placement

Storage Class Mapping

Configure storage placement for different workloads:

# High-performance storage
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: high-performance
spec:
  cpus: 8
  memory: "32Gi"
  storage:
    class: "nvme-storage"  # Maps to PVE storage
    type: "thin"           # Thin provisioning
    
# Standard storage
apiVersion: infra.virtrigaud.io/v1beta1  
kind: VMClass
metadata:
  name: standard
spec:
  cpus: 4
  memory: "8Gi"
  storage:
    class: "ssd-storage"
    type: "thick"          # Thick provisioning

Placement Policies

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMPlacementPolicy
metadata:
  name: production-placement
spec:
  nodeSelector:
    - "pve-node-1"
    - "pve-node-2"
  antiAffinity:
    - key: "vm.type"
      operator: "In"
      values: ["database"]
  constraints:
    maxVMsPerNode: 10
    minFreeMemory: "4Gi"

Performance Testing

Load Test Results

Performance benchmarks using virtrigaud-loadgen against fake PVE server:

OperationP50 LatencyP95 LatencyThroughputNotes
Create VM2.3s4.1s12 ops/minIncluding cloud-init
Power On800ms1.2s45 ops/minAsync operation
Power Off650ms1.1s50 ops/minGraceful shutdown
Describe120ms200ms200 ops/minStatus query
Reconfigure CPU1.8s3.2s15 ops/minOnline hot-plug
Snapshot Create3.5s6.8s8 ops/minWith memory
Clone (Linked)1.9s3.4s12 ops/minFast COW clone

Running Performance Tests

# Deploy fake PVE server for testing
kubectl apply -f test/performance/proxmox-loadtest.yaml

# Run performance test
kubectl create job proxmox-perf-test --from=cronjob/proxmox-performance-test

# View results
kubectl logs job/proxmox-perf-test -f

Security Best Practices

  1. Use API Tokens: Prefer API tokens over username/password
  2. Least Privilege: Grant minimal required permissions (see above)
  3. TLS Verification: Always verify certificates in production
  4. Secret Management: Use Kubernetes secrets with proper RBAC
  5. Network Policies: Restrict provider network access (see examples)
  6. Regular Rotation: Rotate API tokens quarterly
  7. Audit Logging: Enable PVE audit logs for provider actions
  8. Resource Quotas: Limit provider resource consumption

Examples

Multi-Node Setup

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: proxmox-cluster
spec:
  type: proxmox
  endpoint: https://pve-cluster.example.com:8006
  runtime:
    env:
      - name: PVE_NODE_SELECTOR
        value: "pve-1,pve-2,pve-3"

High-Availability Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: provider-proxmox
spec:
  replicas: 2
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: provider-proxmox
              topologyKey: kubernetes.io/hostname

Troubleshooting

Common Issues

❌ β€œendpoint is required” Error

Symptom: Provider pod crashes with ERROR Failed to create PVE client error="endpoint is required"

Cause: Missing or empty PVE_ENDPOINT environment variable

Solution:

# Ensure PVE_ENDPOINT is set in deployment
env:
  - name: PVE_ENDPOINT
    value: "https://your-proxmox.example.com:8006/api2"

❌ Connection Timeout/Refused

Symptom: Provider fails with connection timeouts or β€œconnection refused”

Cause: Network connectivity issues or wrong endpoint URL

Solutions:

  1. Verify endpoint: Test from a pod in the cluster:

    kubectl run test-curl --rm -i --tty --image=curlimages/curl -- \
      curl -k https://your-proxmox.example.com:8006/api2/json/version
    
  2. Check firewall: Ensure port 8006 is accessible from Kubernetes cluster

  3. Verify URL format: Should be https://hostname:8006/api2 (note the /api2 path)

❌ TLS Certificate Errors

Symptom: x509: certificate signed by unknown authority

Solutions:

  • Development: Set PVE_INSECURE_SKIP_VERIFY=true (not for production!)
  • Production: Provide valid TLS certificates or CA bundle

❌ Authentication Failures

Symptom: 401 Unauthorized or authentication failure

Solutions:

  1. Verify token permissions:

    # Test API token manually
    curl -k "https://pve.example.com:8006/api2/json/version" \
      -H "Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET"
    
  2. Check user privileges: Ensure user has VM management permissions

  3. Verify token format: Should be user@realm!tokenid (note the !)

❌ Provider Not Starting

Symptom: Pod in CrashLoopBackOff or 0/1 Ready

Diagnostic Steps:

# Check pod logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-provider-proxmox

# Check environment variables
kubectl describe pod -n virtrigaud-system -l app.kubernetes.io/component=provider-proxmox

# Verify configuration
kubectl get secret proxmox-credentials -o yaml

Validation Commands

Test your Proxmox connection before deploying:

# 1. Test network connectivity
telnet your-proxmox.example.com 8006

# 2. Test API endpoint
curl -k https://your-proxmox.example.com:8006/api2/json/version

# 3. Test authentication
curl -k "https://your-proxmox.example.com:8006/api2/json/nodes" \
  -H "Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET"

# 4. Test from within cluster
kubectl run debug --rm -i --tty --image=curlimages/curl -- sh
# Then run curl commands from inside the pod

Debug Logging

Enable verbose logging for the provider:

providers:
  proxmox:
    env:
      - name: LOG_LEVEL
        value: "debug"
      - name: PVE_ENDPOINT
        value: "https://pve.example.com:8006/api2"

API Reference

For complete API reference, see the Provider API Documentation.

Contributing

To contribute to the Proxmox provider:

  1. See the Provider Development Guide
  2. Check the GitHub repository
  3. Review open issues

Support

Provider Developer Tutorial

This comprehensive tutorial walks you through creating a complete VirtRigaud provider from scratch. By the end, you’ll have a fully functional provider that can create, manage, and delete virtual machines.

Prerequisites

Before starting this tutorial, ensure you have:

  • Go 1.23 or later installed
  • Docker installed for containerization
  • kubectl and a Kubernetes cluster (Kind/minikube for local development)
  • Helm 3.x installed
  • Basic understanding of gRPC and protobuf

Tutorial Overview

We’ll build a File Provider that manages β€œvirtual machines” as JSON files on disk. While not practical for production, this provider demonstrates all the core concepts without requiring actual hypervisor access.

What we’ll build:

  • A complete provider implementation using the VirtRigaud SDK
  • Conformance tests that pass VCTS core profile
  • A Helm chart for deployment
  • CI/CD integration
  • Publication to the provider catalog

Step 1: Initialize Your Provider Project

1.1 Create Project Structure

# Create project directory
mkdir virtrigaud-provider-file
cd virtrigaud-provider-file

# Initialize the provider project
vrtg-provider init file

The vrtg-provider init command creates the following structure:

virtrigaud-provider-file/
β”œβ”€β”€ cmd/
β”‚   └── provider-file/
β”‚       β”œβ”€β”€ main.go
β”‚       └── Dockerfile
β”œβ”€β”€ internal/
β”‚   └── provider/
β”‚       β”œβ”€β”€ provider.go
β”‚       β”œβ”€β”€ capabilities.go
β”‚       └── provider_test.go
β”œβ”€β”€ charts/
β”‚   └── provider-file/
β”‚       β”œβ”€β”€ Chart.yaml
β”‚       β”œβ”€β”€ values.yaml
β”‚       └── templates/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── ci.yml
β”œβ”€β”€ Makefile
β”œβ”€β”€ go.mod
β”œβ”€β”€ go.sum
β”œβ”€β”€ .gitignore
└── README.md

1.2 Examine Generated Files

main.go - Entry point that sets up the gRPC server:

package main

import (
    "log"
    
    "github.com/projectbeskar/virtrigaud/sdk/provider/server"
    "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
    "virtrigaud-provider-file/internal/provider"
)

func main() {
    // Create provider instance
    p, err := provider.New()
    if err != nil {
        log.Fatalf("Failed to create provider: %v", err)
    }
    
    // Configure server
    config := &server.Config{
        Port:        9443,
        HealthPort:  8080,
        EnableTLS:   false,
    }
    
    srv, err := server.New(config)
    if err != nil {
        log.Fatalf("Failed to create server: %v", err)
    }
    
    // Register provider service
    providerv1.RegisterProviderServiceServer(srv.GRPCServer(), p)
    
    // Start server
    log.Println("Starting file provider on port 9443...")
    if err := srv.Serve(); err != nil {
        log.Fatalf("Server failed: %v", err)
    }
}

go.mod - Module definition with SDK dependency:

module virtrigaud-provider-file

go 1.23

require (
    github.com/projectbeskar/virtrigaud/sdk v0.1.0
    github.com/projectbeskar/virtrigaud/proto v0.1.0
)

Step 2: Implement the Core Provider

2.1 Design the File Provider

Our file provider will:

  • Store VM metadata as JSON files in /var/lib/virtrigaud/vms/
  • Use filename as VM ID
  • Simulate power operations with state files
  • Support basic CRUD operations

2.2 Define the VM Model

Create internal/provider/vm.go:

package provider

import (
    "encoding/json"
    "fmt"
    "io/ioutil"
    "os"
    "path/filepath"
    "time"
    
    "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
)

type VirtualMachine struct {
    ID          string                 `json:"id"`
    Name        string                 `json:"name"`
    Spec        *providerv1.VMSpec     `json:"spec"`
    Status      *providerv1.VMStatus   `json:"status"`
    CreatedAt   time.Time              `json:"created_at"`
    UpdatedAt   time.Time              `json:"updated_at"`
}

type FileStore struct {
    baseDir string
}

func NewFileStore(baseDir string) *FileStore {
    return &FileStore{baseDir: baseDir}
}

func (fs *FileStore) Save(vm *VirtualMachine) error {
    if err := os.MkdirAll(fs.baseDir, 0755); err != nil {
        return fmt.Errorf("failed to create directory: %w", err)
    }
    
    vm.UpdatedAt = time.Now()
    data, err := json.MarshalIndent(vm, "", "  ")
    if err != nil {
        return fmt.Errorf("failed to marshal VM: %w", err)
    }
    
    filename := filepath.Join(fs.baseDir, vm.ID+".json")
    return ioutil.WriteFile(filename, data, 0644)
}

func (fs *FileStore) Load(id string) (*VirtualMachine, error) {
    filename := filepath.Join(fs.baseDir, id+".json")
    data, err := ioutil.ReadFile(filename)
    if err != nil {
        if os.IsNotExist(err) {
            return nil, fmt.Errorf("VM not found: %s", id)
        }
        return nil, fmt.Errorf("failed to read VM file: %w", err)
    }
    
    var vm VirtualMachine
    if err := json.Unmarshal(data, &vm); err != nil {
        return nil, fmt.Errorf("failed to unmarshal VM: %w", err)
    }
    
    return &vm, nil
}

func (fs *FileStore) Delete(id string) error {
    filename := filepath.Join(fs.baseDir, id+".json")
    if err := os.Remove(filename); err != nil && !os.IsNotExist(err) {
        return fmt.Errorf("failed to delete VM file: %w", err)
    }
    return nil
}

func (fs *FileStore) List() ([]*VirtualMachine, error) {
    files, err := ioutil.ReadDir(fs.baseDir)
    if err != nil {
        if os.IsNotExist(err) {
            return []*VirtualMachine{}, nil
        }
        return nil, fmt.Errorf("failed to read directory: %w", err)
    }
    
    var vms []*VirtualMachine
    for _, file := range files {
        if !file.IsDir() && filepath.Ext(file.Name()) == ".json" {
            id := file.Name()[:len(file.Name())-5] // Remove .json extension
            vm, err := fs.Load(id)
            if err != nil {
                continue // Skip invalid files
            }
            vms = append(vms, vm)
        }
    }
    
    return vms, nil
}

2.3 Implement the Provider Interface

Update internal/provider/provider.go:

package provider

import (
    "context"
    "fmt"
    "os"
    "path/filepath"
    "time"
    
    "github.com/google/uuid"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
    
    "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
    "github.com/projectbeskar/virtrigaud/sdk/provider/capabilities"
    "github.com/projectbeskar/virtrigaud/sdk/provider/errors"
)

type Provider struct {
    store *FileStore
    caps  *capabilities.ProviderCapabilities
}

func New() (*Provider, error) {
    // Get storage directory from environment or use default
    baseDir := os.Getenv("PROVIDER_STORAGE_DIR")
    if baseDir == "" {
        baseDir = "/var/lib/virtrigaud/vms"
    }
    
    // Create capabilities
    caps := &capabilities.ProviderCapabilities{
        ProviderInfo: &providerv1.ProviderInfo{
            Name:        "file",
            Version:     "0.1.0",
            Description: "File-based virtual machine provider for development and testing",
        },
        SupportedCapabilities: []capabilities.Capability{
            capabilities.CapabilityCore,
            capabilities.CapabilitySnapshot,
            capabilities.CapabilityClone,
        },
    }
    
    return &Provider{
        store: NewFileStore(baseDir),
        caps:  caps,
    }, nil
}

// GetCapabilities returns provider capabilities
func (p *Provider) GetCapabilities(ctx context.Context, req *providerv1.GetCapabilitiesRequest) (*providerv1.GetCapabilitiesResponse, error) {
    return &providerv1.GetCapabilitiesResponse{
        ProviderId: "file-provider",
        Capabilities: []*providerv1.Capability{
            {
                Name:        "vm.create",
                Supported:   true,
                Description: "Create virtual machines",
            },
            {
                Name:        "vm.read",
                Supported:   true,
                Description: "Read virtual machine information",
            },
            {
                Name:        "vm.update",
                Supported:   true,
                Description: "Update virtual machine configuration",
            },
            {
                Name:        "vm.delete",
                Supported:   true,
                Description: "Delete virtual machines",
            },
            {
                Name:        "vm.power",
                Supported:   true,
                Description: "Control virtual machine power state",
            },
            {
                Name:        "vm.snapshot",
                Supported:   true,
                Description: "Create and manage VM snapshots",
            },
            {
                Name:        "vm.clone",
                Supported:   true,
                Description: "Clone virtual machines",
            },
        },
    }, nil
}

// CreateVM creates a new virtual machine
func (p *Provider) CreateVM(ctx context.Context, req *providerv1.CreateVMRequest) (*providerv1.CreateVMResponse, error) {
    // Validate request
    if req.Name == "" {
        return nil, errors.NewInvalidSpec("VM name is required")
    }
    
    if req.Spec == nil {
        return nil, errors.NewInvalidSpec("VM spec is required")
    }
    
    // Generate unique ID
    vmID := uuid.New().String()
    
    // Create VM object
    vm := &VirtualMachine{
        ID:   vmID,
        Name: req.Name,
        Spec: req.Spec,
        Status: &providerv1.VMStatus{
            State:   "Creating",
            Message: "VM is being created",
        },
        CreatedAt: time.Now(),
        UpdatedAt: time.Now(),
    }
    
    // Save to store
    if err := p.store.Save(vm); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to save VM: %v", err)
    }
    
    // Simulate creation time
    go func() {
        time.Sleep(2 * time.Second)
        vm.Status.State = "Running"
        vm.Status.Message = "VM is running"
        p.store.Save(vm)
    }()
    
    return &providerv1.CreateVMResponse{
        VmId:   vmID,
        Status: vm.Status,
    }, nil
}

// GetVM retrieves virtual machine information
func (p *Provider) GetVM(ctx context.Context, req *providerv1.GetVMRequest) (*providerv1.GetVMResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    vm, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    return &providerv1.GetVMResponse{
        VmId:   vm.ID,
        Name:   vm.Name,
        Spec:   vm.Spec,
        Status: vm.Status,
    }, nil
}

// UpdateVM updates virtual machine configuration
func (p *Provider) UpdateVM(ctx context.Context, req *providerv1.UpdateVMRequest) (*providerv1.UpdateVMResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    vm, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    // Update spec if provided
    if req.Spec != nil {
        vm.Spec = req.Spec
        vm.Status.Message = "VM configuration updated"
        
        if err := p.store.Save(vm); err != nil {
            return nil, status.Errorf(codes.Internal, "failed to save VM: %v", err)
        }
    }
    
    return &providerv1.UpdateVMResponse{
        Status: vm.Status,
    }, nil
}

// DeleteVM deletes a virtual machine
func (p *Provider) DeleteVM(ctx context.Context, req *providerv1.DeleteVMRequest) (*providerv1.DeleteVMResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    // Check if VM exists
    _, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    // Delete VM
    if err := p.store.Delete(req.VmId); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to delete VM: %v", err)
    }
    
    return &providerv1.DeleteVMResponse{
        Success: true,
        Message: "VM deleted successfully",
    }, nil
}

// PowerVM controls virtual machine power state
func (p *Provider) PowerVM(ctx context.Context, req *providerv1.PowerVMRequest) (*providerv1.PowerVMResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    vm, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    // Update power state based on operation
    switch req.PowerOp {
    case providerv1.PowerOp_POWER_OP_ON:
        vm.Status.State = "Running"
        vm.Status.Message = "VM is running"
    case providerv1.PowerOp_POWER_OP_OFF:
        vm.Status.State = "Stopped"
        vm.Status.Message = "VM is stopped"
    case providerv1.PowerOp_POWER_OP_REBOOT:
        vm.Status.State = "Rebooting"
        vm.Status.Message = "VM is rebooting"
        // Simulate reboot
        go func() {
            time.Sleep(3 * time.Second)
            vm.Status.State = "Running"
            vm.Status.Message = "VM is running"
            p.store.Save(vm)
        }()
    default:
        return nil, errors.NewInvalidSpec("unsupported power operation: %v", req.PowerOp)
    }
    
    if err := p.store.Save(vm); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to save VM: %v", err)
    }
    
    return &providerv1.PowerVMResponse{
        Status: vm.Status,
    }, nil
}

// ListVMs lists all virtual machines
func (p *Provider) ListVMs(ctx context.Context, req *providerv1.ListVMsRequest) (*providerv1.ListVMsResponse, error) {
    vms, err := p.store.List()
    if err != nil {
        return nil, status.Errorf(codes.Internal, "failed to list VMs: %v", err)
    }
    
    var vmInfos []*providerv1.VMInfo
    for _, vm := range vms {
        vmInfos = append(vmInfos, &providerv1.VMInfo{
            VmId:   vm.ID,
            Name:   vm.Name,
            Status: vm.Status,
        })
    }
    
    return &providerv1.ListVMsResponse{
        Vms: vmInfos,
    }, nil
}

// CreateSnapshot creates a VM snapshot
func (p *Provider) CreateSnapshot(ctx context.Context, req *providerv1.CreateSnapshotRequest) (*providerv1.CreateSnapshotResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    vm, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    // Create snapshot (simulate by copying VM file)
    snapshotID := uuid.New().String()
    snapshotPath := filepath.Join(filepath.Dir(p.store.baseDir), "snapshots")
    
    if err := os.MkdirAll(snapshotPath, 0755); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to create snapshot directory: %v", err)
    }
    
    // Copy VM data to snapshot
    snapshotVM := *vm
    snapshotVM.ID = snapshotID
    snapshotStore := NewFileStore(snapshotPath)
    
    if err := snapshotStore.Save(&snapshotVM); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to save snapshot: %v", err)
    }
    
    return &providerv1.CreateSnapshotResponse{
        SnapshotId: snapshotID,
        Status: &providerv1.TaskStatus{
            State:   "Completed",
            Message: "Snapshot created successfully",
        },
    }, nil
}

// CloneVM clones a virtual machine
func (p *Provider) CloneVM(ctx context.Context, req *providerv1.CloneVMRequest) (*providerv1.CloneVMResponse, error) {
    if req.SourceVmId == "" {
        return nil, errors.NewInvalidSpec("Source VM ID is required")
    }
    
    if req.CloneName == "" {
        return nil, errors.NewInvalidSpec("Clone name is required")
    }
    
    // Load source VM
    sourceVM, err := p.store.Load(req.SourceVmId)
    if err != nil {
        return nil, errors.NewNotFound("Source VM not found: %s", req.SourceVmId)
    }
    
    // Create clone
    cloneID := uuid.New().String()
    cloneVM := &VirtualMachine{
        ID:   cloneID,
        Name: req.CloneName,
        Spec: sourceVM.Spec, // Copy spec from source
        Status: &providerv1.VMStatus{
            State:   "Stopped",
            Message: "Clone created successfully",
        },
        CreatedAt: time.Now(),
        UpdatedAt: time.Now(),
    }
    
    if err := p.store.Save(cloneVM); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to save clone: %v", err)
    }
    
    return &providerv1.CloneVMResponse{
        CloneVmId: cloneID,
        Status: &providerv1.TaskStatus{
            State:   "Completed",
            Message: "VM cloned successfully",
        },
    }, nil
}

Step 3: Add Tests and Validation

3.1 Create Unit Tests

Create internal/provider/provider_test.go:

package provider

import (
    "context"
    "os"
    "path/filepath"
    "testing"
    "time"
    
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    
    "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
)

func TestProvider_CreateVM(t *testing.T) {
    // Create temporary directory for testing
    tmpDir, err := os.MkdirTemp("", "file-provider-test")
    require.NoError(t, err)
    defer os.RemoveAll(tmpDir)
    
    // Set storage directory
    os.Setenv("PROVIDER_STORAGE_DIR", tmpDir)
    defer os.Unsetenv("PROVIDER_STORAGE_DIR")
    
    // Create provider
    p, err := New()
    require.NoError(t, err)
    
    // Test VM creation
    req := &providerv1.CreateVMRequest{
        Name: "test-vm",
        Spec: &providerv1.VMSpec{
            Cpu:    2,
            Memory: 4096,
            Image:  "ubuntu:20.04",
        },
    }
    
    resp, err := p.CreateVM(context.Background(), req)
    require.NoError(t, err)
    assert.NotEmpty(t, resp.VmId)
    assert.Equal(t, "Creating", resp.Status.State)
    
    // Verify VM file was created
    vmFile := filepath.Join(tmpDir, resp.VmId+".json")
    assert.FileExists(t, vmFile)
}

func TestProvider_GetVM(t *testing.T) {
    tmpDir, err := os.MkdirTemp("", "file-provider-test")
    require.NoError(t, err)
    defer os.RemoveAll(tmpDir)
    
    os.Setenv("PROVIDER_STORAGE_DIR", tmpDir)
    defer os.Unsetenv("PROVIDER_STORAGE_DIR")
    
    p, err := New()
    require.NoError(t, err)
    
    // Create VM first
    createReq := &providerv1.CreateVMRequest{
        Name: "test-vm",
        Spec: &providerv1.VMSpec{
            Cpu:    2,
            Memory: 4096,
        },
    }
    
    createResp, err := p.CreateVM(context.Background(), createReq)
    require.NoError(t, err)
    
    // Get VM
    getReq := &providerv1.GetVMRequest{
        VmId: createResp.VmId,
    }
    
    getResp, err := p.GetVM(context.Background(), getReq)
    require.NoError(t, err)
    assert.Equal(t, createResp.VmId, getResp.VmId)
    assert.Equal(t, "test-vm", getResp.Name)
    assert.Equal(t, int32(2), getResp.Spec.Cpu)
}

func TestProvider_PowerVM(t *testing.T) {
    tmpDir, err := os.MkdirTemp("", "file-provider-test")
    require.NoError(t, err)
    defer os.RemoveAll(tmpDir)
    
    os.Setenv("PROVIDER_STORAGE_DIR", tmpDir)
    defer os.Unsetenv("PROVIDER_STORAGE_DIR")
    
    p, err := New()
    require.NoError(t, err)
    
    // Create VM
    createReq := &providerv1.CreateVMRequest{
        Name: "test-vm",
        Spec: &providerv1.VMSpec{Cpu: 1, Memory: 1024},
    }
    
    createResp, err := p.CreateVM(context.Background(), createReq)
    require.NoError(t, err)
    
    // Power off VM
    powerReq := &providerv1.PowerVMRequest{
        VmId:    createResp.VmId,
        PowerOp: providerv1.PowerOp_POWER_OP_OFF,
    }
    
    powerResp, err := p.PowerVM(context.Background(), powerReq)
    require.NoError(t, err)
    assert.Equal(t, "Stopped", powerResp.Status.State)
    
    // Power on VM
    powerReq.PowerOp = providerv1.PowerOp_POWER_OP_ON
    powerResp, err = p.PowerVM(context.Background(), powerReq)
    require.NoError(t, err)
    assert.Equal(t, "Running", powerResp.Status.State)
}

func TestProvider_GetCapabilities(t *testing.T) {
    p, err := New()
    require.NoError(t, err)
    
    req := &providerv1.GetCapabilitiesRequest{}
    resp, err := p.GetCapabilities(context.Background(), req)
    require.NoError(t, err)
    
    assert.Equal(t, "file-provider", resp.ProviderId)
    assert.NotEmpty(t, resp.Capabilities)
    
    // Check for core capabilities
    capNames := make(map[string]bool)
    for _, cap := range resp.Capabilities {
        capNames[cap.Name] = cap.Supported
    }
    
    assert.True(t, capNames["vm.create"])
    assert.True(t, capNames["vm.read"])
    assert.True(t, capNames["vm.delete"])
    assert.True(t, capNames["vm.power"])
}

func TestProvider_CloneVM(t *testing.T) {
    tmpDir, err := os.MkdirTemp("", "file-provider-test")
    require.NoError(t, err)
    defer os.RemoveAll(tmpDir)
    
    os.Setenv("PROVIDER_STORAGE_DIR", tmpDir)
    defer os.Unsetenv("PROVIDER_STORAGE_DIR")
    
    p, err := New()
    require.NoError(t, err)
    
    // Create source VM
    createReq := &providerv1.CreateVMRequest{
        Name: "source-vm",
        Spec: &providerv1.VMSpec{
            Cpu:    4,
            Memory: 8192,
            Image:  "centos:8",
        },
    }
    
    createResp, err := p.CreateVM(context.Background(), createReq)
    require.NoError(t, err)
    
    // Clone VM
    cloneReq := &providerv1.CloneVMRequest{
        SourceVmId: createResp.VmId,
        CloneName:  "cloned-vm",
    }
    
    cloneResp, err := p.CloneVM(context.Background(), cloneReq)
    require.NoError(t, err)
    assert.NotEmpty(t, cloneResp.CloneVmId)
    assert.NotEqual(t, createResp.VmId, cloneResp.CloneVmId)
    
    // Verify clone has same specs as source
    getReq := &providerv1.GetVMRequest{
        VmId: cloneResp.CloneVmId,
    }
    
    getResp, err := p.GetVM(context.Background(), getReq)
    require.NoError(t, err)
    assert.Equal(t, "cloned-vm", getResp.Name)
    assert.Equal(t, int32(4), getResp.Spec.Cpu)
    assert.Equal(t, int32(8192), getResp.Spec.Memory)
    assert.Equal(t, "centos:8", getResp.Spec.Image)
}

3.2 Add Build and Test Targets

Update the Makefile:

# File Provider Makefile

.PHONY: help build test lint clean run docker-build docker-push

help: ## Show this help message
	@echo 'Usage: make [target]'
	@echo ''
	@echo 'Targets:'
	@awk 'BEGIN {FS = ":.*?## "} /^[a-zA-Z_-]+:.*?## / {printf "  %-15s %s\n", $$1, $$2}' $(MAKEFILE_LIST)

build: ## Build the provider binary
	go build -o bin/provider-file ./cmd/provider-file

test: ## Run tests
	go test -v ./...

test-coverage: ## Run tests with coverage
	go test -v -coverprofile=coverage.out ./...
	go tool cover -html=coverage.out -o coverage.html

lint: ## Run linters
	golangci-lint run ./...

clean: ## Clean build artifacts
	rm -rf bin/
	rm -f coverage.out coverage.html

run: build ## Run the provider locally
	PROVIDER_STORAGE_DIR=/tmp/virtrigaud-file ./bin/provider-file

docker-build: ## Build Docker image
	docker build -f cmd/provider-file/Dockerfile -t provider-file:latest .

docker-push: docker-build ## Build and push Docker image
	docker tag provider-file:latest ghcr.io/yourorg/provider-file:latest
	docker push ghcr.io/yourorg/provider-file:latest

# Development targets
dev-setup: ## Set up development environment
	go mod download
	go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest

integration-test: build ## Run integration tests
	./scripts/integration-test.sh

Step 4: Test with VCTS (VirtRigaud Conformance Test Suite)

4.1 Install VCTS

# Build VCTS from the main repository
go install github.com/projectbeskar/virtrigaud/cmd/vcts@latest

4.2 Create VCTS Configuration

Create vcts-config.yaml:

provider:
  name: "file"
  endpoint: "localhost:9443"
  tls: false
  
profiles:
  core:
    enabled: true
    vm_specs:
      - name: "basic"
        cpu: 1
        memory: 1024
        image: "test:latest"
      - name: "medium"
        cpu: 2
        memory: 4096
        image: "ubuntu:20.04"
        
  snapshot:
    enabled: true
    
  clone:
    enabled: true

tests:
  timeout: "30s"
  parallel: false
  cleanup: true

4.3 Run Conformance Tests

# Start the provider
make run &
PROVIDER_PID=$!

# Wait for provider to start
sleep 3

# Run VCTS core profile
vcts run --config vcts-config.yaml --profile core

# Run all enabled profiles
vcts run --config vcts-config.yaml --profile all

# Stop the provider
kill $PROVIDER_PID

Expected output:

βœ… Core Profile Tests
  βœ… Provider.GetCapabilities
  βœ… Provider.CreateVM
  βœ… Provider.GetVM
  βœ… Provider.UpdateVM
  βœ… Provider.DeleteVM
  βœ… Provider.PowerVM
  βœ… Provider.ListVMs

βœ… Snapshot Profile Tests
  βœ… Provider.CreateSnapshot

βœ… Clone Profile Tests
  βœ… Provider.CloneVM

πŸŽ‰ All tests passed! Provider is conformant.

Step 5: Create Helm Chart for Deployment

5.1 Chart Structure

The generated chart in charts/provider-file/ includes:

charts/provider-file/
β”œβ”€β”€ Chart.yaml
β”œβ”€β”€ values.yaml
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ deployment.yaml
β”‚   β”œβ”€β”€ service.yaml
β”‚   β”œβ”€β”€ serviceaccount.yaml
β”‚   β”œβ”€β”€ rbac.yaml
β”‚   └── _helpers.tpl
└── examples/
    └── values-development.yaml

5.2 Customize Chart Values

Update charts/provider-file/values.yaml:

# Default values for provider-file

replicaCount: 1

image:
  repository: ghcr.io/yourorg/provider-file
  pullPolicy: IfNotPresent
  tag: "0.1.0"

nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations: {}

podSecurityContext:
  fsGroup: 2000
  runAsNonRoot: true
  runAsUser: 1000

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

service:
  type: ClusterIP
  port: 9443
  healthPort: 8080

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

nodeSelector: {}

tolerations: []

affinity: {}

# Provider-specific configuration
provider:
  storageDir: "/var/lib/virtrigaud/vms"
  logLevel: "info"

# Persistent storage for VM data
persistence:
  enabled: true
  accessMode: ReadWriteOnce
  size: 10Gi
  storageClass: ""

5.3 Test Helm Chart

# Lint the chart
helm lint charts/provider-file/

# Template the chart
helm template provider-file charts/provider-file/ \
  --values charts/provider-file/values.yaml

# Install to local cluster
helm install provider-file charts/provider-file/ \
  --namespace provider-file \
  --create-namespace \
  --values charts/provider-file/examples/values-development.yaml

Step 6: Set Up CI/CD

6.1 GitHub Actions Workflow

The generated .github/workflows/ci.yml includes:

name: CI

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main, develop ]

env:
  GO_VERSION: '1.23'

jobs:
  test:
    name: Test
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: ${{ env.GO_VERSION }}
    
    - name: Run tests
      run: make test

    - name: Run linting
      run: make lint

  build:
    name: Build
    runs-on: ubuntu-latest
    needs: test
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: ${{ env.GO_VERSION }}
    
    - name: Build binary
      run: make build

    - name: Build Docker image
      run: make docker-build

  conformance:
    name: Conformance Tests
    runs-on: ubuntu-latest
    needs: build
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: ${{ env.GO_VERSION }}
    
    - name: Build provider
      run: make build

    - name: Install VCTS
      run: go install github.com/projectbeskar/virtrigaud/cmd/vcts@latest

    - name: Run conformance tests
      run: |
        # Start provider in background
        PROVIDER_STORAGE_DIR=/tmp/vcts-test ./bin/provider-file &
        PROVIDER_PID=$!
        
        # Wait for startup
        sleep 5
        
        # Run VCTS
        vcts run --config vcts-config.yaml --profile core
        
        # Clean up
        kill $PROVIDER_PID

  release:
    name: Release
    runs-on: ubuntu-latest
    needs: [test, build, conformance]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
    - uses: actions/checkout@v4
    
    - name: Build and push Docker image
      run: |
        echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin
        make docker-push

    - name: Package Helm chart
      run: |
        helm package charts/provider-file/ -d dist/
        
    - name: Upload artifacts
      uses: actions/upload-artifact@v4
      with:
        name: release-artifacts
        path: |
          bin/
          dist/

Step 7: Publish to Provider Catalog

7.1 Run Provider Verification

# Verify the provider meets all requirements
vrtg-provider verify --profile all

7.2 Publish to Catalog

# Publish to the VirtRigaud provider catalog
vrtg-provider publish \
  --name file \
  --image ghcr.io/yourorg/provider-file \
  --tag 0.1.0 \
  --repo https://github.com/yourorg/virtrigaud-provider-file \
  --maintainer your-email@example.com \
  --license Apache-2.0

This command will:

  1. Run VCTS conformance tests
  2. Generate a provider badge
  3. Create a catalog entry
  4. Open a pull request to the main VirtRigaud repository

7.3 Example Catalog Entry

The generated catalog entry will look like:

- name: file
  displayName: "File Provider"
  description: "File-based virtual machine provider for development and testing"
  repo: "https://github.com/yourorg/virtrigaud-provider-file"
  image: "ghcr.io/yourorg/provider-file"
  tag: "0.1.0"
  capabilities:
    - core
    - snapshot
    - clone
  conformance:
    profiles:
      core: pass
      snapshot: pass
      clone: pass
      image-prepare: skip
      advanced: skip
    report_url: "https://github.com/yourorg/virtrigaud-provider-file/actions"
    badge_url: "https://img.shields.io/badge/conformance-pass-green"
    last_tested: "2025-08-26T15:00:00Z"
  maintainer: "your-email@example.com"
  license: "Apache-2.0"
  maturity: "beta"
  tags:
    - file
    - development
    - testing
  documentation: "https://github.com/yourorg/virtrigaud-provider-file/blob/main/README.md"

Step 8: Production Considerations

8.1 Security Hardening

# Production values.yaml
securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 65534

podSecurityContext:
  fsGroup: 65534
  runAsNonRoot: true
  runAsUser: 65534
  seccompProfile:
    type: RuntimeDefault

networkPolicy:
  enabled: true
  ingress:
    fromNamespaces:
      - virtrigaud-system
  egress:
    - to: []
      ports:
        - protocol: UDP
          port: 53

8.2 Observability

Add monitoring and logging:

// Add to provider.go
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    vmOperations = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "file_provider_vm_operations_total",
            Help: "Total number of VM operations",
        },
        []string{"operation", "status"},
    )
    
    vmOperationDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "file_provider_vm_operation_duration_seconds",
            Help: "Duration of VM operations",
        },
        []string{"operation"},
    )
)

func (p *Provider) CreateVM(ctx context.Context, req *providerv1.CreateVMRequest) (*providerv1.CreateVMResponse, error) {
    start := time.Now()
    defer func() {
        vmOperationDuration.WithLabelValues("create").Observe(time.Since(start).Seconds())
    }()
    
    // ... existing implementation ...
    
    vmOperations.WithLabelValues("create", "success").Inc()
    return resp, nil
}

8.3 Performance Optimization

  • Add connection pooling for gRPC clients
  • Implement caching for frequently accessed VMs
  • Use background workers for long-running operations
  • Add rate limiting and request validation

8.4 Error Handling and Resilience

  • Implement circuit breakers for external dependencies
  • Add retry logic with exponential backoff
  • Use structured logging with correlation IDs
  • Implement graceful shutdown handling

Conclusion

You’ve successfully created a complete VirtRigaud provider! This tutorial covered:

βœ… Provider Implementation - Full gRPC service with all core operations
βœ… SDK Integration - Using VirtRigaud SDK for server setup and utilities
βœ… Testing - Unit tests and VCTS conformance validation
βœ… Containerization - Docker images and Helm charts
βœ… CI/CD - Automated testing and publishing
βœ… Catalog Integration - Publishing to the provider ecosystem

Next Steps

  1. Explore Advanced Features:

    • Add image management capabilities
    • Implement networking configuration
    • Add storage volume management
  2. Integration Examples:

    • Connect to real hypervisors (libvirt, vSphere, etc.)
    • Add authentication and authorization
    • Implement backup and disaster recovery
  3. Community Contribution:

    • Submit your provider to the catalog
    • Contribute improvements to the SDK
    • Help other developers with provider development
  4. Production Deployment:

    • Set up monitoring and alerting
    • Implement proper security measures
    • Plan for scaling and high availability

For more information, visit the VirtRigaud documentation or join our community discussions.

Versioning & Breaking Changes

This document outlines VirtRigaud’s approach to versioning, compatibility, and managing breaking changes across the provider ecosystem.

Overview

VirtRigaud follows semantic versioning (SemVer) principles and maintains backward compatibility through careful API design and migration strategies. The system has multiple versioning dimensions:

  • VirtRigaud Core - The main platform (API server, manager, CRDs)
  • Provider SDK - Go SDK for building providers
  • Proto Contracts - gRPC/protobuf API definitions
  • Individual Providers - Each provider has independent versioning

Semantic Versioning

All VirtRigaud components follow Semantic Versioning 2.0.0:

Version Format: MAJOR.MINOR.PATCH

  • MAJOR (X.0.0): Breaking changes that require user action
  • MINOR (0.X.0): New features that are backward compatible
  • PATCH (0.0.X): Bug fixes and security updates

Examples

1.0.0 β†’ 1.0.1  # Patch: Bug fixes only
1.0.1 β†’ 1.1.0  # Minor: New features, backward compatible
1.1.0 β†’ 2.0.0  # Major: Breaking changes

Component Versioning Strategy

VirtRigaud Core APIs

Kubernetes-style API versioning with multiple supported versions:

# Supported API versions
apiVersion: infra.virtrigaud.io/v1beta1  # Development/preview
apiVersion: infra.virtrigaud.io/v1beta1   # Pre-release/testing
apiVersion: infra.virtrigaud.io/v1        # Stable/production

Stability Levels:

  • Alpha (v1beta1): Experimental, may change or be removed
  • Beta (v1beta1): Well-tested, minimal changes expected
  • Stable (v1): Production-ready, strong backward compatibility

Support Windows:

  • Alpha: Best effort, no guarantees
  • Beta: Supported for 2 minor releases after stable equivalent
  • Stable: Supported for 12 months after deprecation

Provider SDK Versioning

SDK versions are independent of core VirtRigaud versions:

// Go module versioning
module github.com/projectbeskar/virtrigaud/sdk

// Version tags
sdk/v0.1.0    # Initial release
sdk/v0.2.0    # New features
sdk/v1.0.0    # First stable release
sdk/v2.0.0    # Breaking changes (new module path: sdk/v2)

SDK Compatibility Matrix:

SDK VersionVirtRigaud CoreGo VersionStatus
v0.1.x0.1.0 - 0.2.x1.23+Beta
v1.0.x0.2.0 - 1.0.x1.23+Stable
v1.1.x0.3.0 - 1.1.x1.23+Stable
v2.0.x1.0.0+1.24+Future

Proto Contract Versioning

Protobuf APIs use both module versions and service versions:

// Service versioning in proto files
package provider.v1;
service ProviderService {
  // API methods
}

// Module versioning
module github.com/projectbeskar/virtrigaud/proto

Proto Evolution Rules:

  • βœ… Add new fields (with proper defaults)
  • βœ… Add new RPC methods
  • βœ… Add new enum values
  • ❌ Remove fields or methods
  • ❌ Change field types or semantics
  • ❌ Remove enum values

Provider Versioning

Each provider maintains independent versioning:

# Provider catalog entry
name: vsphere
tag: "1.2.3"      # Provider version
sdk_version: "v1.0.0"  # SDK dependency
proto_version: "v0.1.0"  # Proto dependency

Breaking Change Policy

What Constitutes a Breaking Change

API Breaking Changes:

  • Removing or renaming API fields
  • Changing field types or semantics
  • Removing API endpoints or methods
  • Changing required vs optional fields
  • Modifying default behaviors
  • Changing error codes or messages that clients depend on

SDK Breaking Changes:

  • Removing public functions, types, or methods
  • Changing function signatures
  • Modifying struct fields (without proper backward compatibility)
  • Changing package import paths
  • Removing or renaming configuration options

Proto Breaking Changes:

  • Removing fields or RPC methods
  • Changing field numbers or types
  • Removing enum values
  • Modifying service or method names

Breaking Change Process

1. Proposal Phase

# Breaking Change Proposal: [Title]

## Summary
Brief description of the change and motivation.

## Motivation  
Why is this change necessary? What problems does it solve?

## Proposed Changes
Detailed description of the changes.

## Migration Path
How will users migrate from old to new behavior?

## Timeline
- Deprecation announcement: v1.1.0
- Breaking change implementation: v2.0.0
- Legacy support removal: v3.0.0

## Alternatives Considered
What other approaches were considered?

2. Deprecation Phase

// Deprecated functions include clear migration guidance
// Deprecated: Use NewCreateVMRequest instead. Will be removed in v2.0.0.
func CreateVM(name string) *VMRequest {
    return &VMRequest{Name: name}
}

// New recommended approach
func NewCreateVMRequest(spec *VMSpec) *CreateVMRequest {
    return &CreateVMRequest{Spec: spec}
}

3. Migration Tools

# Migration command examples
vrtg-provider migrate --from v1 --to v2
vrtg-provider check-compatibility --target-version v2.0.0

4. Communication

  • Release notes with migration guide
  • Blog posts for major changes
  • Community discussions and Q&A
  • Updated documentation

Compatibility Testing

Automated Compatibility Checks

# .github/workflows/compatibility.yml
name: Compatibility Check

jobs:
  compatibility-matrix:
    strategy:
      matrix:
        sdk_version: [v1.0.0, v1.1.0, current]
        provider_version: [v1.0.0, v1.1.0, current]
    
    steps:
    - name: Test SDK ${{ matrix.sdk_version }} with Provider ${{ matrix.provider_version }}
      run: |
        # Build provider with specific SDK version
        # Run conformance tests
        # Report compatibility results

Buf Proto Compatibility

# proto/buf.yaml
version: v1
breaking:
  use:
    # Prevent breaking changes
    - FILE_NO_DELETE
    - FIELD_NO_DELETE
    - FIELD_SAME_TYPE
    - ENUM_VALUE_NO_DELETE
    - RPC_NO_DELETE
    - SERVICE_NO_DELETE
  ignore:
    # Allowed changes during alpha/beta
    - "provider/v1beta1"
# Check for breaking changes
buf breaking --against 'https://github.com/projectbeskar/virtrigaud.git#branch=main'

Provider Compatibility Testing

# Test provider against multiple VirtRigaud versions
vcts run --provider ./provider --virtrigaud-version 0.1.0
vcts run --provider ./provider --virtrigaud-version 0.2.0
vcts run --provider ./provider --virtrigaud-version 1.0.0

Migration Strategies

API Version Migration

Example: VirtualMachine v1beta1 β†’ v1beta1

// Conversion webhook approach
func (src *v1beta1.VirtualMachine) ConvertTo(dst *v1beta1.VirtualMachine) error {
    // Convert common fields
    dst.ObjectMeta = src.ObjectMeta
    
    // Handle field migrations
    if src.Spec.PowerState == "On" {
        dst.Spec.PowerState = v1beta1.PowerStateOn
    }
    
    // Set new fields with appropriate defaults
    if dst.Spec.Phase == "" {
        dst.Spec.Phase = v1beta1.PhaseUnknown
    }
    
    return nil
}

Gradual Migration Process

# Phase 1: Dual support (both versions work)
kubectl apply -f vm-v1beta1.yaml  # Still works
kubectl apply -f vm-v1beta1.yaml   # Also works

# Phase 2: Deprecation warning
kubectl apply -f vm-v1beta1.yaml
# Warning: v1beta1 is deprecated, use v1beta1

# Phase 3: Conversion only (internal storage uses v1beta1)
kubectl apply -f vm-v1beta1.yaml  # Automatically converted

# Phase 4: Removal (after support window)
kubectl apply -f vm-v1beta1.yaml  # Error: version not supported

Provider SDK Migration

Example: SDK v1 β†’ v2

SDK v1 (deprecated):

// Old SDK pattern
func NewProvider(config Config) *Provider {
    return &Provider{config: config}
}

func (p *Provider) CreateVM(name string, cpu int, memory int) error {
    // Implementation
}

SDK v2 (new):

// New SDK pattern with better types
func NewProvider(config *Config) (*Provider, error) {
    if err := config.Validate(); err != nil {
        return nil, err
    }
    return &Provider{config: config}, nil
}

func (p *Provider) CreateVM(ctx context.Context, req *CreateVMRequest) (*CreateVMResponse, error) {
    // Implementation with proper context and structured types
}

Migration Bridge:

// sdk/v2/compat/v1.go - Compatibility layer
package compat

import (
    v1 "github.com/projectbeskar/virtrigaud/sdk/provider"
    v2 "github.com/projectbeskar/virtrigaud/sdk/v2/provider"
)

// Bridge for gradual migration
func AdaptV1Provider(v1Provider v1.Provider) v2.Provider {
    return &v1ProviderAdapter{old: v1Provider}
}

type v1ProviderAdapter struct {
    old v1.Provider
}

func (a *v1ProviderAdapter) CreateVM(ctx context.Context, req *v2.CreateVMRequest) (*v2.CreateVMResponse, error) {
    // Convert v2 request to v1 format
    err := a.old.CreateVM(req.Name, int(req.Spec.CPU), int(req.Spec.Memory))
    
    // Convert v1 response to v2 format
    if err != nil {
        return nil, err
    }
    
    return &v2.CreateVMResponse{
        Status: "Created",
    }, nil
}

Configuration Migration

Example: Configuration Schema Changes

v1 Configuration:

# provider-config-v1.yaml
provider:
  type: "vsphere"
  server: "vcenter.example.com"
  username: "admin"
  password: "secret"

v2 Configuration:

# provider-config-v2.yaml
apiVersion: config.virtrigaud.io/v2
kind: ProviderConfig
metadata:
  name: vsphere-config
spec:
  type: "vsphere"
  connection:
    endpoint: "vcenter.example.com"
    authentication:
      method: "basic"
      secretRef:
        name: "vsphere-credentials"
  features:
    snapshots: true
    cloning: true

Migration Command:

# Automatic migration tool
vrtg-provider config migrate \
  --from provider-config-v1.yaml \
  --to provider-config-v2.yaml \
  --create-secret vsphere-credentials

Release Planning

Release Cadence

  • Patch releases: As needed for critical bugs/security
  • Minor releases: Every 2-3 months
  • Major releases: Every 12-18 months

Feature Lifecycle

Experimental β†’ Alpha β†’ Beta β†’ Stable β†’ Deprecated β†’ Removed
     |          |       |       |         |          |
     |          |       |       |         |          +-- After support window
     |          |       |       |         +-- 2 releases notice
     |          |       |       +-- Production ready
     |          |       +-- Pre-release testing
     |          +-- Public preview
     +-- Internal/development only

Release Branch Strategy

main                    # Current development
β”œβ”€β”€ release-0.1        # Patch releases for v0.1.x
β”œβ”€β”€ release-0.2        # Patch releases for v0.2.x
└── release-1.0        # Patch releases for v1.0.x

Support Matrix

VersionStatusSupport LevelEnd of Life
1.0.xStableFull support2026-01-01
0.2.xStableSecurity only2025-06-01
0.1.xDeprecatedNone2025-01-01

Best Practices

For Provider Developers

  1. Version Dependencies Carefully

    // Use specific versions, not floating
    require github.com/projectbeskar/virtrigaud/sdk v1.2.3
    
  2. Test Compatibility Early

    # Test against multiple SDK versions
    go mod edit -require=github.com/projectbeskar/virtrigaud/sdk@v1.1.0
    go test ./...
    go mod edit -require=github.com/projectbeskar/virtrigaud/sdk@v1.2.0
    go test ./...
    
  3. Handle Deprecations Gracefully

    // Check for deprecated features
    if provider.IsDeprecated("vm.legacy-create") {
        log.Warn("Using deprecated API, migrate to vm.create")
    }
    
  4. Document Breaking Changes

    # CHANGELOG.md
    ## [2.0.0] - 2025-01-15
    ### BREAKING CHANGES
    - Removed deprecated `CreateVM` method, use `CreateVMRequest` instead
    - Changed configuration format, see migration guide
    
    ### Migration Guide
    Old: `provider.CreateVM("vm1", 2, 4096)`
    New: `provider.CreateVM(ctx, &CreateVMRequest{...})`
    

For Users

  1. Pin Versions in Production

    # Helm values
    image:
      tag: "1.2.3"  # Not "latest"
    
  2. Test Upgrades in Staging

    # Upgrade strategy
    helm upgrade provider-test virtrigaud/provider \
      --version 1.3.0 \
      --namespace staging
    
  3. Monitor Deprecation Warnings

    # Check for deprecation warnings
    kubectl logs -l app=provider | grep -i deprecat
    
  4. Plan Migration Windows

    # Schedule upgrades during maintenance windows
    # Have rollback plans ready
    # Test compatibility thoroughly
    

Future Considerations

Long-term Compatibility

  • 10-year Support Goal: Core APIs should remain usable for 10 years
  • Gradual Evolution: Prefer gradual evolution over revolutionary changes
  • Ecosystem Stability: Consider impact on the entire provider ecosystem

Emerging Standards

  • OCI Compliance: Align with OCI runtime and image standards
  • CNCF Integration: Follow CNCF project graduation requirements
  • Industry Standards: Adopt relevant industry standards as they emerge

Technology Evolution

  • Go Version Support: Support 2-3 latest Go versions
  • Kubernetes Compatibility: Support 3-4 latest Kubernetes versions
  • gRPC Evolution: Adapt to gRPC and protobuf improvements

This versioning strategy ensures VirtRigaud can evolve while maintaining stability and compatibility for the provider ecosystem.

Advanced VM Lifecycle Management

This document describes the advanced VM lifecycle features in VirtRigaud, including reconfiguration, snapshots, cloning, multi-VM sets, and placement policies.

Overview

VirtRigaud Stage E introduces comprehensive VM lifecycle management capabilities that go beyond basic create/delete operations:

  • VM Reconfiguration: Modify CPU, memory, and disk resources of running VMs
  • Snapshot Management: Create, delete, and revert VM snapshots
  • VM Cloning: Create new VMs from existing ones with linked clone support
  • Multi-VM Sets: Manage groups of VMs with rolling updates
  • Placement Policies: Advanced placement rules and anti-affinity constraints
  • Image Preparation: Automated image import and preparation workflows

VM Reconfiguration

Online vs Offline Reconfiguration

VirtRigaud supports both online (hot) and offline reconfiguration depending on provider capabilities:

vSphere: Supports online CPU/memory changes and hot disk expansion Libvirt: Typically requires power cycle for resource changes

Example: CPU/Memory Upgrade

# Original VM with 2 CPU, 4GB RAM
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  resources:
    cpu: 2
    memoryMiB: 4096

# Patch to upgrade resources
# kubectl patch vm web-server --type merge -p '{"spec":{"resources":{"cpu":4,"memoryMiB":8192}}}'

The controller will:

  1. Detect resource changes in VM spec
  2. Attempt online reconfiguration if supported
  3. If offline required, orchestrate graceful power cycle:
    • Set condition ReconfigurePendingPowerCycle=True
    • Power off VM gracefully
    • Apply reconfiguration
    • Power on VM
    • Update status.lastReconfigureTime

Disk Expansion

spec:
  disks:
    - name: data
      sizeGiB: 100  # Expanded from 50GB
      expandPolicy: "Online"  # Try online first

Snapshot Management

Creating Snapshots

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: pre-maintenance-backup
spec:
  vmRef:
    name: web-server
  nameHint: "maintenance-backup"
  memory: true  # Include memory state
  description: "Backup before maintenance"
  retentionPolicy:
    maxAge: "7d"
    deleteOnVMDelete: true

Snapshot Lifecycle

  1. Creating: Snapshot creation in progress
  2. Ready: Snapshot available for use
  3. Deleting: Snapshot being removed
  4. Failed: Snapshot operation failed

Reverting to Snapshots

# Patch VM to revert to snapshot
spec:
  snapshot:
    revertToRef:
      name: pre-maintenance-backup

The controller will:

  1. Power off VM if running
  2. Call provider’s SnapshotRevert RPC
  3. Power on VM
  4. Clear revertToRef when complete

VM Cloning

Basic Cloning

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClone
metadata:
  name: web-server-clone
spec:
  sourceRef:
    name: web-server
  target:
    name: web-server-test
    classRef:
      name: test-class
  linked: true  # Faster, space-efficient
  powerOn: true

Clone Customization

spec:
  customization:
    hostname: web-server-test
    networks:
      - name: primary
        ipAddress: "192.168.1.100"
        gateway: "192.168.1.1"
        dns: ["8.8.8.8"]
    userData:
      cloudInit:
        inline: |
          #cloud-config
          runcmd:
            - echo "Test environment" > /etc/motd

Multi-VM Sets (VMSet)

VMSets provide declarative management of multiple VMs with rolling updates.

Basic VMSet

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSet
metadata:
  name: web-tier
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-server
  template:
    metadata:
      labels:
        app: web-server
    spec:
      providerRef:
        name: vsphere-prod
      classRef:
        name: web-class
      imageRef:
        name: nginx-image
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

Rolling Updates

When you update the template spec, VMSet will:

  1. Create new VMs with updated configuration
  2. Wait for new VMs to be ready
  3. Delete old VMs respecting maxUnavailable
  4. Continue until all replicas are updated

Placement Policies

Advanced Placement Rules

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMPlacementPolicy
metadata:
  name: production-policy
spec:
  hard:
    clusters: ["prod-cluster-1", "prod-cluster-2"]
    datastores: ["ssd-datastore-1", "ssd-datastore-2"]
    hosts: ["esxi-01", "esxi-02", "esxi-03"]
  soft:
    folders: ["/Production/WebServers"]
    zones: ["zone-a", "zone-b"]
  antiAffinity:
    hostAntiAffinity: true      # Spread across hosts
    clusterAntiAffinity: false
    datastoreAntiAffinity: true # Spread across datastores

Using Placement Policies

spec:
  placementRef:
    name: production-policy

The provider will attempt to satisfy:

  1. Hard constraints: Must be satisfied
  2. Soft constraints: Best effort
  3. Anti-affinity rules: Avoid co-location

Image Preparation

Automated Image Import

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-04
spec:
  vsphere:
    ovaURL: "https://releases.ubuntu.com/22.04/ubuntu-22.04-server.ova"
    checksum: "sha256:abcd1234..."
  libvirt:
    url: "https://cloud-images.ubuntu.com/22.04/ubuntu-22.04-server.img"
    format: "qcow2"
  prepare:
    onMissing: "Import"  # Auto-import if missing
    validateChecksum: true
    timeout: "30m"
    retries: 3
    storage:
      vsphere:
        datastore: "images-datastore"
        folder: "/Templates"
        thinProvisioned: true

Image Preparation Phases

  1. Pending: Waiting to start preparation
  2. Importing: Downloading/importing image
  3. Preparing: Processing image (conversion, etc.)
  4. Ready: Image ready for use
  5. Failed: Preparation failed

Provider Capabilities

Different providers support different features. Query capabilities:

# Example capabilities response
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
status:
  capabilities:
    supportsReconfigureOnline: true      # vSphere: true, Libvirt: false
    supportsDiskExpansionOnline: true    # vSphere: true, Libvirt: false
    supportsSnapshots: true              # Both: true
    supportsMemorySnapshots: true        # vSphere: true, Libvirt: varies
    supportsLinkedClones: true           # Both: true
    supportsImageImport: true            # Both: true
    supportedDiskTypes: ["thin", "thick"]
    supportedNetworkTypes: ["VMXNET3", "E1000"]

Observability

Metrics

New metrics for advanced lifecycle operations:

virtrigaud_vm_reconfigure_total{provider_type,outcome}
virtrigaud_vm_snapshot_total{action,provider_type,outcome}
virtrigaud_vm_clone_total{linked,provider_type,outcome}
virtrigaud_vm_image_prepare_total{provider_type,outcome}

Events

Detailed events for lifecycle operations:

Normal   SnapshotCreating    Started snapshot creation
Normal   SnapshotReady       Snapshot created successfully
Normal   ReconfigureStarted  Started VM reconfiguration
Warning  ReconfigurePowerCycle  Reconfiguration requires power cycle
Normal   CloneCompleted      VM clone created successfully

Conditions

Comprehensive condition reporting:

VM Conditions:

  • Ready: VM is ready for use
  • Provisioning: VM is being created
  • Reconfiguring: VM is being reconfigured
  • ReconfigurePendingPowerCycle: Needs power cycle for changes

Snapshot Conditions:

  • Ready: Snapshot is ready
  • Creating: Snapshot being created
  • Deleting: Snapshot being deleted

Clone Conditions:

  • Ready: Clone completed successfully
  • Cloning: Clone operation in progress
  • Customizing: Applying customizations

Best Practices

Snapshot Management

  1. Retention Policies: Always set appropriate retention policies
  2. Memory Snapshots: Use sparingly due to storage overhead
  3. Cleanup: Implement automated cleanup for old snapshots
  4. Testing: Test snapshot revert procedures regularly

VM Reconfiguration

  1. Gradual Changes: Make incremental resource changes
  2. Monitoring: Monitor VM performance after changes
  3. Rollback Plan: Have snapshots before major changes
  4. Capacity Planning: Ensure host resources before scaling up

Placement Policies

  1. Start Simple: Begin with basic constraints
  2. Test Anti-Affinity: Verify rules work as expected
  3. Monitor Placement: Check actual VM placement matches policy
  4. Balance Performance: Don’t over-constrain placement

Multi-VM Operations

  1. Rolling Updates: Use appropriate maxUnavailable settings
  2. Health Checks: Implement proper readiness checks
  3. Monitoring: Monitor rollout progress
  4. Rollback Strategy: Plan for rollback scenarios

Troubleshooting

Common Issues

Reconfiguration Fails:

  • Check provider capabilities
  • Verify resource availability on host
  • Check for VM tools/agent issues

Snapshot Operations Fail:

  • Verify storage backend supports snapshots
  • Check available storage space
  • Ensure VM is not in transitional state

Clone Customization Issues:

  • Verify network configuration
  • Check cloud-init/guest tools
  • Validate IP address availability

Placement Policy Violations:

  • Check resource availability in target locations
  • Verify anti-affinity rules aren’t too restrictive
  • Review cluster resource distribution

Debugging

# Check VM reconfiguration status
kubectl describe vm web-server

# Monitor snapshot progress
kubectl get vmsnapshots -w

# Check clone status
kubectl describe vmclone web-server-clone

# Review placement policy usage
kubectl describe vmplacementpolicy production-policy

# Check VMSet rollout
kubectl describe vmset web-tier

Migration from Basic VMs

Existing VMs can be enhanced with advanced features:

  1. Add Placement Policy: Update VM spec with placementRef
  2. Enable Reconfiguration: Add resource overrides
  3. Create Snapshots: Deploy VMSnapshot resources
  4. Scale with VMSets: Migrate to VMSet for multi-instance workloads

The controller maintains backward compatibility with existing VM definitions.

Nested Virtualization Support

This document describes how to enable and configure nested virtualization in VirtRigaud virtual machines across different hypervisor providers.

Overview

Nested virtualization allows virtual machines to run hypervisors and create their own virtual machines. This is useful for:

  • Development and testing of virtualization software
  • Running container orchestration platforms like Kubernetes
  • Creating nested lab environments
  • Educational purposes for learning virtualization concepts

VirtRigaud supports nested virtualization through the PerformanceProfile configuration in VMClass resources.

Prerequisites

vSphere Provider

  • ESXi 6.0 or later
  • VM hardware version 9 or later (recommended: version 14+)
  • ESXi host must have VT-x/AMD-V enabled in BIOS
  • Sufficient CPU and memory resources on the ESXi host

LibVirt Provider

  • QEMU/KVM hypervisor
  • Host CPU with VT-x (Intel) or AMD-V (AMD) support
  • Nested virtualization enabled in host kernel modules
  • libvirt 1.2.13 or later

Proxmox Provider

  • Proxmox VE 6.0 or later
  • Host CPU with nested virtualization support
  • Nested virtualization enabled in Proxmox configuration

Enabling Nested Virtualization

Nested virtualization is configured at the VMClass level using the PerformanceProfile section:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: nested-vm-class
  namespace: virtrigaud-system
spec:
  cpu: 4
  memory: 8Gi
  firmware: UEFI  # Recommended for modern features
  
  # Enable nested virtualization
  performanceProfile:
    nestedVirtualization: true
    # Optional: Enable additional features
    virtualizationBasedSecurity: true
    cpuHotAddEnabled: true
    memoryHotAddEnabled: true
  
  # Optional: Security features that work well with nested virtualization
  securityProfile:
    secureBoot: false  # May interfere with some nested hypervisors
    tpmEnabled: false  # Optional, depending on nested OS requirements
    vtdEnabled: true   # Enable VT-d/AMD-Vi for better performance
  
  diskDefaults:
    type: thin
    size: 100Gi  # Larger disk for nested VMs

Complete Example

Here’s a complete example showing how to create a VM with nested virtualization support:

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: hypervisor-class
  namespace: default
spec:
  cpu: 8
  memory: 16Gi
  firmware: UEFI
  
  performanceProfile:
    nestedVirtualization: true
    virtualizationBasedSecurity: false  # May conflict with nested hypervisors
    cpuHotAddEnabled: true
    memoryHotAddEnabled: true
    latencySensitivity: low  # Better performance for nested VMs
    hyperThreadingPolicy: prefer
  
  securityProfile:
    secureBoot: false  # Disable for compatibility
    tpmEnabled: false
    vtdEnabled: true   # Enable for better I/O performance
  
  resourceLimits:
    cpuReservation: 4000  # Reserve 4GHz for nested VMs
    memoryReservation: 8Gi
  
  diskDefaults:
    type: thin
    size: 200Gi
    storageClass: fast-ssd

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-server-22-04
  namespace: default
spec:
  source:
    libvirt:
      url: "https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img"
      checksum: "sha256:de5e632e17b8965f2baf4ea6d2b824788e154d9a65df4fd419ec4019898e15cd"

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: nested-hypervisor
  namespace: default
spec:
  providerRef:
    name: my-provider
  classRef:
    name: hypervisor-class
  imageRef:
    name: ubuntu-server-22-04
  
  userData:
    cloudInit:
      inline: |
        #cloud-config
        hostname: nested-hypervisor
        users:
          - name: ubuntu
            sudo: ALL=(ALL) NOPASSWD:ALL
            ssh_authorized_keys:
              - ssh-rsa AAAAB3NzaC1yc2E... # Your SSH key
        
        packages:
          - qemu-kvm
          - libvirt-daemon-system
          - libvirt-clients
          - bridge-utils
          - virt-manager
        
        runcmd:
          # Enable nested virtualization verification
          - echo "Checking nested virtualization support..."
          - cat /proc/cpuinfo | grep -E "(vmx|svm)"
          - ls -la /dev/kvm
          
          # Configure libvirt
          - systemctl enable libvirtd
          - systemctl start libvirtd
          - usermod -aG libvirt ubuntu
          
          # Verify nested KVM support
          - modprobe kvm_intel nested=1 || modprobe kvm_amd nested=1
          - echo "Nested virtualization setup complete"
  
  powerState: On

Provider-Specific Configuration

vSphere Provider

For vSphere, nested virtualization is enabled using the following VM configuration:

  • vhv.enable = TRUE - Enables hardware-assisted virtualization
  • vhv.allowNestedPageTables = TRUE - Improves nested VM performance
  • Hardware version 14+ recommended for best compatibility

Additional considerations:

  • Use UEFI firmware for modern guest operating systems
  • Ensure sufficient CPU and memory allocation
  • Consider enabling VT-d for better I/O performance

LibVirt Provider

For LibVirt/KVM, nested virtualization requires:

  • Host kernel modules: kvm_intel nested=1 or kvm_amd nested=1
  • CPU features: vmx (Intel) or svm (AMD) passed through to guest
  • QEMU machine type: q35 recommended for modern features

The LibVirt provider automatically configures:

<cpu mode='host-model' check='partial'>
  <feature policy='require' name='vmx'/>  <!-- Intel -->
  <feature policy='require' name='svm'/>  <!-- AMD -->
</cpu>

Proxmox Provider

For Proxmox VE, nested virtualization is configured through:

  • CPU type: host or kvm64 with nested features
  • Enable nested virtualization in VM CPU configuration
  • Ensure host has nested virtualization enabled

Verification

After creating a VM with nested virtualization enabled, verify the setup:

On Linux Guests

# Check for virtualization extensions
grep -E "(vmx|svm)" /proc/cpuinfo

# Verify KVM device availability
ls -la /dev/kvm

# Check nested virtualization status
cat /sys/module/kvm_intel/parameters/nested  # Intel
cat /sys/module/kvm_amd/parameters/nested    # AMD

# Test with a simple nested VM
virt-host-validate

On Windows Guests

# Check Hyper-V compatibility
systeminfo | findstr /i hyper

# Verify virtualization extensions
Get-ComputerInfo | Select-Object HyperV*

Performance Considerations

CPU Allocation

  • Allocate sufficient CPU cores (minimum 4, recommended 8+)
  • Consider CPU reservation for consistent performance
  • Enable CPU hot-add for flexibility

Memory Configuration

  • Allocate generous memory (minimum 8GB, recommended 16GB+)
  • Consider memory reservation for nested VMs
  • Enable memory hot-add for dynamic scaling

Storage

  • Use fast storage (SSD/NVMe) for better nested VM performance
  • Allocate sufficient disk space for multiple nested VMs
  • Consider thin provisioning for efficient space usage

Network

  • Configure appropriate network topology
  • Consider SR-IOV for high-performance networking
  • Plan IP address allocation for nested environments

Troubleshooting

Common Issues

  1. Nested virtualization not working

    • Verify host CPU supports VT-x/AMD-V
    • Check host BIOS settings
    • Ensure hypervisor nested virtualization is enabled
  2. Poor performance in nested VMs

    • Increase CPU and memory allocation
    • Enable CPU/memory reservations
    • Use faster storage
    • Verify nested page tables are enabled
  3. Guest OS doesn’t detect virtualization extensions

    • Check VM hardware version (vSphere)
    • Verify CPU feature passthrough (LibVirt)
    • Ensure proper CPU type configuration (Proxmox)

Debugging Commands

# Check virtualization support on host
lscpu | grep Virtualization

# Verify KVM nested support
cat /sys/module/kvm_*/parameters/nested

# Check VM CPU features (inside guest)
lscpu | grep -E "(vmx|svm|Virtualization)"

# Test nested VM creation
virt-install --name test-nested --memory 1024 --vcpus 1 --disk size=10 --cdrom /path/to/iso

Security Considerations

Isolation

  • Nested VMs add additional attack surface
  • Consider network isolation for nested environments
  • Implement proper access controls

Resource Limits

  • Set appropriate resource limits to prevent resource exhaustion
  • Monitor nested VM resource usage
  • Implement quotas for nested environments

Updates and Patches

  • Keep host hypervisor updated
  • Maintain guest hypervisor software
  • Apply security patches to nested VMs

Best Practices

  1. Planning

    • Design nested architecture carefully
    • Plan resource allocation in advance
    • Consider network topology requirements
  2. Configuration

    • Use UEFI firmware for modern features
    • Enable VT-d/AMD-Vi for better performance
    • Configure appropriate CPU and memory reservations
  3. Monitoring

    • Monitor resource usage at all levels
    • Set up alerting for resource exhaustion
    • Track performance metrics
  4. Maintenance

    • Regular backup of nested environments
    • Plan for hypervisor updates
    • Test disaster recovery procedures

Limitations

vSphere Provider

  • Requires ESXi 6.0+ and hardware version 9+
  • Performance overhead of 10-20% typical
  • Some advanced features may not be available in nested VMs

LibVirt Provider

  • Requires host kernel support
  • Performance depends on host CPU features
  • Limited to x86_64 architecture

Proxmox Provider

  • Requires Proxmox VE 6.0+
  • Performance overhead varies by workload
  • Some clustering features may not work in nested environments

Support Matrix

ProviderMin VersionNested SupportPerformanceSecurity Features
vSphereESXi 6.0FullGoodTPM, Secure Boot
LibVirt1.2.13FullGoodTPM, Secure Boot
ProxmoxPVE 6.0PlannedGoodLimited

For more information, see the provider-specific documentation in the docs/providers/ directory.

Graceful Shutdown Feature

The virtrigaud VM management platform now supports graceful shutdown of virtual machines to prevent data corruption and ensure proper cleanup of running processes.

Overview

Graceful shutdown uses VM guest tools (VMware Tools, QEMU Guest Agent, etc.) to properly shut down the operating system before powering off the virtual machine. This prevents data corruption and allows applications to save their state properly.

Power States

virtrigaud supports three power states:

  • On: Power on the VM
  • Off: Hard power off (immediate shutdown without guest OS notification)
  • OffGraceful: Graceful shutdown using guest tools with automatic fallback to hard power off

Configuration

Basic Usage

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: my-vm
spec:
  powerState: OffGraceful  # Use graceful shutdown
  # ... other configuration

Advanced Configuration with Lifecycle Hooks

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: my-vm
spec:
  powerState: OffGraceful
  
  lifecycle:
    # Timeout for graceful shutdown (default: 60s)
    gracefulShutdownTimeout: "120s"
    
    # Pre-stop hook runs before shutdown
    preStop:
      exec:
        command:
          - "/bin/bash"
          - "-c"
          - |
            # Save application state
            systemctl stop my-application
            # Sync filesystem
            sync

How It Works

vSphere Provider

  1. Guest Tools Check: Verifies VMware Tools is installed and running
  2. Graceful Shutdown: Calls vm.ShutdownGuest() to initiate OS shutdown
  3. Monitoring: Polls VM power state every 2 seconds
  4. Timeout Handling: Falls back to hard power off if timeout is reached
  5. Fallback: Uses vm.PowerOff() if graceful shutdown fails

Libvirt Provider

  1. Graceful Attempt: Uses virsh shutdown command
  2. Fallback: Falls back to virsh destroy if shutdown fails
  3. Guest Agent: Requires QEMU Guest Agent for best results

Proxmox Provider

  1. API Call: Uses Proxmox shutdown API endpoint
  2. Built-in Timeout: Proxmox handles timeout and fallback internally

Default Timeouts

  • vSphere: 60 seconds (configurable via gRPC request)
  • Libvirt: Immediate fallback if virsh shutdown fails
  • Proxmox: Managed by Proxmox server configuration

Requirements

VMware vSphere

  • VMware Tools must be installed and running in the guest OS
  • Guest OS must support ACPI shutdown signals

Libvirt/KVM

  • QEMU Guest Agent recommended for reliable graceful shutdown
  • Guest OS must support ACPI shutdown signals

Proxmox

  • QEMU Guest Agent recommended
  • Guest OS must support ACPI shutdown signals

Best Practices

  1. Always Install Guest Tools: Ensure VMware Tools or QEMU Guest Agent is installed
  2. Test Graceful Shutdown: Verify your VMs respond properly to shutdown signals
  3. Set Appropriate Timeouts: Allow enough time for applications to shut down gracefully
  4. Use Lifecycle Hooks: Implement pre-stop hooks for critical applications
  5. Monitor Logs: Check provider logs to verify graceful shutdown is working

Troubleshooting

Graceful Shutdown Not Working

  1. Check Guest Tools Status:

    # For VMware
    vmware-toolbox-cmd stat running
    
    # For QEMU/KVM
    systemctl status qemu-guest-agent
    
  2. Verify ACPI Support:

    # Check if ACPI shutdown is supported
    cat /proc/acpi/button/power/*/info
    
  3. Test Manual Shutdown:

    # Test graceful shutdown manually
    sudo shutdown -h now
    

Timeout Issues

If VMs consistently hit the graceful shutdown timeout:

  1. Increase Timeout: Set a longer gracefulShutdownTimeout
  2. Optimize Applications: Ensure applications shut down quickly
  3. Check System Resources: Verify the system isn’t under heavy load

Fallback to Hard Power Off

The provider will automatically fall back to hard power off if:

  • Guest tools are not available
  • Graceful shutdown times out
  • Guest tools command fails

This ensures VMs are always powered off even if graceful shutdown isn’t possible.

Examples

See examples/graceful-shutdown-vm.yaml for complete examples of using graceful shutdown with various configurations.

Provider Architecture

This document describes the provider architecture in VirtRigaud.

Overview

VirtRigaud uses a Remote Provider architecture where providers run as independent pods, communicating with the manager controller via gRPC. This design provides scalability, security, and reliability benefits.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   VirtualMachine β”‚    β”‚     Provider      β”‚    β”‚ Provider Runtimeβ”‚
β”‚      CRD        β”‚    β”‚       CRD         β”‚    β”‚   Deployment    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                        β”‚                        β”‚
         β”‚                        β”‚                        β”‚
         v                        v                        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚    Manager      β”‚    β”‚ Provider          β”‚              β”‚
β”‚   Controller    β”‚    β”‚ Controller        β”‚              β”‚
β”‚                 β”‚    β”‚                   β”‚              β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚ - Creates Deploy  β”‚              β”‚
β”‚   β”‚ VM Reconcileβ”‚    β”‚ - Creates Service β”‚              β”‚
β”‚   β”‚             β”‚    β”‚ - Updates Status  β”‚              β”‚
β”‚   └──────────────    β”‚                   β”‚              β”‚
β”‚                 β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€                                       β”‚
β”‚   β”‚ gRPC Client β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚   β”‚             β”‚        gRPC Connection
β”‚   └──────────────        Port 9090
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Provider Components

1. Provider Runtime Deployments

Each Provider resource automatically creates:

  • Deployment: Runs provider-specific containers
  • Service: ClusterIP service for gRPC communication
  • ConfigMaps: Provider configuration
  • Secret mounts: Credentials for hypervisor access

Configuration Flow: Provider Resource β†’ Provider Pod

The VirtRigaud Provider Controller automatically translates your Provider resource configuration into the appropriate command-line arguments and environment variables for the provider pod.

Command-Line Arguments

The controller generates these arguments from your Provider spec:

Provider FieldGenerated ArgumentExample
spec.type--provider-type--provider-type=vsphere
spec.endpoint--provider-endpoint--provider-endpoint=https://vcenter.example.com
spec.runtime.service.port--grpc-addr--grpc-addr=:9090
(hardcoded)--metrics-addr--metrics-addr=:8080
(optional)--tls-enabled--tls-enabled=false

Environment Variables

The controller also sets these environment variables:

Provider FieldEnvironment VariableExample
spec.typePROVIDER_TYPEvsphere
spec.endpointPROVIDER_ENDPOINThttps://vcenter.example.com
metadata.namespacePROVIDER_NAMESPACEdefault
metadata.namePROVIDER_NAMEvsphere-datacenter
(optional)TLS_ENABLEDfalse

Secret Volume Mounts

Credentials from spec.credentialSecretRef are automatically mounted at:

  • Mount Path: /etc/virtrigaud/credentials/
  • Files Created: Each secret key becomes a file
    • username β†’ /etc/virtrigaud/credentials/username
    • password β†’ /etc/virtrigaud/credentials/password
    • token β†’ /etc/virtrigaud/credentials/token

Complete Example

When you create this Provider resource:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-datacenter
  namespace: default
spec:
  type: vsphere
  endpoint: "https://vcenter.example.com:443"
  credentialSecretRef:
    name: vsphere-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"
    service:
      port: 9090

The controller automatically creates a deployment with:

Command-line arguments:

/provider-vsphere \
  --grpc-addr=:9090 \
  --metrics-addr=:8080 \
  --provider-type=vsphere \
  --provider-endpoint=https://vcenter.example.com:443 \
  --tls-enabled=false

Environment variables:

PROVIDER_TYPE=vsphere
PROVIDER_ENDPOINT=https://vcenter.example.com:443
PROVIDER_NAMESPACE=default
PROVIDER_NAME=vsphere-datacenter
TLS_ENABLED=false

Volume mounts:

/etc/virtrigaud/credentials/username  # Contains: admin@vsphere.local
/etc/virtrigaud/credentials/password  # Contains: your-password

βœ… Key Point: You Don’t Configure This Manually

The beauty of VirtRigaud’s Remote Provider architecture is that you never need to manually configure command-line arguments or environment variables. Simply create the Provider resource, and the controller handles all the deployment details automatically!

2. Provider Images

Specialized images for each provider type:

  • ghcr.io/projectbeskar/virtrigaud/provider-vsphere: vSphere provider with govmomi
  • ghcr.io/projectbeskar/virtrigaud/provider-libvirt: LibVirt provider via virsh commands
  • ghcr.io/projectbeskar/virtrigaud/provider-proxmox: Proxmox VE provider
  • ghcr.io/projectbeskar/virtrigaud/provider-mock: Mock provider for testing

3. gRPC Communication

  • Protocol: gRPC with protocol buffers
  • Security: Secure communication over TLS (optional)
  • Health: Built-in health checks and graceful shutdown
  • Metrics: Prometheus metrics on port 8080

Provider Configuration

Basic Provider Setup

apiVersion: v1
kind: Secret
metadata:
  name: vsphere-credentials
  namespace: default
type: Opaque
stringData:
  username: "admin@vsphere.local"
  password: "your-password"

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-datacenter
  namespace: default
spec:
  type: vsphere
  endpoint: "https://vcenter.example.com:443"
  credentialSecretRef:
    name: vsphere-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"
    service:
      port: 9090

Advanced Configuration

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-cluster
  namespace: production
spec:
  type: libvirt
  endpoint: "qemu+ssh://admin@kvm.example.com/system"
  credentialSecretRef:
    name: libvirt-credentials
  defaults:
    cluster: production
  rateLimit:
    qps: 20
    burst: 50
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.0"
    replicas: 3
    
    service:
      port: 9090
      
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "2"
        memory: "2Gi"
        
    # High availability setup
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/instance: libvirt-cluster
          topologyKey: kubernetes.io/hostname
          
    # Node placement
    nodeSelector:
      workload-type: compute
      
    tolerations:
    - key: "compute-dedicated"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
      
    # Environment variables
    env:
    - name: LIBVIRT_DEBUG
      value: "1"
    - name: PROVIDER_TIMEOUT
      value: "300s"

Security Model

Pod Security

  • Non-root execution: All containers run as non-root users
  • Read-only filesystem: Immutable container filesystem
  • Minimal capabilities: Reduced Linux capabilities
  • Security contexts: Enforced via deployment templates

Credential Isolation

  • Separated secrets: Each provider has dedicated credential secrets
  • Scoped access: Providers only access their own hypervisor credentials
  • RBAC isolation: Fine-grained RBAC per provider namespace

Network Security

  • Service mesh ready: Compatible with Istio/Linkerd
  • Network policies: Optional traffic restrictions
  • TLS support: Secure gRPC communication (configurable)

Communication Protocol

gRPC Service Definition

service Provider {
  rpc Validate(ValidateRequest) returns (ValidateResponse);
  rpc Create(CreateRequest) returns (CreateResponse);
  rpc Delete(DeleteRequest) returns (TaskResponse);
  rpc Power(PowerRequest) returns (TaskResponse);
  rpc Reconfigure(ReconfigureRequest) returns (TaskResponse);
  rpc Describe(DescribeRequest) returns (DescribeResponse);
  rpc TaskStatus(TaskStatusRequest) returns (TaskStatusResponse);
  rpc ListCapabilities(CapabilitiesRequest) returns (CapabilitiesResponse);
}

Error Handling

  • Retry logic: Exponential backoff for transient failures
  • Circuit breakers: Prevent cascade failures
  • Timeout controls: Configurable per-operation timeouts
  • Status reporting: Conditions reflected in Kubernetes status

Observability

Metrics

Provider pods expose Prometheus metrics on port 8080:

# Request metrics
provider_grpc_requests_total{method="Create",status="success"} 42
provider_grpc_request_duration_seconds{method="Create",quantile="0.95"} 2.5

# VM metrics  
provider_vms_total{state="running"} 15
provider_vms_total{state="stopped"} 3

# Health metrics
provider_health_status{provider="vsphere-datacenter"} 1
provider_hypervisor_connection_status{endpoint="vcenter.example.com"} 1

Logging

  • Structured logs: JSON format with correlation IDs
  • Log levels: Configurable verbosity (debug, info, warn, error)
  • Request tracing: Context propagation across gRPC calls

Health Checks

  • Kubernetes probes: Liveness and readiness probes
  • gRPC health protocol: Standard health check implementation
  • Hypervisor connectivity: Validates connection to external systems

Deployment Patterns

Single Provider Setup

# Simple development setup
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: dev-vsphere
spec:
  type: vsphere
  endpoint: "https://vcenter-dev.example.com:443"
  credentialSecretRef:
    name: dev-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"

High Availability Setup

# Production HA setup
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: prod-vsphere
spec:
  type: vsphere
  endpoint: "https://vcenter-prod.example.com:443"
  credentialSecretRef:
    name: prod-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"
    replicas: 3
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/instance: prod-vsphere
          topologyKey: kubernetes.io/hostname

Multi-Environment Setup

# Development environment
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: dev-libvirt
  namespace: development
spec:
  type: libvirt
  endpoint: "qemu+ssh://dev@libvirt-dev.example.com/system"
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.0"
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"

---
# Production environment  
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: prod-libvirt
  namespace: production
spec:
  type: libvirt
  endpoint: "qemu+ssh://prod@libvirt-prod.example.com/system"
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.0"
    replicas: 2
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
      limits:
        cpu: "2"
        memory: "2Gi"

Benefits

Scalability

  • Horizontal scaling: Multiple provider replicas per hypervisor
  • Resource isolation: Independent resource allocation per provider
  • Load distribution: gRPC load balancing across provider instances

Security

  • Credential isolation: Hypervisor credentials isolated to provider pods
  • Network segmentation: Providers can run in separate namespaces
  • Least privilege: Manager runs without direct hypervisor access

Reliability

  • Fault isolation: Provider failures don’t affect the manager
  • Independent updates: Provider images updated separately
  • Circuit breaking: Automatic failure detection and recovery

Operational Excellence

  • Rolling updates: Zero-downtime provider updates
  • Health monitoring: Built-in health checks and metrics
  • Debugging: Isolated provider logs and observability

Troubleshooting

Common Issues

  1. Image Pull Failures

    # Check image availability
    docker pull ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0
    
    # Verify imagePullSecrets if using private registry
    kubectl get secret regcred -o yaml
    
  2. Network Connectivity

    # Test provider service
    kubectl get svc virtrigaud-provider-*
    
    # Check provider pod logs
    kubectl logs -l app.kubernetes.io/name=virtrigaud-provider
    
  3. Credential Issues

    # Verify secret exists and is mounted
    kubectl get secret vsphere-credentials
    kubectl describe pod virtrigaud-provider-*
    

Debugging Commands

# Check provider status
kubectl describe provider vsphere-datacenter

# Check provider deployment
kubectl get deployment -l app.kubernetes.io/instance=vsphere-datacenter

# Check provider pods
kubectl get pods -l app.kubernetes.io/instance=vsphere-datacenter

# View provider logs
kubectl logs -l app.kubernetes.io/instance=vsphere-datacenter -f

# Check provider metrics
kubectl port-forward svc/virtrigaud-provider-vsphere-datacenter 8080:8080
curl http://localhost:8080/metrics

Performance Tuning

# Optimize for high-volume workloads
spec:
  rateLimit:
    qps: 100        # Increase API rate limit
    burst: 200      # Allow burst capacity
  runtime:
    replicas: 5     # Scale out for throughput
    resources:
      requests:
        cpu: "1"    # Guarantee CPU resources
        memory: "1Gi"
      limits:
        cpu: "4"    # Allow burst CPU
        memory: "4Gi"

Best Practices

Resource Management

  • Right-sizing: Start with small requests, monitor and adjust
  • Limits: Always set memory limits to prevent OOM kills
  • QoS: Use Guaranteed QoS for production workloads

Security

  • Secrets rotation: Implement regular credential rotation
  • Network policies: Restrict provider-to-hypervisor traffic
  • RBAC: Use dedicated service accounts per provider

Monitoring

  • Alerting: Set up alerts on provider health metrics
  • Dashboards: Create Grafana dashboards for provider metrics
  • Log aggregation: Centralize logs for debugging and auditing

Migration and Upgrades

Provider Image Updates

# Update provider image
kubectl patch provider vsphere-datacenter -p '
{
  "spec": {
    "runtime": {
      "image": "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"
    }
  }
}'

# Monitor rollout
kubectl rollout status deployment virtrigaud-provider-vsphere-datacenter

Configuration Changes

# Update provider configuration
kubectl edit provider vsphere-datacenter

# Verify changes applied
kubectl describe provider vsphere-datacenter

VirtRigaud Observability Guide

This document describes the comprehensive observability features of VirtRigaud, including structured logging, metrics, tracing, and monitoring.

Overview

VirtRigaud provides production-grade observability through:

  • Structured JSON Logging with correlation IDs and automatic secret redaction
  • Comprehensive Prometheus Metrics for all components and operations
  • OpenTelemetry Tracing with gRPC instrumentation
  • Health Endpoints for liveness and readiness probes
  • Grafana Dashboards for visualization
  • Prometheus Alerts for proactive monitoring

Logging

Configuration

Configure logging via environment variables:

LOG_LEVEL=info              # debug, info, warn, error
LOG_FORMAT=json             # json or console
LOG_SAMPLING=true           # Enable log sampling
LOG_DEVELOPMENT=false       # Development mode

Correlation IDs

All log entries include correlation fields:

{
  "level": "info",
  "ts": "2025-01-27T10:30:45.123Z",
  "msg": "VM operation started",
  "correlationID": "req-12345",
  "vm": "default/web-server-1",
  "provider": "default/vsphere-prod",
  "providerType": "vsphere",
  "taskRef": "task-67890",
  "reconcile": "uuid-abcdef"
}

Secret Redaction

Sensitive information is automatically redacted:

{
  "msg": "Connecting to provider",
  "endpoint": "vcenter://user:[REDACTED]@vc.example.com/Datacenter",
  "userData": "[REDACTED]"
}

Metrics Catalog

Manager Metrics

MetricTypeDescriptionLabels
virtrigaud_manager_reconcile_totalCounterTotal reconcile operationskind, outcome
virtrigaud_manager_reconcile_duration_secondsHistogramReconcile durationkind
virtrigaud_queue_depthGaugeWork queue depthkind

Provider Metrics

MetricTypeDescriptionLabels
virtrigaud_provider_rpc_requests_totalCounterRPC requestsprovider_type, method, code
virtrigaud_provider_rpc_latency_secondsHistogramRPC latencyprovider_type, method
virtrigaud_provider_tasks_inflightGaugeInflight tasksprovider_type, provider

VM Operation Metrics

MetricTypeDescriptionLabels
virtrigaud_vm_operations_totalCounterVM operationsoperation, provider_type, provider, outcome
virtrigaud_ip_discovery_duration_secondsHistogramIP discovery timeprovider_type

Circuit Breaker Metrics

MetricTypeDescriptionLabels
virtrigaud_circuit_breaker_stateGaugeCB state (0=closed, 1=half-open, 2=open)provider_type, provider
virtrigaud_circuit_breaker_failures_totalCounterCB failuresprovider_type, provider

Error Metrics

MetricTypeDescriptionLabels
virtrigaud_errors_totalCounterErrors by reasonreason, component

Tracing

Configuration

Enable OpenTelemetry tracing:

VIRTRIGAUD_TRACING_ENABLED=true
VIRTRIGAUD_TRACING_ENDPOINT=http://jaeger:14268/api/traces
VIRTRIGAUD_TRACING_SAMPLING_RATIO=0.1
VIRTRIGAUD_TRACING_INSECURE=true

Span Structure

Key spans include:

  • vm.reconcile - Full VM reconciliation
  • vm.create - VM creation operation
  • provider.validate - Provider validation
  • rpc.Create - gRPC calls to providers

Trace Attributes

Standard attributes:

vm.namespace = "default"
vm.name = "web-server-1"
provider.type = "vsphere"
operation = "Create"
task.ref = "task-12345"

Health Endpoints

HTTP Endpoints

All components expose health endpoints on port 8080:

  • GET /healthz - Liveness probe (always returns 200)
  • GET /readyz - Readiness probe (checks dependencies)
  • GET /health - Detailed health status (JSON)

gRPC Health

Providers implement grpc.health.v1.Health service for health checks.

Grafana Dashboards

Manager Dashboard

  • Reconcile rates and duration
  • Queue depth monitoring
  • Error rate tracking
  • Resource usage (CPU/memory)

Provider Dashboard

  • RPC latency and error rates
  • Task monitoring
  • Circuit breaker status
  • Provider-specific metrics

VM Lifecycle Dashboard

  • Creation success rates
  • IP discovery times
  • Failure analysis
  • Provider comparison

Prometheus Alerts

Critical Alerts

  • VirtrigaudProviderDown - Provider unavailable
  • VirtrigaudManagerDown - Manager unavailable

Warning Alerts

  • VirtrigaudProviderErrorRateHigh - High error rate (>50%)
  • VirtrigaudReconcileStuck - Slow reconciles (>5min)
  • VirtrigaudQueueBackedUp - Queue depth >100
  • VirtrigaudCircuitBreakerOpen - CB protection active

Configuration Reference

Complete Environment Variables

# Logging
LOG_LEVEL=info
LOG_FORMAT=json
LOG_SAMPLING=true
LOG_DEVELOPMENT=false

# Tracing
VIRTRIGAUD_TRACING_ENABLED=false
VIRTRIGAUD_TRACING_ENDPOINT=""
VIRTRIGAUD_TRACING_SAMPLING_RATIO=0.1
VIRTRIGAUD_TRACING_INSECURE=true

# RPC Timeouts
RPC_TIMEOUT_DESCRIBE=30s
RPC_TIMEOUT_MUTATING=4m
RPC_TIMEOUT_VALIDATE=10s
RPC_TIMEOUT_TASK_STATUS=10s

# Retry Configuration
RETRY_MAX_ATTEMPTS=5
RETRY_BASE_DELAY=500ms
RETRY_MAX_DELAY=30s
RETRY_MULTIPLIER=2.0
RETRY_JITTER=true

# Circuit Breaker
CB_FAILURE_THRESHOLD=10
CB_RESET_SECONDS=60s
CB_HALF_OPEN_MAX_CALLS=3

# Rate Limiting
RATE_LIMIT_QPS=10
RATE_LIMIT_BURST=20

# Workers
WORKERS_PER_KIND=2
MAX_INFLIGHT_TASKS=100

# Feature Gates
FEATURE_GATES=""

# Performance
VIRTRIGAUD_PPROF_ENABLED=false
VIRTRIGAUD_PPROF_ADDR=:6060

Deployment

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: virtrigaud-manager
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud
  endpoints:
  - port: metrics
    interval: 30s

PrometheusRule

Deploy alerts:

kubectl apply -f deploy/observability/prometheus/alerts.yaml

Grafana Dashboards

Import dashboards from deploy/observability/grafana/

Troubleshooting

High Error Rates

  1. Check provider health: kubectl get providers
  2. Review error metrics: virtrigaud_errors_total
  3. Check circuit breaker state
  4. Review provider logs

Slow Operations

  1. Check RPC latency metrics
  2. Review reconcile duration
  3. Check resource constraints
  4. Monitor task queue depth

Memory Issues

  1. Monitor process_resident_memory_bytes
  2. Check for goroutine leaks: go_goroutines
  3. Review heap usage: go_memstats_heap_inuse_bytes

Security Policy

Supported Versions

We actively support the following versions of VirtRigaud with security updates:

VersionSupported
0.1.x:white_check_mark:
< 0.1:x:

Reporting a Vulnerability

The VirtRigaud team takes security vulnerabilities seriously. We appreciate your efforts to responsibly disclose your findings, and will make every effort to acknowledge your contributions.

How to Report

Please do not report security vulnerabilities through public GitHub issues.

Instead, please send an email to security@virtrigaud.io with the following information:

  • A description of the vulnerability
  • Steps to reproduce the issue
  • Potential impact
  • Any possible mitigations you’ve identified

You should receive a response within 48 hours. If for some reason you do not, please follow up via email to ensure we received your original message.

What to Expect

  • Acknowledgment: We will acknowledge receipt of your vulnerability report within 48 hours.
  • Assessment: We will assess the vulnerability and determine its severity within 5 business days.
  • Mitigation: For confirmed vulnerabilities, we will work on a fix and coordinate disclosure timeline with you.
  • Recognition: We will credit you in our security advisory and release notes (unless you prefer to remain anonymous).

Disclosure Policy

  • We ask that you do not publicly disclose the vulnerability until we have had a chance to address it.
  • We will coordinate with you on an appropriate disclosure timeline.
  • We typically aim to disclose within 90 days of initial report.

Security Considerations

General Security

  • VirtRigaud runs with minimal privileges and follows security best practices
  • All communications with providers use TLS encryption
  • Sensitive data (credentials, user data) is properly handled and never logged
  • RBAC is enforced to limit access to resources

Supply Chain Security

  • All container images are signed with Cosign
  • Software Bill of Materials (SBOM) is provided for all releases
  • Container images are scanned for vulnerabilities
  • Dependencies are regularly updated

Network Security

  • Network policies are provided to restrict traffic
  • mTLS is supported for provider communications
  • No unnecessary ports are exposed

Access Control

  • RBAC roles follow principle of least privilege
  • Service accounts are properly scoped
  • Admission webhooks enforce security policies

Vulnerability Management

Scanning

We regularly scan our codebase and dependencies for known vulnerabilities using:

  • GitHub Security Advisories
  • Trivy for container scanning
  • Go vulnerability database
  • OWASP dependency checking

Response Process

  1. Detection: Vulnerability discovered through scanning or reporting
  2. Assessment: Determine severity and impact
  3. Patching: Develop and test fix
  4. Release: Create security release with patch
  5. Notification: Inform users through security advisory

Severity Classification

We use the following severity levels:

  • Critical: Immediate action required, patch within 24 hours
  • High: Patch within 7 days
  • Medium: Patch within 30 days
  • Low: Patch in next regular release

Security Features

Authentication and Authorization

  • Integration with Kubernetes RBAC
  • Support for external identity providers
  • Service account token projection
  • Webhook authentication

Encryption

  • TLS 1.2+ for all communications
  • Certificate rotation and management
  • Support for custom CA certificates
  • Secrets encryption at rest (Kubernetes level)

Audit and Monitoring

  • Comprehensive audit logging
  • Security event monitoring
  • Metrics for security-relevant events
  • Integration with security monitoring tools

Best Practices for Users

Deployment Security

  1. Use namespace isolation: Deploy in dedicated namespace
  2. Apply network policies: Restrict network access
  3. Enable Pod Security Standards: Use strict or baseline profiles
  4. Regular updates: Keep VirtRigaud and dependencies updated
  5. Monitor security advisories: Subscribe to security notifications

Credential Management

  1. Use external secret management: HashiCorp Vault, External Secrets Operator
  2. Rotate credentials regularly: Implement credential rotation
  3. Principle of least privilege: Grant minimal required permissions
  4. Secure storage: Never store credentials in Git or plain text

Network Security

  1. Enable TLS: Use TLS for all communications
  2. Network segmentation: Isolate provider networks
  3. Firewall rules: Restrict hypervisor access
  4. VPN access: Use VPN for remote hypervisor access

Monitoring and Alerting

  1. Security monitoring: Monitor for security events
  2. Failed authentication alerts: Alert on authentication failures
  3. Unusual activity: Monitor for unexpected behavior
  4. Compliance scanning: Regular security scans

Compliance

VirtRigaud is designed to support compliance with various security frameworks:

  • SOC 2: Control implementation guidance available
  • ISO 27001: Security control mapping provided
  • CIS Kubernetes Benchmark: Alignment with security benchmarks
  • NIST Cybersecurity Framework: Control implementation guidance

Security Tools and Integrations

Supported Security Tools

  • Falco: Runtime security monitoring
  • OPA Gatekeeper: Policy enforcement
  • Twistlock/Prisma: Container security scanning
  • Aqua Security: Container and runtime security
  • Cilium: Network security and observability

Security Configurations

Example security-hardened configurations are provided in:

  • examples/security/strict-rbac.yaml
  • examples/security/network-policies.yaml
  • examples/security/pod-security-policies.yaml
  • examples/security/external-secrets.yaml

Contact

For security-related questions that are not vulnerabilities, you can:

  • Open a GitHub Discussion in the Security category
  • Email security@virtrigaud.io
  • Join the #virtrigaud-security channel on Kubernetes Slack

Recognition

We maintain a security hall of fame for researchers who have helped improve VirtRigaud security:

Thank you to all the security researchers who have contributed to making VirtRigaud more secure!

VirtRigaud Resilience Guide

This document describes the resilience patterns and error handling mechanisms in VirtRigaud.

Overview

VirtRigaud implements comprehensive resilience patterns:

  • Error Taxonomy - Structured error classification
  • Circuit Breakers - Protection against cascading failures
  • Exponential Backoff - Intelligent retry strategies
  • Timeout Policies - Prevent resource exhaustion
  • Rate Limiting - Provider protection

Error Taxonomy

Error Types

VirtRigaud classifies all errors into specific categories:

TypeRetryableDescriptionExample
NotFoundNoResource doesn’t existVM not found
InvalidSpecNoInvalid configurationMalformed VM spec
UnauthorizedNoAuthentication failedInvalid credentials
NotSupportedNoUnsupported operationFeature not available
RetryableYesTransient errorNetwork timeout
UnavailableYesService unavailableProvider down
RateLimitYesRate limitedAPI quota exceeded
TimeoutYesOperation timeoutLong-running task
QuotaExceededNoResource quota hitStorage full
ConflictNoResource conflictDuplicate name

Error Creation

import "github.com/projectbeskar/virtrigaud/internal/providers/contracts"

// Create specific error types
err := contracts.NewNotFoundError("VM not found", originalErr)
err := contracts.NewRetryableError("Network timeout", originalErr)
err := contracts.NewUnavailableError("Provider unavailable", originalErr)

// Check if error is retryable
if providerErr, ok := err.(*contracts.ProviderError); ok {
    if providerErr.IsRetryable() {
        // Retry the operation
    }
}

Circuit Breaker Pattern

Configuration

import "github.com/projectbeskar/virtrigaud/internal/resilience"

config := &resilience.Config{
    FailureThreshold: 10,              // Open after 10 failures
    ResetTimeout:     60 * time.Second, // Try again after 60s
    HalfOpenMaxCalls: 3,               // Allow 3 test calls
}

cb := resilience.NewCircuitBreaker("provider-vsphere", "vsphere", "prod", config)

Usage

err := cb.Call(ctx, func(ctx context.Context) error {
    // Call the potentially failing operation
    return provider.Create(ctx, request)
})

if err != nil {
    // Handle error (may be circuit breaker protection)
    log.Error(err, "Operation failed")
}

States

  1. Closed - Normal operation, failures are counted
  2. Open - Fast-fail mode, requests are rejected immediately
  3. Half-Open - Testing mode, limited requests allowed

Metrics

Circuit breaker state is exposed via metrics:

virtrigaud_circuit_breaker_state{provider_type="vsphere",provider="prod"} 0
virtrigaud_circuit_breaker_failures_total{provider_type="vsphere",provider="prod"} 5

Retry Strategies

Exponential Backoff

import "github.com/projectbeskar/virtrigaud/internal/resilience"

config := &resilience.RetryConfig{
    MaxAttempts: 5,
    BaseDelay:   500 * time.Millisecond,
    MaxDelay:    30 * time.Second,
    Multiplier:  2.0,
    Jitter:      true,
}

err := resilience.Retry(ctx, config, func(ctx context.Context, attempt int) error {
    return provider.Describe(ctx, vmID)
})

Backoff Calculation

For attempt n:

delay = BaseDelay Γ— Multiplier^n
delay = min(delay, MaxDelay)
if Jitter:
    delay += random(0, delay * 0.1)

Example delays with BaseDelay=500ms, Multiplier=2.0:

  • Attempt 0: 500ms
  • Attempt 1: 1s
  • Attempt 2: 2s
  • Attempt 3: 4s
  • Attempt 4: 8s

Predefined Configurations

// For frequent, low-latency operations
aggressive := resilience.AggressiveRetryConfig()
// MaxAttempts: 10, BaseDelay: 100ms, Multiplier: 1.5

// For expensive operations
conservative := resilience.ConservativeRetryConfig()
// MaxAttempts: 3, BaseDelay: 1s, Multiplier: 3.0

// Disable retries
none := resilience.NoRetryConfig()
// MaxAttempts: 1

Combined Resilience Policies

Policy Builder

policy := resilience.NewPolicyBuilder("vm-operations").
    WithRetry(resilience.DefaultRetryConfig()).
    WithCircuitBreaker(circuitBreaker).
    Build()

err := policy.Execute(ctx, func(ctx context.Context) error {
    return provider.Create(ctx, request)
})

Integration Example

// In VirtualMachine controller
func (r *VirtualMachineReconciler) createVM(ctx context.Context, vm *v1beta1.VirtualMachine) error {
    // Get circuit breaker for this provider
    cb := r.CircuitBreakerRegistry.GetOrCreate(
        "vm-operations", 
        provider.Spec.Type, 
        provider.Name,
    )
    
    // Create resilience policy
    policy := resilience.NewPolicyBuilder("create-vm").
        WithRetry(&resilience.RetryConfig{
            MaxAttempts: 3,
            BaseDelay:   1 * time.Second,
            MaxDelay:    30 * time.Second,
            Multiplier:  2.0,
            Jitter:      true,
        }).
        WithCircuitBreaker(cb).
        Build()
    
    // Execute with resilience
    return policy.Execute(ctx, func(ctx context.Context) error {
        resp, err := provider.Create(ctx, createReq)
        if err != nil {
            return err
        }
        
        vm.Status.ID = resp.ID
        vm.Status.TaskRef = resp.TaskRef
        return nil
    })
}

Timeout Policies

RPC Timeouts

Different operations have different timeout requirements:

// Operation-specific timeouts
config := &config.RPCConfig{
    TimeoutDescribe:   30 * time.Second,  // Quick status check
    TimeoutMutating:   4 * time.Minute,   // Create/Delete/Power
    TimeoutValidate:   10 * time.Second,  // Provider validation
    TimeoutTaskStatus: 10 * time.Second,  // Task polling
}

// Usage in gRPC client
timeout := config.GetRPCTimeout("Create")
ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()

resp, err := client.Create(ctx, request)

Context Propagation

Always respect context deadlines:

func (p *Provider) Create(ctx context.Context, req CreateRequest) error {
    // Check if context is already cancelled
    select {
    case <-ctx.Done():
        return ctx.Err()
    default:
    }
    
    // Perform operation with context
    return p.performCreate(ctx, req)
}

Rate Limiting

Provider Protection

import "golang.org/x/time/rate"

// Configure rate limiter
limiter := rate.NewLimiter(
    rate.Limit(config.RateLimit.QPS),    // 10 requests per second
    config.RateLimit.Burst,              // Allow bursts of 20
)

// Check rate limit before operation
if !limiter.Allow() {
    return contracts.NewRateLimitError("Rate limit exceeded", nil)
}

// Proceed with operation
return provider.Create(ctx, request)

Per-Provider Limits

Each provider instance has its own rate limiter:

type ProviderManager struct {
    limiters map[string]*rate.Limiter
}

func (pm *ProviderManager) getLimiter(providerType, provider string) *rate.Limiter {
    key := fmt.Sprintf("%s:%s", providerType, provider)
    if limiter, exists := pm.limiters[key]; exists {
        return limiter
    }
    
    // Create new limiter
    limiter := rate.NewLimiter(rate.Limit(10), 20)
    pm.limiters[key] = limiter
    return limiter
}

Condition Mapping

VM Conditions

VirtRigaud sets standard conditions based on operations:

ConditionStatusReasonDescription
ReadyTrueVMReadyVM is ready for use
ReadyFalseProviderErrorProvider operation failed
ReadyFalseValidationErrorSpec validation failed
ProvisioningTrueCreatingVM creation in progress
ProvisioningFalseCreateFailedVM creation failed

Provider Conditions

ConditionStatusReasonDescription
ProviderRuntimeReadyTrueDeploymentReadyRemote runtime ready
ProviderRuntimeReadyFalseDeploymentErrorDeployment failed
ProviderAvailableTrueHealthCheckPassedProvider healthy
ProviderAvailableFalseHealthCheckFailedProvider unhealthy

Error to Condition Mapping

func mapErrorToCondition(err error) metav1.Condition {
    if providerErr, ok := err.(*contracts.ProviderError); ok {
        switch providerErr.Type {
        case contracts.ErrorTypeNotFound:
            return metav1.Condition{
                Type:    "Ready",
                Status:  metav1.ConditionFalse,
                Reason:  "ResourceNotFound",
                Message: providerErr.Message,
            }
        case contracts.ErrorTypeUnauthorized:
            return metav1.Condition{
                Type:    "Ready", 
                Status:  metav1.ConditionFalse,
                Reason:  "AuthenticationFailed",
                Message: providerErr.Message,
            }
        case contracts.ErrorTypeUnavailable:
            return metav1.Condition{
                Type:    "Ready",
                Status:  metav1.ConditionFalse,
                Reason:  "ProviderUnavailable", 
                Message: providerErr.Message,
            }
        }
    }
    
    // Default error condition
    return metav1.Condition{
        Type:    "Ready",
        Status:  metav1.ConditionFalse,
        Reason:  "InternalError",
        Message: err.Error(),
    }
}

Best Practices

Error Handling

  1. Always classify errors - Use appropriate error types
  2. Preserve context - Wrap errors with additional context
  3. Avoid retrying non-retryable errors - Check error type first
  4. Set meaningful conditions - Help users understand state

Circuit Breakers

  1. Per-provider instances - Isolate failures
  2. Appropriate thresholds - Balance protection vs availability
  3. Monitor state changes - Alert on circuit breaker trips
  4. Manual override - Provide way to reset if needed

Timeouts

  1. Operation-appropriate - Different timeouts for different ops
  2. Propagate context - Always pass context through
  3. Handle cancellation - Check context.Done() regularly
  4. Resource cleanup - Ensure resources are freed on timeout

Rate Limiting

  1. Provider protection - Prevent overwhelming providers
  2. Burst handling - Allow reasonable bursts
  3. Back-pressure - Surface rate limits to users
  4. Fair sharing - Consider tenant isolation

Configuration Examples

Development Environment

apiVersion: v1
kind: ConfigMap
metadata:
  name: virtrigaud-config
data:
  # Relaxed timeouts for development
  RPC_TIMEOUT_MUTATING: "10m"
  
  # Aggressive retries for flaky dev environments  
  RETRY_MAX_ATTEMPTS: "10"
  RETRY_BASE_DELAY: "100ms"
  
  # Lower circuit breaker threshold
  CB_FAILURE_THRESHOLD: "5"
  CB_RESET_SECONDS: "30s"

Production Environment

apiVersion: v1
kind: ConfigMap
metadata:
  name: virtrigaud-config
data:
  # Strict timeouts
  RPC_TIMEOUT_MUTATING: "4m"
  RPC_TIMEOUT_DESCRIBE: "30s"
  
  # Conservative retries
  RETRY_MAX_ATTEMPTS: "3"
  RETRY_BASE_DELAY: "1s"
  RETRY_MAX_DELAY: "60s"
  
  # Higher circuit breaker threshold
  CB_FAILURE_THRESHOLD: "15" 
  CB_RESET_SECONDS: "120s"
  
  # Rate limiting
  RATE_LIMIT_QPS: "20"
  RATE_LIMIT_BURST: "50"

VirtRigaud Upgrade Guide

This guide covers upgrading VirtRigaud installations, including CRD updates and breaking changes.

Quick Upgrade

# 1. Update Helm repository
helm repo update

# 2. Check for breaking changes
helm diff upgrade virtrigaud virtrigaud/virtrigaud --version v0.2.1

# 3. Upgrade CRDs first (required for schema changes)
helm pull virtrigaud/virtrigaud --version v0.2.1 --untar
kubectl apply -f virtrigaud/crds/

# 4. Upgrade VirtRigaud
helm upgrade virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --version v0.2.1

Alternative: Direct CRD Download

# Download and apply CRDs from release
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.1/virtrigaud-crds.yaml" | kubectl apply -f -

# Upgrade application
helm upgrade virtrigaud virtrigaud/virtrigaud --version v0.2.1

Version-Specific Upgrade Notes

v0.2.0 β†’ v0.2.1

Breaking Changes:

  • βœ… PowerState validation fixed (OffGraceful now supported)
  • βœ… Hardware version management added (vSphere only)
  • βœ… Disk size configuration respected

Required Actions:

  1. CRD Update Required: New powerState validation and schema changes
  2. Provider Image Update: Ensure providers use v0.2.1+ images for new features
  3. Field Testing: Verify OffGraceful, hardware version, and disk sizing work correctly

Upgrade Steps:

# 1. Backup existing resources
kubectl get virtualmachines,vmclasses,providers -A -o yaml > virtrigaud-backup-v021.yaml

# 2. Update CRDs (fixes OffGraceful validation)
kubectl apply -f https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.1/virtrigaud-crds.yaml

# 3. Upgrade VirtRigaud
helm upgrade virtrigaud virtrigaud/virtrigaud --version v0.2.1

# 4. Verify OffGraceful works
kubectl patch virtualmachine <vm-name> --type='merge' -p='{"spec":{"powerState":"OffGraceful"}}'

Rollback Procedures

Rollback to Previous Version

# 1. Rollback application
helm rollback virtrigaud <revision>

# 2. Rollback CRDs (if schema breaking changes)
kubectl apply -f https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.0/virtrigaud-crds.yaml

# 3. Verify resources still work
kubectl get virtualmachines -A

Emergency Recovery

# 1. Restore from backup
kubectl apply -f virtrigaud-backup-v021.yaml

# 2. Check controller logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager

# 3. Force reconciliation
kubectl annotate virtualmachine <vm-name> virtrigaud.io/force-sync="$(date)"

Automated Upgrade with GitOps

ArgoCD

apiVersion: argoproj.io/v1beta1
kind: Application
metadata:
  name: virtrigaud
spec:
  source:
    chart: virtrigaud
    repoURL: https://projectbeskar.github.io/virtrigaud
    targetRevision: "0.2.1"
    helm:
      parameters:
      - name: manager.image.tag
        value: "v0.2.1"
  syncPolicy:
    syncOptions:
    - CreateNamespace=true
    - Replace=true  # Required for CRD updates

Flux

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: virtrigaud
spec:
  chart:
    spec:
      chart: virtrigaud
      version: "0.2.1"
      sourceRef:
        kind: HelmRepository
        name: virtrigaud
  upgrade:
    crds: CreateReplace  # Ensure CRDs are updated

Troubleshooting Upgrades

CRD Validation Errors

# Check CRD status
kubectl get crd virtualmachines.infra.virtrigaud.io -o yaml

# Fix validation conflicts
kubectl patch crd virtualmachines.infra.virtrigaud.io --type='json' -p='[{"op": "remove", "path": "/spec/versions/0/schema/openAPIV3Schema/properties/spec/properties/powerState/allOf"}]'

Provider Image Mismatch

# Check provider images
kubectl get providers -o jsonpath='{.items[*].spec.runtime.image}'

# Update provider image
kubectl patch provider <provider-name> --type='merge' -p='{"spec":{"runtime":{"image":"ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.1"}}}'

Resource Conflicts

# Check for resource conflicts
kubectl get events --sort-by=.metadata.creationTimestamp

# Force resource refresh
kubectl delete pod -l app.kubernetes.io/name=virtrigaud -n virtrigaud-system

Best Practices

Pre-Upgrade Checklist

  • Backup all VirtRigaud resources
  • Check for breaking changes in release notes
  • Test upgrade in staging environment
  • Verify provider connectivity
  • Plan rollback strategy

Post-Upgrade Verification

  • All CRDs updated successfully
  • Controller manager running
  • Providers healthy and responsive
  • Existing VMs still manageable
  • New features working (OffGraceful, hardware version, etc.)

Monitoring During Upgrade

# Watch controller logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager -f

# Monitor VM status
kubectl get virtualmachines -A --watch

# Check provider health
kubectl get providers -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[0].type,MESSAGE:.status.conditions[0].message

Support and Recovery

If you encounter issues during upgrade:

  1. Check Release Notes: https://github.com/projectbeskar/virtrigaud/releases
  2. Review Logs: Controller and provider logs for error details
  3. Community Support: GitHub issues and discussions
  4. Emergency Rollback: Use documented rollback procedures

Remember: Always test upgrades in non-production environments first!

Development Workflow (v0.2.1+)

CRD Management

Starting with v0.2.1+, VirtRigaud uses a single-source-of-truth approach for CRDs:

  • Code is the source of truth (API types in api/infra.virtrigaud.io/v1beta1)
  • config/crd/bases/ contains generated CRDs for local development and is checked into git
  • charts/virtrigaud/crds/ CRDs are generated during Helm chart packaging and are NOT checked into git

For Developers

# Generate CRDs for local development
make gen-crds

# Generate CRDs for Helm chart packaging
make gen-helm-crds

# Package Helm chart with generated CRDs
make helm-package

Pre-commit Hooks

Install pre-commit hooks to automatically generate CRDs:

# Install pre-commit
pip install pre-commit

# Install hooks
pre-commit install

# CRDs will now be generated automatically on commits that modify:
# - api/**.go files

CI/CD Integration

The CI/CD pipeline automatically:

  1. Generates CRDs from code during builds
  2. Includes CRDs in release artifacts for users to download
  3. Generates Helm chart CRDs during packaging

This ensures CRDs are always up-to-date and not duplicated in the repository.

Repository Workflow

# 1. Make API changes
vim api/infra.virtrigaud.io/v1beta1/virtualmachine_types.go

# 2. Generate CRDs (automated by pre-commit)
make gen-crds

# 3. Commit changes
git add .
git commit -m "feat: add new VM power states"

# 4. CI validates and builds with generated CRDs
git push origin feature-branch

vSphere Hardware Version Management

This document describes how to configure and upgrade VM hardware compatibility versions in VMware vSphere environments using virtrigaud.

Overview

VMware vSphere virtual machines have a hardware compatibility version (also called virtual hardware version) that determines which features and capabilities are available to the VM. Higher hardware versions provide access to newer features but require compatible ESXi hosts.

Note: Hardware version management is specific to VMware vSphere and is not available for other providers (LibVirt, Proxmox, etc.).

Hardware Version Numbers

Common hardware versions and their corresponding VMware products:

Hardware VersionvSphere/ESXi VersionKey Features
10ESXi 5.5Legacy baseline
11ESXi 6.0Enhanced graphics, larger VM memory
13ESXi 6.5Enhanced security, more CPU/memory
14ESXi 6.7Persistent memory, enhanced security
15ESXi 6.7 U2Enhanced graphics, more vCPU
17ESXi 7.0TPM 2.0, enhanced security
18ESXi 7.0 U1Enhanced networking
19ESXi 7.0 U2Precision time protocol
20ESXi 7.0 U3Enhanced graphics, more memory
21ESXi 8.0Latest features, DPU support

Setting Hardware Version During VM Creation

Configure the hardware version in the VMClass using the extraConfig field:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: modern-vm-class
  namespace: virtrigaud-system
spec:
  cpu: 4
  memory: 8Gi
  firmware: UEFI
  
  # vSphere-specific hardware version configuration
  extraConfig:
    vsphere.hardwareVersion: "21"  # Use latest hardware version
  
  diskDefaults:
    type: thin
    sizeGiB: 50
---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: modern-vm
  namespace: default
spec:
  providerRef:
    name: vsphere-provider
    namespace: virtrigaud-system
  
  classRef:
    name: modern-vm-class  # Uses hardware version 21
    namespace: virtrigaud-system
  
  imageRef:
    name: ubuntu-22-04
    namespace: virtrigaud-system

Upgrading Hardware Version for Existing VMs

You can upgrade the hardware version of existing VMs using the dedicated hardware upgrade API:

Using kubectl with Raw gRPC

# First, ensure the VM is powered off
kubectl patch vm my-vm --type='merge' -p='{"spec":{"powerState":"Off"}}'

# Wait for VM to be powered off, then upgrade hardware version
# Note: This requires direct access to the provider gRPC endpoint
# A kubectl plugin or controller extension would be needed for this operation

Programmatic Upgrade (Example Go Code)

package main

import (
    "context"
    "fmt"
    "log"
    
    providerv1 "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
    "google.golang.org/grpc"
)

func upgradeVMHardwareVersion(vmID string, targetVersion int32) error {
    // Connect to vSphere provider
    conn, err := grpc.Dial("vsphere-provider:9090", grpc.WithInsecure())
    if err != nil {
        return fmt.Errorf("failed to connect: %w", err)
    }
    defer conn.Close()
    
    client := providerv1.NewProviderClient(conn)
    
    // Upgrade hardware version
    req := &providerv1.HardwareUpgradeRequest{
        Id:            vmID,
        TargetVersion: targetVersion,
    }
    
    resp, err := client.HardwareUpgrade(context.Background(), req)
    if err != nil {
        return fmt.Errorf("hardware upgrade failed: %w", err)
    }
    
    log.Printf("Hardware upgrade completed: %+v", resp)
    return nil
}

Requirements and Limitations

Prerequisites

  1. VM Must Be Powered Off: Hardware version upgrades require the VM to be completely powered off
  2. ESXi Host Compatibility: Target hardware version must be supported by the ESXi host
  3. VMware Tools: For best results, ensure VMware Tools is installed and up-to-date
  4. Backup Recommended: Take a snapshot before upgrading hardware version

Limitations

  1. One-Way Operation: Hardware version upgrades cannot be downgraded
  2. vSphere Only: This feature is not available for LibVirt, Proxmox, or other providers
  3. Host Requirements: Upgrading to newer versions may prevent VM from running on older ESXi hosts
  4. Compatibility: Some older guest operating systems may not support newer hardware versions

Best Practices

Choosing Hardware Version

  1. Match ESXi Version: Use the hardware version that matches your ESXi environment
  2. Conservative Approach: Don’t always use the latest version unless you need specific features
  3. Test First: Test hardware version upgrades in development before production

Upgrade Process

  1. Plan Maintenance Window: VMs must be powered off during upgrade
  2. Backup First: Always take a snapshot before upgrading
  3. Batch Operations: Group VMs by hardware requirements for efficient upgrades
  4. Verify Compatibility: Ensure all ESXi hosts in your cluster support the target version

Example VMClass Configurations

Legacy Environment (ESXi 6.5)

extraConfig:
  vsphere.hardwareVersion: "13"

Modern Environment (ESXi 7.0)

extraConfig:
  vsphere.hardwareVersion: "17"

Latest Features (ESXi 8.0)

extraConfig:
  vsphere.hardwareVersion: "21"

Troubleshooting

Common Issues

  1. VM Not Powered Off

    Error: VM must be powered off for hardware upgrade, current state: poweredOn
    

    Solution: Power off the VM first using powerState: Off

  2. Unsupported Hardware Version

    Error: target version vmx-21 is not supported by ESXi host
    

    Solution: Check ESXi host compatibility and use a supported version

  3. Version Not Newer

    Error: target version vmx-15 is not newer than current version vmx-17
    

    Solution: Hardware versions can only be upgraded, not downgraded

Validation

After upgrading, verify the hardware version:

# Check VM configuration in vSphere
kubectl get vm my-vm -o jsonpath='{.status.provider}'

Integration Examples

Complete VM Lifecycle with Hardware Version

# 1. Create VMClass with specific hardware version
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: production-vm-class
spec:
  cpu: 8
  memory: 16Gi
  firmware: UEFI
  extraConfig:
    vsphere.hardwareVersion: "19"  # ESXi 7.0 U2 compatible

---
# 2. Create VM using the class
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: production-vm
spec:
  powerState: On
  providerRef:
    name: vsphere-provider
    namespace: virtrigaud-system
  classRef:
    name: production-vm-class
    namespace: virtrigaud-system
  imageRef:
    name: ubuntu-22-04
    namespace: virtrigaud-system

---
# 3. Update to newer hardware version (requires separate upgrade operation)
# This would typically be done through a controller or manual gRPC call
# after powering off the VM

This vSphere-specific feature provides fine-grained control over VM hardware capabilities while maintaining compatibility with your ESXi infrastructure.

vSphere Datastore Cluster (StoragePod) Support

This document describes how to use vSphere Datastore Clusters (also known as StoragePods) for automatic datastore selection when provisioning virtual machines with virtrigaud.

Note: StoragePod support is specific to the vSphere provider and is not available for Libvirt or Proxmox.

Overview

A vSphere Datastore Cluster (internally called a StoragePod) is a logical grouping of datastores managed together as a single unit. When you specify a Datastore Cluster instead of an individual datastore, virtrigaud automatically selects the datastore within the cluster that has the most available free space at provisioning time.

This simplifies VM placement in environments with multiple datastores: instead of tracking which individual datastore has capacity, you point to the cluster and let virtrigaud choose.

Datastore Selection Strategy

virtrigaud uses a simple, predictable strategy: pick the datastore with the most free space. This distributes VMs across the cluster over time as datastores fill up.

vSphere Storage DRS is not required to be enabled on the cluster. virtrigaud queries datastore summaries directly via the vSphere API.

Configuration

Per-VM Placement (VirtualMachine spec)

Specify storagePod inside spec.placement on a VirtualMachine resource:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: my-vm
  namespace: virtrigaud-system
spec:
  providerRef:
    name: vsphere-prod
  classRef:
    name: standard-2cpu-4gb
  imageRef:
    name: ubuntu-24-04
  placement:
    cluster: prod-cluster
    storagePod: "Production-DS-Cluster"   # Datastore Cluster name
    folder: /prod/vms

virtrigaud will inspect every datastore in Production-DS-Cluster and clone the VM onto the one with the most free space.

Provider-Level Default

Set spec.defaults.storagePod on the Provider resource to apply a Datastore Cluster as the default for all VMs that do not specify their own placement:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-prod
  namespace: virtrigaud-system
spec:
  type: vsphere
  endpoint: https://vcenter.example.com
  credentialSecretRef:
    name: vsphere-credentials
  defaults:
    cluster: prod-cluster
    storagePod: "Production-DS-Cluster"   # cluster-wide default
    folder: /prod/vms
  runtime:
    image: ghcr.io/projectbeskar/virtrigaud-provider-vsphere:latest

Alternatively, pass the default through the provider pod’s environment by adding it to spec.runtime.env:

spec:
  runtime:
    env:
      - name: PROVIDER_DEFAULT_STORAGE_POD
        value: "Production-DS-Cluster"

Precedence Rules

When multiple sources specify storage placement, virtrigaud applies the following priority (highest to lowest):

PrioritySourceField
1VM spec β€” explicit datastorespec.placement.datastore
2VM spec β€” StoragePodspec.placement.storagePod
3Provider default β€” StoragePodspec.defaults.storagePod / PROVIDER_DEFAULT_STORAGE_POD
4Provider default β€” datastorespec.defaults.datastore / PROVIDER_DEFAULT_DATASTORE

An explicit datastore always wins. storagePod is only consulted when no explicit datastore is set.

Examples

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server-01
  namespace: virtrigaud-system
spec:
  providerRef:
    name: vsphere-prod
  classRef:
    name: web-4cpu-8gb
  imageRef:
    name: ubuntu-24-04
  placement:
    cluster: prod-cluster
    storagePod: "SSD-Datastore-Cluster"

Override with an explicit datastore (e.g. for compliance)

When you need a specific datastoreβ€”for example, a datastore dedicated to regulated workloadsβ€”set datastore and storagePod is ignored:

  placement:
    cluster: prod-cluster
    datastore: "regulated-ds-01"    # StoragePod is ignored when this is set
    storagePod: "SSD-Datastore-Cluster"

Use different clusters for different teams via a shared Provider

Combine a provider-level StoragePod default with per-VM overrides:

# Provider default: route most VMs to the general-purpose cluster
spec:
  defaults:
    cluster: general-cluster
    storagePod: "General-DS-Cluster"
# High-performance VM overrides both cluster and StoragePod
spec:
  placement:
    cluster: nvme-cluster
    storagePod: "NVMe-DS-Cluster"

How it Works Internally

When a VM is created and a StoragePod is resolved:

  1. virtrigaud creates a container view scoped to the vSphere root folder and searches for StoragePod managed objects.
  2. It matches the named StoragePod and reads its childEntity listβ€”the set of datastores it contains.
  3. For each child datastore the summary.freeSpace property is retrieved via the property collector.
  4. The datastore with the highest freeSpace is selected and used as the target in the clone specification (VirtualMachineRelocateSpec.Datastore).

The selection happens at provisioning time; it is not re-evaluated on subsequent reconciliations or reboots.

Troubleshooting

β€œStoragePod β€˜X’ not found”

  • Verify the name exactly matches the Datastore Cluster name in vCenter (case-sensitive).
  • Confirm the vCenter user account has Datastore.Browse privilege on the Datastore Cluster.
  • Check provider pod logs for the container view query.

β€œStoragePod β€˜X’ contains no datastores”

The Datastore Cluster exists but is empty (no datastores are members). Add datastores to the cluster in vCenter.

β€œfailed to retrieve datastores from StoragePod”

The provider account lacks permission to read datastore summary properties. Grant Datastore.Browse on the individual datastores within the cluster.

VM is always placed on the same datastore

This is expected when one datastore consistently has significantly more free space. It is not a bug.

Checking which datastore was selected

The provider logs an INFO message at VM creation time:

INFO  Selected datastore from StoragePod  storagePod=Production-DS-Cluster  datastore=vsanDatastore-02  freeSpaceGiB=812

Check the provider pod logs to see the selection for any specific VM.

Limitations

  • Free-space only: virtrigaud does not use vSphere Storage DRS policies, IOPS limits, or storage tags when selecting a datastore. Only free space is considered.
  • Point-in-time selection: The datastore is chosen once at clone time. Subsequent Storage vMotion by Storage DRS is not prevented.
  • No rebalancing: virtrigaud does not rebalance existing VMs when free space changes.
  • vSphere only: This feature has no equivalent for Libvirt or Proxmox providers.

Bearer Token Authentication

This guide covers how to configure bearer token authentication for VirtRigaud providers using JWT tokens and RBAC.

Overview

Bearer token authentication provides a stateless, scalable authentication mechanism using JSON Web Tokens (JWT). This approach is suitable for:

  • Multi-tenant environments: Different tokens for different tenants
  • API-based access: External systems accessing provider services
  • Short-lived sessions: Tokens with configurable expiration
  • Fine-grained permissions: Token-based RBAC

JWT Token Structure

Token Claims

{
  "iss": "virtrigaud-manager",
  "sub": "provider-client",
  "aud": "virtrigaud-provider",
  "exp": 1640995200,
  "iat": 1640908800,
  "nbf": 1640908800,
  "scope": "vm:create vm:read vm:update vm:delete",
  "tenant": "default",
  "provider": "vsphere",
  "jti": "unique-token-id"
}

Scopes Definition

ScopeDescription
vm:createCreate virtual machines
vm:readRead virtual machine information
vm:updateUpdate virtual machine configuration
vm:deleteDelete virtual machines
vm:powerControl virtual machine power state
vm:snapshotCreate and manage snapshots
vm:cloneClone virtual machines
adminFull administrative access

Token Generation

JWT Signing Key

# Generate RS256 private key
openssl genrsa -out jwt-private-key.pem 2048

# Extract public key
openssl rsa -in jwt-private-key.pem -pubout -out jwt-public-key.pem

# Store as Kubernetes secret
kubectl create secret generic jwt-keys \
  --from-file=private-key=jwt-private-key.pem \
  --from-file=public-key=jwt-public-key.pem \
  --namespace=virtrigaud-system

Token Generation Service

package auth

import (
    "crypto/rsa"
    "time"
    
    "github.com/golang-jwt/jwt/v4"
)

type TokenClaims struct {
    Issuer    string   `json:"iss"`
    Subject   string   `json:"sub"`
    Audience  string   `json:"aud"`
    ExpiresAt int64    `json:"exp"`
    IssuedAt  int64    `json:"iat"`
    NotBefore int64    `json:"nbf"`
    Scope     string   `json:"scope"`
    Tenant    string   `json:"tenant"`
    Provider  string   `json:"provider"`
    ID        string   `json:"jti"`
    jwt.RegisteredClaims
}

type TokenService struct {
    privateKey *rsa.PrivateKey
    publicKey  *rsa.PublicKey
    issuer     string
}

func NewTokenService(privateKey *rsa.PrivateKey, publicKey *rsa.PublicKey, issuer string) *TokenService {
    return &TokenService{
        privateKey: privateKey,
        publicKey:  publicKey,
        issuer:     issuer,
    }
}

func (ts *TokenService) GenerateToken(subject, tenant, provider string, scopes []string, duration time.Duration) (string, error) {
    now := time.Now()
    claims := &TokenClaims{
        Issuer:    ts.issuer,
        Subject:   subject,
        Audience:  "virtrigaud-provider",
        ExpiresAt: now.Add(duration).Unix(),
        IssuedAt:  now.Unix(),
        NotBefore: now.Unix(),
        Scope:     strings.Join(scopes, " "),
        Tenant:    tenant,
        Provider:  provider,
        ID:        generateJTI(),
    }
    
    token := jwt.NewWithClaims(jwt.SigningMethodRS256, claims)
    return token.SignedString(ts.privateKey)
}

func (ts *TokenService) ValidateToken(tokenString string) (*TokenClaims, error) {
    token, err := jwt.ParseWithClaims(tokenString, &TokenClaims{}, func(token *jwt.Token) (interface{}, error) {
        if _, ok := token.Method.(*jwt.SigningMethodRSA); !ok {
            return nil, fmt.Errorf("unexpected signing method: %v", token.Header["alg"])
        }
        return ts.publicKey, nil
    })
    
    if err != nil {
        return nil, err
    }
    
    if claims, ok := token.Claims.(*TokenClaims); ok && token.Valid {
        return claims, nil
    }
    
    return nil, fmt.Errorf("invalid token")
}

func generateJTI() string {
    return uuid.New().String()
}

Provider Authentication Interceptor

gRPC Interceptor

package middleware

import (
    "context"
    "strings"
    
    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/metadata"
    "google.golang.org/grpc/status"
)

type AuthInterceptor struct {
    tokenService *auth.TokenService
    rbac         *RBACManager
}

func NewAuthInterceptor(tokenService *auth.TokenService, rbac *RBACManager) *AuthInterceptor {
    return &AuthInterceptor{
        tokenService: tokenService,
        rbac:         rbac,
    }
}

func (ai *AuthInterceptor) Unary() grpc.UnaryServerInterceptor {
    return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
        // Skip authentication for health checks
        if strings.HasSuffix(info.FullMethod, "/Health/Check") {
            return handler(ctx, req)
        }
        
        token, err := ai.extractToken(ctx)
        if err != nil {
            return nil, status.Errorf(codes.Unauthenticated, "missing or invalid token: %v", err)
        }
        
        claims, err := ai.tokenService.ValidateToken(token)
        if err != nil {
            return nil, status.Errorf(codes.Unauthenticated, "invalid token: %v", err)
        }
        
        // Check authorization
        if !ai.rbac.IsAuthorized(claims, info.FullMethod) {
            return nil, status.Errorf(codes.PermissionDenied, "insufficient permissions")
        }
        
        // Add claims to context
        ctx = context.WithValue(ctx, "claims", claims)
        
        return handler(ctx, req)
    }
}

func (ai *AuthInterceptor) extractToken(ctx context.Context) (string, error) {
    md, ok := metadata.FromIncomingContext(ctx)
    if !ok {
        return "", fmt.Errorf("missing metadata")
    }
    
    authHeaders := md.Get("authorization")
    if len(authHeaders) == 0 {
        return "", fmt.Errorf("missing authorization header")
    }
    
    authHeader := authHeaders[0]
    if !strings.HasPrefix(authHeader, "Bearer ") {
        return "", fmt.Errorf("invalid authorization header format")
    }
    
    return strings.TrimPrefix(authHeader, "Bearer "), nil
}

RBAC Manager

package middleware

import (
    "strings"
)

type Permission struct {
    Resource string
    Action   string
}

type RBACManager struct {
    permissions map[string][]Permission
}

func NewRBACManager() *RBACManager {
    return &RBACManager{
        permissions: map[string][]Permission{
            // RPC method to required permissions mapping
            "/provider.v1.ProviderService/CreateVM": {
                {Resource: "vm", Action: "create"},
            },
            "/provider.v1.ProviderService/GetVM": {
                {Resource: "vm", Action: "read"},
            },
            "/provider.v1.ProviderService/UpdateVM": {
                {Resource: "vm", Action: "update"},
            },
            "/provider.v1.ProviderService/DeleteVM": {
                {Resource: "vm", Action: "delete"},
            },
            "/provider.v1.ProviderService/PowerVM": {
                {Resource: "vm", Action: "power"},
            },
            "/provider.v1.ProviderService/CreateSnapshot": {
                {Resource: "vm", Action: "snapshot"},
            },
            "/provider.v1.ProviderService/CloneVM": {
                {Resource: "vm", Action: "clone"},
            },
        },
    }
}

func (rbac *RBACManager) IsAuthorized(claims *auth.TokenClaims, method string) bool {
    requiredPerms, exists := rbac.permissions[method]
    if !exists {
        // Allow if no specific permissions required
        return true
    }
    
    userScopes := strings.Split(claims.Scope, " ")
    
    // Check if user has admin scope
    for _, scope := range userScopes {
        if scope == "admin" {
            return true
        }
    }
    
    // Check specific permissions
    for _, requiredPerm := range requiredPerms {
        requiredScope := requiredPerm.Resource + ":" + requiredPerm.Action
        
        hasPermission := false
        for _, userScope := range userScopes {
            if userScope == requiredScope {
                hasPermission = true
                break
            }
        }
        
        if !hasPermission {
            return false
        }
    }
    
    return true
}

Kubernetes RBAC Integration

ServiceAccount and ClusterRole

apiVersion: v1
kind: ServiceAccount
metadata:
  name: virtrigaud-token-manager
  namespace: virtrigaud-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: virtrigaud-token-manager
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: virtrigaud-token-manager
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: virtrigaud-token-manager
subjects:
  - kind: ServiceAccount
    name: virtrigaud-token-manager
    namespace: virtrigaud-system

Token Management ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: token-config
  namespace: virtrigaud-system
data:
  config.yaml: |
    tokenService:
      issuer: "virtrigaud-manager"
      defaultDuration: "1h"
      maxDuration: "24h"
      
    scopes:
      - name: "vm:create"
        description: "Create virtual machines"
      - name: "vm:read"
        description: "Read virtual machine information"
      - name: "vm:update"
        description: "Update virtual machine configuration"
      - name: "vm:delete"
        description: "Delete virtual machines"
      - name: "vm:power"
        description: "Control virtual machine power state"
      - name: "vm:snapshot"
        description: "Create and manage snapshots"
      - name: "vm:clone"
        description: "Clone virtual machines"
      - name: "admin"
        description: "Full administrative access"
        
    tenants:
      - name: "default"
        description: "Default tenant"
        allowedScopes: ["vm:create", "vm:read", "vm:update", "vm:delete", "vm:power"]
      - name: "development"
        description: "Development environment"
        allowedScopes: ["vm:create", "vm:read", "vm:update", "vm:delete", "vm:power", "vm:snapshot", "vm:clone"]
      - name: "production"
        description: "Production environment"
        allowedScopes: ["vm:read", "vm:power"]

Client Configuration

Manager Client Setup

package client

import (
    "context"
    "time"
    
    "google.golang.org/grpc"
    "google.golang.org/grpc/metadata"
)

type AuthenticatedClient struct {
    client providerv1.ProviderServiceClient
    token  string
}

func NewAuthenticatedClient(endpoint, token string) (*AuthenticatedClient, error) {
    conn, err := grpc.Dial(endpoint, grpc.WithInsecure())
    if err != nil {
        return nil, err
    }
    
    return &AuthenticatedClient{
        client: providerv1.NewProviderServiceClient(conn),
        token:  token,
    }, nil
}

func (ac *AuthenticatedClient) CreateVM(ctx context.Context, req *providerv1.CreateVMRequest) (*providerv1.CreateVMResponse, error) {
    ctx = ac.addAuthHeader(ctx)
    return ac.client.CreateVM(ctx, req)
}

func (ac *AuthenticatedClient) addAuthHeader(ctx context.Context) context.Context {
    md := metadata.Pairs("authorization", "Bearer "+ac.token)
    return metadata.NewOutgoingContext(ctx, md)
}

Token Refresh

package auth

import (
    "sync"
    "time"
)

type TokenManager struct {
    tokenService *TokenService
    currentToken string
    expiresAt    time.Time
    mutex        sync.RWMutex
    
    subject  string
    tenant   string
    provider string
    scopes   []string
}

func NewTokenManager(tokenService *TokenService, subject, tenant, provider string, scopes []string) *TokenManager {
    return &TokenManager{
        tokenService: tokenService,
        subject:      subject,
        tenant:       tenant,
        provider:     provider,
        scopes:       scopes,
    }
}

func (tm *TokenManager) GetToken() (string, error) {
    tm.mutex.RLock()
    if tm.currentToken != "" && time.Now().Before(tm.expiresAt.Add(-5*time.Minute)) {
        token := tm.currentToken
        tm.mutex.RUnlock()
        return token, nil
    }
    tm.mutex.RUnlock()
    
    return tm.refreshToken()
}

func (tm *TokenManager) refreshToken() (string, error) {
    tm.mutex.Lock()
    defer tm.mutex.Unlock()
    
    // Double-check after acquiring write lock
    if tm.currentToken != "" && time.Now().Before(tm.expiresAt.Add(-5*time.Minute)) {
        return tm.currentToken, nil
    }
    
    token, err := tm.tokenService.GenerateToken(tm.subject, tm.tenant, tm.provider, tm.scopes, time.Hour)
    if err != nil {
        return "", err
    }
    
    tm.currentToken = token
    tm.expiresAt = time.Now().Add(time.Hour)
    
    return token, nil
}

Helm Chart Integration

Provider Runtime with Bearer Token Auth

# values-bearer-auth.yaml
auth:
  type: "bearer"
  jwt:
    publicKeySecret: "jwt-keys"
    publicKeyKey: "public-key"
    issuer: "virtrigaud-manager"
    audience: "virtrigaud-provider"

# Environment variables for authentication
env:
  - name: AUTH_TYPE
    value: "bearer"
  - name: JWT_PUBLIC_KEY_PATH
    value: "/etc/jwt/public-key"
  - name: JWT_ISSUER
    value: "virtrigaud-manager"
  - name: JWT_AUDIENCE
    value: "virtrigaud-provider"

# Mount JWT public key
volumes:
  - name: jwt-public-key
    secret:
      secretName: jwt-keys

volumeMounts:
  - name: jwt-public-key
    mountPath: /etc/jwt
    readOnly: true

Monitoring and Logging

Authentication Metrics

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    authenticationAttempts = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "virtrigaud_authentication_attempts_total",
            Help: "Total number of authentication attempts",
        },
        []string{"method", "result", "tenant"},
    )
    
    authenticationDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "virtrigaud_authentication_duration_seconds",
            Help: "Duration of authentication operations",
        },
        []string{"method", "result"},
    )
    
    activeTokens = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "virtrigaud_active_tokens",
            Help: "Number of active tokens by tenant",
        },
        []string{"tenant", "provider"},
    )
)

func RecordAuthAttempt(method, result, tenant string) {
    authenticationAttempts.WithLabelValues(method, result, tenant).Inc()
}

func RecordAuthDuration(method, result string, duration time.Duration) {
    authenticationDuration.WithLabelValues(method, result).Observe(duration.Seconds())
}

Audit Logging

package audit

import (
    "context"
    "encoding/json"
    "time"
    
    "go.uber.org/zap"
)

type AuditEvent struct {
    Timestamp time.Time `json:"timestamp"`
    EventType string    `json:"event_type"`
    Subject   string    `json:"subject"`
    Tenant    string    `json:"tenant"`
    Provider  string    `json:"provider"`
    Resource  string    `json:"resource"`
    Action    string    `json:"action"`
    Result    string    `json:"result"`
    Error     string    `json:"error,omitempty"`
    Metadata  map[string]interface{} `json:"metadata,omitempty"`
}

type AuditLogger struct {
    logger *zap.Logger
}

func NewAuditLogger(logger *zap.Logger) *AuditLogger {
    return &AuditLogger{logger: logger}
}

func (al *AuditLogger) LogAuthEvent(ctx context.Context, eventType, subject, tenant, provider, result string, err error) {
    event := AuditEvent{
        Timestamp: time.Now(),
        EventType: eventType,
        Subject:   subject,
        Tenant:    tenant,
        Provider:  provider,
        Result:    result,
    }
    
    if err != nil {
        event.Error = err.Error()
    }
    
    eventJSON, _ := json.Marshal(event)
    al.logger.Info("audit_event", zap.String("event", string(eventJSON)))
}

Security Best Practices

1. Token Validation

// Always validate all token claims
func validateTokenClaims(claims *TokenClaims) error {
    now := time.Now()
    
    // Check expiration
    if claims.ExpiresAt < now.Unix() {
        return fmt.Errorf("token expired")
    }
    
    // Check not before
    if claims.NotBefore > now.Unix() {
        return fmt.Errorf("token not yet valid")
    }
    
    // Check issuer
    if claims.Issuer != expectedIssuer {
        return fmt.Errorf("invalid issuer")
    }
    
    // Check audience
    if claims.Audience != expectedAudience {
        return fmt.Errorf("invalid audience")
    }
    
    return nil
}

2. Rate Limiting

// Implement rate limiting for token generation
type RateLimiter struct {
    requests map[string][]time.Time
    mutex    sync.RWMutex
    limit    int
    window   time.Duration
}

func (rl *RateLimiter) Allow(key string) bool {
    rl.mutex.Lock()
    defer rl.mutex.Unlock()
    
    now := time.Now()
    requests := rl.requests[key]
    
    // Remove old requests outside the window
    var validRequests []time.Time
    for _, req := range requests {
        if now.Sub(req) < rl.window {
            validRequests = append(validRequests, req)
        }
    }
    
    // Check if we've exceeded the limit
    if len(validRequests) >= rl.limit {
        return false
    }
    
    // Add the current request
    validRequests = append(validRequests, now)
    rl.requests[key] = validRequests
    
    return true
}

3. Token Blacklisting

// Implement token blacklisting for revoked tokens
type TokenBlacklist struct {
    blacklistedTokens map[string]time.Time
    mutex             sync.RWMutex
}

func (tb *TokenBlacklist) IsBlacklisted(jti string) bool {
    tb.mutex.RLock()
    defer tb.mutex.RUnlock()
    
    expiresAt, exists := tb.blacklistedTokens[jti]
    if !exists {
        return false
    }
    
    // Remove expired entries
    if time.Now().After(expiresAt) {
        delete(tb.blacklistedTokens, jti)
        return false
    }
    
    return true
}

func (tb *TokenBlacklist) BlacklistToken(jti string, expiresAt time.Time) {
    tb.mutex.Lock()
    defer tb.mutex.Unlock()
    tb.blacklistedTokens[jti] = expiresAt
}

mTLS Security Configuration

This guide covers how to configure mutual TLS (mTLS) authentication between VirtRigaud managers and providers.

Overview

mTLS provides strong authentication and encryption for gRPC communication between the VirtRigaud manager and provider services. It ensures:

  • Authentication: Both client and server verify each other’s certificates
  • Encryption: All traffic is encrypted in transit
  • Certificate Pinning: Specific certificate authorities are trusted
  • Certificate Rotation: Automated certificate renewal

Certificate Management

1. Generate CA Certificate

# Create CA private key
openssl genrsa -out ca-key.pem 4096

# Create CA certificate
openssl req -new -x509 -key ca-key.pem -out ca-cert.pem -days 365 \
  -subj "/C=US/ST=CA/L=San Francisco/O=VirtRigaud/CN=VirtRigaud CA"

2. Generate Server Certificate (Provider)

# Create server private key
openssl genrsa -out server-key.pem 4096

# Create server certificate signing request
openssl req -new -key server-key.pem -out server-csr.pem \
  -subj "/C=US/ST=CA/L=San Francisco/O=VirtRigaud/CN=provider-service"

# Sign server certificate
openssl x509 -req -in server-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \
  -CAcreateserial -out server-cert.pem -days 365 \
  -extensions v3_req -extfile <(cat <<EOF
[v3_req]
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = provider-service
DNS.2 = provider-service.default.svc.cluster.local
DNS.3 = localhost
IP.1 = 127.0.0.1
EOF
)

3. Generate Client Certificate (Manager)

# Create client private key
openssl genrsa -out client-key.pem 4096

# Create client certificate signing request
openssl req -new -key client-key.pem -out client-csr.pem \
  -subj "/C=US/ST=CA/L=San Francisco/O=VirtRigaud/CN=manager-client"

# Sign client certificate
openssl x509 -req -in client-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \
  -CAcreateserial -out client-cert.pem -days 365 \
  -extensions v3_req -extfile <(cat <<EOF
[v3_req]
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = clientAuth
EOF
)

Kubernetes Secret Configuration

Provider TLS Secret

apiVersion: v1
kind: Secret
metadata:
  name: provider-tls
  namespace: default
type: kubernetes.io/tls
data:
  tls.crt: # base64 encoded server-cert.pem
  tls.key: # base64 encoded server-key.pem
  ca.crt: # base64 encoded ca-cert.pem

Manager TLS Secret

apiVersion: v1
kind: Secret
metadata:
  name: manager-tls
  namespace: virtrigaud-system
type: kubernetes.io/tls
data:
  tls.crt: # base64 encoded client-cert.pem
  tls.key: # base64 encoded client-key.pem
  ca.crt: # base64 encoded ca-cert.pem

Provider Configuration

SDK Server Configuration

package main

import (
    "crypto/tls"
    "crypto/x509"
    "fmt"
    "io/ioutil"
    
    "github.com/projectbeskar/virtrigaud/sdk/provider/server"
)

func main() {
    // Load certificates
    cert, err := tls.LoadX509KeyPair("/etc/tls/tls.crt", "/etc/tls/tls.key")
    if err != nil {
        panic(fmt.Sprintf("Failed to load server certificates: %v", err))
    }
    
    // Load CA certificate for client verification
    caCert, err := ioutil.ReadFile("/etc/tls/ca.crt")
    if err != nil {
        panic(fmt.Sprintf("Failed to load CA certificate: %v", err))
    }
    
    caCertPool := x509.NewCertPool()
    if !caCertPool.AppendCertsFromPEM(caCert) {
        panic("Failed to parse CA certificate")
    }
    
    // Configure TLS
    tlsConfig := &tls.Config{
        Certificates: []tls.Certificate{cert},
        ClientAuth:   tls.RequireAndVerifyClientCert,
        ClientCAs:    caCertPool,
        MinVersion:   tls.VersionTLS12,
        CipherSuites: []uint16{
            tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
            tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,
            tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
            tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,
        },
    }
    
    // Create server with mTLS
    srv, err := server.New(&server.Config{
        Port:      9443,
        TLS:       tlsConfig,
        EnableTLS: true,
    })
    if err != nil {
        panic(fmt.Sprintf("Failed to create server: %v", err))
    }
    
    // Register your provider implementation here
    // providerv1.RegisterProviderServiceServer(srv.GRPCServer(), &YourProvider{})
    
    if err := srv.Serve(); err != nil {
        panic(fmt.Sprintf("Server failed: %v", err))
    }
}

Helm Chart Values (Provider Runtime)

# values-mtls.yaml
tls:
  enabled: true
  secretName: provider-tls

# Mount TLS certificates
volumes:
  - name: tls-certs
    secret:
      secretName: provider-tls

volumeMounts:
  - name: tls-certs
    mountPath: /etc/tls
    readOnly: true

# Environment variables for TLS
env:
  - name: TLS_ENABLED
    value: "true"
  - name: TLS_CERT_PATH
    value: "/etc/tls/tls.crt"
  - name: TLS_KEY_PATH
    value: "/etc/tls/tls.key"
  - name: TLS_CA_PATH
    value: "/etc/tls/ca.crt"

Manager Configuration

Client TLS Configuration

// In manager code
func createProviderClient(endpoint string) (providerv1.ProviderServiceClient, error) {
    // Load client certificates
    cert, err := tls.LoadX509KeyPair("/etc/manager-tls/tls.crt", "/etc/manager-tls/tls.key")
    if err != nil {
        return nil, fmt.Errorf("failed to load client certificates: %w", err)
    }
    
    // Load CA certificate for server verification
    caCert, err := ioutil.ReadFile("/etc/manager-tls/ca.crt")
    if err != nil {
        return nil, fmt.Errorf("failed to load CA certificate: %w", err)
    }
    
    caCertPool := x509.NewCertPool()
    if !caCertPool.AppendCertsFromPEM(caCert) {
        return nil, fmt.Errorf("failed to parse CA certificate")
    }
    
    // Configure TLS
    tlsConfig := &tls.Config{
        Certificates: []tls.Certificate{cert},
        RootCAs:      caCertPool,
        ServerName:   "provider-service", // Must match server certificate CN/SAN
        MinVersion:   tls.VersionTLS12,
    }
    
    // Create gRPC connection with mTLS
    conn, err := grpc.Dial(endpoint,
        grpc.WithTransportCredentials(credentials.NewTLS(tlsConfig)),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to connect: %w", err)
    }
    
    return providerv1.NewProviderServiceClient(conn), nil
}

Certificate Rotation

Using cert-manager

# Install cert-manager first
# kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.12.0/cert-manager.yaml

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: virtrigaud-ca-issuer
spec:
  ca:
    secretName: virtrigaud-ca-secret

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: provider-tls
  namespace: default
spec:
  secretName: provider-tls
  issuerRef:
    name: virtrigaud-ca-issuer
    kind: ClusterIssuer
  commonName: provider-service
  dnsNames:
    - provider-service
    - provider-service.default.svc.cluster.local
  duration: 8760h # 1 year
  renewBefore: 720h # 30 days before expiry

Manual Rotation Script

#!/bin/bash
# rotate-certs.sh

NAMESPACE=${1:-default}
SECRET_NAME=${2:-provider-tls}

echo "Rotating certificates for $SECRET_NAME in namespace $NAMESPACE"

# Generate new certificates (using the same process as above)
# ...

# Update Kubernetes secret
kubectl create secret tls $SECRET_NAME \
  --cert=server-cert.pem \
  --key=server-key.pem \
  --namespace=$NAMESPACE \
  --dry-run=client -o yaml | kubectl apply -f -

# Add CA certificate to the secret
kubectl patch secret $SECRET_NAME -n $NAMESPACE \
  --patch="$(cat <<EOF
data:
  ca.crt: $(base64 -w 0 ca-cert.pem)
EOF
)"

# Restart provider deployment to pick up new certificates
kubectl rollout restart deployment/provider-deployment -n $NAMESPACE

echo "Certificate rotation completed"

Security Best Practices

1. Certificate Validation

// Always validate certificate chains
func validateCertificate(cert *x509.Certificate, caCert *x509.Certificate) error {
    roots := x509.NewCertPool()
    roots.AddCert(caCert)
    
    opts := x509.VerifyOptions{
        Roots: roots,
        KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
    }
    
    _, err := cert.Verify(opts)
    return err
}

2. Certificate Pinning

// Pin specific certificate or CA
func createTLSConfigWithPinning(expectedCertFingerprint string) *tls.Config {
    return &tls.Config{
        VerifyPeerCertificate: func(rawCerts [][]byte, verifiedChains [][]*x509.Certificate) error {
            if len(rawCerts) == 0 {
                return fmt.Errorf("no certificates provided")
            }
            
            cert, err := x509.ParseCertificate(rawCerts[0])
            if err != nil {
                return err
            }
            
            fingerprint := sha256.Sum256(cert.Raw)
            if hex.EncodeToString(fingerprint[:]) != expectedCertFingerprint {
                return fmt.Errorf("certificate fingerprint mismatch")
            }
            
            return nil
        },
    }
}

3. Monitoring and Alerting

# Prometheus AlertManager rules
groups:
  - name: virtrigaud.certificates
    rules:
      - alert: CertificateExpiringSoon
        expr: (cert_manager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate expiring soon"
          description: "Certificate {{ $labels.name }} expires in less than 30 days"
      
      - alert: CertificateExpired
        expr: cert_manager_certificate_expiration_timestamp_seconds < time()
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Certificate expired"
          description: "Certificate {{ $labels.name }} has expired"

Troubleshooting

Common Issues

  1. Certificate chain issues

    # Verify certificate chain
    openssl verify -CAfile ca-cert.pem server-cert.pem
    
  2. SAN mismatch

    # Check certificate SAN entries
    openssl x509 -in server-cert.pem -text -noout | grep -A1 "Subject Alternative Name"
    
  3. TLS handshake failures

    # Test TLS connection
    openssl s_client -connect provider-service:9443 -cert client-cert.pem -key client-key.pem -CAfile ca-cert.pem
    
  4. Clock skew issues

    # Ensure time synchronization
    ntpdate -s time.nist.gov
    

Debug Commands

# Check certificate validity
kubectl get secret provider-tls -o yaml | grep tls.crt | base64 -d | openssl x509 -text -noout

# Monitor certificate expiration
kubectl get certificates

# Check provider logs for TLS errors
kubectl logs deployment/provider-deployment | grep -i tls

External Secrets Management

This guide covers integrating VirtRigaud providers with external secret management systems using ExternalSecrets operators and best practices for credential security.

Overview

External secret management provides secure, centralized credential storage and automatic secret rotation. Supported systems include:

  • HashiCorp Vault: Enterprise secret management with dynamic secrets
  • AWS Secrets Manager: Cloud-native secret storage with automatic rotation
  • Azure Key Vault: Azure-integrated secret management
  • Google Secret Manager: GCP secret storage service
  • Kubernetes External Secrets: Generic external secret integration

External Secrets Operator Setup

Installation

# Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

helm install external-secrets external-secrets/external-secrets \
  --namespace external-secrets-system \
  --create-namespace \
  --set installCRDs=true

Basic Configuration

# ServiceAccount for External Secrets Operator
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-secrets
  namespace: virtrigaud-system
  annotations:
    # For AWS IRSA (IAM Roles for Service Accounts)
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/external-secrets-role

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-secrets
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["create", "update", "patch", "delete", "get", "list", "watch"]
  - apiGroups: ["external-secrets.io"]
    resources: ["*"]
    verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-secrets
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-secrets
subjects:
  - kind: ServiceAccount
    name: external-secrets
    namespace: virtrigaud-system

HashiCorp Vault Integration

Vault SecretStore

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: vault-secret-store
  namespace: virtrigaud-system
spec:
  provider:
    vault:
      server: "https://vault.example.com:8200"
      path: "secret"
      version: "v2"
      auth:
        # Use Kubernetes service account for authentication
        kubernetes:
          mountPath: "kubernetes"
          role: "virtrigaud-role"
          serviceAccountRef:
            name: "external-secrets"

---
# For multi-namespace access
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: vault-cluster-store
spec:
  provider:
    vault:
      server: "https://vault.example.com:8200"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "virtrigaud-cluster-role"
          serviceAccountRef:
            name: "external-secrets"
            namespace: "virtrigaud-system"

Vault Policy Configuration

# Vault policy for VirtRigaud secrets
path "secret/data/virtrigaud/*" {
  capabilities = ["read"]
}

path "secret/data/providers/*" {
  capabilities = ["read"]
}

# Dynamic database credentials
path "database/creds/readonly" {
  capabilities = ["read"]
}

# PKI for TLS certificates
path "pki/issue/virtrigaud" {
  capabilities = ["create", "update"]
}

vSphere Credentials from Vault

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: vsphere-credentials
  namespace: vsphere-providers
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: vsphere-credentials
    creationPolicy: Owner
    template:
      type: Opaque
      data:
        username: "{{ .username }}"
        password: "{{ .password }}"
        server: "{{ .server }}"
        # Optional: TLS certificate
        ca.crt: "{{ .ca_cert | b64dec }}"
  data:
    - secretKey: username
      remoteRef:
        key: secret/data/providers/vsphere
        property: username
    - secretKey: password
      remoteRef:
        key: secret/data/providers/vsphere
        property: password
    - secretKey: server
      remoteRef:
        key: secret/data/providers/vsphere
        property: server
    - secretKey: ca_cert
      remoteRef:
        key: secret/data/providers/vsphere
        property: ca_cert

AWS Secrets Manager Integration

AWS SecretStore with IRSA

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: virtrigaud-system
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-west-2
      auth:
        # Use IAM Roles for Service Accounts (IRSA)
        serviceAccount:
          name: external-secrets
          namespace: virtrigaud-system

---
# IAM Policy for the IRSA role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret"
      ],
      "Resource": [
        "arn:aws:secretsmanager:us-west-2:ACCOUNT:secret:virtrigaud/*"
      ]
    }
  ]
}

AWS Secret Configuration

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: aws-provider-credentials
  namespace: provider-namespace
spec:
  refreshInterval: 15m
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: provider-credentials
    creationPolicy: Owner
  data:
    - secretKey: credentials.json
      remoteRef:
        key: "virtrigaud/provider-credentials"
        property: "credentials"
    - secretKey: api-key
      remoteRef:
        key: "virtrigaud/api-keys"
        property: "provider-api-key"

Azure Key Vault Integration

Azure SecretStore

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: azure-key-vault
  namespace: virtrigaud-system
spec:
  provider:
    azurekv:
      vaultUrl: "https://virtrigaud-vault.vault.azure.net/"
      authType: "ManagedIdentity"
      # Or use Service Principal:
      # authType: "ServicePrincipal"
      # authSecretRef:
      #   clientId:
      #     name: azure-secret
      #     key: client-id
      #   clientSecret:
      #     name: azure-secret
      #     key: client-secret
      tenantId: "tenant-id-here"

---
# Managed Identity setup (ARM template or Terraform)
apiVersion: v1
kind: Secret
metadata:
  name: azure-config
  namespace: virtrigaud-system
type: Opaque
data:
  # Base64 encoded values
  tenant-id: dGVuYW50LWlkLWhlcmU=
  client-id: Y2xpZW50LWlkLWhlcmU=
  client-secret: Y2xpZW50LXNlY3JldC1oZXJl

Azure Key Vault Secret

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: azure-provider-secrets
  namespace: provider-namespace
spec:
  refreshInterval: 30m
  secretStoreRef:
    name: azure-key-vault
    kind: SecretStore
  target:
    name: provider-secrets
    creationPolicy: Owner
  data:
    - secretKey: subscription-id
      remoteRef:
        key: "azure-subscription-id"
    - secretKey: resource-group
      remoteRef:
        key: "azure-resource-group"
    - secretKey: client-certificate
      remoteRef:
        key: "azure-client-cert"

Google Secret Manager Integration

GCP SecretStore

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: gcp-secret-manager
  namespace: virtrigaud-system
spec:
  provider:
    gcpsm:
      projectId: "your-gcp-project"
      auth:
        # Use Workload Identity
        workloadIdentity:
          clusterLocation: us-central1
          clusterName: virtrigaud-cluster
          serviceAccountRef:
            name: external-secrets
            namespace: virtrigaud-system

---
# Workload Identity binding
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-secrets
  namespace: virtrigaud-system
  annotations:
    iam.gke.io/gcp-service-account: virtrigaud-secrets@PROJECT.iam.gserviceaccount.com

GCP Secret Configuration

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: gcp-provider-secrets
  namespace: provider-namespace
spec:
  refreshInterval: 20m
  secretStoreRef:
    name: gcp-secret-manager
    kind: SecretStore
  target:
    name: gcp-provider-credentials
    creationPolicy: Owner
  data:
    - secretKey: service-account.json
      remoteRef:
        key: "virtrigaud-service-account"
        version: "latest"
    - secretKey: project-id
      remoteRef:
        key: "gcp-project-id"
        version: "latest"

Provider-Specific Configurations

vSphere Provider with Dynamic Credentials

# Vault configuration for vSphere dynamic credentials
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: vsphere-dynamic-credentials
  namespace: vsphere-providers
spec:
  refreshInterval: 15m  # Short refresh for dynamic credentials
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: vsphere-dynamic-creds
    creationPolicy: Owner
    template:
      type: Opaque
      data:
        username: "{{ .username }}"
        password: "{{ .password }}"
        server: "{{ .server }}"
        session_ttl: "{{ .lease_duration }}"
  data:
    - secretKey: username
      remoteRef:
        key: "vsphere/creds/dynamic-role"
        property: "username"
    - secretKey: password
      remoteRef:
        key: "vsphere/creds/dynamic-role"
        property: "password"
    - secretKey: server
      remoteRef:
        key: "secret/data/vsphere/static"
        property: "server"
    - secretKey: lease_duration
      remoteRef:
        key: "vsphere/creds/dynamic-role"
        property: "lease_duration"

---
# Provider deployment using dynamic credentials
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vsphere-provider
  namespace: vsphere-providers
spec:
  template:
    spec:
      containers:
        - name: provider
          env:
            - name: VSPHERE_USERNAME
              valueFrom:
                secretKeyRef:
                  name: vsphere-dynamic-creds
                  key: username
            - name: VSPHERE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: vsphere-dynamic-creds
                  key: password
            - name: VSPHERE_SERVER
              valueFrom:
                secretKeyRef:
                  name: vsphere-dynamic-creds
                  key: server

Libvirt Provider with SSH Keys

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: libvirt-ssh-keys
  namespace: libvirt-providers
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: libvirt-ssh-credentials
    creationPolicy: Owner
    template:
      type: kubernetes.io/ssh-auth
      data:
        ssh-privatekey: "{{ .private_key }}"
        ssh-publickey: "{{ .public_key }}"
        known_hosts: "{{ .known_hosts }}"
  data:
    - secretKey: private_key
      remoteRef:
        key: "secret/data/libvirt/ssh"
        property: "private_key"
    - secretKey: public_key
      remoteRef:
        key: "secret/data/libvirt/ssh"
        property: "public_key"
    - secretKey: known_hosts
      remoteRef:
        key: "secret/data/libvirt/ssh"
        property: "known_hosts"

---
# Mount SSH keys in provider
apiVersion: apps/v1
kind: Deployment
metadata:
  name: libvirt-provider
spec:
  template:
    spec:
      containers:
        - name: provider
          volumeMounts:
            - name: ssh-keys
              mountPath: /home/provider/.ssh
              readOnly: true
          env:
            - name: SSH_AUTH_SOCK
              value: "/tmp/ssh-agent.sock"
      volumes:
        - name: ssh-keys
          secret:
            secretName: libvirt-ssh-credentials
            defaultMode: 0600

TLS Certificate Management

Automatic TLS with External Secrets

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: provider-tls-certs
  namespace: provider-namespace
spec:
  refreshInterval: 24h
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: provider-tls
    creationPolicy: Owner
    template:
      type: kubernetes.io/tls
      data:
        tls.crt: "{{ .certificate }}"
        tls.key: "{{ .private_key }}"
        ca.crt: "{{ .ca_certificate }}"
  data:
    - secretKey: certificate
      remoteRef:
        key: "pki/issue/virtrigaud"
        property: "certificate"
    - secretKey: private_key
      remoteRef:
        key: "pki/issue/virtrigaud"
        property: "private_key"
    - secretKey: ca_certificate
      remoteRef:
        key: "pki/issue/virtrigaud"
        property: "issuing_ca"

---
# Vault PKI configuration (run in Vault)
# vault write pki/roles/virtrigaud \
#   allowed_domains="virtrigaud.local,provider-service" \
#   allow_subdomains=true \
#   max_ttl="8760h" \
#   generate_lease=true

Monitoring and Alerting

ExternalSecret Monitoring

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: external-secrets-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: external-secrets
  endpoints:
    - port: metrics
      interval: 30s

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: external-secrets-alerts
  namespace: monitoring
spec:
  groups:
    - name: external-secrets.rules
      rules:
        - alert: ExternalSecretSyncFailure
          expr: increase(external_secrets_sync_calls_error[5m]) > 0
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "External secret sync failure"
            description: "ExternalSecret {{ $labels.name }} in namespace {{ $labels.namespace }} failed to sync"
        
        - alert: ExternalSecretStale
          expr: (time() - external_secrets_sync_calls_total) > 3600
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "External secret not refreshed"
            description: "ExternalSecret {{ $labels.name }} has not been refreshed for over 1 hour"

Custom Monitoring

package monitoring

import (
    "context"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
)

var (
    secretAge = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "virtrigaud_secret_age_seconds",
            Help: "Age of provider secrets in seconds",
        },
        []string{"secret_name", "namespace", "provider"},
    )
    
    secretRotationCount = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "virtrigaud_secret_rotations_total",
            Help: "Total number of secret rotations",
        },
        []string{"secret_name", "namespace", "provider"},
    )
)

type SecretMonitor struct {
    client kubernetes.Interface
}

func (sm *SecretMonitor) MonitorSecrets(ctx context.Context) {
    ticker := time.NewTicker(60 * time.Second)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            sm.updateSecretMetrics()
        }
    }
}

func (sm *SecretMonitor) updateSecretMetrics() {
    secrets, err := sm.client.CoreV1().Secrets("").List(context.TODO(), metav1.ListOptions{
        LabelSelector: "app.kubernetes.io/managed-by=external-secrets",
    })
    if err != nil {
        return
    }
    
    for _, secret := range secrets.Items {
        provider := secret.Labels["provider"]
        if provider == "" {
            continue
        }
        
        age := time.Since(secret.CreationTimestamp.Time).Seconds()
        secretAge.WithLabelValues(secret.Name, secret.Namespace, provider).Set(age)
    }
}

Security Best Practices

1. Least Privilege Access

# Minimal Vault policy for specific provider
path "secret/data/providers/vsphere/{{ identity.entity.aliases.auth_kubernetes_*.metadata.service_account_namespace }}" {
  capabilities = ["read"]
}

# Time-bound secrets
path "vsphere/creds/readonly" {
  capabilities = ["read"]
  allowed_parameters = {
    "ttl" = ["15m", "30m", "1h"]
  }
}

2. Secret Rotation Automation

apiVersion: batch/v1
kind: CronJob
metadata:
  name: rotate-provider-secrets
  namespace: virtrigaud-system
spec:
  schedule: "0 2 * * 0"  # Weekly on Sunday at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: secret-rotator
              image: virtrigaud/secret-rotator:latest
              command:
                - /bin/sh
                - -c
                - |
                  # Force refresh of all external secrets
                  kubectl annotate externalsecret --all \
                    force-sync="$(date +%s)" \
                    --namespace=vsphere-providers
                  
                  # Restart provider deployments to pick up new secrets
                  kubectl rollout restart deployment \
                    --selector=app.kubernetes.io/name=virtrigaud-provider-runtime \
                    --namespace=vsphere-providers
          restartPolicy: OnFailure
          serviceAccountName: secret-rotator

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: secret-rotator
rules:
  - apiGroups: ["external-secrets.io"]
    resources: ["externalsecrets"]
    verbs: ["get", "list", "patch"]
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "patch"]

3. Audit Logging

# Vault audit configuration
vault audit enable file file_path=/vault/logs/audit.log

# Example audit log entry structure
{
  "time": "2023-12-01T10:30:00Z",
  "type": "request",
  "auth": {
    "client_token": "hvs.xxx",
    "accessor": "hmac-sha256:xxx",
    "display_name": "kubernetes-virtrigaud-system-external-secrets",
    "policies": ["virtrigaud-policy"],
    "metadata": {
      "service_account_name": "external-secrets",
      "service_account_namespace": "virtrigaud-system"
    }
  },
  "request": {
    "id": "request-id",
    "operation": "read",
    "path": "secret/data/providers/vsphere",
    "data": null,
    "remote_address": "10.0.0.100"
  }
}

4. Emergency Procedures

#!/bin/bash
# emergency-secret-rotation.sh

echo "=== Emergency Secret Rotation ==="

# 1. Revoke all active leases for a provider
vault lease revoke -prefix vsphere/creds/

# 2. Force refresh all external secrets
kubectl get externalsecret --all-namespaces -o name | \
  xargs -I {} kubectl annotate {} force-sync="$(date +%s)"

# 3. Restart all provider deployments
kubectl get deployments --all-namespaces \
  -l app.kubernetes.io/name=virtrigaud-provider-runtime \
  -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | \
  while read deployment; do
    kubectl rollout restart deployment $deployment
  done

# 4. Monitor rollout status
kubectl get deployments --all-namespaces \
  -l app.kubernetes.io/name=virtrigaud-provider-runtime \
  -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | \
  while read deployment; do
    kubectl rollout status deployment $deployment --timeout=300s
  done

echo "Emergency rotation completed"

5. Secret Validation

package validation

import (
    "crypto/x509"
    "encoding/pem"
    "fmt"
    "time"
)

func ValidateSecret(secretData map[string][]byte, secretType string) error {
    switch secretType {
    case "tls":
        return validateTLSSecret(secretData)
    case "ssh":
        return validateSSHSecret(secretData)
    case "credential":
        return validateCredentialSecret(secretData)
    }
    return nil
}

func validateTLSSecret(data map[string][]byte) error {
    cert, ok := data["tls.crt"]
    if !ok {
        return fmt.Errorf("missing tls.crt")
    }
    
    key, ok := data["tls.key"]
    if !ok {
        return fmt.Errorf("missing tls.key")
    }
    
    // Parse certificate
    block, _ := pem.Decode(cert)
    if block == nil {
        return fmt.Errorf("failed to parse certificate PEM")
    }
    
    parsedCert, err := x509.ParseCertificate(block.Bytes)
    if err != nil {
        return fmt.Errorf("failed to parse certificate: %w", err)
    }
    
    // Check expiration
    if time.Now().After(parsedCert.NotAfter) {
        return fmt.Errorf("certificate expired on %v", parsedCert.NotAfter)
    }
    
    if time.Now().Add(24*time.Hour).After(parsedCert.NotAfter) {
        return fmt.Errorf("certificate expires soon on %v", parsedCert.NotAfter)
    }
    
    // Validate key
    block, _ = pem.Decode(key)
    if block == nil {
        return fmt.Errorf("failed to parse private key PEM")
    }
    
    return nil
}

Network Policies for Provider Security

This guide covers Kubernetes NetworkPolicy configurations to secure communication between VirtRigaud components and provider services.

Overview

NetworkPolicies provide network-level security by controlling traffic flow between pods, namespaces, and external endpoints. For VirtRigaud providers, this includes:

  • Ingress Control: Restricting which services can communicate with providers
  • Egress Control: Limiting provider access to external hypervisor endpoints
  • Namespace Isolation: Preventing cross-tenant communication
  • External Access: Controlling access to hypervisor management interfaces

Basic NetworkPolicy Template

Provider Ingress Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: provider-ingress
  namespace: provider-namespace
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider
  policyTypes:
    - Ingress
  ingress:
    # Allow from VirtRigaud manager
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: virtrigaud-manager
      ports:
        - protocol: TCP
          port: 9443  # gRPC provider port
    
    # Allow health checks from monitoring
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
        - podSelector:
            matchLabels:
              app: prometheus
      ports:
        - protocol: TCP
          port: 8080  # Health/metrics port
    
    # Allow from same namespace (for debugging)
    - from:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 8080

Provider Egress Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: provider-egress
  namespace: provider-namespace
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider
  policyTypes:
    - Egress
  egress:
    # Allow DNS resolution
    - to: []
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    
    # Allow HTTPS to Kubernetes API
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: TCP
          port: 443
    
    # Allow access to hypervisor management interfaces
    - to: []
      ports:
        - protocol: TCP
          port: 443  # vCenter HTTPS
        - protocol: TCP
          port: 80   # vCenter HTTP (if needed)
    
    # For libvirt providers - allow access to hypervisor nodes
    - to:
        - podSelector:
            matchLabels:
              node-role.kubernetes.io/worker: "true"
      ports:
        - protocol: TCP
          port: 16509  # libvirt daemon

Environment-Specific Policies

vSphere Provider

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vsphere-provider-policy
  namespace: vsphere-providers
  labels:
    provider: vsphere
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider-runtime
      provider: vsphere
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Manager access
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
      ports:
        - protocol: TCP
          port: 9443
    
    # Monitoring access
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 8080

  egress:
    # DNS
    - to: []
      ports:
        - protocol: UDP
          port: 53
    
    # vCenter access (specific IP ranges)
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8
            except:
              - 10.244.0.0/16  # Exclude pod network
      ports:
        - protocol: TCP
          port: 443
    
    - to:
        - ipBlock:
            cidr: 192.168.0.0/16
      ports:
        - protocol: TCP
          port: 443
    
    # ESXi host access for direct operations
    - to:
        - ipBlock:
            cidr: 10.1.0.0/24  # ESXi management network
      ports:
        - protocol: TCP
          port: 443
        - protocol: TCP
          port: 902   # vCenter agent

Libvirt Provider

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: libvirt-provider-policy
  namespace: libvirt-providers
  labels:
    provider: libvirt
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider-runtime
      provider: libvirt
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Manager access
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
      ports:
        - protocol: TCP
          port: 9443
    
    # Monitoring access
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 8080

  egress:
    # DNS
    - to: []
      ports:
        - protocol: UDP
          port: 53
    
    # Access to hypervisor nodes
    - to: []
      ports:
        - protocol: TCP
          port: 16509  # libvirt daemon
        - protocol: TCP
          port: 22     # SSH for remote libvirt
    
    # Access to shared storage (NFS, iSCSI, etc.)
    - to:
        - ipBlock:
            cidr: 10.2.0.0/24  # Storage network
      ports:
        - protocol: TCP
          port: 2049  # NFS
        - protocol: TCP
          port: 3260  # iSCSI
        - protocol: UDP
          port: 111   # RPC portmapper

Mock Provider (Development)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: mock-provider-policy
  namespace: development
  labels:
    provider: mock
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider-runtime
      provider: mock
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Allow from manager and other development pods
    - from:
        - namespaceSelector:
            matchLabels:
              environment: development
      ports:
        - protocol: TCP
          port: 9443
        - protocol: TCP
          port: 8080

  egress:
    # Allow all egress for development environment
    - to: []

Multi-Tenant Isolation

Tenant Namespace Policies

# Template for tenant-specific policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-isolation
  namespace: tenant-{{TENANT_NAME}}
  labels:
    tenant: "{{TENANT_NAME}}"
spec:
  podSelector: {}  # Apply to all pods in namespace
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Allow from same tenant namespace
    - from:
        - namespaceSelector:
            matchLabels:
              tenant: "{{TENANT_NAME}}"
    
    # Allow from VirtRigaud system namespace
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
    
    # Allow from monitoring namespace
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring

  egress:
    # Allow to same tenant namespace
    - to:
        - namespaceSelector:
            matchLabels:
              tenant: "{{TENANT_NAME}}"
    
    # Allow to VirtRigaud system namespace
    - to:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
    
    # DNS resolution
    - to: []
      ports:
        - protocol: UDP
          port: 53
    
    # External hypervisor access (tenant-specific IP ranges)
    - to:
        - ipBlock:
            cidr: "{{TENANT_HYPERVISOR_CIDR}}"
      ports:
        - protocol: TCP
          port: 443

Cross-Tenant Communication Prevention

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-tenant
  namespace: tenant-production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Explicitly deny from other tenant namespaces
    - from: []
      # Empty from selector with explicit namespace exclusions
  
  egress:
    # Explicitly deny to other tenant namespaces
    - to:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
    - to:
        - namespaceSelector:
            matchLabels:
              name: monitoring
    - to:
        - namespaceSelector:
            matchLabels:
              tenant: production
    # Deny all other namespace access

Advanced Policies

Time-Based Access Control

# Use external controllers like OPA Gatekeeper for time-based policies
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: timerestriction
spec:
  crd:
    spec:
      names:
        kind: TimeRestriction
      validation:
        type: object
        properties:
          allowedHours:
            type: array
            items:
              type: integer
            description: "Allowed hours (0-23) for network access"
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package timerestriction
        
        violation[{"msg": msg}] {
          current_hour := time.now_ns() / 1000000000 / 3600 % 24
          not current_hour in input.parameters.allowedHours
          msg := sprintf("Network access not allowed at hour %v", [current_hour])
        }

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: TimeRestriction
metadata:
  name: business-hours-only
spec:
  match:
    kinds:
      - apiGroups: ["networking.k8s.io"]
        kinds: ["NetworkPolicy"]
    namespaces: ["production"]
  parameters:
    allowedHours: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]  # 8 AM - 5 PM

Dynamic IP Allow-listing

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: dynamic-hypervisor-access
  namespace: provider-namespace
  annotations:
    # Use external controllers to update IP blocks dynamically
    network-policy-controller/update-interval: "300s"
    network-policy-controller/ip-source: "configmap:hypervisor-ips"
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider
  policyTypes:
    - Egress
  egress:
    # Will be dynamically updated by controller
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8
    # Static rules remain
    - to: []
      ports:
        - protocol: UDP
          port: 53

Monitoring and Troubleshooting

NetworkPolicy Monitoring

# ServiceMonitor for network policy violations
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: networkpolicy-monitoring
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: networkpolicy-exporter
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

---
# Example alerts for network policy violations
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: networkpolicy-alerts
  namespace: monitoring
spec:
  groups:
    - name: networkpolicy.rules
      rules:
        - alert: NetworkPolicyDeniedConnections
          expr: increase(networkpolicy_denied_connections_total[5m]) > 10
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "High number of denied network connections"
            description: "{{ $labels.source_namespace }}/{{ $labels.source_pod }} had {{ $value }} denied connections to {{ $labels.dest_namespace }}/{{ $labels.dest_pod }}"

Debug NetworkPolicies

#!/bin/bash
# debug-networkpolicy.sh

NAMESPACE=${1:-default}
POD_NAME=${2}

echo "=== NetworkPolicy Debug for $NAMESPACE/$POD_NAME ==="

# List all NetworkPolicies in namespace
echo "NetworkPolicies in namespace $NAMESPACE:"
kubectl get networkpolicy -n $NAMESPACE

# Show specific NetworkPolicy details
echo -e "\nNetworkPolicy details:"
kubectl get networkpolicy -n $NAMESPACE -o yaml

# Test connectivity
if [ ! -z "$POD_NAME" ]; then
    echo -e "\nTesting connectivity from $POD_NAME:"
    
    # Test DNS resolution
    kubectl exec -n $NAMESPACE $POD_NAME -- nslookup kubernetes.default.svc.cluster.local
    
    # Test internal connectivity
    kubectl exec -n $NAMESPACE $POD_NAME -- wget -qO- --timeout=5 http://kubernetes.default.svc.cluster.local/api
    
    # Test external connectivity (adjust as needed)
    kubectl exec -n $NAMESPACE $POD_NAME -- wget -qO- --timeout=5 https://google.com
fi

# Check iptables rules (if accessible)
echo -e "\nIPTables rules (if accessible):"
kubectl get nodes -o wide
echo "Run the following on a node to see iptables:"
echo "sudo iptables -L -n | grep -E '(KUBE|Chain)'"

CNI-Specific Troubleshooting

Calico

# Check Calico network policies
kubectl get networkpolicy --all-namespaces
kubectl get globalnetworkpolicy

# Check Calico endpoints
kubectl get endpoints --all-namespaces

# Debug Calico connectivity
kubectl exec -it -n kube-system <calico-node-pod> -- /bin/sh
calicoctl get wep --all-namespaces
calicoctl get netpol --all-namespaces

Cilium

# Check Cilium network policies
kubectl get cnp --all-namespaces  # Cilium Network Policies
kubectl get ccnp --all-namespaces # Cilium Cluster Network Policies

# Debug Cilium connectivity
kubectl exec -it -n kube-system <cilium-pod> -- cilium endpoint list
kubectl exec -it -n kube-system <cilium-pod> -- cilium policy get

Security Best Practices

1. Principle of Least Privilege

# Example: Minimal egress for a provider
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: minimal-egress-example
spec:
  podSelector:
    matchLabels:
      app: provider
  policyTypes:
    - Egress
  egress:
    # Only allow what's absolutely necessary
    - to: []
      ports:
        - protocol: UDP
          port: 53  # DNS only
    - to:
        - ipBlock:
            cidr: 10.1.1.100/32  # Specific vCenter IP only
      ports:
        - protocol: TCP
          port: 443  # HTTPS only

2. Default Deny Policies

# Apply default deny to all namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  # Empty ingress/egress rules = deny all

3. Regular Policy Auditing

#!/bin/bash
# audit-networkpolicies.sh

echo "=== NetworkPolicy Audit Report ==="
echo "Generated: $(date)"
echo

# Check for namespaces without NetworkPolicies
echo "Namespaces without NetworkPolicies:"
for ns in $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}'); do
    if [ $(kubectl get networkpolicy -n $ns --no-headers 2>/dev/null | wc -l) -eq 0 ]; then
        echo "  - $ns (WARNING: No network policies)"
    fi
done

echo

# Check for overly permissive policies
echo "Potentially overly permissive policies:"
kubectl get networkpolicy --all-namespaces -o json | jq -r '
  .items[] | 
  select(
    (.spec.egress[]?.to // []) | length == 0 or
    (.spec.ingress[]?.from // []) | length == 0
  ) | 
  "\(.metadata.namespace)/\(.metadata.name) - Check for overly broad rules"
'

echo

# Check for unused NetworkPolicies
echo "NetworkPolicies with no matching pods:"
kubectl get networkpolicy --all-namespaces -o json | jq -r '
  .items[] as $np |
  $np.metadata.namespace as $ns |
  $np.spec.podSelector as $selector |
  if ($selector | keys | length) == 0 then
    "\($ns)/\($np.metadata.name) - Applies to all pods in namespace"
  else
    "\($ns)/\($np.metadata.name) - Check if pods match selector"
  end
'

4. Integration with Service Mesh

# Example: Istio integration with NetworkPolicies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: istio-compatible-policy
spec:
  podSelector:
    matchLabels:
      app: provider
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow Istio sidecar communication
    - from:
        - podSelector:
            matchLabels:
              app: istio-proxy
      ports:
        - protocol: TCP
          port: 15090  # Istio pilot
    # Your application ports
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
      ports:
        - protocol: TCP
          port: 9443
  egress:
    # Allow Istio control plane
    - to:
        - namespaceSelector:
            matchLabels:
              name: istio-system
      ports:
        - protocol: TCP
          port: 15010  # Pilot
        - protocol: TCP
          port: 15011  # Pilot secure

CLI Tools Reference

VirtRigaud provides a comprehensive set of command-line tools for managing virtual machines, developing providers, running conformance tests, and performing load testing. This guide covers all available CLI tools and their usage.

Overview

ToolPurposeTarget Users
vrtgMain CLI for VM management and operationsEnd users, DevOps teams, System administrators
vctsConformance testing suiteProvider developers, QA teams, CI/CD pipelines
vrtg-providerProvider development toolkitProvider developers, Contributors
virtrigaud-loadgenLoad testing and benchmarkingPerformance engineers, SREs

Installation

From GitHub Releases

# Download the latest release
export VIRTRIGAUD_VERSION="v0.2.1"
export PLATFORM="linux-amd64"  # or darwin-amd64, windows-amd64

# Install main CLI tool
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/${VIRTRIGAUD_VERSION}/vrtg-${PLATFORM}" -o vrtg
chmod +x vrtg
sudo mv vrtg /usr/local/bin/

# Install all CLI tools
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/${VIRTRIGAUD_VERSION}/virtrigaud-cli-${PLATFORM}.tar.gz" | tar xz
sudo mv vrtg vcts vrtg-provider virtrigaud-loadgen /usr/local/bin/

From Source

git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Build all CLI tools
make build-cli

# Install to /usr/local/bin
sudo make install-cli

# Or install to custom location
make install-cli PREFIX=/usr/local

Using Go

# Install specific version
go install github.com/projectbeskar/virtrigaud/cmd/vrtg@v0.2.1
go install github.com/projectbeskar/virtrigaud/cmd/vcts@v0.2.1
go install github.com/projectbeskar/virtrigaud/cmd/vrtg-provider@v0.2.1
go install github.com/projectbeskar/virtrigaud/cmd/virtrigaud-loadgen@v0.2.1

# Install latest
go install github.com/projectbeskar/virtrigaud/cmd/vrtg@latest

Completion

Enable shell completion for enhanced productivity:

# Bash
vrtg completion bash > /etc/bash_completion.d/vrtg
source /etc/bash_completion.d/vrtg

# Zsh
vrtg completion zsh > "${fpath[1]}/_vrtg"

# Fish
vrtg completion fish > ~/.config/fish/completions/vrtg.fish

# PowerShell
vrtg completion powershell | Out-String | Invoke-Expression

vrtg

The main CLI tool for managing VirtRigaud resources and virtual machines.

Global Flags

--kubeconfig string   Path to kubeconfig file (default: $KUBECONFIG or ~/.kube/config)
--namespace string    Kubernetes namespace (default: "default")
--output string       Output format: table, json, yaml (default: "table")
--timeout duration    Operation timeout (default: 30s)
--verbose             Enable verbose output
-h, --help           Help for vrtg

Commands

vm - Virtual Machine Management

Manage virtual machines with comprehensive lifecycle operations.

# List virtual machines
vrtg vm list [flags]

# Describe a virtual machine
vrtg vm describe <name> [flags]

# Show VM events
vrtg vm events <name> [flags]

# Get VM console URL
vrtg vm console-url <name> [flags]

Flags:

  • --all-namespaces: List VMs across all namespaces
  • --label-selector: Filter by labels (e.g., app=web,env=prod)
  • --field-selector: Filter by fields (e.g., spec.powerState=On)
  • --sort-by: Sort output by column (name, namespace, powerState, provider)
  • --watch: Watch for changes

Examples:

# List all VMs in table format
vrtg vm list

# List VMs with custom output format
vrtg vm list --output json --namespace production

# List VMs across all namespaces
vrtg vm list --all-namespaces

# Filter VMs by labels
vrtg vm list --label-selector environment=production,tier=web

# Watch VM status changes
vrtg vm list --watch

# Get detailed VM information
vrtg vm describe my-vm --output yaml

# Get VM console URL
vrtg vm console-url my-vm

# Show recent VM events
vrtg vm events my-vm

provider - Provider Management

Manage provider configurations and monitor their health.

# List providers
vrtg provider list [flags]

# Show provider status
vrtg provider status <name> [flags]

# Show provider logs
vrtg provider logs <name> [flags]

Flags:

  • --follow: Follow log output (for logs command)
  • --tail: Number of lines to show from end of logs (default: 100)
  • --since: Show logs since timestamp (e.g., 1h, 30m)

Examples:

# List all providers
vrtg provider list

# Check provider status
vrtg provider status vsphere-provider

# View provider logs
vrtg provider logs vsphere-provider --tail 50

# Follow provider logs in real-time
vrtg provider logs vsphere-provider --follow

# Show logs from last hour
vrtg provider logs vsphere-provider --since 1h

snapshot - Snapshot Management

Manage VM snapshots for backup and recovery.

# Create a VM snapshot
vrtg snapshot create <vm-name> <snapshot-name> [flags]

# List snapshots
vrtg snapshot list [vm-name] [flags]

# Revert VM to snapshot
vrtg snapshot revert <vm-name> <snapshot-name> [flags]

Flags for create:

  • --description: Snapshot description
  • --include-memory: Include memory state in snapshot

Examples:

# Create a simple snapshot
vrtg snapshot create my-vm pre-upgrade

# Create snapshot with description and memory
vrtg snapshot create my-vm pre-maintenance \
  --description "Before maintenance window" \
  --include-memory

# List all snapshots
vrtg snapshot list

# List snapshots for specific VM
vrtg snapshot list my-vm

# Revert to a snapshot
vrtg snapshot revert my-vm pre-upgrade

clone - VM Cloning

Clone virtual machines for rapid provisioning.

# Clone a virtual machine
vrtg clone run <source-vm> <target-vm> [flags]

# List clone operations
vrtg clone list [flags]

Flags for run:

  • --linked: Create linked clone (faster, space-efficient)
  • --target-namespace: Namespace for target VM
  • --customize: Apply customization during clone

Examples:

# Simple VM clone
vrtg clone run template-vm new-vm

# Linked clone for development
vrtg clone run production-vm dev-vm --linked

# Clone to different namespace
vrtg clone run template-vm test-vm --target-namespace testing

# List clone operations
vrtg clone list

conformance - Provider Testing

Run conformance tests against providers.

# Run conformance tests
vrtg conformance run <provider> [flags]

Flags:

  • --output-dir: Directory for test results
  • --skip-tests: Comma-separated list of tests to skip
  • --timeout: Test timeout (default: 30m)

Examples:

# Run conformance tests
vrtg conformance run vsphere-provider

# Run tests with custom timeout
vrtg conformance run vsphere-provider --timeout 1h

# Skip specific tests
vrtg conformance run vsphere-provider --skip-tests "test-large-vms,test-network"

diag - Diagnostics

Diagnostic tools for troubleshooting.

# Create diagnostic bundle
vrtg diag bundle [flags]

Flags:

  • --output: Output file path (default: virtrigaud-diag-<timestamp>.tar.gz)
  • --include-logs: Include provider logs in bundle
  • --since: Collect logs since timestamp

Examples:

# Create diagnostic bundle
vrtg diag bundle

# Create bundle with logs from last 2 hours
vrtg diag bundle --include-logs --since 2h

# Custom output location
vrtg diag bundle --output /tmp/debug-bundle.tar.gz

init - Installation

Initialize VirtRigaud in a Kubernetes cluster.

# Initialize virtrigaud
vrtg init [flags]

Flags:

  • --chart-version: Helm chart version to install
  • --namespace: Installation namespace (default: virtrigaud-system)
  • --values: Values file for Helm chart
  • --dry-run: Show what would be installed

Examples:

# Basic installation
vrtg init

# Install specific version
vrtg init --chart-version v0.2.1

# Install with custom values
vrtg init --values custom-values.yaml

# Dry run to see what would be installed
vrtg init --dry-run

vcts

VirtRigaud Conformance Test Suite - runs standardized tests against providers.

Usage

vcts [command] [flags]

Global Flags

--kubeconfig string   Path to kubeconfig file
--namespace string    Kubernetes namespace (default: "virtrigaud-system")
--provider string     Provider name to test
--output-dir string   Output directory for test results (default: "./conformance-results")
--skip-tests strings  Comma-separated list of tests to skip
--timeout duration    Test timeout (default: 30m)
--parallel int        Number of parallel test executions (default: 1)
--verbose             Enable verbose output

Commands

run - Execute Tests

# Run all conformance tests
vcts run --provider vsphere-provider

# Run with custom settings
vcts run --provider vsphere-provider \
  --timeout 1h \
  --parallel 3 \
  --output-dir /tmp/test-results

# Skip specific tests
vcts run --provider libvirt-provider \
  --skip-tests "test-snapshots,test-linked-clones"

# Verbose output for debugging
vcts run --provider proxmox-provider --verbose

list - List Available Tests

# List all available tests
vcts list

# List tests for specific capability
vcts list --capability snapshots

validate - Validate Provider

# Validate provider configuration
vcts validate --provider vsphere-provider

# Check provider connectivity
vcts validate --provider vsphere-provider --check-connectivity

Test Categories

  1. Basic Operations: VM creation, deletion, power operations
  2. Lifecycle Management: Start, stop, restart, suspend operations
  3. Resource Management: CPU, memory, disk operations
  4. Networking: Network configuration and connectivity
  5. Storage: Disk operations, resizing, multiple disks
  6. Snapshots: Create, list, revert, delete snapshots
  7. Cloning: VM cloning and linked clones
  8. Error Handling: Provider error scenarios
  9. Performance: Basic performance benchmarks

Output Formats

Test results are available in multiple formats:

  • JUnit XML: For CI/CD integration
  • JSON: Machine-readable format
  • HTML: Human-readable report
  • TAP: Test Anything Protocol

vrtg-provider

Provider development toolkit for creating and managing VirtRigaud providers.

Usage

vrtg-provider [command] [flags]

Global Flags

--verbose     Enable verbose output
--help        Help for vrtg-provider

Commands

init - Initialize Provider

Bootstrap a new provider project with scaffolding.

vrtg-provider init <provider-name> [flags]

Flags:

  • --template: Template to use (grpc, rest, hybrid)
  • --output-dir: Output directory (default: current directory)
  • --module: Go module name
  • --author: Author name for generated files

Examples:

# Create basic gRPC provider
vrtg-provider init my-provider --template grpc

# Create with custom module
vrtg-provider init my-provider \
  --template grpc \
  --module github.com/myorg/my-provider \
  --author "John Doe <john@example.com>"

# Create in specific directory
vrtg-provider init my-provider \
  --output-dir /path/to/providers \
  --template grpc

generate - Code Generation

Generate boilerplate code for provider implementation.

vrtg-provider generate [type] [flags]

Types:

  • client: Generate client code
  • server: Generate server implementation
  • tests: Generate test scaffolding
  • docs: Generate documentation templates

Examples:

# Generate client code
vrtg-provider generate client --provider my-provider

# Generate test scaffolding
vrtg-provider generate tests --provider my-provider

# Generate documentation
vrtg-provider generate docs --provider my-provider

verify - Verification

Verify provider implementation and compliance.

vrtg-provider verify [flags]

Flags:

  • --provider-dir: Provider directory to verify
  • --check-interface: Verify interface compliance
  • --check-docs: Verify documentation completeness
  • --check-tests: Verify test coverage

Examples:

# Basic verification
vrtg-provider verify --provider-dir ./my-provider

# Comprehensive check
vrtg-provider verify \
  --provider-dir ./my-provider \
  --check-interface \
  --check-docs \
  --check-tests

publish - Publishing

Prepare provider for publishing and distribution.

vrtg-provider publish [flags]

Flags:

  • --provider-dir: Provider directory
  • --version: Version to publish
  • --registry: Container registry
  • --chart-repo: Helm chart repository

Examples:

# Publish provider
vrtg-provider publish \
  --provider-dir ./my-provider \
  --version v1.0.0 \
  --registry ghcr.io/myorg

# Publish with Helm chart
vrtg-provider publish \
  --provider-dir ./my-provider \
  --version v1.0.0 \
  --registry ghcr.io/myorg \
  --chart-repo https://charts.myorg.com

Provider Template Structure

my-provider/
β”œβ”€β”€ cmd/
β”‚   └── provider/
β”‚       └── main.go              # Provider entry point
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ provider/
β”‚   β”‚   β”œβ”€β”€ server.go           # gRPC server implementation
β”‚   β”‚   β”œβ”€β”€ client.go           # Provider client
β”‚   β”‚   └── types.go            # Provider-specific types
β”‚   └── config/
β”‚       └── config.go           # Configuration management
β”œβ”€β”€ pkg/
β”‚   └── api/                    # Public API interfaces
β”œβ”€β”€ test/
β”‚   β”œβ”€β”€ conformance/            # Conformance tests
β”‚   └── integration/            # Integration tests
β”œβ”€β”€ deploy/
β”‚   β”œβ”€β”€ helm/                   # Helm charts
β”‚   └── k8s/                    # Kubernetes manifests
β”œβ”€β”€ docs/                       # Documentation
β”œβ”€β”€ Dockerfile                  # Container image
β”œβ”€β”€ Makefile                    # Build automation
└── README.md                   # Provider documentation

virtrigaud-loadgen

Load testing and benchmarking tool for VirtRigaud deployments.

Usage

virtrigaud-loadgen [command] [flags]

Global Flags

--kubeconfig string   Path to kubeconfig file
--namespace string    Kubernetes namespace (default: "default")  
--output-dir string   Output directory for results (default: "./loadgen-results")
--config-file string  Load generation configuration file
--dry-run            Show what would be executed without running
--verbose            Enable verbose output

Commands

run - Execute Load Test

virtrigaud-loadgen run [flags]

Flags:

  • --vms: Number of VMs to create (default: 10)
  • --duration: Test duration (default: 10m)
  • --ramp-up: Ramp-up time (default: 2m)
  • --workers: Number of concurrent workers (default: 5)
  • --provider: Provider to test against
  • --vm-class: VMClass to use for test VMs
  • --vm-image: VMImage to use for test VMs

Examples:

# Basic load test
virtrigaud-loadgen run --vms 50 --duration 15m

# Comprehensive load test
virtrigaud-loadgen run \
  --vms 100 \
  --duration 30m \
  --ramp-up 5m \
  --workers 10 \
  --provider vsphere-provider

# Test with specific configuration
virtrigaud-loadgen run --config-file loadtest-config.yaml

config - Configuration Management

# Generate sample configuration
virtrigaud-loadgen config generate --output sample-config.yaml

# Validate configuration
virtrigaud-loadgen config validate --config-file my-config.yaml

Configuration File

# loadtest-config.yaml
metadata:
  name: "production-load-test"
  description: "Load test for production environment"

spec:
  # Test parameters
  vms: 100
  duration: "30m"
  rampUp: "5m"
  workers: 10
  
  # Target configuration
  provider: "vsphere-provider"
  namespace: "loadtest"
  
  # VM configuration
  vmClass: "standard-vm"
  vmImage: "ubuntu-22-04"
  
  # Test scenarios
  scenarios:
    - name: "vm-lifecycle"
      weight: 70
      operations:
        - create
        - start
        - stop
        - delete
    
    - name: "vm-operations"
      weight: 20
      operations:
        - snapshot
        - clone
        - reconfigure
    
    - name: "provider-stress"
      weight: 10
      operations:
        - rapid-create-delete
        - concurrent-operations

  # Reporting
  reporting:
    formats: ["json", "html", "csv"]
    metrics:
      - response-time
      - throughput
      - error-rate
      - resource-usage

Metrics and Reporting

Load test results include:

  • Performance Metrics: Response times, throughput, latency percentiles
  • Error Analysis: Error rates, failure patterns, error categorization
  • Resource Usage: CPU, memory, network utilization
  • Provider Metrics: Provider-specific performance indicators
  • Trend Analysis: Performance over time, bottleneck identification

Output Formats

  • JSON: Machine-readable results for automation
  • HTML: Interactive dashboard with charts and graphs
  • CSV: Raw data for further analysis
  • Prometheus: Metrics export for monitoring systems

Advanced Usage

Automation and Scripting

Bash Integration

#!/bin/bash
# VM management script

# Function to check VM status
check_vm_status() {
  local vm_name=$1
  vrtg vm describe "$vm_name" --output json | jq -r '.status.powerState'
}

# Wait for VM to be ready
wait_for_vm() {
  local vm_name=$1
  local timeout=300
  local count=0
  
  while [ $count -lt $timeout ]; do
    status=$(check_vm_status "$vm_name")
    if [ "$status" = "On" ]; then
      echo "VM $vm_name is ready"
      return 0
    fi
    sleep 5
    count=$((count + 5))
  done
  
  echo "Timeout waiting for VM $vm_name"
  return 1
}

# Create and wait for VM
vrtg vm create --file vm-config.yaml
wait_for_vm "my-vm"

CI/CD Integration

# .github/workflows/vm-test.yml
name: VM Integration Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Install vrtg CLI
        run: |
          curl -L "https://github.com/projectbeskar/virtrigaud/releases/latest/download/vrtg-linux-amd64" -o vrtg
          chmod +x vrtg
          sudo mv vrtg /usr/local/bin/
      
      - name: Setup kubeconfig
        run: echo "${{ secrets.KUBECONFIG }}" | base64 -d > ~/.kube/config
      
      - name: Run conformance tests
        run: vcts run --provider test-provider --output-dir test-results
      
      - name: Upload test results
        uses: actions/upload-artifact@v3
        with:
          name: conformance-results
          path: test-results/

Configuration Management

Environment-specific Configurations

# Development environment
export VRTG_KUBECONFIG=~/.kube/dev-config
export VRTG_NAMESPACE=development
export VRTG_OUTPUT=yaml

# Production environment  
export VRTG_KUBECONFIG=~/.kube/prod-config
export VRTG_NAMESPACE=production
export VRTG_OUTPUT=json

# Use environment-specific settings
vrtg vm list  # Uses environment variables

Configuration Files

Create ~/.vrtg/config.yaml:

contexts:
  development:
    kubeconfig: ~/.kube/dev-config
    namespace: development
    output: yaml
    timeout: 30s
  
  production:
    kubeconfig: ~/.kube/prod-config
    namespace: production
    output: json
    timeout: 60s

current-context: development

aliases:
  ls: vm list
  get: vm describe
  logs: provider logs

Troubleshooting

Common Issues

  1. Connection Issues
# Check cluster connectivity
vrtg provider list

# Validate kubeconfig
kubectl cluster-info

# Check provider logs
vrtg provider logs <provider-name> --tail 100
  1. Permission Issues
# Check RBAC permissions
kubectl auth can-i create virtualmachines

# Get current user context
kubectl auth whoami
  1. Provider Issues
# Check provider status
vrtg provider status <provider-name>

# Run diagnostics
vrtg diag bundle --include-logs

Debug Mode

Enable debug output:

# Global debug flag
vrtg --verbose vm list

# Provider-specific debugging
vrtg provider logs <provider-name> --follow --verbose

# Conformance test debugging
vcts run --provider <provider-name> --verbose

See Also

CLI Reference

VirtRigaud provides several command-line tools for managing virtual machines, testing providers, and developing new providers. All tools are available as part of VirtRigaud v0.2.0.

Overview

ToolPurposeTarget Users
vrtgMain CLI for VM managementEnd users, DevOps teams
vctsConformance testing suiteProvider developers, QA teams
vrtg-providerProvider development toolkitProvider developers
virtrigaud-loadgenLoad testing and benchmarkingPerformance engineers

Installation

From GitHub Releases

# Download the latest release
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.0/vrtg-linux-amd64" -o vrtg
chmod +x vrtg
sudo mv vrtg /usr/local/bin/

# Install all CLI tools
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.0/virtrigaud-cli-linux-amd64.tar.gz" | tar xz
sudo mv vrtg vcts vrtg-provider virtrigaud-loadgen /usr/local/bin/

From Source

git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Build all CLI tools
make build-cli

# Install to /usr/local/bin
sudo make install-cli

Using Go

go install github.com/projectbeskar/virtrigaud/cmd/vrtg@v0.2.0
go install github.com/projectbeskar/virtrigaud/cmd/vcts@v0.2.0
go install github.com/projectbeskar/virtrigaud/cmd/vrtg-provider@v0.2.0
go install github.com/projectbeskar/virtrigaud/cmd/virtrigaud-loadgen@v0.2.0

vrtg

The main CLI tool for managing VirtRigaud resources and virtual machines.

Global Flags

--kubeconfig string   Path to kubeconfig file (default: $KUBECONFIG or ~/.kube/config)
--namespace string    Kubernetes namespace (default: "default")
--output string       Output format: table, json, yaml (default: "table")
--timeout duration    Operation timeout (default: 5m0s)
-h, --help           Help for vrtg

Commands

vm

Manage virtual machines.

# List all VMs
vrtg vm list

# Get detailed VM information
vrtg vm get <vm-name>

# Create a VM from configuration
vrtg vm create --file vm.yaml

# Delete a VM
vrtg vm delete <vm-name>

# Power operations
vrtg vm start <vm-name>
vrtg vm stop <vm-name>
vrtg vm restart <vm-name>

# Scale VMSet
vrtg vm scale <vmset-name> --replicas 5

# Get VM console URL
vrtg vm console <vm-name>

# Watch VM status changes
vrtg vm watch <vm-name>

Examples:

# List VMs with custom output
vrtg vm list --output json --namespace production

# Create VM with timeout
vrtg vm create --file my-vm.yaml --timeout 10m

# Power on all VMs in namespace
vrtg vm list --output json | jq -r '.items[].metadata.name' | xargs -I {} vrtg vm start {}

provider

Manage provider configurations.

# List providers
vrtg provider list

# Get provider details
vrtg provider get <provider-name>

# Check provider connectivity
vrtg provider validate <provider-name>

# Get provider capabilities
vrtg provider capabilities <provider-name>

# View provider logs
vrtg provider logs <provider-name>

# Test provider functionality
vrtg provider test <provider-name>

Examples:

# Validate all providers
vrtg provider list --output json | jq -r '.items[].metadata.name' | xargs -I {} vrtg provider validate {}

# Get detailed provider status
vrtg provider get vsphere-prod --output yaml

image

Manage VM images and templates.

# List available images
vrtg image list

# Get image details
vrtg image get <image-name>

# Prepare an image
vrtg image prepare <image-name>

# Delete an image
vrtg image delete <image-name>

snapshot

Manage VM snapshots.

# List snapshots for a VM
vrtg snapshot list --vm <vm-name>

# Create a snapshot
vrtg snapshot create <vm-name> --name "pre-upgrade"

# Restore from snapshot
vrtg snapshot restore <vm-name> <snapshot-name>

# Delete a snapshot
vrtg snapshot delete <vm-name> <snapshot-name>

completion

Generate shell completion scripts.

# Bash
vrtg completion bash > /etc/bash_completion.d/vrtg

# Zsh
vrtg completion zsh > "${fpath[1]}/_vrtg"

# Fish
vrtg completion fish > ~/.config/fish/completions/vrtg.fish

# PowerShell
vrtg completion powershell > vrtg.ps1

Configuration

vrtg uses the same kubeconfig as kubectl. Configuration precedence:

  1. --kubeconfig flag
  2. KUBECONFIG environment variable
  3. ~/.kube/config

Config File

Create ~/.vrtg/config.yaml for default settings:

defaults:
  namespace: "virtrigaud-system"
  timeout: "10m"
  output: "table"
providers:
  preferred: "vsphere-prod"
output:
  colors: true
  timestamps: true

vcts

VirtRigaud Conformance Test Suite for validating provider implementations.

Global Flags

--kubeconfig string   Path to kubeconfig file
--namespace string    Test namespace (default: "vcts")
--provider string     Provider to test
--output-dir string   Directory for test results
--timeout duration    Test timeout (default: 30m)
--parallel int        Number of parallel tests (default: 1)
--skip strings        Tests to skip (comma-separated)
--verbose             Verbose output
-h, --help           Help for vcts

Commands

run

Run conformance tests against a provider.

# Run all tests
vcts run --provider vsphere-prod

# Run specific test suites
vcts run --provider vsphere-prod --suites core,storage

# Run with custom configuration
vcts run --provider libvirt-test --config test-config.yaml

# Skip specific tests
vcts run --provider vsphere-prod --skip "test-large-vm,test-snapshot-memory"

# Generate detailed report
vcts run --provider vsphere-prod --output-dir ./test-results --verbose

list

List available test suites and tests.

# List all test suites
vcts list suites

# List tests in a suite
vcts list tests --suite core

# List supported providers
vcts list providers

validate

Validate test configuration.

# Validate configuration file
vcts validate --config test-config.yaml

# Validate provider setup
vcts validate --provider vsphere-prod

Test Suites

Core Suite

  • Basic VM lifecycle (create, start, stop, delete)
  • Provider connectivity and authentication
  • Resource allocation and management

Storage Suite

  • Disk creation and attachment
  • Volume expansion operations
  • Storage pool management

Network Suite

  • Network interface management
  • IP address allocation
  • Network connectivity tests

Snapshot Suite

  • Snapshot creation and deletion
  • Snapshot restoration
  • Memory state preservation

Performance Suite

  • VM creation performance
  • Resource utilization benchmarks
  • Concurrent operation handling

Test Configuration

Create test-config.yaml:

provider:
  name: "vsphere-prod"
  type: "vsphere"
  
tests:
  core:
    enabled: true
    timeout: "15m"
  storage:
    enabled: true
    testDiskSize: "10Gi"
  network:
    enabled: false  # Skip network tests
    
resources:
  vmClass: "test-small"
  vmImage: "ubuntu-22-04"
  
cleanup:
  enabled: true
  timeout: "10m"

vrtg-provider

Development toolkit for creating and maintaining VirtRigaud providers.

Global Flags

--verbose            Enable verbose output
-h, --help          Help for vrtg-provider

Commands

init

Initialize a new provider project.

# Create a new provider
vrtg-provider init --name hyperv --type hyperv --output ./hyperv-provider

# Create with custom options
vrtg-provider init --name aws-ec2 --type aws \
  --capabilities snapshots,linked-clones \
  --output ./aws-provider

Options:

  • --name: Provider name
  • --type: Provider type
  • --capabilities: Comma-separated capabilities list
  • --output: Output directory
  • --remote: Generate remote provider (default: true)

generate

Generate code for provider components.

# Generate API types
vrtg-provider generate api --provider-type vsphere

# Generate client code
vrtg-provider generate client --provider-type vsphere --api-version v1

# Generate test suite
vrtg-provider generate tests --provider-type vsphere

# Generate documentation
vrtg-provider generate docs --provider-type vsphere

verify

Verify provider implementation.

# Verify provider structure
vrtg-provider verify structure --path ./my-provider

# Verify capabilities
vrtg-provider verify capabilities --path ./my-provider

# Verify API compatibility
vrtg-provider verify api --path ./my-provider --api-version v1beta1

publish

Publish provider artifacts.

# Build and publish provider image
vrtg-provider publish --path ./my-provider --registry ghcr.io/myorg

# Publish with specific tag
vrtg-provider publish --path ./my-provider --tag v1.0.0

# Dry run publication
vrtg-provider publish --path ./my-provider --dry-run

Provider Structure

my-provider/
β”œβ”€β”€ cmd/
β”‚   └── provider-mytype/
β”‚       β”œβ”€β”€ Dockerfile
β”‚       └── main.go
β”œβ”€β”€ internal/
β”‚   └── provider/
β”‚       β”œβ”€β”€ provider.go
β”‚       β”œβ”€β”€ capabilities.go
β”‚       └── provider_test.go
β”œβ”€β”€ deploy/
β”‚   β”œβ”€β”€ provider.yaml
β”‚   β”œβ”€β”€ service.yaml
β”‚   └── deployment.yaml
β”œβ”€β”€ docs/
β”‚   └── README.md
β”œβ”€β”€ go.mod
β”œβ”€β”€ go.sum
└── Makefile

virtrigaud-loadgen

Load testing and performance benchmarking tool for VirtRigaud providers.

Global Flags

--kubeconfig string   Path to kubeconfig file
--namespace string    Test namespace (default: "loadgen")
--output-dir string   Output directory for results
--config-file string  Load generation configuration file
--dry-run            Show what would be created without executing
--verbose            Verbose output
-h, --help          Help for virtrigaud-loadgen

Commands

run

Execute load generation scenarios.

# Run default load test
virtrigaud-loadgen run --config loadtest.yaml

# Run with custom settings
virtrigaud-loadgen run --config loadtest.yaml --workers 50 --duration 10m

# Run specific scenario
virtrigaud-loadgen run --scenario vm-creation --vms 100

# Generate performance report
virtrigaud-loadgen run --config loadtest.yaml --output-dir ./perf-results

scenarios

Manage load testing scenarios.

# List available scenarios
virtrigaud-loadgen scenarios list

# Show scenario details
virtrigaud-loadgen scenarios get vm-lifecycle

# Validate scenario configuration
virtrigaud-loadgen scenarios validate --config custom-scenario.yaml

analyze

Analyze load test results.

# Generate performance report
virtrigaud-loadgen analyze --input ./perf-results

# Compare test runs
virtrigaud-loadgen analyze --compare run1.csv,run2.csv

# Generate charts
virtrigaud-loadgen analyze --input ./perf-results --charts

Load Test Configuration

Create loadtest.yaml:

metadata:
  name: "vm-creation-load-test"
  description: "Test VM creation performance"

scenarios:
  - name: "vm-creation"
    type: "vm-lifecycle"
    workers: 20
    duration: "5m"
    resources:
      vmClass: "small"
      vmImage: "ubuntu-22-04"
      provider: "vsphere-prod"
    
  - name: "vm-scaling"
    type: "vmset-scaling"
    workers: 5
    iterations: 10
    scaling:
      min: 1
      max: 50
      step: 5

providers:
  - name: "vsphere-prod"
    type: "vsphere"
  - name: "libvirt-test"
    type: "libvirt"

output:
  format: ["csv", "json"]
  metrics: ["latency", "throughput", "errors"]
  
cleanup:
  enabled: true
  timeout: "15m"

Performance Scenarios

VM Lifecycle

  • Create, start, stop, delete operations
  • Measures end-to-end VM management performance

Burst Creation

  • Rapid VM creation under load
  • Tests provider scaling capabilities

VMSet Scaling

  • Scale VMSets up and down
  • Measures horizontal scaling performance

Provider Stress

  • High concurrent operations
  • Tests provider reliability under stress

Results Analysis

Load test results include:

  • Latency metrics: P50, P95, P99 response times
  • Throughput: Operations per second
  • Error rates: Failed operations percentage
  • Resource usage: CPU, memory, network utilization
  • Provider metrics: API call statistics

Example output:

timestamp,scenario,operation,latency_ms,status,provider
2025-01-15T10:00:01Z,vm-creation,create,2500,success,vsphere-prod
2025-01-15T10:00:03Z,vm-creation,create,2800,success,vsphere-prod
2025-01-15T10:00:05Z,vm-creation,create,failed,timeout,vsphere-prod

Best Practices

Using vrtg

  1. Use namespaces to organize resources
  2. Set timeouts appropriately for your environment
  3. Use dry-run options for validation before execution
  4. Monitor operations with watch commands

Testing with vcts

  1. Run core tests first to validate basic functionality
  2. Use separate namespaces for different test runs
  3. Clean up resources after testing
  4. Document test results for compliance tracking

Developing with vrtg-provider

  1. Start with init to create proper structure
  2. Implement core capabilities before advanced features
  3. Test thoroughly with vcts before publishing
  4. Follow naming conventions for consistency

Load Testing with virtrigaud-loadgen

  1. Start small and gradually increase load
  2. Monitor system resources during tests
  3. Use realistic scenarios that match production workloads
  4. Analyze results to identify bottlenecks

Support

Version Information

This documentation covers VirtRigaud CLI tools v0.2.0.

For older versions, see the releases page.

Metrics Catalog

VirtRigaud exposes comprehensive metrics for monitoring and observability. All metrics are available at the /metrics endpoint on port 8080.

Manager Metrics

Reconciliation Metrics

Metric NameTypeLabelsDescription
virtrigaud_manager_reconcile_totalCounterkind, outcomeTotal number of reconcile loops
virtrigaud_manager_reconcile_duration_secondsHistogramkindTime spent in reconcile loops
virtrigaud_queue_depthGaugekindCurrent queue depth for each resource kind

VM Operation Metrics

Metric NameTypeLabelsDescription
virtrigaud_vm_operations_totalCounteroperation, provider_type, provider, outcomeTotal VM operations
virtrigaud_vm_reconfigure_totalCounterprovider_type, outcomeTotal VM reconfiguration operations
virtrigaud_vm_snapshot_totalCounteraction, provider_type, outcomeTotal VM snapshot operations
virtrigaud_vm_clone_totalCounterlinked, provider_type, outcomeTotal VM clone operations
virtrigaud_vm_image_prepare_totalCounterprovider_type, outcomeTotal VM image preparation operations

Build Information

Metric NameTypeLabelsDescription
virtrigaud_build_infoGaugeversion, git_sha, go_versionBuild information

Provider Metrics

gRPC Metrics

Metric NameTypeLabelsDescription
virtrigaud_provider_rpc_requests_totalCounterprovider_type, method, codeTotal gRPC requests
virtrigaud_provider_rpc_latency_secondsHistogramprovider_type, methodgRPC request latency
virtrigaud_provider_tasks_inflightGaugeprovider_type, providerNumber of inflight tasks

Provider-Specific Metrics

Metric NameTypeLabelsDescription
virtrigaud_ip_discovery_duration_secondsHistogramprovider_typeTime to discover VM IP addresses

Error Metrics

Metric NameTypeLabelsDescription
virtrigaud_errors_totalCounterreason, componentTotal errors by reason and component

Label Definitions

Common Labels

  • provider_type: The type of provider (vsphere, libvirt)
  • provider: The name of the provider instance
  • outcome: The result of an operation (success, failure, error)
  • kind: The Kubernetes resource kind (VirtualMachine, VMClass, etc.)
  • component: The component generating the metric (manager, provider)

Operation-Specific Labels

  • operation: Type of VM operation (Create, Delete, Power, Describe, Reconfigure)
  • method: gRPC method name (CreateVM, DeleteVM, PowerVM, etc.)
  • code: gRPC status code (OK, INVALID_ARGUMENT, DEADLINE_EXCEEDED, etc.)
  • action: Snapshot action (create, delete, revert)
  • linked: Whether a clone is linked (true, false)
  • reason: Error reason (ConnectionFailed, AuthenticationError, etc.)

Histogram Buckets

Duration histograms use the following buckets (in seconds):

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300

Example Queries

Prometheus Queries

Error Rate

# Overall error rate
rate(virtrigaud_vm_operations_total{outcome="failure"}[5m]) /
rate(virtrigaud_vm_operations_total[5m])

# Provider-specific error rate
rate(virtrigaud_provider_rpc_requests_total{code!="OK"}[5m]) /
rate(virtrigaud_provider_rpc_requests_total[5m])

Latency

# 95th percentile VM creation time
histogram_quantile(0.95, 
  rate(virtrigaud_vm_operations_duration_seconds_bucket{operation="Create"}[5m])
)

# gRPC request latency by method
histogram_quantile(0.95,
  rate(virtrigaud_provider_rpc_latency_seconds_bucket[5m])
) by (method)

Throughput

# VM operations per second
rate(virtrigaud_vm_operations_total[5m])

# Operations by provider
rate(virtrigaud_vm_operations_total[5m]) by (provider_type, provider)

Queue Depth

# Current queue depth
virtrigaud_queue_depth

# Average queue depth over time
avg_over_time(virtrigaud_queue_depth[5m])

Inflight Tasks

# Current inflight tasks
virtrigaud_provider_tasks_inflight

# Inflight tasks by provider
virtrigaud_provider_tasks_inflight by (provider_type, provider)

Grafana Dashboard Queries

VM Creation Success Rate Panel

sum(rate(virtrigaud_vm_operations_total{operation="Create",outcome="success"}[5m])) /
sum(rate(virtrigaud_vm_operations_total{operation="Create"}[5m])) * 100

Provider Health Panel

up{job="virtrigaud-provider"}

Error Rate by Provider Panel

sum(rate(virtrigaud_errors_total[5m])) by (component, provider_type)

ServiceMonitor Configuration

Example ServiceMonitor for Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: virtrigaud-manager
  namespace: virtrigaud-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud
      app.kubernetes.io/component: manager
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: virtrigaud-providers
  namespace: virtrigaud-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud
      app.kubernetes.io/component: provider
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Alert Rules

Example PrometheusRule for common alerts:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: virtrigaud-alerts
  namespace: virtrigaud-system
spec:
  groups:
  - name: virtrigaud.rules
    rules:
    - alert: VirtrigaudProviderDown
      expr: up{job="virtrigaud-provider"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Virtrigaud provider is down"
        description: "Provider {{ $labels.instance }} has been down for more than 5 minutes"

    - alert: VirtrigaudHighErrorRate
      expr: |
        rate(virtrigaud_vm_operations_total{outcome="failure"}[5m]) /
        rate(virtrigaud_vm_operations_total[5m]) > 0.1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High error rate in VM operations"
        description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.provider }}"

    - alert: VirtrigaudSlowVMCreation
      expr: |
        histogram_quantile(0.95,
          rate(virtrigaud_vm_operations_duration_seconds_bucket{operation="Create"}[5m])
        ) > 600
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Slow VM creation times"
        description: "95th percentile VM creation time is {{ $value }}s"

    - alert: VirtrigaudQueueBacklog
      expr: virtrigaud_queue_depth > 100
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Queue backlog detected"
        description: "Queue depth for {{ $labels.kind }} is {{ $value }}"

Custom Metrics

Providers can expose additional custom metrics specific to their implementation:

vSphere Provider Metrics

Metric NameTypeLabelsDescription
virtrigaud_vsphere_sessions_totalCounterdatacenterTotal vSphere sessions created
virtrigaud_vsphere_api_calls_totalCountermethod, datacenterTotal vSphere API calls

Libvirt Provider Metrics

Metric NameTypeLabelsDescription
virtrigaud_libvirt_connections_totalCounterhostTotal Libvirt connections
virtrigaud_libvirt_domains_totalGaugehost, stateCurrent number of domains by state

Metric Collection Best Practices

  1. Scrape Interval: Use 30s interval for most metrics
  2. Retention: Keep metrics for at least 30 days for trending
  3. High Cardinality: Be careful with VM names and IDs in labels
  4. Aggregation: Use recording rules for frequently queried metrics
  5. Alerting: Set up alerts for SLI/SLO violations

Provider Catalog

Last updated: 2025-08-26T14:30:00Z

The VirtRigaud Provider Catalog lists all verified and community providers available for the VirtRigaud virtualization management platform. All providers in this catalog have been tested for conformance and compatibility.

Provider Overview

ProviderDescriptionCapabilitiesConformanceMaintainerLicense
Mock ProviderA mock provider for testing and demonstrationscore, snapshot, clone, image-prepare, advancedConformancevirtrigaud@projectbeskar.comApache-2.0
vSphere ProviderVMware vSphere provider for VirtRigaudcore, snapshot, clone, advancedConformancevirtrigaud@projectbeskar.comApache-2.0
Libvirt ProviderLibvirt/KVM provider for VirtRigaudcore, snapshot, cloneConformancevirtrigaud@projectbeskar.comApache-2.0

Quick Start

Installing a Provider

To install a provider in your Kubernetes cluster, use the VirtRigaud provider runtime Helm chart:

# Add the VirtRigaud Helm repository
helm repo add virtrigaud https://projectbeskar.github.io/virtrigaud
helm repo update

# Install a provider using the runtime chart
helm install my-vsphere-provider virtrigaud/virtrigaud-provider-runtime \
  --namespace vsphere-providers \
  --create-namespace \
  --set image.repository=ghcr.io/projectbeskar/virtrigaud/provider-vsphere \
  --set image.tag=0.1.1 \
  --set env[0].name=VSPHERE_SERVER \
  --set env[0].value=vcenter.example.com

Provider Discovery

Once installed, providers automatically register with the VirtRigaud manager. You can list available providers:

kubectl get providers -n virtrigaud-system

Provider Details

Mock Provider

The mock provider is perfect for:

  • Testing VirtRigaud functionality
  • Development and CI/CD pipelines
  • Learning provider concepts
  • Demonstrating VirtRigaud capabilities

Installation:

helm install mock-provider virtrigaud/virtrigaud-provider-runtime \
  --namespace development \
  --create-namespace \
  --set image.repository=ghcr.io/projectbeskar/virtrigaud/provider-mock \
  --set image.tag=0.1.1 \
  --set env[0].name=LOG_LEVEL \
  --set env[0].value=debug

vSphere Provider

The vSphere provider enables VirtRigaud to manage VMware vSphere environments, including:

  • VM lifecycle management (create, update, delete)
  • Power operations (on, off, restart, suspend)
  • Snapshot management
  • VM cloning and templates
  • Resource allocation and configuration

Prerequisites:

  • VMware vSphere 6.7 or later
  • vCenter Server credentials
  • Network connectivity to vCenter API

Installation:

# Create secret for vSphere credentials
kubectl create secret generic vsphere-credentials \
  --namespace vsphere-providers \
  --from-literal=username=your-username \
  --from-literal=password=your-password

# Install provider
helm install vsphere-provider virtrigaud/virtrigaud-provider-runtime \
  --namespace vsphere-providers \
  --create-namespace \
  --set image.repository=ghcr.io/projectbeskar/virtrigaud/provider-vsphere \
  --set image.tag=0.1.1 \
  --set env[0].name=VSPHERE_SERVER \
  --set env[0].value=vcenter.example.com \
  --set env[1].name=VSPHERE_USERNAME \
  --set env[1].valueFrom.secretKeyRef.name=vsphere-credentials \
  --set env[1].valueFrom.secretKeyRef.key=username \
  --set env[2].name=VSPHERE_PASSWORD \
  --set env[2].valueFrom.secretKeyRef.name=vsphere-credentials \
  --set env[2].valueFrom.secretKeyRef.key=password

Libvirt Provider

The libvirt provider manages KVM/QEMU virtual machines through libvirt, supporting:

  • VM lifecycle management
  • Power state control
  • Snapshot operations
  • Basic cloning capabilities
  • Local and remote libvirt connections

Prerequisites:

  • Libvirt daemon running on target hosts
  • SSH access for remote connections
  • Shared storage for multi-host deployments

Installation:

helm install libvirt-provider virtrigaud/virtrigaud-provider-runtime \
  --namespace libvirt-providers \
  --create-namespace \
  --set image.repository=ghcr.io/projectbeskar/virtrigaud/provider-libvirt \
  --set image.tag=0.1.1 \
  --set env[0].name=LIBVIRT_URI \
  --set env[0].value=qemu:///system \
  --set securityContext.runAsUser=0 \
  --set podSecurityContext.runAsUser=0

Capability Profiles

VirtRigaud defines several capability profiles that providers can implement:

Core Profile

Required for all providers

  • vm.create - Create virtual machines
  • vm.read - Get virtual machine information
  • vm.update - Update virtual machine configuration
  • vm.delete - Delete virtual machines
  • vm.power - Control power state (on/off/restart)
  • vm.list - List virtual machines

Snapshot Profile

Optional - for providers supporting VM snapshots

  • vm.snapshot.create - Create VM snapshots
  • vm.snapshot.list - List VM snapshots
  • vm.snapshot.delete - Delete VM snapshots
  • vm.snapshot.restore - Restore VM from snapshot

Clone Profile

Optional - for providers supporting VM cloning

  • vm.clone - Clone virtual machines
  • vm.template - Create and manage VM templates

Image Prepare Profile

Optional - for providers with image management

  • image.prepare - Prepare VM images
  • image.list - List available images
  • image.upload - Upload custom images

Advanced Profile

Optional - for advanced provider features

  • vm.migrate - Live migrate VMs between hosts
  • vm.resize - Dynamic resource allocation
  • vm.backup - Backup and restore operations
  • vm.monitoring - Advanced monitoring and metrics

Contributing a Provider

Want to add your provider to the catalog? Follow these steps:

1. Develop Your Provider

Use the Provider Developer Tutorial to create your provider using the VirtRigaud SDK.

2. Ensure Conformance

Run the VirtRigaud Conformance Test Suite (VCTS) to verify your provider meets the requirements:

# Install the VCTS tool
go install github.com/projectbeskar/virtrigaud/cmd/vcts@latest

# Run conformance tests
vcts run --provider-endpoint=localhost:9443 --profile=core

3. Publish to Catalog

Use the vrtg-provider publish command to submit your provider:

vrtg-provider publish \
  --name your-provider \
  --image ghcr.io/yourorg/your-provider \
  --tag v1.0.0 \
  --repo https://github.com/yourorg/your-provider \
  --maintainer your-email@example.com \
  --license Apache-2.0

This will:

  1. Run conformance tests
  2. Generate provider badges
  3. Create a catalog entry
  4. Open a pull request to add your provider

4. Catalog Requirements

To be included in the catalog, providers must:

  • βœ… Pass VCTS core profile tests
  • βœ… Include comprehensive documentation
  • βœ… Provide Helm chart for deployment
  • βœ… Follow security best practices
  • βœ… Include proper error handling
  • βœ… Support health checks and metrics
  • βœ… Have active maintenance and support

Provider Support Matrix

ProviderKubernetesVirtRigaudGo VersionPlatforms
Mock1.25+0.1.0+1.23+linux/amd64, linux/arm64
vSphere1.25+0.1.0+1.23+linux/amd64, linux/arm64
Libvirt1.25+0.1.0+1.23+linux/amd64

Community and Support

Versioning and Compatibility

Providers follow semantic versioning (SemVer) and maintain compatibility with VirtRigaud versions:

  • Major versions (1.0.0 β†’ 2.0.0): Breaking changes to APIs or behavior
  • Minor versions (1.0.0 β†’ 1.1.0): New features, backward compatible
  • Patch versions (1.0.0 β†’ 1.0.1): Bug fixes, security updates

Compatibility Policy:

  • Current VirtRigaud version supports providers from current major version
  • Providers should support at least 2 minor versions of VirtRigaud
  • Breaking changes require migration documentation

All providers in this catalog are open source and follow the licensing terms specified in their individual repositories. The catalog itself is maintained under the Apache 2.0 license.

Trademark Notice: VMware and vSphere are trademarks of VMware, Inc. KVM and QEMU are trademarks of their respective owners. All trademarks are the property of their respective owners.

Testing GitHub Actions Workflows Locally

This guide explains how to test VirtRigaud’s GitHub Actions workflows locally before pushing to save on GitHub Actions costs and catch issues early.

Overview

We provide several scripts to test workflows locally:

ScriptPurposeDependenciesUse Case
hack/test-workflows-locally.shMain orchestrator using actact, dockerFull GitHub Actions simulation
hack/test-lint-locally.shLint workflow replicago, golangci-lintQuick lint testing
hack/test-ci-locally.shCI workflow replicago, helm, system depsComprehensive CI testing
hack/test-release-locally.shRelease workflow simulationdocker, helm, goRelease preparation testing
hack/test-helm-locally.shHelm charts testinghelm, kind, kubectlChart validation and deployment

Quick Start

1. Setup (First Time)

# Install dependencies and configure act
./hack/test-workflows-locally.sh setup

This will:

  • Install act (GitHub Actions runner)
  • Create .actrc configuration
  • Create .env.local with environment variables
  • Create .secrets file (update with real values if needed)

2. Quick Validation

# Fast syntax check of all workflows
./hack/test-workflows-locally.sh smoke

3. Test Individual Workflows

# Test lint workflow (fastest)
./hack/test-lint-locally.sh

# Test CI workflow (comprehensive)
./hack/test-ci-locally.sh

# Test Helm charts
./hack/test-helm-locally.sh

# Test release workflow (requires Docker)
./hack/test-release-locally.sh v0.2.0-test

Detailed Usage

Lint Testing (test-lint-locally.sh)

Replicates the lint.yml workflow:

# Quick lint test
./hack/test-lint-locally.sh

What it tests:

  • Go version compatibility
  • golangci-lint installation and execution
  • Comprehensive code linting (matching CI exactly)

Requirements:

  • Go 1.21+
  • Internet access (to download golangci-lint if needed)

CI Testing (test-ci-locally.sh)

Replicates the ci.yml workflow jobs:

# Interactive mode (asks about optional jobs)
./hack/test-ci-locally.sh

# Quick essential tests only
./hack/test-ci-locally.sh quick

# Full CI replication including security scans
./hack/test-ci-locally.sh full

Jobs tested:

  • test: Go tests and coverage
  • lint: Code linting with golangci-lint
  • generate: Code and manifest generation
  • build: Binary compilation
  • build-tools: CLI tools compilation
  • helm: Helm chart validation
  • security: Security scanning (optional)

Requirements:

  • Go 1.23+
  • Helm 3.12+
  • System dependencies (libvirt-dev on Linux)
  • Python 3 (for YAML validation)

Release Testing (test-release-locally.sh)

Simulates the release.yml workflow:

# Test with default tag
./hack/test-release-locally.sh

# Test with specific tag
./hack/test-release-locally.sh v0.3.0-rc.1

# Skip image building (faster)
./hack/test-release-locally.sh --no-images

What it tests:

  • Container image building and pushing (to local registry)
  • Helm chart packaging with version updates
  • CLI tools building for multiple platforms
  • Changelog generation
  • Checksum creation
  • Container image smoke testing

Requirements:

  • Docker
  • Go 1.23+
  • Helm 3.12+
  • Local Docker registry (started automatically)

Helm Testing (test-helm-locally.sh)

Tests Helm charts with real Kubernetes:

# Full helm test suite
./hack/test-helm-locally.sh

# Individual test types
./hack/test-helm-locally.sh lint     # Chart linting only
./hack/test-helm-locally.sh template # Template rendering only
./hack/test-helm-locally.sh crd      # CRD installation only
./hack/test-helm-locally.sh main     # Main chart installation
./hack/test-helm-locally.sh runtime  # Runtime chart installation

# Cleanup after testing
./hack/test-helm-locally.sh cleanup

What it tests:

  • Helm chart linting (helm lint)
  • Template rendering with various value files
  • CRD installation and functionality
  • Chart installation in Kind cluster
  • Pod readiness and basic functionality

Requirements:

  • Helm 3.12+
  • Kind (Kubernetes in Docker)
  • kubectl
  • Docker

Act-Based Testing (test-workflows-locally.sh)

Uses act to run actual GitHub Actions workflows:

# Setup first time
./hack/test-workflows-locally.sh setup

# Test individual workflows
./hack/test-workflows-locally.sh lint
./hack/test-workflows-locally.sh ci
./hack/test-workflows-locally.sh runtime

# Test all workflows (interactive)
./hack/test-workflows-locally.sh all

# Cleanup
./hack/test-workflows-locally.sh cleanup

Advanced usage:

  • Supports secrets from .secrets file
  • Uses reusable containers for speed
  • Artifact handling with local storage
  • Environment variable injection

Configuration Files

.actrc

# Act configuration for GitHub Actions simulation
-P ubuntu-latest=catthehacker/ubuntu:act-22.04
-P ubuntu-22.04=catthehacker/ubuntu:act-22.04  
-P ubuntu-24.04=catthehacker/ubuntu:act-22.04
--container-daemon-socket /var/run/docker.sock
--reuse
--rm

.env.local

# Local environment variables
GO_VERSION=1.23
GOLANGCI_LINT_VERSION=v1.64.8
REGISTRY=localhost:5000
IMAGE_NAME_PREFIX=virtrigaud
GITHUB_ACTOR=local-user
GITHUB_REPOSITORY=projectbeskar/virtrigaud
# ... more environment variables

.secrets (optional)

# GitHub token for release workflows
GITHUB_TOKEN=your_github_token_here
REGISTRY=localhost:5000

Workflow-Specific Notes

Lint Workflow (lint.yml)

  • Fast: Usually completes in 1-2 minutes
  • Requirements: Minimal (Go + golangci-lint)
  • Run before: Every commit
  • Catches: Code style, syntax, and simple errors

CI Workflow (ci.yml)

  • Comprehensive: Tests building, testing, security
  • Duration: 10-20 minutes for full run
  • Platform deps: LibVirt requires Linux for full testing
  • Run before: Pull requests and major changes

Release Workflow (release.yml)

  • Complex: Multi-platform builds, signing, publishing
  • Duration: 20-30 minutes
  • Local only: Uses local registry, no real publishing
  • Run before: Creating releases

Runtime Chart Workflow (runtime-chart.yml)

  • Kubernetes focused: Tests provider runtime charts
  • Requirements: Kind cluster
  • Duration: 5-10 minutes
  • Run before: Chart changes

Best Practices

Daily Development Workflow

# Before committing
./hack/test-lint-locally.sh

# Before pushing feature branch
./hack/test-ci-locally.sh quick

# Before creating PR
./hack/test-ci-locally.sh full

Pre-Release Workflow

# Test release preparation
./hack/test-release-locally.sh v0.2.0-rc.1

# Test chart deployment
./hack/test-helm-locally.sh full

# Test with act for full simulation
./hack/test-workflows-locally.sh all

Troubleshooting

Common Issues

  1. Docker permission denied

    sudo usermod -aG docker $USER
    # Then logout/login
    
  2. LibVirt dependencies missing

    # Ubuntu/Debian
    sudo apt-get install libvirt-dev pkg-config
    
    # Skip libvirt tests on non-Linux
    ./hack/test-ci-locally.sh quick
    
  3. Kind cluster creation fails

    # Clean up and retry
    kind delete cluster --name virtrigaud-test
    ./hack/test-helm-locally.sh
    
  4. Act fails with container errors

    # Clean up act containers
    docker ps -a | grep "act-" | awk '{print $1}' | xargs docker rm -f
    
    # Rebuild without cache
    ./hack/test-workflows-locally.sh cleanup
    ./hack/test-workflows-locally.sh setup
    

Debugging Tips

  • Check logs: All scripts provide detailed logging
  • Use dry-run: Most scripts support --help for options
  • Incremental testing: Start with lint, then ci quick, then full tests
  • Docker cleanup: Regular docker system prune helps with space

Performance Tips

  1. Use quick modes for daily development
  2. Skip expensive jobs like security scans during iteration
  3. Reuse Kind clusters with ./hack/test-helm-locally.sh
  4. Use local registry for container testing
  5. Run parallel tests when possible

Integration with Development

Git Hooks

Add to .git/hooks/pre-push:

#!/bin/bash
echo "Running local lint check before push..."
./hack/test-lint-locally.sh

IDE Integration

Many IDEs can run these scripts as build tasks:

VS Code (.vscode/tasks.json):

{
  "version": "2.0.0",
  "tasks": [
    {
      "label": "Test Lint Locally",
      "type": "shell", 
      "command": "./hack/test-lint-locally.sh",
      "group": "test",
      "presentation": {
        "echo": true,
        "reveal": "always",
        "focus": false,
        "panel": "shared"
      }
    }
  ]
}

CI Cost Optimization

By testing locally first:

  • Reduce failed CI runs by ~80%
  • Save GitHub Actions minutes
  • Faster feedback (local runs are often faster)
  • Better debugging (local environment is easier to inspect)

Conclusion

These local testing scripts allow you to:

βœ… Catch issues early before they reach GitHub Actions
βœ… Save costs by reducing failed CI runs
βœ… Debug faster with local environment access
βœ… Test thoroughly with multiple approaches
βœ… Iterate quickly during development

Start with the lint script for daily use, and gradually incorporate the full test suite for comprehensive validation before releases.

Contributing to VirtRigaud

Thank you for your interest in contributing to VirtRigaud! This document provides guidelines and information for contributors.

Development Setup

Prerequisites

  • Go 1.23+
  • Docker
  • Kubernetes cluster (kind, k3s, or remote)
  • kubectl
  • Helm 3.x
  • make

Clone and Setup

git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Install development dependencies
make dev-setup

# Install pre-commit hooks (optional but recommended)
pip install pre-commit
pre-commit install

Development Workflow

1. Making Changes

API Changes

When modifying API types:

# Edit API types
vim api/infra.virtrigaud.io/v1beta1/virtualmachine_types.go

# Generate code and CRDs
make generate
make gen-crds

Code Changes

For other code changes:

# Run tests
make test

# Lint code
make lint

# Format code
make fmt

2. CRD Management

Important: CRDs are generated from code (the source of truth) and are not duplicated in git.

  • config/crd/bases/ - CRDs for local development and releases (checked into git)
  • charts/virtrigaud/crds/ - CRDs for Helm charts (generated during packaging, not checked into git)
# After API changes, generate CRDs
make gen-crds

# For Helm chart development/packaging
make gen-helm-crds

3. Testing

# Unit tests
make test

# Integration tests (requires cluster)
make test-integration

# End-to-end tests
make test-e2e

# Test specific provider
make test-provider-vsphere

4. Local Development

# Deploy to local cluster
make dev-deploy

# Watch for changes and auto-reload
make dev-watch

# Clean up
make dev-clean

Contribution Guidelines

Pull Request Process

  1. Fork and branch: Create a feature branch from main
  2. Make changes: Follow the development workflow above
  3. Test thoroughly: Run all relevant tests
  4. Update docs: Update documentation if needed
  5. CRD sync: Ensure CRDs are synchronized (CI will verify)
  6. Submit PR: Create a pull request with clear description

PR Requirements

  • All tests pass
  • CRDs are in sync (verified by CI)
  • Code is formatted (make fmt)
  • Code is linted (make lint)
  • Documentation updated if needed
  • Changelog entry added (for user-facing changes)

Commit Message Format

Use conventional commit format:

<type>(<scope>): <description>

[optional body]

[optional footer(s)]

Types:

  • feat: New feature
  • fix: Bug fix
  • docs: Documentation changes
  • style: Code style changes
  • refactor: Code refactoring
  • test: Test changes
  • chore: Maintenance tasks

Examples:

feat(vsphere): add graceful shutdown support
fix(crd): resolve powerState validation conflict
docs(upgrade): add CRD synchronization guide

Code Style

Go Code

  • Follow standard Go conventions
  • Use gofmt and golangci-lint
  • Add meaningful comments for exported functions
  • Include unit tests for new functionality

YAML/Kubernetes

  • Use 2-space indentation
  • Follow Kubernetes API conventions
  • Add descriptions to CRD fields
  • Include examples in documentation

Documentation

  • Use clear, concise language
  • Include code examples
  • Update both API docs and user guides
  • Test documentation examples

Testing

Unit Tests

# Run all unit tests
make test

# Run tests for specific package
go test ./internal/controller/...

# Run with coverage
make test-coverage

Integration Tests

# Requires running Kubernetes cluster
export KUBECONFIG=~/.kube/config
make test-integration

Provider Tests

# Test specific provider (requires infrastructure)
make test-provider-vsphere
make test-provider-libvirt
make test-provider-proxmox

Release Process

For Maintainers

  1. Prepare release:

    # Generate CRDs for config directory (will be in release artifacts)
    make gen-crds
    
    # Update version in charts
    vim charts/virtrigaud/Chart.yaml
    
    # Update changelog
    vim CHANGELOG.md
    
  2. Create release:

    git tag v0.2.1
    git push origin v0.2.1
    
  3. CI handles:

    • Building and pushing images
    • Creating GitHub release
    • Publishing Helm charts
    • Generating CLI binaries

Common Issues

CRD Generation Issues

If you need to regenerate CRDs:

# For local development and config directory
make gen-crds

# For Helm chart packaging
make gen-helm-crds

Note: CRDs in charts/virtrigaud/crds/ are generated during packaging and should not be committed to git.

Test Failures

# Clean and retry
make clean
make test

# For libvirt-related failures
export SKIP_LIBVIRT_TESTS=true
make test

Development Environment

# Reset development environment
make dev-clean
make dev-deploy

# Check logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager

Getting Help

  • GitHub Issues: Bug reports and feature requests
  • GitHub Discussions: Questions and community support
  • Documentation: Check docs/ directory
  • Code Review: Maintainers will provide feedback on PRs

Recognition

Contributors are recognized in:

  • CHANGELOG.md for significant contributions
  • README.md contributors section
  • GitHub contributor graphs

Thank you for contributing to VirtRigaud! πŸš€

VirtRigaud Examples

This directory contains comprehensive examples for VirtRigaud v0.2.3+, showcasing all features and capabilities.

Quick Start Examples

Basic Examples

Provider Examples

Resource Examples

v0.2.1 Feature Examples

New in v0.2.1

Advanced Provider Examples

Multi-Provider Examples

v0.2.3 Feature Summary

πŸ”§ VM Reconfiguration (vSphere, Libvirt, Proxmox)

# Online resource changes (vSphere, Proxmox)
# Offline changes (Libvirt - requires restart)
spec:
  vmClassRef: medium  # Change from small to medium
  powerState: "On"

πŸ“‹ Async Task Tracking (vSphere, Proxmox)

# Automatic tracking of long-running operations
# Real-time progress and error reporting

πŸ–₯️ Console Access (vSphere, Libvirt)

# Web console URLs automatically generated
status:
  consoleURL: "https://vcenter.example.com/ui/app/vm..."  # vSphere
  # or
  consoleURL: "vnc://libvirt-host:5900"  # Libvirt VNC

🌐 Guest Agent Integration (Proxmox)

# Accurate IP detection via QEMU guest agent
status:
  ipAddresses:
    - 192.168.1.100
    - fd00::1234:5678:9abc:def0

πŸ“¦ VM Cloning (vSphere)

# Full and linked clones with automatic snapshot handling
spec:
  vmImageRef: source-vm
  cloneType: linked  # or "full"

πŸ”„ Previous Features (v0.2.1)

  • Graceful Shutdown: OffGraceful power state with VMware Tools
  • Hardware Version Management: vSphere hardware version control
  • Proper Disk Sizing: Correct disk allocation across providers
  • Enhanced Lifecycle Management: postStart/preStop hooks

Usage Patterns

Testing v0.2.3 Features

  1. Test VM reconfiguration:

    # Change VM class to trigger reconfiguration
    kubectl patch virtualmachine my-vm --type='merge' \
      -p='{"spec":{"vmClassRef":"medium"}}'
    
    # Watch the reconfiguration process
    kubectl get vm my-vm -w
    
  2. Access VM console:

    # Get console URL from VM status
    kubectl get vm my-vm -o jsonpath='{.status.consoleURL}'
    
    # For VNC (Libvirt): Use any VNC client
    vncviewer $(kubectl get vm my-vm -o jsonpath='{.status.consoleURL}' | sed 's/vnc:\/\///')
    
  3. Monitor async tasks (vSphere, Proxmox):

    # Watch task progress in provider logs
    kubectl logs -f deployment/virtrigaud-provider-vsphere
    
  4. Verify guest agent (Proxmox):

    # Check IP addresses from guest agent
    kubectl get vm my-vm -o jsonpath='{.status.ipAddresses}'
    
  5. Test VM cloning (vSphere):

    # Create a clone of existing VM
    kubectl apply -f - <<EOF
    apiVersion: infra.virtrigaud.io/v1beta1
    kind: VirtualMachine
    metadata:
      name: web-server-clone
    spec:
      vmClassRef: small
      vmImageRef: web-server-01
      cloneType: linked
    EOF
    

Development Workflow

  1. Choose base example based on your use case
  2. Customize provider, class, and VM specifications
  3. Test locally with your infrastructure
  4. Iterate based on your requirements

Production Deployment

  1. Start with complete-example.yaml
  2. Add security configurations from security/ subdirectory
  3. Configure secrets from secrets/ subdirectory
  4. Apply advanced patterns from advanced/ subdirectory

File Organization

docs/examples/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ complete-example.yaml             # Complete setup guide
β”œβ”€β”€ v021-feature-showcase.yaml        # 🌟 v0.2.1 comprehensive demo
β”œβ”€β”€ vm-ubuntu-small.yaml             # Simple VM example
β”œβ”€β”€ vmclass-small.yaml               # Basic VMClass
β”œβ”€β”€ provider-*.yaml                  # Provider configurations
β”œβ”€β”€ graceful-shutdown-examples.yaml  # OffGraceful demos
β”œβ”€β”€ vsphere-hardware-versions.yaml   # Hardware version examples
β”œβ”€β”€ disk-sizing-examples.yaml        # Disk sizing tests
β”œβ”€β”€ advanced/                        # Complex scenarios
β”œβ”€β”€ secrets/                         # Secret management
└── security/                        # Security configurations

Version Compatibility

  • v0.2.3+: All examples with v0.2.3 features (Reconfigure, Clone, TaskStatus, ConsoleURL, Guest Agent)
  • v0.2.2: Nested virtualization, TPM support, snapshot management
  • v0.2.1: Graceful shutdown, hardware version, disk sizing fixes
  • v0.2.0: Initial production-ready providers
  • v0.1.x: Legacy examples in git history

Need Help?


Pro Tip: Start with v021-feature-showcase.yaml to see all v0.2.1 capabilities in action! πŸš€