VirtRigaud Documentation

Welcome to the VirtRigaud documentation. VirtRigaud is a Kubernetes operator for managing virtual machines across multiple hypervisors including vSphere, Libvirt/KVM, and Proxmox VE.

Getting Started

15-Minute Quickstart - Get up and running quickly
Installation Guide - Helm installation instructions
Helm CRD Upgrades - Managing CRD updates

Core Documentation

Custom Resource Definitions - Complete API reference
Examples - Practical configuration examples
Cloud-Init Configuration - UserData and MetaData guide
Provider Documentation - Provider development guide
Provider Capabilities Matrix - Feature comparison

Provider-Specific Guides

vSphere Provider - VMware vCenter/ESXi integration
Libvirt Provider - KVM/QEMU virtualization
Proxmox VE Provider - Proxmox Virtual Environment
Provider Tutorial - Build your own provider
Provider Versioning - Version management

Advanced Features

VM Lifecycle Management - Advanced VM operations
VM Adoption - Onboard existing VMs into VirtRigaud
Nested Virtualization - Run hypervisors in VMs
Graceful Shutdown - Proper VM shutdown handling
VM Snapshots - Backup and restore
Remote Providers - Provider architecture

Operations & Administration

Observability - Monitoring and metrics
Security - Security best practices
Resilience - High availability and fault tolerance
Upgrade Guide - Version upgrade procedures
vSphere Hardware Versions - Hardware compatibility

API Reference

CLI Tools Reference - Command-line interface guide
CLI API Reference - Detailed CLI documentation
Metrics Catalog - Available metrics
Provider Catalog - Available providers

Development

Testing Workflows Locally - Local CI/CD testing
Contributing - Contribution guidelines
Development Guide - Developer setup

Examples Directory

Example README - Overview of all examples
Complete Examples - Working configuration files
Advanced Examples - Complex scenarios
Security Examples - Security configurations

Version Information

This documentation covers VirtRigaud v0.2.3.

Recent Changes

v0.2.3: Provider feature parity - Reconfigure, Clone, TaskStatus, ConsoleURL
v0.2.2: Nested virtualization, TPM support, snapshot management
v0.2.1: Critical fixes and documentation updates
v0.2.0: Production-ready vSphere and Libvirt providers

See CHANGELOG.md for complete version history.

Provider Status

Provider	Status	Maturity	Documentation
vSphere	Production Ready	Stable	Guide
Libvirt/KVM	Production Ready	Stable	Guide
Proxmox VE	Production Ready	Beta	Guide
Mock	Complete	Testing	PROVIDERS.md

Support

GitHub Issues: github.com/projectbeskar/virtrigaud/issues
Discussions: github.com/projectbeskar/virtrigaud/discussions
Slack: #virtrigaud on Kubernetes Slack

Quick Links

Main README - Project overview
CHANGELOG - Version history
Contributing - How to contribute
License - Apache License 2.0

15-Minute Quickstart

This guide will get you up and running with VirtRigaud in 15 minutes using both vSphere and Libvirt providers.

Prerequisites

Kubernetes cluster (1.24+)
kubectl configured
Helm 3.x
Access to a vSphere environment (optional)
Access to a Libvirt/KVM host (optional)

API Support

Default API: v1beta1 - The recommended stable API for all new deployments.

Legacy API: v1beta1 - Served for compatibility but deprecated. See the upgrade guide for migration instructions.

All resources support seamless conversion between API versions via webhooks.

Step 1: Install VirtRigaud

Using Helm (Recommended)

# Add the VirtRigaud Helm repository
helm repo add virtrigaud https://projectbeskar.github.io/virtrigaud
helm repo update

# Install with default settings (CRDs included automatically)
helm install virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --create-namespace

# Or install with specific providers enabled
helm install virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --create-namespace \
  --set providers.vsphere.enabled=true \
  --set providers.libvirt.enabled=true

# To skip CRDs if already installed separately
helm install virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --create-namespace \
  --skip-crds

Using Kustomize

# Clone the repository
git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Apply base installation
kubectl apply -k deploy/kustomize/base

# Or apply with overlays
kubectl apply -k deploy/kustomize/overlays/standard

Step 2: Verify Installation

# Check that the manager is running
kubectl get pods -n virtrigaud-system

# Check CRDs are installed
kubectl get crds | grep virtrigaud

# Verify API conversion is working (v1beta1 <-> v1beta1)
kubectl get crd virtualmachines.infra.virtrigaud.io -o yaml | yq '.spec.conversion'

# Check manager logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager

Step 3: Configure a Provider

Option A: vSphere Provider

Create a secret with vSphere credentials:

kubectl create secret generic vsphere-credentials \
  --namespace default \
  --from-literal=endpoint=https://vcenter.example.com \
  --from-literal=username=administrator@vsphere.local \
  --from-literal=password=your-password \
  --from-literal=insecure=false

Create a vSphere provider:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-prod
  namespace: default
spec:
  type: vsphere
  endpoint: https://vcenter.example.com
  credentialSecretRef:
    name: vsphere-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.3"
    service:
      port: 9090
  defaults:
    datastore: "datastore1"
    cluster: "cluster1"
    folder: "virtrigaud-vms"

Option B: Libvirt Provider

Create a secret with Libvirt connection details:

kubectl create secret generic libvirt-credentials \
  --namespace default \
  --from-literal=uri=qemu+ssh://root@libvirt-host.example.com/system \
  --from-literal=username=root \
  --from-literal=privateKey="$(cat ~/.ssh/id_rsa)"

Create a Libvirt provider:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-lab
  namespace: default
spec:
  type: libvirt
  endpoint: qemu+ssh://root@libvirt-host.example.com/system
  credentialSecretRef:
    name: libvirt-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.0"
    service:
      port: 9090
  defaults:
    defaultStoragePool: "default"
    defaultNetwork: "default"

Apply the provider configuration:

kubectl apply -f provider.yaml

💡 Behind the scenes: VirtRigaud automatically converts your Provider resource into the appropriate command-line arguments, environment variables, and secret mounts for the provider pod. See the configuration flow documentation for complete details.

Step 4: Create a VM Class

Define resource templates for your VMs:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: small
  namespace: default
spec:
  cpu: 2
  memoryMiB: 2048
  disks:
  - name: root
    sizeGiB: 20
    type: thin
  networks:
  - name: default
    type: "VM Network"  # vSphere network name

kubectl apply -f vmclass.yaml

Step 5: Create a VM Image

Define the base image for your VMs:

vSphere Image (OVA)

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-20-04
  namespace: virtrigaud-system
spec:
  source:
    vsphere:
      ovaURL: "https://cloud-images.ubuntu.com/releases/20.04/ubuntu-20.04-server-cloudimg-amd64.ova"
      checksum: "sha256:abc123..."
      datastore: "datastore1"
      folder: "vm-templates"
  prepare:
    onMissing: Import
    timeout: "30m"

Libvirt Image (qcow2)

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-20-04
  namespace: virtrigaud-system
spec:
  source:
    libvirt:
      qcow2URL: "https://cloud-images.ubuntu.com/releases/20.04/ubuntu-20.04-server-cloudimg-amd64.img"
      checksum: "sha256:def456..."
      storagePool: "default"
  prepare:
    onMissing: Import
    timeout: "30m"

kubectl apply -f vmimage.yaml

Step 6: Create Your First VM

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: my-first-vm
  namespace: default
spec:
  providerRef:
    name: vsphere-prod  # or libvirt-lab
    namespace: default
  classRef:
    name: small
    namespace: default
  imageRef:
    name: ubuntu-20-04
    namespace: default
  powerState: "On"
  userData:
    cloudInit:
      inline: |
        #cloud-config
        users:
          - name: ubuntu
            sudo: ALL=(ALL) NOPASSWD:ALL
            ssh_authorized_keys:
              - ssh-rsa AAAAB3... your-public-key
        packages:
          - curl
          - vim
  networks:
  - name: default
    networkRef:
      name: default-network
      namespace: default

kubectl apply -f vm.yaml

Step 7: Monitor VM Creation

# Watch VM status
kubectl get vm my-first-vm -w

# Check detailed status
kubectl describe vm my-first-vm

# View events
kubectl get events --field-selector involvedObject.name=my-first-vm

# Check provider logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-provider-vsphere

Step 8: Access Your VM

# Get VM IP address
kubectl get vm my-first-vm -o jsonpath='{.status.ips[0]}'

# Get console URL (if supported)
kubectl get vm my-first-vm -o jsonpath='{.status.consoleURL}'

# SSH to the VM (once it has an IP)
ssh ubuntu@<vm-ip>

Step 9: Try Advanced Operations

Create a Snapshot

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: my-vm-snapshot
  namespace: default
spec:
  vmRef:
    name: my-first-vm
  nameHint: "pre-update-snapshot"
  memory: true

Clone the VM

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClone
metadata:
  name: my-vm-clone
  namespace: default
spec:
  sourceRef:
    name: my-first-vm
  target:
    name: cloned-vm
    classRef:
      name: small
      namespace: default
  linked: true

Scale with VMSet

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSet
metadata:
  name: web-servers
  namespace: default
spec:
  replicas: 3
  template:
    spec:
      providerRef:
        name: vsphere-prod
        namespace: default
      classRef:
        name: small
        namespace: default
      imageRef:
        name: ubuntu-20-04
        namespace: default
      powerState: "On"

Step 10: Clean Up

# Delete VM
kubectl delete vm my-first-vm

# Delete snapshots and clones
kubectl delete vmsnapshot my-vm-snapshot
kubectl delete vmclone my-vm-clone
kubectl delete vmset web-servers

# Uninstall VirtRigaud (optional)
helm uninstall virtrigaud -n virtrigaud-system
kubectl delete namespace virtrigaud-system

Next Steps

Browse Complete Examples for production-ready configurations
Explore the VM Lifecycle Guide
Learn about Advanced Networking
Set up Monitoring and Observability
Configure Security and RBAC
Read the Remote Providers Documentation
Read the Provider Development Guide

Troubleshooting

If you encounter issues:

Check the Troubleshooting Guide
Verify your provider credentials and connectivity
Check the manager and provider logs
Ensure your Kubernetes cluster meets the requirements
File an issue on GitHub

Helm-only Installation & Verify Conversion

This guide covers installing virtrigaud using only Helm (without pre-applying CRDs via Kustomize) and verifying that API conversion is working correctly.

Helm-only Install

VirtRigaud can be installed using only Helm, which will automatically install all required CRDs including conversion webhook configuration.

Prerequisites

Kubernetes cluster (1.26+)
Helm 3.8+
kubectl configured to access your cluster

Installation

# Add the virtrigaud Helm repository
helm repo add virtrigaud https://projectbeskar.github.io/virtrigaud
helm repo update

# Or install directly from source
git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Install virtrigaud with CRDs
helm install virtrigaud charts/virtrigaud \
  --namespace virtrigaud \
  --create-namespace \
  --wait \
  --timeout 10m

Or install directly from source

git clone https://github.com/projectbeskar/virtrigaud.git cd virtrigaud helm install virtrigaud charts/virtrigaud
–namespace virtrigaud
–create-namespace
–wait
–timeout 10m


### Skip CRDs (if already installed)

If you need to install the chart without CRDs (e.g., they're managed separately):

```bash
helm install virtrigaud charts/virtrigaud \
  --namespace virtrigaud \
  --create-namespace \
  --skip-crds \
  --wait

Verify Conversion

After installation, verify that API conversion is working correctly.

Check CRD Conversion Configuration

# Verify all CRDs have conversion webhook configuration
kubectl get crd virtualmachines.infra.virtrigaud.io -o yaml | yq '.spec.conversion'

Expected output:

strategy: Webhook
webhook:
  clientConfig:
    service:
      name: virtrigaud-webhook
      namespace: virtrigaud
      path: /convert
  conversionReviewVersions:
  - v1

Check API Versions

Verify that both v1beta1 and v1beta1 versions are available:

# Check available versions for VirtualMachine CRD
kubectl get crd virtualmachines.infra.virtrigaud.io -o jsonpath='{.spec.versions[*].name}' | tr ' ' '\n'

Expected output:

v1beta1
v1beta1

Verify Storage Version

Confirm that v1beta1 is set as the storage version:

# Check storage version
kubectl get crd virtualmachines.infra.virtrigaud.io -o jsonpath='{.spec.versions[?(@.storage==true)].name}'

Expected output:

v1beta1

Test Conversion

Create resources using different API versions and verify conversion works:

# Create a VM using v1beta1 API
cat <<EOF | kubectl apply -f -
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: test-vm-alpha
  namespace: default
spec:
  providerRef:
    name: test-provider
  classRef:
    name: small
  imageRef:
    name: ubuntu-22
  powerState: "On"
EOF

# Read it back as v1beta1
kubectl get vm test-vm-alpha -o yaml | grep "apiVersion:"
# Should show: apiVersion: infra.virtrigaud.io/v1beta1

# Create a VM using v1beta1 API
cat <<EOF | kubectl apply -f -
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: test-vm-beta
  namespace: default
spec:
  providerRef:
    name: test-provider
  classRef:
    name: small
  imageRef:
    name: ubuntu-22
  powerState: On
EOF

# Clean up test resources
kubectl delete vm test-vm-alpha test-vm-beta

Troubleshooting

Conversion Webhook Missing

If the conversion webhook is missing or not configured:

# Check if webhook service exists
kubectl get svc virtrigaud-webhook -n virtrigaud

# Check webhook pod logs
kubectl logs -l app.kubernetes.io/name=virtrigaud -n virtrigaud

# Verify webhook certificate
kubectl get secret virtrigaud-webhook-certs -n virtrigaud

Conversion Webhook Failing

If conversion is failing:

# Check conversion webhook logs
kubectl logs -l app.kubernetes.io/name=virtrigaud -n virtrigaud | grep conversion

# Test webhook connectivity
kubectl get --raw "/api/v1/namespaces/virtrigaud/services/virtrigaud-webhook:webhook/proxy/convert"

# Check webhook certificate validity
kubectl get secret virtrigaud-webhook-certs -n virtrigaud -o yaml

API Version Issues

If certain API versions aren’t working:

# List all available APIs
kubectl api-resources | grep virtrigaud

# Check specific CRD status
kubectl describe crd virtualmachines.infra.virtrigaud.io

# Verify controller is running
kubectl get pods -l app.kubernetes.io/name=virtrigaud -n virtrigaud

Integration with GitOps

ArgoCD

apiVersion: argoproj.io/v1beta1
kind: Application
metadata:
  name: virtrigaud
spec:
  source:
    chart: virtrigaud
    repoURL: https://projectbeskar.github.io/virtrigaud
    targetRevision: "1.0.0"
    helm:
      values: |
        manager:
          image:
            repository: ghcr.io/projectbeskar/virtrigaud/manager
            tag: v1.0.0

Flux

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: virtrigaud
spec:
  chart:
    spec:
      chart: virtrigaud
      sourceRef:
        kind: HelmRepository
        name: virtrigaud
      version: "1.0.0"
  values:
    manager:
      image:
        repository: ghcr.io/projectbeskar/virtrigaud/manager
        tag: v1.0.0

Migration from Kustomize to Helm

If you’re currently using Kustomize for CRD management and want to switch to Helm:

Backup existing resources:

kubectl get vms,providers,vmclasses -A -o yaml > virtrigaud-backup.yaml

Uninstall Kustomize-managed CRDs (optional):
```
kubectl delete -k config/default
```

Install via Helm:

helm install virtrigaud charts/virtrigaud --namespace virtrigaud --create-namespace

Restore resources:

kubectl apply -f virtrigaud-backup.yaml

The conversion webhook will handle any necessary API version transformations automatically.

Automatic CRD Upgrades in VirtRigaud Helm Chart

Overview

VirtRigaud Helm chart now supports automatic CRD upgrades during helm upgrade. This eliminates the need for manual CRD management and provides a seamless upgrade experience.

The Problem

By default, Helm has a limitation:

CRDs are installed during helm install
CRDs are NOT upgraded during helm upgrade

This means users had to manually apply CRD updates before upgrading, which was:

Error-prone
Easy to forget
Breaks GitOps workflows
Causes version drift between chart and CRDs

The Solution

VirtRigaud uses Helm Hooks with a Kubernetes Job to automatically apply CRDs during both install and upgrade:

kubectl Image

VirtRigaud builds and publishes its own kubectl image as part of the release process. This image:

Based on Alpine Linux for minimal size (~50MB)
Includes kubectl 1.32.0 binary from official Kubernetes releases
Includes bash and shell for scripting support
Runs as non-root user (UID 65532)
Verified with SHA256 checksums
Signed with Cosign and includes SBOM
Security scanned but uses official kubectl binary (vulnerabilities tracked upstream)

The image is automatically built and tagged to match each VirtRigaud release version, ensuring version consistency across all components.

Image Location: ghcr.io/projectbeskar/virtrigaud/kubectl:<version>

How It Works

Pre-Upgrade Hook: Before the main upgrade starts, a Job is created
CRD Application: The Job applies all CRDs using kubectl apply --server-side
Safe Upgrades: Server-side apply handles conflicts gracefully
Automatic Cleanup: Job is deleted after successful completion

Architecture

helm upgrade virtrigaud
    ↓
[Pre-Upgrade Hook -10]
    ↓
ConfigMap with CRDs created
    ↓
[Pre-Upgrade Hook -5]
    ↓
ServiceAccount + RBAC created
    ↓
[Pre-Upgrade Hook 0]
    ↓
Job applies CRDs via kubectl
    ↓
[Standard Helm Resources]
    ↓
Manager & Providers deployed
    ↓
[Hook Cleanup]
    ↓
Job & Hook resources deleted

Features

Enabled by Default

No configuration needed - just works:

helm upgrade virtrigaud virtrigaud/virtrigaud -n virtrigaud-system

Server-Side Apply

Uses kubectl apply --server-side for:

Safe conflict resolution
Field management
No ownership conflicts

GitOps Compatible

Works seamlessly with:

ArgoCD: Helm hooks execute properly
Flux: Compatible with HelmRelease CRD upgrades
Terraform: Helm provider handles hooks

Configurable

Customize the upgrade behavior:

crdUpgrade:
  enabled: true  # Enable/disable automatic upgrades
  
  image:
    repository: ghcr.io/projectbeskar/virtrigaud/kubectl  # VirtRigaud kubectl image
    tag: "v0.2.0"  # Auto-updated to match release version
  
  backoffLimit: 3
  ttlSecondsAfterFinished: 300
  waitSeconds: 5
  
  resources:
    limits:
      cpu: 100m
      memory: 128Mi

Usage Examples

Standard Upgrade (Automatic CRDs)

# CRDs are automatically upgraded
helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system

Disable Automatic CRD Upgrade

# Disable if you manage CRDs separately
helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --set crdUpgrade.enabled=false

Manual CRD Management

# Apply CRDs manually before upgrade
kubectl apply -f charts/virtrigaud/crds/

# Then upgrade without CRD management
helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --set crdUpgrade.enabled=false

Skip CRDs Entirely

# Skip CRDs during upgrade (for external CRD management)
helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --skip-crds \
  --set crdUpgrade.enabled=false

GitOps Integration

ArgoCD

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: virtrigaud
spec:
  source:
    chart: virtrigaud
    targetRevision: 0.2.2
    helm:
      values: |
        crdUpgrade:
          enabled: true  # Automatic upgrades work!
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Note: ArgoCD executes Helm hooks properly, so CRDs will be upgraded automatically.

Flux

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: virtrigaud
spec:
  chart:
    spec:
      chart: virtrigaud
      version: 0.2.2
  values:
    crdUpgrade:
      enabled: true  # Automatic upgrades work!
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace

Note: Flux’s crds: CreateReplace works alongside our hook-based upgrades for maximum compatibility.

Troubleshooting

Check CRD Upgrade Job

# View job status
kubectl get jobs -n virtrigaud-system -l app.kubernetes.io/component=crd-upgrade

# View job logs
kubectl logs -n virtrigaud-system -l app.kubernetes.io/component=crd-upgrade

# View job details
kubectl describe job -n virtrigaud-system -l app.kubernetes.io/component=crd-upgrade

Common Issues

1. RBAC Permissions

Symptom: Job fails with “forbidden” errors

Solution: Ensure the ServiceAccount has CRD permissions:

kubectl get clusterrole -l app.kubernetes.io/component=crd-upgrade
kubectl describe clusterrole <role-name>

2. Image Pull Failures

Symptom: Job fails to start, ImagePullBackOff

Solution: Check image configuration:

crdUpgrade:
  image:
    repository: ghcr.io/projectbeskar/virtrigaud/kubectl
    tag: "v0.2.2-rc1"  # Use matching VirtRigaud version
    pullPolicy: IfNotPresent

3. CRD Conflicts

Symptom: Apply errors about field conflicts

Solution: Server-side apply handles this automatically, but you can force:

kubectl apply --server-side=true --force-conflicts -f charts/virtrigaud/crds/

4. Job Not Cleaning Up

Symptom: Old jobs remain after upgrade

Solution: Adjust TTL or manually clean:

kubectl delete jobs -n virtrigaud-system -l app.kubernetes.io/component=crd-upgrade

Debug Mode

Enable verbose logging:

helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --debug

Migration Guide

Migrating from Manual CRD Management

If you were previously managing CRDs manually:

Enable automatic upgrades:

helm upgrade virtrigaud virtrigaud/virtrigaud \
  -n virtrigaud-system \
  --set crdUpgrade.enabled=true

Verify CRDs are upgraded:

kubectl get crd -l app.kubernetes.io/name=virtrigaud

Remove manual steps from your upgrade process

Migrating to External CRD Management

If you want to manage CRDs externally (e.g., separate Helm chart):

Disable automatic upgrades:
```
crdUpgrade:
  enabled: false
```

Extract CRDs:

helm show crds virtrigaud/virtrigaud > my-crds.yaml

Manage CRDs separately:
```
kubectl apply -f my-crds.yaml
```

Technical Details

Hook Weights

The upgrade process uses weighted hooks for proper ordering:

Weight	Resource	Purpose
`-10`	ConfigMap	Store CRD content
`-5`	RBAC	Create permissions
`0`	Job	Apply CRDs

Resource Requirements

The CRD upgrade job is lightweight:

resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 50m
    memory: 64Mi

Security

Runs as non-root user (65532)
Read-only root filesystem
No privilege escalation
Minimal RBAC (only CRD permissions)
Automatic cleanup after completion

Compatibility

Kubernetes: 1.25+
Helm: 3.8+
kubectl: 1.24+ (in Job image)

Best Practices

Use Automatic Upgrades: Enable by default for best UX
Monitor Job Logs: Check logs during first upgrade
Test in Dev First: Verify upgrades in non-production
Backup CRDs: Keep backups before major upgrades
Review Changelogs: Check for breaking CRD changes

FAQ

Q: Will this delete my existing resources?

A: No. CRD upgrades are additive and preserve existing Custom Resources.

Q: What happens if the job fails?

A: Helm upgrade will fail, leaving your cluster in the previous state. Fix the issue and retry.

Q: Can I use this with ArgoCD?

A: Yes! ArgoCD properly executes Helm hooks.

Q: Does this work with Flux?

A: Yes! Flux HelmRelease handles hooks correctly.

Q: How do I roll back?

A: Use helm rollback. CRDs are not rolled back (Kubernetes limitation).

Q: Can I customize the kubectl image?

A: Yes, via crdUpgrade.image.repository and crdUpgrade.image.tag. The default uses the official Kubernetes kubectl image from registry.k8s.io.

Custom Resource Definitions (CRDs)

This document describes all the Custom Resource Definitions (CRDs) provided by virtrigaud.

VirtualMachine

The VirtualMachine CRD represents a virtual machine instance.

Spec

Field	Type	Required	Description
`providerRef`	`ObjectRef`	Yes	Reference to the Provider resource
`classRef`	`ObjectRef`	Yes	Reference to the VMClass resource
`imageRef`	`ObjectRef`	Yes	Reference to the VMImage resource
`networks`	`[]VMNetworkRef`	No	Network attachments
`disks`	`[]DiskSpec`	No	Additional disks
`userData`	`UserData`	No	Cloud-init configuration
`metaData`	`MetaData`	No	Cloud-init metadata configuration
`placement`	`Placement`	No	Placement hints
`powerState`	`string`	No	Desired power state (On/Off)
`tags`	`[]string`	No	Tags for organization

Status

Field	Type	Description
`id`	`string`	Provider-specific VM identifier
`powerState`	`string`	Current power state
`ips`	`[]string`	Assigned IP addresses
`consoleURL`	`string`	Console access URL
`conditions`	`[]Condition`	Status conditions
`observedGeneration`	`int64`	Last observed generation
`lastTaskRef`	`string`	Reference to last async task
`provider`	`map[string]string`	Provider-specific details

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: demo-web-01
spec:
  providerRef:
    name: vsphere-prod
  classRef:
    name: small
  imageRef:
    name: ubuntu-22-template
  networks:
    - name: app-net
      ipPolicy: dhcp
  powerState: On

VMClass

The VMClass CRD defines resource allocation for virtual machines.

Spec

Field	Type	Required	Description
`cpu`	`int32`	Yes	Number of virtual CPUs
`memoryMiB`	`int32`	Yes	Memory in MiB
`firmware`	`string`	No	Firmware type (BIOS/UEFI)
`diskDefaults`	`DiskDefaults`	No	Default disk settings
`guestToolsPolicy`	`string`	No	Guest tools policy
`extraConfig`	`map[string]string`	No	Provider-specific configuration

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: small
spec:
  cpu: 2
  memoryMiB: 4096
  firmware: UEFI
  diskDefaults:
    type: thin
    sizeGiB: 40

VMImage

The VMImage CRD defines base templates/images for virtual machines.

Spec

Field	Type	Required	Description
`vsphere`	`VSphereImageSpec`	No	vSphere-specific configuration
`libvirt`	`LibvirtImageSpec`	No	Libvirt-specific configuration
`prepare`	`ImagePrepare`	No	Image preparation options

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-template
spec:
  vsphere:
    templateName: "tmpl-ubuntu-22.04-cloudimg"
  libvirt:
    url: "https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img"
    format: qcow2

VMNetworkAttachment

The VMNetworkAttachment CRD defines network configurations.

Spec

Field	Type	Required	Description
`vsphere`	`VSphereNetworkSpec`	No	vSphere-specific network config
`libvirt`	`LibvirtNetworkSpec`	No	Libvirt-specific network config
`ipPolicy`	`string`	No	IP assignment policy
`macAddress`	`string`	No	Static MAC address

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMNetworkAttachment
metadata:
  name: app-net
spec:
  vsphere:
    portgroup: "PG-App"
  ipPolicy: dhcp

Provider

The Provider CRD configures hypervisor connection details.

Spec

Field	Type	Required	Description
`type`	`string`	Yes	Provider type (vsphere/libvirt/etc)
`endpoint`	`string`	Yes	Provider endpoint URI
`credentialSecretRef`	`ObjectRef`	Yes	Secret containing credentials
`insecureSkipVerify`	`bool`	No	Skip TLS verification
`defaults`	`ProviderDefaults`	No	Default placement settings
`rateLimit`	`RateLimit`	No	API rate limiting

Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-prod
spec:
  type: vsphere
  endpoint: https://vcenter.example.com
  credentialSecretRef:
    name: vsphere-creds
  defaults:
    datastore: datastore1
    cluster: compute-cluster-a

Common Types

ObjectRef

Field	Type	Required	Description
`name`	`string`	Yes	Object name
`namespace`	`string`	No	Object namespace

DiskSpec

Field	Type	Required	Description
`sizeGiB`	`int32`	Yes	Disk size in GiB
`type`	`string`	No	Disk type
`name`	`string`	No	Disk name

UserData

Field	Type	Required	Description
`cloudInit`	`CloudInitConfig`	No	Cloud-init configuration

MetaData

Field	Type	Required	Description
`inline`	`string`	No	Inline cloud-init metadata in YAML format
`secretRef`	`ObjectRef`	No	Secret containing cloud-init metadata

CloudInitConfig

Field	Type	Required	Description
`secretRef`	`ObjectRef`	No	Secret containing cloud-init data
`inline`	`string`	No	Inline cloud-init configuration

Examples

This document provides practical examples for using VirtRigaud with the Remote provider architecture.

Quick Start Examples

All VirtRigaud providers now run as Remote providers. Here are the essential examples to get started:

Basic Provider Setup

vSphere Provider - Basic vSphere provider configuration
LibVirt Provider - Basic LibVirt provider configuration

Complete Working Examples

Complete vSphere Setup - End-to-end vSphere VM creation
Advanced vSphere Setup - Production-ready vSphere configuration
LibVirt Complete Setup - End-to-end LibVirt VM creation
Multi-Provider Setup - Using multiple providers together

Individual Resource Examples

VMClass - VM resource allocation template
VMImage - VM image/template definition
VMNetworkAttachment - Network configuration
Simple VM - Basic virtual machine

Advanced Examples

Security Configuration - RBAC, network policies, external secrets
Advanced Operations - Snapshots, reconfiguration, lifecycle management

Example Directory Structure

docs/examples/
├── provider-*.yaml          # Provider configurations
├── complete-example.yaml    # Full working setup
├── *-advanced-example.yaml  # Production configurations
├── vm*.yaml                 # Individual resource definitions
├── advanced/                # Advanced operations
├── security/                # Security configurations
└── secrets/                 # Credential examples

Key Changes from Previous Versions

Remote-Only Architecture

All providers now run as separate pods with the Remote runtime:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: my-provider
spec:
  type: vsphere  # or libvirt, proxmox
  endpoint: https://vcenter.example.com
  credentialSecretRef:
    name: provider-creds
  runtime:
    mode: Remote              # Required - only mode supported
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.3"
    service:
      port: 9090

Current API Schema (v0.2.3)

VMClass: Standard Kubernetes resource quantities (cpus: 4, memory: "4Gi")
VMImage: Provider-specific source configurations
VMNetworkAttachment: Network provider abstractions
VirtualMachine: Declarative power state management

Configuration Management

Providers receive configuration through:

Endpoint: Environment variable PROVIDER_ENDPOINT
Credentials: Mounted secret files in /etc/virtrigaud/credentials/
Runtime: Managed automatically by the provider controller

Getting Started

Choose your provider from the basic examples above
Create credentials secret (see examples/secrets/)
Apply provider configuration with required runtime section
Define VM resources (VMClass, VMImage, VMNetworkAttachment)
Create VirtualMachine referencing your resources

For detailed setup instructions, see:

Need Help?

Check the Remote Providers documentation for architecture details
Review provider-specific guides for setup instructions
Look at complete examples for working configurations
See troubleshooting tips for common issues

Provider Development Guide

This document explains how to implement a new provider for VirtRigaud.

Providers are responsible for implementing VM lifecycle operations on specific hypervisor platforms. VirtRigaud uses a Remote Provider architecture where each provider runs as an independent gRPC service, communicating with the manager controller.

Provider Interface

All providers must implement the contracts.Provider interface:

type Provider interface {
    // Validate ensures the provider session/credentials are healthy
    Validate(ctx context.Context) error

    // Create creates a new VM if it doesn't exist (idempotent)
    Create(ctx context.Context, req CreateRequest) (CreateResponse, error)

    // Delete removes a VM (idempotent)
    Delete(ctx context.Context, id string) (taskRef string, err error)

    // Power performs a power operation on the VM
    Power(ctx context.Context, id string, op PowerOp) (taskRef string, err error)

    // Reconfigure modifies VM resources
    Reconfigure(ctx context.Context, id string, desired CreateRequest) (taskRef string, err error)

    // Describe returns the current state of the VM
    Describe(ctx context.Context, id string) (DescribeResponse, error)

    // IsTaskComplete checks if an async task is complete
    IsTaskComplete(ctx context.Context, taskRef string) (done bool, err error)
}

Implementation Steps

1. Create Provider Package

Create a new package under internal/providers/ for your provider:

internal/providers/yourprovider/
├── provider.go      # Main provider implementation
├── session.go       # Connection/session management
├── tasks.go         # Async task handling
├── converter.go     # Type conversions
├── network.go       # Network operations
└── storage.go       # Storage operations

2. Implement the Provider

package yourprovider

import (
    "context"
    "github.com/projectbeskar/virtrigaud/api/v1beta1"
    "github.com/projectbeskar/virtrigaud/internal/providers/contracts"
)

type Provider struct {
    config   *v1beta1.Provider
    client   YourProviderClient
}

func NewProvider(ctx context.Context, provider *v1beta1.Provider) (contracts.Provider, error) {
    // Initialize your provider client
    // Parse credentials from secret
    // Establish connection
    return &Provider{
        config: provider,
        client: client,
    }, nil
}

func (p *Provider) Validate(ctx context.Context) error {
    // Check connection health
    // Validate credentials
    return nil
}

// Implement other interface methods...

3. Create Provider gRPC Server

Create a gRPC server for your provider:

// cmd/provider-yourprovider/main.go
package main

import (
    "context"
    "log"
    "net"
    
    "google.golang.org/grpc"
    "github.com/projectbeskar/virtrigaud/pkg/grpc/provider"
    "github.com/projectbeskar/virtrigaud/internal/providers/yourprovider"
)

func main() {
    lis, err := net.Listen("tcp", ":9090")
    if err != nil {
        log.Fatal(err)
    }
    
    s := grpc.NewServer()
    provider.RegisterProviderServer(s, &yourprovider.GRPCServer{})
    
    log.Println("Provider server listening on :9090")
    if err := s.Serve(lis); err != nil {
        log.Fatal(err)
    }
}

4. Handle Credentials

Providers should read credentials from Kubernetes secrets. Common credential fields:

username / password: Basic authentication
token: API token authentication
tls.crt / tls.key: TLS client certificates

Example:

func (p *Provider) getCredentials(ctx context.Context) (*Credentials, error) {
    secret := &corev1.Secret{}
    err := p.client.Get(ctx, types.NamespacedName{
        Name:      p.config.Spec.CredentialSecretRef.Name,
        Namespace: p.config.Namespace,
    }, secret)
    if err != nil {
        return nil, err
    }

    return &Credentials{
        Username: string(secret.Data["username"]),
        Password: string(secret.Data["password"]),
    }, nil
}

Error Handling

Use the provided error types for consistent error handling:

import "github.com/projectbeskar/virtrigaud/internal/providers/contracts"

// For not found errors
return contracts.NewNotFoundError("VM not found", err)

// For retryable errors
return contracts.NewRetryableError("Connection timeout", err)

// For validation errors
return contracts.NewInvalidSpecError("Invalid CPU count", nil)

Asynchronous Operations

For long-running operations, return a task reference:

func (p *Provider) Create(ctx context.Context, req CreateRequest) (CreateResponse, error) {
    taskID, err := p.client.CreateVMAsync(...)
    if err != nil {
        return CreateResponse{}, err
    }

    return CreateResponse{
        ID:      vmID,
        TaskRef: taskID,
    }, nil
}

func (p *Provider) IsTaskComplete(ctx context.Context, taskRef string) (bool, error) {
    task, err := p.client.GetTask(taskRef)
    if err != nil {
        return false, err
    }
    return task.IsComplete(), nil
}

Type Conversions

Convert between CRD types and provider-specific types:

func (p *Provider) convertVMClass(class contracts.VMClass) YourProviderVMSpec {
    return YourProviderVMSpec{
        CPUs:   class.CPU,
        Memory: class.MemoryMiB * 1024 * 1024, // Convert to bytes
        // ... other conversions
    }
}

Testing

Create unit tests for your provider:

func TestProvider_Create(t *testing.T) {
    provider := &Provider{
        client: &mockClient{},
    }

    req := contracts.CreateRequest{
        Name: "test-vm",
        // ... populate request
    }

    resp, err := provider.Create(context.Background(), req)
    assert.NoError(t, err)
    assert.NotEmpty(t, resp.ID)
}

Provider-Specific CRD Fields

Update the CRD types to include provider-specific fields:

// In VMImage types
type YourProviderImageSpec struct {
    ImageID   string `json:"imageId,omitempty"`
    Checksum  string `json:"checksum,omitempty"`
}

// In VMNetworkAttachment types
type YourProviderNetworkSpec struct {
    NetworkID string `json:"networkId,omitempty"`
    VLAN      int32  `json:"vlan,omitempty"`
}

Best Practices

Idempotency: All operations should be idempotent
Error Classification: Use appropriate error types
Resource Cleanup: Ensure proper cleanup in Delete operations
Logging: Use structured logging with context
Timeouts: Respect context timeouts
Rate Limiting: Implement client-side rate limiting
Retry Logic: Handle transient failures gracefully

Examples

See the existing providers for reference:

internal/providers/vsphere/ - vSphere implementation
internal/providers/libvirt/ - Libvirt implementation (production ready)

Provider Configuration

Each provider type should support these configuration options:

Connection endpoints
Authentication credentials
Default placement settings
Rate limiting configuration
Provider-specific options

Example Provider spec:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: my-provider
spec:
  type: yourprovider
  endpoint: https://api.yourprovider.com
  credentialSecretRef:
    name: provider-creds
  defaults:
    region: us-west-2
    zone: us-west-2a
  rateLimit:
    qps: 10
    burst: 20

Provider Capabilities Matrix

This document provides a comprehensive overview of VirtRigaud provider capabilities as of v0.2.3.

Overview

VirtRigaud supports multiple hypervisor platforms through a provider architecture. Each provider implements the core VirtRigaud API while supporting platform-specific features and capabilities.

Core Provider Interface

All providers implement these core operations:

Validate: Test provider connectivity and credentials
Create: Create new virtual machines
Delete: Remove virtual machines and cleanup resources
Power: Control VM power state (On/Off/Reboot)
Describe: Query VM state and properties
GetCapabilities: Report provider-specific capabilities

Provider Status

Provider	Status	Implementation	Maturity
vSphere	✅ Production Ready	govmomi-based	Stable
Libvirt/KVM	✅ Production Ready	virsh-based	Stable
Proxmox VE	✅ Production Ready	REST API-based	Beta
Mock	✅ Complete	In-memory simulation	Testing

Comprehensive Capability Matrix

Core Operations

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
VM Create	✅	✅	✅	✅	All providers support VM creation
VM Delete	✅	✅	✅	✅	With resource cleanup
Power On/Off	✅	✅	✅	✅	Basic power management
Reboot	✅	✅	✅	✅	Graceful and forced restart
Suspend	✅	❌	✅	✅	Memory state preservation
Describe	✅	✅	✅	✅	VM state and properties
Reconfigure	✅	⚠️	✅	✅	CPU/Memory/Disk changes (Libvirt requires restart)
TaskStatus	✅	N/A	✅	✅	Async operation tracking
ConsoleURL	✅	✅	⚠️	✅	Remote console access (Proxmox planned)

Resource Management

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
CPU Configuration	✅	✅	✅	✅	Cores, sockets, threading
Memory Allocation	✅	✅	✅	✅	Static memory sizing
Hot CPU Add	✅	❌	✅	✅	Online CPU expansion
Hot Memory Add	✅	❌	✅	✅	Online memory expansion
Resource Reservations	✅	❌	✅	✅	Guaranteed resources
Resource Limits	✅	❌	✅	✅	Resource capping

Storage Operations

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
Disk Creation	✅	✅	✅	✅	Virtual disk provisioning
Disk Expansion	✅	✅	✅	✅	Online disk growth
Multiple Disks	✅	✅	✅	✅	Multi-disk VMs
Thin Provisioning	✅	✅	✅	✅	Space-efficient disks
Thick Provisioning	✅	✅	✅	✅	Pre-allocated storage
Storage Policies	✅	❌	✅	✅	Policy-based placement
Storage Pools	✅	✅	✅	✅	Organized storage management

Network Configuration

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
Basic Networking	✅	✅	✅	✅	Single network interface
Multiple NICs	✅	✅	✅	✅	Multi-interface VMs
VLAN Support	✅	✅	✅	✅	Network segmentation
Static IP	✅	✅	✅	✅	Fixed IP assignment
DHCP	✅	✅	✅	✅	Dynamic IP assignment
Bridge Networks	❌	✅	✅	✅	Direct host bridging
Distributed Switches	✅	❌	❌	✅	Advanced vSphere networking

VM Lifecycle

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
Template Deployment	✅	✅	✅	✅	Deploy from templates
Clone Operations	✅ Complete	✅	✅	✅	Full VM duplication with snapshot support
Linked Clones	✅	❌	✅	✅	COW-based clones with automatic snapshot creation
Full Clones	✅	✅	✅	✅	Independent copies
VM Reconfiguration	✅ Complete	⚠️ Restart Required	✅	✅	Online resource modification

Snapshot Operations

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
Create Snapshots	✅	✅	✅	✅	Point-in-time captures
Delete Snapshots	✅	✅	✅	✅	Snapshot cleanup
Revert Snapshots	✅	✅	✅	✅	Restore VM state
Memory Snapshots	✅	❌	✅	✅	Include RAM state
Quiesced Snapshots	✅	❌	✅	✅	Consistent filesystem
Snapshot Trees	✅	✅	✅	✅	Hierarchical snapshots

Image Management

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
OVA/OVF Import	✅	❌	✅	✅	Standard VM formats
Cloud Image Download	❌	✅	✅	✅	Remote image fetch
Content Libraries	✅	❌	❌	✅	Centralized image management
Image Conversion	❌	✅	✅	✅	Format transformation
Image Caching	✅	✅	✅	✅	Performance optimization

Guest Operating System

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
Cloud-Init	✅	✅	✅	✅	Guest initialization
Guest Tools	✅	✅	✅	✅	Enhanced guest integration
Guest Agent	✅	✅	✅	✅	Runtime guest communication
Guest Customization	✅	✅	✅	✅	OS-specific customization
Guest Monitoring	✅	✅	✅	✅	Resource usage tracking

Advanced Features

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
High Availability	✅	❌	✅	✅	Automatic failover
DRS/Load Balancing	✅	❌	❌	✅	Resource optimization
Fault Tolerance	✅	❌	❌	✅	Zero-downtime protection
vMotion/Migration	✅	❌	✅	✅	Live VM migration
Resource Pools	✅	❌	✅	✅	Hierarchical resource mgmt
Affinity Rules	✅	❌	✅	✅	VM placement policies

Monitoring & Observability

Capability	vSphere	Libvirt	Proxmox	Mock	Notes
Performance Metrics	✅	✅	✅	✅	CPU, memory, disk, network
Event Logging	✅	✅	✅	✅	Operation audit trail
Health Checks	✅	✅	✅	✅	VM and guest health
Alerting	✅	❌	✅	✅	Threshold-based notifications
Historical Data	✅	❌	✅	✅	Performance history
Console URL Generation	✅	✅	⚠️	✅	Web/VNC console access (Proxmox planned)
Guest Agent Integration	✅	✅	✅ Complete	✅	IP detection and guest info

Provider-Specific Features

vSphere Exclusive

vCenter Integration: Full vCenter Server and ESXi support
Content Library: Centralized template and ISO management
Distributed Resource Scheduler (DRS): Automatic load balancing
vMotion: Live migration between hosts
High Availability (HA): Automatic VM restart on host failure
Fault Tolerance: Zero-downtime VM protection
Storage vMotion: Live storage migration
vSAN Integration: Hyper-converged storage
NSX Integration: Software-defined networking
Hot Reconfiguration: Online CPU/memory/disk changes with hot-add support
TaskStatus Tracking: Real-time async operation monitoring via govmomi
Clone Operations: Full and linked clones with automatic snapshot handling
Web Console URLs: Direct vSphere web client console access

Libvirt/KVM Exclusive

Virsh Integration: Command-line management
QEMU Guest Agent: Advanced guest OS integration
KVM Optimization: Native Linux virtualization
Bridge Networking: Direct host network bridging
Storage Pool Flexibility: Multiple storage backend support
Cloud Image Support: Direct cloud image deployment
Host Device Passthrough: Hardware device assignment
Reconfiguration Support: CPU/memory/disk changes via virsh (restart required)
VNC Console Access: Direct VNC console URL generation for remote viewers

Proxmox VE Exclusive

Web UI Integration: Built-in management interface
Container Support: LXC container management
Backup Integration: Built-in backup and restore
Cluster Management: Multi-node cluster support
ZFS Integration: Advanced filesystem features
Ceph Integration: Distributed storage
Guest Agent IP Detection: Accurate IP address extraction via QEMU guest agent
Hot-plug Reconfiguration: Online CPU/memory/disk modifications
Complete CRD Integration: Full Kubernetes custom resource support

Mock Provider Features

Testing Scenarios: Configurable failure modes
Performance Simulation: Controllable operation delays
Sample Data: Pre-populated demonstration VMs
Development Support: Full API coverage for testing

Supported Disk Types

Provider	Disk Formats	Notes
vSphere	thin, thick, eagerZeroedThick	vSphere native formats
Libvirt	qcow2, raw, vmdk	QEMU-supported formats
Proxmox	qcow2, raw, vmdk	Proxmox storage formats
Mock	thin, thick, raw, qcow2	Simulated formats

Supported Network Types

Provider	Network Types	Notes
vSphere	distributed, standard, vlan	vSphere networking
Libvirt	virtio, e1000, rtl8139	QEMU network adapters
Proxmox	virtio, e1000, rtl8139	Proxmox network models
Mock	bridge, nat, distributed	Simulated network types

Provider Images

All provider images are available from the GitHub Container Registry:

vSphere: ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.3
Libvirt: ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.3
Proxmox: ghcr.io/projectbeskar/virtrigaud/provider-proxmox:v0.2.3
Mock: ghcr.io/projectbeskar/virtrigaud/provider-mock:v0.2.3

Choosing a Provider

Use vSphere When:

You have existing VMware infrastructure
You need enterprise features (HA, DRS, vMotion)
You require advanced networking (NSX, distributed switches)
You need centralized management (vCenter)

Use Libvirt/KVM When:

You want open-source virtualization
You’re running on Linux hosts
You need cost-effective virtualization
You want direct host integration

Use Proxmox VE When:

You need both VMs and containers
You want integrated backup solutions
You need cluster management
You want web-based management

Use Mock Provider When:

You’re developing or testing VirtRigaud
You need to simulate VM operations
You’re creating demos or training materials
You’re testing VirtRigaud without hypervisors

Performance Considerations

vSphere

Best for: Large-scale enterprise deployments
Scalability: Hundreds to thousands of VMs
Overhead: Higher due to feature richness
Resource Efficiency: Excellent with DRS

Libvirt/KVM

Best for: Linux-based deployments
Scalability: Moderate to large deployments
Overhead: Low, near-native performance
Resource Efficiency: Good with proper tuning

Proxmox VE

Best for: SMB and mixed workloads
Scalability: Small to medium deployments
Overhead: Moderate
Resource Efficiency: Good with clustering

Future Roadmap

Planned Enhancements

vSphere

vSphere 8.0 support
Enhanced NSX integration
GPU passthrough support
vSAN policy automation

Libvirt

Live migration support
SR-IOV networking
NUMA topology optimization
Enhanced performance monitoring

Proxmox

HA configuration
Storage replication
Advanced networking
Performance optimizations

Support Matrix

Feature Category	vSphere	Libvirt	Proxmox	Mock
Production Ready	✅	✅	✅ Beta	✅ Testing
Documentation	Complete	Complete	Complete	Complete
Community Support	Active	Active	Growing	N/A
Enterprise Support	Available	Available	Available	N/A

Version History

v0.2.3: Provider feature parity - Reconfigure, Clone, TaskStatus, ConsoleURL
v0.2.2: Nested virtualization, TPM support, comprehensive snapshot management
v0.2.1: Critical fixes, documentation updates, VMClass disk settings
v0.2.0: Production-ready vSphere and Libvirt providers
v0.1.0: Initial provider framework and mock implementation

This document reflects VirtRigaud v0.2.3 capabilities. For the latest updates, see the VirtRigaud documentation.

vSphere Provider

The vSphere provider enables VirtRigaud to manage virtual machines on VMware vSphere environments, including vCenter Server and standalone ESXi hosts. This provider is designed for enterprise production environments with comprehensive support for vSphere features.

Overview

This provider implements the VirtRigaud provider interface to manage VM lifecycle operations on VMware vSphere:

Create: Create VMs from templates, content libraries, or OVF/OVA files
Delete: Remove VMs and associated storage (with configurable retention)
Power: Start, stop, restart, and suspend virtual machines
Describe: Query VM state, resource usage, guest info, and vSphere properties
Reconfigure: Hot-add CPU/memory, resize disks, modify network adapters (v0.2.3+)
Clone: Create full or linked clones from existing VMs or templates (v0.2.3+)
Snapshot: Create, delete, and revert VM snapshots with memory state
TaskStatus: Track asynchronous operations with progress monitoring (v0.2.3+)
ConsoleURL: Generate vSphere web client console URLs (v0.2.3+)
ImagePrepare: Import OVF/OVA, deploy from content library, or ensure template existence

Prerequisites

⚠️ IMPORTANT: Active vSphere Environment Required

The vSphere provider connects to VMware vSphere infrastructure and requires active vCenter Server or ESXi hosts.

Requirements:

vCenter Server 7.0+ or ESXi 7.0+ (running and accessible)
User account with appropriate privileges for VM management
Network connectivity from VirtRigaud to vCenter/ESXi (HTTPS/443)
vSphere infrastructure:
- Configured datacenters, clusters, and hosts
- Storage (datastores) for VM files
- Networks (port groups) for VM connectivity
- Resource pools for VM placement (optional)

Testing/Development:

For development environments:

Use VMware vSphere Hypervisor (ESXi) free version
vCenter Server Appliance evaluation license
VMware Workstation/Fusion with nested ESXi
EVE-NG or GNS3 with vSphere emulation

Authentication

The vSphere provider supports multiple authentication methods:

Username/Password Authentication (Common)

Standard vSphere user authentication:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-prod
  namespace: default
spec:
  type: vsphere
  endpoint: https://vcenter.example.com/sdk
  credentialSecretRef:
    name: vsphere-credentials
  # Optional: Skip TLS verification (development only)
  insecureSkipVerify: false
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.3"
    service:
      port: 9090

Create credentials secret:

apiVersion: v1
kind: Secret
metadata:
  name: vsphere-credentials
  namespace: default
type: Opaque
stringData:
  username: "virtrigaud@vsphere.local"
  password: "SecurePassword123!"

Session Token Authentication (Advanced)

For environments using external authentication:

apiVersion: v1
kind: Secret
metadata:
  name: vsphere-token
  namespace: default
type: Opaque
stringData:
  token: "vmware-api-session-id:abcd1234..."

Service Account Authentication (Recommended)

Create a dedicated service account with minimal required privileges:

# vSphere privileges for VirtRigaud service account:
# - Datastore: Allocate space, Browse datastore, Low level file operations
# - Network: Assign network  
# - Resource: Assign virtual machine to resource pool
# - Virtual machine: All privileges (or subset based on requirements)
# - Global: Enable methods, Disable methods, Licenses

Configuration

Connection Endpoints

Endpoint Type	Format	Use Case
vCenter Server	`https://vcenter.example.com/sdk`	Multi-host management (recommended)
vCenter FQDN	`https://vcenter.corp.local/sdk`	Internal domain environments
vCenter IP	`https://192.168.1.10/sdk`	Direct IP access
ESXi Host	`https://esxi-host.example.com`	Single host environments

Deployment Configuration

Using Helm Values

# values.yaml
providers:
  vsphere:
    enabled: true
    endpoint: "https://vcenter.example.com/sdk"
    insecureSkipVerify: false  # Set to true for self-signed certificates
    credentialSecretRef:
      name: vsphere-credentials
      namespace: virtrigaud-system

Production Configuration with TLS

# Create secret with credentials and TLS certificates
apiVersion: v1
kind: Secret
metadata:
  name: vsphere-secure-credentials
  namespace: virtrigaud-system
type: Opaque
stringData:
  username: "svc-virtrigaud@vsphere.local"
  password: "SecurePassword123!"
  # Optional: Custom CA certificate for vCenter
  ca.crt: |
    -----BEGIN CERTIFICATE-----
    # Your vCenter CA certificate here
    -----END CERTIFICATE-----

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-production
  namespace: virtrigaud-system
spec:
  type: vsphere
  endpoint: https://vcenter.prod.example.com/sdk
  credentialSecretRef:
    name: vsphere-secure-credentials
  insecureSkipVerify: false

Development Configuration

# For development with self-signed certificates
providers:
  vsphere:
    enabled: true
    endpoint: "https://esxi-dev.local"
    insecureSkipVerify: true  # Only for development!
    credentialSecretRef:
      name: vsphere-dev-credentials

Multi-vCenter Configuration

# Deploy multiple providers for different vCenters
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-datacenter-a
spec:
  type: vsphere
  endpoint: https://vcenter-a.example.com/sdk
  credentialSecretRef:
    name: vsphere-credentials-a

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-datacenter-b
spec:
  type: vsphere
  endpoint: https://vcenter-b.example.com/sdk
  credentialSecretRef:
    name: vsphere-credentials-b

vSphere Infrastructure Setup

Required vSphere Objects

The provider expects the following vSphere infrastructure to be configured:

Datacenters and Clusters

# Example vSphere hierarchy:
Datacenter: "Production"
├── Cluster: "Compute-Cluster"
│   ├── ESXi Host: esxi-01.example.com
│   ├── ESXi Host: esxi-02.example.com
│   └── ESXi Host: esxi-03.example.com
├── Datastores:
│   ├── "datastore-ssd"     # High-performance storage
│   ├── "datastore-hdd"     # Standard storage
│   └── "datastore-backup"  # Backup storage
└── Networks:
    ├── "VM Network"        # Default VM network
    ├── "DMZ-Network"       # DMZ port group
    └── "Management"        # Management network

Resource Pools (Optional)

# Create resource pools for workload isolation
Datacenter: "Production"
└── Cluster: "Compute-Cluster"
    └── Resource Pools:
        ├── "Development"    # Dev workloads (lower priority)
        ├── "Production"     # Prod workloads (high priority)
        └── "Testing"        # Test workloads (medium priority)

VM Configuration

VMClass Specification

Define CPU, memory, and vSphere-specific settings:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: standard-vm
spec:
  cpus: 4
  memory: "8Gi"
  # vSphere-specific configuration
  spec:
    # VM hardware settings
    hardware:
      version: "vmx-19"              # Hardware version
      firmware: "efi"                # BIOS or EFI
      secureBoot: true               # Secure boot (EFI only)
      enableCpuHotAdd: true          # Hot-add CPU
      enableMemoryHotAdd: true       # Hot-add memory
    
    # CPU configuration
    cpu:
      coresPerSocket: 2              # CPU topology
      enableVirtualization: false    # Nested virtualization
      reservationMHz: 1000           # CPU reservation
      limitMHz: 4000                 # CPU limit
    
    # Memory configuration  
    memory:
      reservationMB: 2048            # Memory reservation
      limitMB: 8192                  # Memory limit
      shareLevel: "normal"           # Memory shares (low/normal/high)
    
    # Storage configuration
    storage:
      diskFormat: "thin"             # thick/thin/eagerZeroedThick
      storagePolicy: "VM Storage Policy - SSD"  # vSAN storage policy
    
    # vSphere placement
    placement:
      datacenter: "Production"       # Target datacenter
      cluster: "Compute-Cluster"     # Target cluster  
      resourcePool: "Production"     # Target resource pool
      datastore: "datastore-ssd"     # Preferred datastore
      folder: "/vm/virtrigaud"       # VM folder

VMImage Specification

Reference vSphere templates, content library items, or OVF files:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-04-template
spec:
  # Template from vSphere inventory
  source:
    template: "ubuntu-22.04-template"
    datacenter: "Production"
    folder: "/vm/templates"
  
  # Or from content library
  # source:
  #   contentLibrary: "OS Templates"
  #   item: "ubuntu-22.04-cloud"
  
  # Or from OVF/OVA URL
  # source:
  #   ovf: "https://releases.ubuntu.com/22.04/ubuntu-22.04-server-cloudimg-amd64.ova"
  
  # Guest OS identification
  guestOS: "ubuntu64Guest"
  
  # Customization specification
  customization:
    type: "cloudInit"              # cloudInit, sysprep, or linux
    spec: "ubuntu-cloud-init"      # Reference to customization spec

Complete VM Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-application
spec:
  providerRef:
    name: vsphere-prod
  classRef:
    name: standard-vm
  imageRef:
    name: ubuntu-22-04-template
  powerState: On
  
  # Disk configuration
  disks:
    - name: root
      size: "100Gi"
      storageClass: "ssd-storage"
      # vSphere-specific disk options
      spec:
        diskMode: "persistent"       # persistent, independent_persistent, independent_nonpersistent
        diskFormat: "thin"           # thick, thin, eagerZeroedThick
        controllerType: "scsi"       # scsi, ide, nvme
        unitNumber: 0                # SCSI unit number
    
    - name: data
      size: "500Gi" 
      storageClass: "hdd-storage"
      spec:
        diskFormat: "thick"
        controllerType: "scsi"
        unitNumber: 1
  
  # Network configuration
  networks:
    # Primary application network
    - name: app-network
      portGroup: "VM Network"
      # Optional: Static IP assignment
      staticIP:
        address: "192.168.100.50/24"
        gateway: "192.168.100.1"
        dns: ["192.168.1.10", "8.8.8.8"]
    
    # Management network
    - name: mgmt-network
      portGroup: "Management"
      # DHCP assignment (default)
  
  # vSphere-specific placement
  placement:
    datacenter: "Production"
    cluster: "Compute-Cluster"
    resourcePool: "Production"
    folder: "/vm/applications"
    datastore: "datastore-ssd"      # Override class default
    host: "esxi-01.example.com"      # Pin to specific host (optional)
  
  # Guest customization
  userData:
    cloudInit:
      inline: |
        #cloud-config
        hostname: web-application
        users:
          - name: ubuntu
            sudo: ALL=(ALL) NOPASSWD:ALL
            ssh_authorized_keys:
              - "ssh-ed25519 AAAA..."
        packages:
          - nginx
          - docker.io
          - open-vm-tools          # VMware tools for guest integration
        runcmd:
          - systemctl enable nginx
          - systemctl enable docker
          - systemctl enable open-vm-tools

Advanced Features

VM Reconfiguration (v0.2.3+)

The vSphere provider supports online VM reconfiguration for CPU, memory, and disk resources:

# Reconfigure VM resources
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  vmClassRef: medium  # Change from small to medium
  powerState: "On"

Capabilities:

Online CPU Changes: Hot-add CPUs to running VMs (requires guest OS support)
Online Memory Changes: Hot-add memory to running VMs (requires guest OS support)
Disk Resizing: Expand disks online (shrinking not supported for safety)
Automatic Fallback: Falls back to offline changes if hot-add not supported
Intelligent Detection: Only applies changes when needed

Memory Format Support:

Standard units: 2Gi, 4096Mi, 2048MiB, 2GiB
Parser handles multiple memory unit formats

Limitations:

Disk shrinking prevented to avoid data loss
Some guest operating systems require special configuration for hot-add
BIOS firmware VMs have limited hot-add support (use EFI firmware)

VM Cloning (v0.2.3+)

Create full or linked clones of existing VMs and templates:

# Clone from existing VM
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server-02
spec:
  vmClassRef: small
  vmImageRef: web-server-01  # Source VM
  cloneType: linked  # or "full"

Clone Types:

Full Clone: Independent copy with separate storage
Linked Clone: Space-efficient copy using snapshots
- Automatically creates snapshot if none exists
- Requires less storage and faster creation
- Parent VM must remain available

Use Cases:

Rapid test environment provisioning
Development environment duplication
Template-based deployments
Disaster recovery scenarios

Task Status Tracking (v0.2.3+)

Monitor asynchronous vSphere operations in real-time:

# VirtRigaud automatically tracks long-running operations
# No manual configuration needed

# Task tracking provides:
# - Real-time task state (queued, running, success, error)
# - Progress percentage
# - Error messages for failed tasks
# - Integration with vSphere task manager

Features:

Automatic tracking of all async operations
Progress monitoring via govmomi task manager
Detailed error reporting
Task history visibility in vCenter

Console Access (v0.2.3+)

Generate direct vSphere web client console URLs:

# Access provided in VM status
kubectl get vm web-server -o yaml

status:
  consolURL: "https://vcenter.example.com/ui/app/vm;nav=h/urn:vmomi:VirtualMachine:vm-123:xxxxx/summary"
  phase: Running

Features:

Direct browser-based VM console access
No additional tools required
Works with vSphere web client
Includes VM instance UUID for reliable identification
Generated automatically in Describe operations

Template Management

Creating Templates

# Convert existing VM to template
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMTemplate
metadata:
  name: create-ubuntu-template
spec:
  sourceVM: "ubuntu-base-vm"
  datacenter: "Production"
  targetFolder: "/vm/templates"
  templateName: "ubuntu-22.04-template"
  
  # Template metadata
  annotation: |
    Ubuntu 22.04 LTS Template
    Created: 2024-01-15
    Includes: cloud-init, open-vm-tools
  
  # Template customization
  powerOff: true                   # Power off before conversion
  removeSnapshots: true           # Clean up snapshots
  updateTools: true               # Update VMware tools

Content Library Integration

# Deploy from content library
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage  
metadata:
  name: centos-stream-9
spec:
  source:
    contentLibrary: "OS Templates"
    item: "CentOS-Stream-9"
    datacenter: "Production"
  
  # Content library item properties
  properties:
    version: "9.0"
    provider: "CentOS"
    osType: "linux"

Storage Policies

# VMClass with vSAN storage policy
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: high-performance
spec:
  cpus: 8
  memory: "32Gi"
  spec:
    storage:
      # vSAN storage policies
      homePolicy: "VM Storage Policy - Performance"    # VM home/config files
      diskPolicy: "VM Storage Policy - SSD Only"       # Virtual disks
      swapPolicy: "VM Storage Policy - Standard"        # Swap files
      
      # Traditional storage
      datastoreCluster: "DatastoreCluster-SSD"         # Datastore cluster
      antiAffinityRules: true                          # VM anti-affinity

Network Advanced Configuration

# Advanced networking with distributed switches
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMNetworkAttachment
metadata:
  name: advanced-networking
spec:
  networks:
    # Distributed port group
    - name: frontend
      portGroup: "DPG-Frontend-VLAN100"
      distributedSwitch: "DSwitch-Production"
      vlan: 100
      
    # NSX-T logical switch
    - name: backend  
      portGroup: "LS-Backend-App"
      nsx: true
      securityPolicy: "Backend-Security-Policy"
      
    # SR-IOV for high performance
    - name: storage
      portGroup: "DPG-Storage-VLAN200"
      sriov: true
      bandwidth:
        reservation: 1000  # Mbps
        limit: 10000      # Mbps
        shares: 100       # Priority

High Availability

# VM with HA/DRS settings
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: critical-application
spec:
  providerRef:
    name: vsphere-prod
  # ... other config ...
  
  # High availability configuration
  availability:
    # HA restart priority
    restartPriority: "high"          # disabled, low, medium, high
    isolationResponse: "powerOff"    # none, powerOff, shutdown
    vmMonitoring: "vmMonitoringOnly" # vmMonitoringDisabled, vmMonitoringOnly, vmAndAppMonitoring
    
    # DRS configuration
    drsAutomationLevel: "fullyAutomated"  # manual, partiallyAutomated, fullyAutomated
    drsVmBehavior: "fullyAutomated"       # manual, partiallyAutomated, fullyAutomated
    
    # Anti-affinity rules
    antiAffinityGroups: ["web-tier", "database-tier"]
    
    # Host affinity (pin to specific hosts)
    hostAffinityGroups: ["production-hosts"]

Snapshot Management

# Advanced snapshot configuration
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: pre-upgrade-snapshot
spec:
  vmRef:
    name: web-application
  
  # Snapshot settings
  name: "Pre-upgrade snapshot"
  description: "Snapshot before application upgrade"
  memory: true                    # Include memory state
  quiesce: true                   # Quiesce guest filesystem
  
  # Retention policy
  retention:
    maxSnapshots: 3               # Keep max 3 snapshots
    maxAge: "7d"                  # Delete after 7 days
    
  # Schedule (optional)
  schedule: "0 2 * * 0"          # Weekly at 2 AM Sunday

Troubleshooting

Common Issues

❌ Connection Failed

Symptom: failed to connect to vSphere: connection refused

Causes & Solutions:

Network connectivity:

# Test connectivity to vCenter
telnet vcenter.example.com 443

# Test from Kubernetes pod
kubectl run debug --rm -i --tty --image=curlimages/curl -- \
  curl -k https://vcenter.example.com

DNS resolution:

# Test DNS resolution
nslookup vcenter.example.com

# Use IP address if DNS fails

Firewall rules: Ensure port 443 is accessible from Kubernetes cluster

❌ Authentication Failed

Symptom: Login failed: incorrect user name or password

Solutions:

Verify credentials:

# Test credentials manually
kubectl get secret vsphere-credentials -o yaml

# Decode and verify
echo "base64-password" | base64 -d

Check user permissions:
- Verify user exists in vCenter
- Check assigned roles and privileges
- Ensure user is not locked out
Test login via vSphere Client: Verify credentials work in the GUI

❌ Insufficient Privileges

Symptom: operation requires privilege 'VirtualMachine.Interact.PowerOn'

Solution: Grant required privileges to the service account:

# Required privileges for VirtRigaud:
# - Datastore privileges:
#   * Datastore.AllocateSpace
#   * Datastore.Browse  
#   * Datastore.FileManagement
# - Network privileges:
#   * Network.Assign
# - Resource privileges:
#   * Resource.AssignVMToPool
# - Virtual machine privileges:
#   * VirtualMachine.* (all) or specific subset
# - Global privileges:
#   * Global.EnableMethods
#   * Global.DisableMethods

❌ Template Not Found

Symptom: template 'ubuntu-template' not found

Solutions:

# List available templates
govc ls /datacenter/vm/templates/

# Check template path and permissions
govc object.collect -s vm/templates/ubuntu-template summary.config.name

# Verify template is properly marked as template
govc object.collect -s vm/templates/ubuntu-template config.template

❌ Datastore Issues

Symptom: insufficient disk space or datastore not accessible

Solutions:

# Check datastore capacity
govc datastore.info datastore-name

# List accessible datastores
govc datastore.ls

# Check datastore cluster configuration
govc cluster.ls

❌ Network Configuration

Symptom: network 'VM Network' not found

Solutions:

# List available networks
govc ls /datacenter/network/

# Check distributed port groups
govc dvs.portgroup.info

# Verify network accessibility from cluster
govc cluster.network.info

Validation Commands

Test your vSphere setup before deploying:

# 1. Install and configure govc CLI tool
export GOVC_URL='https://vcenter.example.com'
export GOVC_USERNAME='administrator@vsphere.local'
export GOVC_PASSWORD='password'
export GOVC_INSECURE=1  # for self-signed certificates

# 2. Test connectivity
govc about

# 3. List datacenters
govc ls

# 4. List clusters and hosts
govc ls /datacenter/host/

# 5. List datastores
govc ls /datacenter/datastore/

# 6. List networks
govc ls /datacenter/network/

# 7. List templates
govc ls /datacenter/vm/templates/

# 8. Test VM creation (dry run)
govc vm.create -c 1 -m 1024 -g ubuntu64Guest -net "VM Network" test-vm
govc vm.destroy test-vm

Debug Logging

Enable verbose logging for the vSphere provider:

providers:
  vsphere:
    env:
      - name: LOG_LEVEL
        value: "debug"
      - name: GOVMOMI_DEBUG
        value: "true"
    endpoint: "https://vcenter.example.com"

Monitor vSphere tasks:

# Monitor recent tasks in vCenter
govc task.ls

# Get details of specific task
govc task.info task-123

Performance Optimization

Resource Allocation

# High-performance VMClass
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: performance-optimized
spec:
  cpus: 16
  memory: "64Gi"
  spec:
    cpu:
      coresPerSocket: 8            # Match physical CPU topology
      reservationMHz: 8000         # Guarantee CPU resources
      shares: 2000                 # High priority (normal=1000)
      enableVirtualization: false  # Disable if not needed for performance
    
    memory:
      reservationMB: 65536         # Guarantee memory
      shares: 2000                 # High priority
      shareLevel: "high"           # Alternative to shares value
    
    hardware:
      enableCpuHotAdd: false       # Better performance when disabled
      enableMemoryHotAdd: false    # Better performance when disabled
      
    # NUMA configuration for large VMs
    numa:
      enabled: true
      coresPerSocket: 8            # Align with NUMA topology

Storage Optimization

# Storage-optimized configuration
spec:
  storage:
    diskFormat: "eagerZeroedThick"  # Best performance, more space usage
    controllerType: "pvscsi"        # Paravirtual SCSI for better performance
    multiwriter: false              # Disable unless needed
    
    # vSAN optimization
    storagePolicy: "Performance-Tier"
    cachingPolicy: "writethrough"   # or "writeback" for better performance
    
    # Multiple controllers for high IOPS
    scsiControllers:
      - type: "pvscsi"
        busNumber: 0
        maxDevices: 15
      - type: "pvscsi" 
        busNumber: 1
        maxDevices: 15

Network Optimization

# High-performance networking
networks:
  - name: high-performance
    portGroup: "DPG-HighPerf-SR-IOV"
    adapter: "vmxnet3"             # Best performance adapter
    sriov: true                    # SR-IOV for near-native performance
    bandwidth:
      reservation: 1000            # Guaranteed bandwidth (Mbps)
      limit: 10000                 # Maximum bandwidth (Mbps)
      shares: 100                  # Priority level

API Reference

For complete API reference, see the Provider API Documentation.

Contributing

To contribute to the vSphere provider:

Support

Documentation: VirtRigaud Docs
Issues: GitHub Issues
Community: Discord
VMware: vSphere API Documentation
govc: govc CLI Tool

LibVirt/KVM Provider

The LibVirt provider enables VirtRigaud to manage virtual machines on KVM/QEMU hypervisors using the LibVirt API. This provider runs as a dedicated pod that communicates with LibVirt daemons locally or remotely, making it ideal for development, on-premises deployments, and cloud environments.

Overview

This provider implements the VirtRigaud provider interface to manage VM lifecycle operations on LibVirt/KVM:

Create: Create VMs from cloud images with comprehensive cloud-init support
Delete: Remove VMs and associated storage volumes (with cleanup)
Power: Start, stop, and reboot virtual machines
Describe: Query VM state, resource usage, guest agent information, and network details
Reconfigure: Modify VM resources (v0.2.3+ - requires VM restart)
Clone: Create new VMs based on existing VM configurations
Snapshot: Create, delete, and revert VM snapshots (storage-dependent)
ConsoleURL: Generate VNC console URLs for remote access (v0.2.3+)
ImagePrepare: Download and prepare cloud images from URLs
Storage Management: Advanced storage pool and volume operations
Cloud-Init: Full NoCloud datasource support with ISO generation
QEMU Guest Agent: Integration for enhanced guest OS monitoring
Network Configuration: Support for various network types and bridges

Prerequisites

The LibVirt provider connects to a LibVirt daemon (libvirtd) which can run locally or remotely. This makes it flexible for both development and production environments.

Connection Options:

Local LibVirt: Connects to local libvirtd via qemu:///system (ideal for development)
Remote LibVirt: Connects to remote libvirtd over SSH/TLS (production)
Container LibVirt: Works with containerized libvirt or KubeVirt

Requirements:

LibVirt daemon (libvirtd) running locally or accessible remotely
KVM/QEMU hypervisor support (hardware virtualization recommended)
Storage pools configured for VM disk storage
Network bridges or interfaces for VM networking
Appropriate permissions for VM management operations

Development Setup:

For local development, you can:

Linux: Install libvirt-daemon-system and qemu-kvm packages
macOS/Windows: Use remote LibVirt or nested virtualization
Testing: The provider can connect to local libvirtd without complex infrastructure

Authentication & Connection

The LibVirt provider supports multiple connection methods:

Local LibVirt Connection

For connecting to a LibVirt daemon on the same host as the provider pod:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-local
  namespace: default
spec:
  type: libvirt
  endpoint: "qemu:///system"  # Local system connection
  credentialSecretRef:
    name: libvirt-local-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.3"
    service:
      port: 9090

Note: When using local connections, ensure the provider pod has appropriate permissions to access the LibVirt socket.

Remote Connection with SSH

For remote LibVirt over SSH:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-remote
  namespace: default
spec:
  type: libvirt
  endpoint: "qemu+ssh://user@libvirt-host/system"
  credentialSecretRef:
    name: libvirt-ssh-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.3"
    service:
      port: 9090

Create SSH credentials secret:

apiVersion: v1
kind: Secret
metadata:
  name: libvirt-ssh-credentials
  namespace: default
type: Opaque
stringData:
  username: "libvirt-user"
  # For key-based auth (recommended):
  tls.key: |
    -----BEGIN PRIVATE KEY-----
    # Your SSH private key here
    -----END PRIVATE KEY-----
  # For password auth (less secure):
  password: "your-password"

Remote Connection with TLS

For remote LibVirt over TLS:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-tls
  namespace: default
spec:
  type: libvirt
  endpoint: "qemu+tls://libvirt-host:16514/system"
  credentialSecretRef:
    name: libvirt-tls-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.3"
    service:
      port: 9090

Create TLS credentials secret:

apiVersion: v1
kind: Secret
metadata:
  name: libvirt-tls-credentials
  namespace: default
type: kubernetes.io/tls
data:
  tls.crt: # Base64 encoded client certificate
  tls.key: # Base64 encoded client private key
  ca.crt:  # Base64 encoded CA certificate

Configuration

Connection URIs

The LibVirt provider supports standard LibVirt connection URIs:

URI Format	Description	Use Case
`qemu:///system`	Local system connection	Development, single-host
`qemu+ssh://user@host/system`	SSH connection	Remote access with SSH
`qemu+tls://host:16514/system`	TLS connection	Secure remote access
`qemu+tcp://host:16509/system`	TCP connection	Insecure remote (testing only)

⚠️ Note: All LibVirt URI schemes are now supported in the CRD validation pattern.

Deployment Configuration

Using Helm Values

# values.yaml
providers:
  libvirt:
    enabled: true
    endpoint: "qemu:///system"  # Adjust for your environment
    # For remote connections:
    # endpoint: "qemu+ssh://user@libvirt-host/system"
    credentialSecretRef:
      name: libvirt-credentials  # Optional for local connections

Development Configuration

# For local development with LibVirt
providers:
  libvirt:
    enabled: true
    endpoint: "qemu:///system"
    runtime:
      # Mount host libvirt socket (for local access)
      volumes:
      - name: libvirt-sock
        hostPath:
          path: /var/run/libvirt/libvirt-sock
      volumeMounts:
      - name: libvirt-sock
        mountPath: /var/run/libvirt/libvirt-sock

Production Configuration

# For production with remote LibVirt
apiVersion: v1
kind: Secret
metadata:
  name: libvirt-credentials
  namespace: virtrigaud-system
type: Opaque
stringData:
  username: "virtrigaud-service"
  tls.crt: |
    -----BEGIN CERTIFICATE-----
    # Client certificate for TLS authentication
    -----END CERTIFICATE-----
  tls.key: |
    -----BEGIN PRIVATE KEY-----
    # Client private key
    -----END PRIVATE KEY-----
  ca.crt: |
    -----BEGIN CERTIFICATE-----
    # CA certificate
    -----END CERTIFICATE-----

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-production
  namespace: virtrigaud-system
spec:
  type: libvirt
  endpoint: "qemu+tls://libvirt.example.com:16514/system"
  credentialSecretRef:
    name: libvirt-credentials

Storage Configuration

Storage Pools

LibVirt requires storage pools for VM disks. Common configurations:

# Create directory-based storage pool
virsh pool-define-as default dir --target /var/lib/libvirt/images
virsh pool-build default
virsh pool-start default
virsh pool-autostart default

# Create LVM-based storage pool (performance)
virsh pool-define-as lvm-pool logical --source-name vg-libvirt --target /dev/vg-libvirt
virsh pool-start lvm-pool
virsh pool-autostart lvm-pool

VMClass Storage Specification

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: standard
spec:
  cpus: 2
  memory: "4Gi"
  # LibVirt-specific storage settings
  spec:
    storage:
      pool: "default"        # Storage pool name
      format: "qcow2"        # Disk format (qcow2, raw)
      cache: "writethrough"  # Cache mode
      io: "threads"          # I/O mode

Network Configuration

Network Setup

Configure LibVirt networks for VM connectivity:

# Create NAT network (default)
virsh net-define /usr/share/libvirt/networks/default.xml
virsh net-start default
virsh net-autostart default

# Create bridge network (for external access)
cat > /tmp/bridge-network.xml << EOF
<network>
  <name>br0</name>
  <forward mode='bridge'/>
  <bridge name='br0'/>
</network>
EOF
virsh net-define /tmp/bridge-network.xml
virsh net-start br0

Network Bridge Mapping

Network Name	LibVirt Network	Use Case
`default`, `nat`	default	NAT networking
`bridge`, `br0`	br0	Bridged networking
`isolated`	isolated	Host-only networking

VM Network Configuration

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  providerRef:
    name: libvirt-local
  networks:
    # Use default NAT network
    - name: default
    # Use bridged network for external access
    - name: bridge
      bridge: br0
      mac: "52:54:00:12:34:56"  # Optional MAC address

VM Configuration

VMClass Specification

Define hardware resources and LibVirt-specific settings:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: development
spec:
  cpus: 2
  memory: "4Gi"
  # LibVirt-specific configuration
  spec:
    machine: "pc-i440fx-2.12"  # Machine type
    cpu:
      mode: "host-model"       # CPU mode (host-model, host-passthrough)
      topology:
        sockets: 1
        cores: 2
        threads: 1
    features:
      acpi: true
      apic: true
      pae: true
    clock:
      offset: "utc"
      timers:
        rtc: "catchup"
        pit: "delay"
        hpet: false

VMImage Specification

Reference existing disk images or templates:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-04
spec:
  source:
    # Path to existing image in storage pool
    disk: "/var/lib/libvirt/images/ubuntu-22.04-base.qcow2"
    # Or reference by pool and volume
    # pool: "default"
    # volume: "ubuntu-22.04-base"
  format: "qcow2"
  
  # Cloud-init preparation
  cloudInit:
    enabled: true
    userDataTemplate: |
      #cloud-config
      hostname: {{"{{ .Name }}"}}
      users:
        - name: ubuntu
          sudo: ALL=(ALL) NOPASSWD:ALL
          ssh_authorized_keys:
            - {{"{{ .SSHPublicKey }}"}}

Complete VM Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: dev-workstation
spec:
  providerRef:
    name: libvirt-local
  classRef:
    name: development
  imageRef:
    name: ubuntu-22-04
  powerState: On
  
  # Disk configuration
  disks:
    - name: root
      size: "50Gi"
      storageClass: "fast-ssd"  # Maps to LibVirt storage pool
  
  # Network configuration  
  networks:
    - name: default  # NAT network for internet
    - name: bridge   # Bridge for LAN access
      staticIP:
        address: "192.168.1.100/24"
        gateway: "192.168.1.1"
        dns: ["8.8.8.8", "1.1.1.1"]
  
  # Cloud-init user data
  userData:
    cloudInit:
      inline: |
        #cloud-config
        hostname: dev-workstation
        users:
          - name: developer
            sudo: ALL=(ALL) NOPASSWD:ALL
            shell: /bin/bash
            ssh_authorized_keys:
              - "ssh-ed25519 AAAA..."
        packages:
          - build-essential
          - docker.io
          - code
        runcmd:
          - systemctl enable docker
          - usermod -aG docker developer

Cloud-Init Integration

Automatic Configuration

The LibVirt provider automatically handles cloud-init setup:

ISO Generation: Creates cloud-init ISO with user-data and meta-data
Attachment: Attaches ISO as CD-ROM device to VM
Network Config: Generates network configuration from VM spec
User Data: Renders templates with VM-specific values

Advanced Cloud-Init

userData:
  cloudInit:
    inline: |
      #cloud-config
      hostname: {{"{{ .Name }}"}}
      
      # Network configuration (if not using DHCP)
      network:
        version: 2
        ethernets:
          ens3:
            addresses: [192.168.1.100/24]
            gateway4: 192.168.1.1
            nameservers:
              addresses: [8.8.8.8, 1.1.1.1]
      
      # Storage configuration
      disk_setup:
        /dev/vdb:
          table_type: gpt
          layout: true
      
      fs_setup:
        - device: /dev/vdb1
          filesystem: ext4
          label: data
      
      mounts:
        - [/dev/vdb1, /data, ext4, defaults]
      
      # Package installation
      packages:
        - qemu-guest-agent  # Enable guest agent
        - cloud-init
        - curl
      
      # Enable services
      runcmd:
        - systemctl enable qemu-guest-agent
        - systemctl start qemu-guest-agent

Performance Optimization

KVM Optimization

# VMClass with performance optimizations
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: high-performance
spec:
  cpus: 8
  memory: "16Gi"
  spec:
    cpu:
      mode: "host-passthrough"  # Best performance
      topology:
        sockets: 1
        cores: 8
        threads: 1
    # NUMA topology for large VMs
    numa:
      cells:
        - id: 0
          cpus: "0-7"
          memory: "16"
    
    # Virtio devices for performance
    devices:
      disk:
        bus: "virtio"
        cache: "none"
        io: "native"
      network:
        model: "virtio"
      video:
        model: "virtio"

Storage Performance

# Create high-performance storage pool
virsh pool-define-as ssd-pool logical --source-name vg-ssd --target /dev/vg-ssd
virsh pool-start ssd-pool

# Use raw format for better performance (larger disk usage)
virsh vol-create-as ssd-pool vm-disk 100G --format raw

# Enable native AIO and disable cache for direct I/O
# (configured automatically by provider based on VMClass)

Troubleshooting

Common Issues

❌ Connection Failed

Symptom: failed to connect to Libvirt: <error>

Causes & Solutions:

Local connection issues:

# Check libvirtd status
sudo systemctl status libvirtd

# Start if not running
sudo systemctl start libvirtd
sudo systemctl enable libvirtd

# Test connection
virsh -c qemu:///system list

Remote SSH connection:

# Test SSH connectivity
ssh user@libvirt-host virsh list

# Check SSH key permissions
chmod 600 ~/.ssh/id_rsa

Remote TLS connection:

# Verify certificates
openssl x509 -in client-cert.pem -text -noout

# Test TLS connection
virsh -c qemu+tls://host:16514/system list

❌ Permission Denied

Symptom: authentication failed or permission denied

Solutions:

# Add user to libvirt group
sudo usermod -a -G libvirt $USER

# Check libvirt group membership
groups $USER

# Verify permissions on libvirt socket
ls -la /var/run/libvirt/libvirt-sock

# For containerized providers, ensure socket is mounted

❌ Storage Pool Not Found

Symptom: storage pool 'default' not found

Solution:

# List available pools
virsh pool-list --all

# Create default pool if missing
virsh pool-define-as default dir --target /var/lib/libvirt/images
virsh pool-build default
virsh pool-start default
virsh pool-autostart default

# Verify pool is active
virsh pool-info default

❌ Network Not Available

Symptom: network 'default' not found

Solution:

# List networks
virsh net-list --all

# Start default network
virsh net-start default
virsh net-autostart default

# Create bridge network if needed
virsh net-define /usr/share/libvirt/networks/default.xml

❌ KVM Not Available

Symptom: KVM is not available or hardware acceleration not available

Solutions:

Check virtualization support:

# Check CPU virtualization features
egrep -c '(vmx|svm)' /proc/cpuinfo

# Check KVM modules
lsmod | grep kvm

# Load KVM modules if missing
sudo modprobe kvm
sudo modprobe kvm_intel  # or kvm_amd

BIOS/UEFI settings: Enable Intel VT-x or AMD-V
Nested virtualization: If running in a VM, enable nested virtualization

Validation Commands

Test your LibVirt setup before deploying:

# 1. Test LibVirt connection
virsh -c qemu:///system list

# 2. Check storage pools
virsh pool-list --all

# 3. Check networks
virsh net-list --all

# 4. Test VM creation (simple test)
virt-install --name test-vm --memory 512 --vcpus 1 \
  --disk size=1 --network network=default \
  --boot cdrom --noautoconsole --dry-run

# 5. From within Kubernetes pod
kubectl run debug --rm -i --tty --image=ubuntu:22.04 -- bash
# Then test virsh commands if socket is mounted

Debug Logging

Enable verbose logging for the LibVirt provider:

providers:
  libvirt:
    env:
      - name: LOG_LEVEL
        value: "debug"
      - name: LIBVIRT_DEBUG
        value: "1"
    endpoint: "qemu:///system"

Advanced Features

VM Reconfiguration (v0.2.3+)

The Libvirt provider supports VM reconfiguration for CPU, memory, and disk resources:

# Reconfigure VM resources
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  vmClassRef: medium  # Change from small to medium
  powerState: "On"

Capabilities:

Online CPU Changes: Modify CPU count using virsh setvcpus --live for running VMs
Online Memory Changes: Modify memory using virsh setmem --live for running VMs
Disk Resizing: Expand disk volumes via storage provider integration
Offline Configuration: Updates persistent config for stopped VMs via --config flag

Important Notes:

Most changes require VM restart for full effect
Online changes apply to running VM but may need restart for persistence
Disk shrinking not supported for safety
Memory format parsing supports bytes, KiB, MiB, GiB

Implementation Details:

Uses virsh setvcpus --live --config for CPU changes
Uses virsh setmem --live --config for memory changes
Parses current VM configuration with virsh dominfo
Integrates with storage provider for volume resizing

VNC Console Access (v0.2.3+)

Generate VNC console URLs for direct VM access:

# Access provided in VM status
kubectl get vm web-server -o yaml

status:
  consoleURL: "vnc://libvirt-host.example.com:5900"
  phase: Running

Features:

Automatic VNC port extraction from domain XML
Direct connection URLs for VNC clients
Support for standard VNC viewers (TigerVNC, RealVNC, etc.)
Web-based VNC viewers compatible (noVNC)

VNC Client Usage:

# Using vncviewer
vncviewer libvirt-host.example.com:5900

# Using TigerVNC
tigervnc libvirt-host.example.com:5900

# Web browser (with noVNC)
# Access through web-based VNC proxy

Configuration: VNC is automatically configured during VM creation. The provider:

Extracts VNC configuration from domain XML using virsh dumpxml
Parses the graphics port number
Constructs the VNC URL with host and port
Returns URL in Describe operations

Advanced Configuration

High Availability Setup

# Multiple LibVirt hosts for HA
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-cluster
spec:
  type: libvirt
  # Use load balancer or failover endpoint
  endpoint: "qemu+tls://libvirt-cluster.example.com:16514/system"
  runtime:
    replicas: 2  # Multiple provider instances
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: libvirt-provider
            topologyKey: kubernetes.io/hostname

GPU Passthrough

# VMClass with GPU passthrough
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: gpu-workstation
spec:
  cpus: 8
  memory: "32Gi"
  spec:
    devices:
      hostdev:
        - type: "pci"
          source:
            address:
              domain: "0x0000"
              bus: "0x01"
              slot: "0x00"
              function: "0x0"
          managed: true

API Reference

For complete API reference, see the Provider API Documentation.

Contributing

To contribute to the LibVirt provider:

Support

Documentation: VirtRigaud Docs
Issues: GitHub Issues
Community: Discord
LibVirt: libvirt.org

Proxmox VE Provider

The Proxmox VE provider enables VirtRigaud to manage virtual machines on Proxmox Virtual Environment (PVE) clusters using the native Proxmox API.

Overview

This provider implements the VirtRigaud provider interface to manage VM lifecycle operations on Proxmox VE:

Create: Create VMs from templates or ISO images with cloud-init support
Delete: Remove VMs and associated resources
Power: Start, stop, and reboot virtual machines
Describe: Query VM state, IPs, and console access
Guest Agent Integration: Enhanced IP detection via QEMU guest agent (v0.2.3+)
Reconfigure: Hot-plug CPU/memory changes, disk expansion
Clone: Create linked or full clones of existing VMs
Snapshot: Create, delete, and revert VM snapshots with memory state
ImagePrepare: Import and prepare VM templates from URLs or ensure existence

Prerequisites

⚠️ IMPORTANT: Active Proxmox VE Server Required

The Proxmox provider requires a running Proxmox VE server to function. Unlike some providers that can operate in simulation mode, this provider performs actual API calls to Proxmox VE during startup and operation.

Requirements:

Proxmox VE 7.0 or later (running and accessible)
API token or user account with appropriate privileges
Network connectivity from VirtRigaud to Proxmox API (port 8006/HTTPS)
Valid TLS configuration (production) or skip verification (development)

Testing/Development:

If you don’t have a Proxmox VE server available:

Use Proxmox VE in a VM for testing
Consider alternative providers (libvirt, vSphere) for local development
The provider will fail startup validation without a reachable Proxmox endpoint

Authentication

The Proxmox provider supports two authentication methods:

API Token Authentication (Recommended)

API tokens provide secure, scope-limited access without exposing user passwords.

Create API Token in Proxmox:

# In Proxmox web UI: Datacenter -> Permissions -> API Tokens
# Or via CLI:
pveum user token add <USER@REALM> <TOKENID> --privsep 0

Configure Provider:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: proxmox-prod
  namespace: default
spec:
  type: proxmox
  endpoint: https://pve.example.com:8006
  credentialSecretRef:
    name: pve-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-proxmox:v0.2.3"
    service:
      port: 9090

Create Credentials Secret:

apiVersion: v1
kind: Secret
metadata:
  name: pve-credentials
  namespace: default
type: Opaque
stringData:
  token_id: "virtrigaud@pve!vrtg-token"
  token_secret: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

For environments that cannot use API tokens:

apiVersion: v1
kind: Secret
metadata:
  name: pve-credentials
  namespace: default
type: Opaque
stringData:
  username: "virtrigaud@pve"
  password: "secure-password"

Deployment Configuration

Required Environment Variables

The Proxmox provider requires environment variables to connect to your Proxmox VE server. Configure these variables in your Helm values file:

Variable	Required	Description	Example
`PVE_ENDPOINT`	✅ Yes	Proxmox VE API endpoint URL	`https://pve.example.com:8006`
`PVE_USERNAME`	✅ Yes*	Username for password auth	`root@pam` or `user@realm`
`PVE_PASSWORD`	✅ Yes*	Password for username	`secure-password`
`PVE_TOKEN_ID`	✅ Yes**	API token ID (alternative)	`user@realm!tokenid`
`PVE_TOKEN_SECRET`	✅ Yes**	API token secret (alternative)	`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`
`PVE_INSECURE_SKIP_VERIFY`	🔵 Optional	Skip TLS verification	`true` (dev only)

* Either username/password OR token authentication is required
** API token authentication is recommended for production

Helm Configuration Examples

Username/Password Authentication

# values.yaml
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        value: "https://your-proxmox-server.example.com:8006"
      - name: PVE_USERNAME
        value: "root@pam"
      - name: PVE_PASSWORD
        value: "your-secure-password"

API Token Authentication (Recommended)

# values.yaml  
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        value: "https://your-proxmox-server.example.com:8006"
      - name: PVE_TOKEN_ID
        value: "virtrigaud@pve!automation"
      - name: PVE_TOKEN_SECRET
        value: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

Using Kubernetes Secrets (Production)

For production environments, use Kubernetes secrets:

# Create secret first
apiVersion: v1
kind: Secret
metadata:
  name: proxmox-credentials
type: Opaque
stringData:
  PVE_ENDPOINT: "https://your-proxmox-server.example.com:8006"
  PVE_TOKEN_ID: "virtrigaud@pve!automation"  
  PVE_TOKEN_SECRET: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

---
# values.yaml - Reference the secret
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        valueFrom:
          secretKeyRef:
            name: proxmox-credentials
            key: PVE_ENDPOINT
      - name: PVE_TOKEN_ID
        valueFrom:
          secretKeyRef:
            name: proxmox-credentials
            key: PVE_TOKEN_ID
      - name: PVE_TOKEN_SECRET
        valueFrom:
          secretKeyRef:
            name: proxmox-credentials
            key: PVE_TOKEN_SECRET

Configuration Validation

The provider validates configuration at startup and will fail to start if:

✅ PVE_ENDPOINT is missing or invalid
✅ Neither username/password nor token credentials are provided
✅ Proxmox server is unreachable
✅ Authentication fails

Error Examples

# Missing endpoint
ERROR Failed to create PVE client error="endpoint is required"

# Invalid endpoint format  
ERROR Failed to create PVE client error="invalid endpoint URL"

# Authentication failure
ERROR Failed to authenticate error="authentication failed: invalid credentials"

# Connection failure
ERROR Failed to connect error="dial tcp: no route to host"

Development vs Production

Environment	Endpoint	Authentication	TLS	Notes
Development	`https://pve-test.local:8006`	Username/Password	Skip verify	Use `PVE_INSECURE_SKIP_VERIFY=true`
Staging	`https://pve-staging.company.com:8006`	API Token	Custom CA	Configure CA bundle
Production	`https://pve.company.com:8006`	API Token	Valid cert	Use Kubernetes secrets

TLS Configuration

Self-Signed Certificates (Development)

For test environments with self-signed certificates:

spec:
  runtime:
    env:
      - name: PVE_INSECURE_SKIP_VERIFY
        value: "true"

Custom CA Certificate (Production)

For production with custom CA:

apiVersion: v1
kind: Secret
metadata:
  name: pve-credentials
type: Opaque
stringData:
  ca.crt: |
    -----BEGIN CERTIFICATE-----
    MIIDXTCCAkWgAwIBAgIJAL...
    -----END CERTIFICATE-----

Reconfiguration Support

Online Reconfiguration

The Proxmox provider supports online (hot-plug) reconfiguration for:

CPU: Add/remove vCPUs while VM is running (guest OS support required)
Memory: Increase memory using balloon driver (guest tools required)
Disk Expansion: Expand disks online (disk shrinking not supported)

Reconfigure Matrix

Operation	Online Support	Requirements	Notes
CPU increase	✅ Yes	Guest OS support	Most modern Linux/Windows
CPU decrease	✅ Yes	Guest OS support	May require guest cooperation
Memory increase	✅ Yes	Balloon driver	Install qemu-guest-agent
Memory decrease	⚠️ Limited	Balloon driver + guest	May require power cycle
Disk expand	✅ Yes	Online resize support	Filesystem resize separate
Disk shrink	❌ No	Not supported	Security/data protection

Example Reconfiguration

# Scale up VM resources
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  # ... existing spec ...
  classRef:
    name: large  # Changed from 'small'
---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: large
spec:
  cpus: 8        # Increased from 2
  memory: "16Gi" # Increased from 4Gi

Snapshot Management

Snapshot Features

Memory Snapshots: Include VM memory state for consistent restore
Crash-Consistent: Without memory for faster snapshots
Snapshot Trees: Nested snapshots with parent-child relationships
Metadata: Description and timestamp tracking

Snapshot Operations

# Create snapshot with memory
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: before-upgrade
spec:
  vmRef:
    name: web-server
  description: "Pre-maintenance snapshot"
  includeMemory: true  # Include running memory state

# Create snapshot via kubectl
kubectl create vmsnapshot before-upgrade \
  --vm=web-server \
  --description="Before major upgrade" \
  --include-memory=true

Multi-NIC Networking

Network Configuration

The provider supports multiple network interfaces with:

Bridge Assignment: Map to Proxmox bridges (vmbr0, vmbr1, etc.)
VLAN Tagging: 802.1Q VLAN support
Static IPs: Cloud-init integration for network configuration
MAC Addresses: Custom MAC assignment

Example Multi-NIC VM

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: multi-nic-vm
spec:
  providerRef:
    name: proxmox-prod
  classRef:
    name: medium
  imageRef:
    name: ubuntu-22
  networks:
    # Primary LAN interface
    - name: lan
      bridge: vmbr0
      staticIP:
        address: "192.168.1.100/24"
        gateway: "192.168.1.1"
        dns: ["8.8.8.8", "1.1.1.1"]
    
    # DMZ interface with VLAN
    - name: dmz
      bridge: vmbr1
      vlan: 100
      staticIP:
        address: "10.0.100.50/24"
    
    # Management interface
    - name: mgmt
      bridge: vmbr2
      mac: "02:00:00:aa:bb:cc"

Network Bridge Mapping

Network Name	Default Bridge	Use Case
`lan`, `default`	vmbr0	General LAN connectivity
`dmz`	vmbr1	DMZ/public services
`mgmt`, `management`	vmbr2	Management network
`vmbr*`	Same name	Direct bridge reference

Configuration

Required Environment Variables

⚠️ The provider requires environment variables to connect to Proxmox VE:

Variable	Description	Required	Default	Example
`PVE_ENDPOINT`	Proxmox API endpoint URL	Yes	-	`https://pve.example.com:8006/api2`
`PVE_TOKEN_ID`	API token identifier	Yes*	-	`virtrigaud@pve!vrtg-token`
`PVE_TOKEN_SECRET`	API token secret	Yes*	-	`xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx`
`PVE_USERNAME`	Username for session auth	Yes*	-	`virtrigaud@pve`
`PVE_PASSWORD`	Password for session auth	Yes*	-	`secure-password`
`PVE_NODE_SELECTOR`	Preferred nodes (comma-separated)	No	Auto-detect	`pve-node-1,pve-node-2`
`PVE_INSECURE_SKIP_VERIFY`	Skip TLS verification	No	`false`	`true`
`PVE_CA_BUNDLE`	Custom CA certificate	No	-	`-----BEGIN CERTIFICATE-----...`

* Either token (PVE_TOKEN_ID + PVE_TOKEN_SECRET) or username/password (PVE_USERNAME + PVE_PASSWORD) is required

Deployment Configuration

The provider needs environment variables to connect to Proxmox. Here are complete deployment examples:

Using Helm Values

# values.yaml
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        value: "https://pve.example.com:8006/api2"
      - name: PVE_TOKEN_ID
        value: "virtrigaud@pve!vrtg-token"
      - name: PVE_TOKEN_SECRET
        value: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
      - name: PVE_INSECURE_SKIP_VERIFY
        value: "true"  # Only for development!
      - name: PVE_NODE_SELECTOR
        value: "pve-node-1,pve-node-2"  # Optional

Using Kubernetes Secrets (Recommended)

# Create secret with credentials
apiVersion: v1
kind: Secret
metadata:
  name: proxmox-credentials
  namespace: virtrigaud-system
type: Opaque
stringData:
  PVE_ENDPOINT: "https://pve.example.com:8006/api2"
  PVE_TOKEN_ID: "virtrigaud@pve!vrtg-token"
  PVE_TOKEN_SECRET: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
  PVE_INSECURE_SKIP_VERIFY: "false"

---
# Reference secret in deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: virtrigaud-provider-proxmox
spec:
  template:
    spec:
      containers:
      - name: provider-proxmox
        image: ghcr.io/projectbeskar/virtrigaud/provider-proxmox:v0.2.3
        envFrom:
        - secretRef:
            name: proxmox-credentials

Development/Testing Configuration

# For development with a local Proxmox VE instance
providers:
  proxmox:
    enabled: true
    env:
      - name: PVE_ENDPOINT
        value: "https://192.168.1.100:8006/api2"
      - name: PVE_USERNAME
        value: "root@pam"
      - name: PVE_PASSWORD
        value: "your-password"
      - name: PVE_INSECURE_SKIP_VERIFY
        value: "true"

Node Selection

The provider can be configured to prefer specific nodes:

env:
  - name: PVE_NODE_SELECTOR
    value: "pve-node-1,pve-node-2"

If not specified, the provider will automatically select nodes based on availability.

VM Configuration

VMClass Specification

Define CPU and memory resources:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: small
spec:
  cpus: 2
  memory: "4Gi"
  # Proxmox-specific settings
  spec:
    machine: "q35"
    bios: "uefi"

VMImage Specification

Reference Proxmox templates:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22
spec:
  source: "ubuntu-22-template"  # Template name in Proxmox
  # Or clone from existing VM:
  # source: "9000"  # VMID to clone from

VirtualMachine Example

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  providerRef:
    name: proxmox-prod
  classRef:
    name: small
  imageRef:
    name: ubuntu-22
  powerState: On
  networks:
    - name: lan
      # Maps to Proxmox bridge or VLAN configuration
  disks:
    - name: root
      size: "40Gi"
  userData:
    cloudInit:
      inline: |
        #cloud-config
        hostname: web-server
        users:
          - name: ubuntu
            ssh_authorized_keys:
              - "ssh-ed25519 AAAA..."
        packages:
          - nginx

Cloud-Init Integration

The provider automatically configures cloud-init for supported VMs:

Automatic Configuration

IDE2 Device: Attached as cloudinit drive
User Data: Rendered from VirtualMachine spec
Network Config: Generated from network specifications
SSH Keys: Extracted from userData or secrets

Static IP Configuration

Configure static IPs using cloud-init:

userData:
  cloudInit:
    inline: |
      #cloud-config
      write_files:
        - path: /etc/netplan/01-static.yaml
          content: |
            network:
              version: 2
              ethernets:
                ens18:
                  addresses: [192.168.1.100/24]
                  gateway4: 192.168.1.1
                  nameservers:
                    addresses: [8.8.8.8, 1.1.1.1]

Or use Proxmox IP configuration:

# This would be handled by the provider internally
# when processing network specifications

Guest Agent Integration (v0.2.3+)

The Proxmox provider now integrates with the QEMU Guest Agent for enhanced VM monitoring:

IP Address Detection

When a VM is running, the provider automatically queries the QEMU guest agent to retrieve accurate IP addresses:

# IP addresses are automatically populated in VM status
kubectl get vm my-vm -o yaml

status:
  phase: Running
  ipAddresses:
    - 192.168.1.100
    - fd00::1234:5678:9abc:def0

Features

Automatic IP Detection: Retrieves all network interface IPs from running VMs
IPv4 and IPv6 Support: Reports both address families
Smart Filtering: Excludes loopback (127.0.0.1, ::1) and link-local (169.254.x.x, fe80::) addresses
Real-time Updates: Information updated during Describe operations
Graceful Degradation: Falls back gracefully when guest agent is not available

Requirements

For guest agent integration to work, the VM must have:

QEMU Guest Agent Installed:

# Ubuntu/Debian
apt-get install qemu-guest-agent

# CentOS/RHEL
yum install qemu-guest-agent

# Enable and start the service
systemctl enable --now qemu-guest-agent

VM Configuration: Guest agent is automatically enabled during VM creation

Implementation Details

The provider:

Checks if VM is in running state
Makes API call to /api2/json/nodes/{node}/qemu/{vmid}/agent/network-get-interfaces
Parses network interface details from guest agent response
Filters out irrelevant addresses (loopback, link-local)
Populates status.ipAddresses field

Troubleshooting

If IP addresses are not appearing:

Verify guest agent is installed: systemctl status qemu-guest-agent
Check Proxmox VM options: qm config <vmid> | grep agent
Ensure VM has network connectivity
Check provider logs for guest agent errors

Cloning Behavior

Linked Clones (Default)

Efficient space usage, faster creation:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClone
metadata:
  name: web-clone
spec:
  sourceVMRef:
    name: template-vm
  linkedClone: true  # Default

Full Clones

Independent copies, slower creation:

spec:
  linkedClone: false

Snapshots

Create and manage VM snapshots:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: before-upgrade
spec:
  vmRef:
    name: web-server
  description: "Snapshot before system upgrade"

Troubleshooting

Common Issues

Authentication Failures

Error: failed to connect to Proxmox VE: authentication failed

Solutions:

Verify API token permissions
Check token expiration
Ensure user has VM.* privileges

TLS Certificate Errors

Error: x509: certificate signed by unknown authority

Solutions:

Add custom CA certificate to credentials secret
Use PVE_INSECURE_SKIP_VERIFY=true for testing
Verify certificate chain

VM Creation Failures

Error: create VM failed with status 400: storage 'local-lvm' does not exist

Solutions:

Verify storage configuration in Proxmox
Check node availability
Ensure sufficient resources

Debug Logging

Enable debug logging for troubleshooting:

env:
  - name: LOG_LEVEL
    value: "debug"

Health Checks

Monitor provider health:

# Check provider pod logs
kubectl logs -n virtrigaud-system deployment/provider-proxmox

# Test connectivity
kubectl exec -n virtrigaud-system deployment/provider-proxmox -- \
  curl -k https://pve.example.com:8006/api2/json/version

Performance Considerations

Resource Allocation

For production environments:

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Concurrent Operations

The provider handles concurrent VM operations efficiently but consider:

Node capacity limits
Storage I/O constraints
Network bandwidth

Task Polling

Task completion is polled every 2 seconds with a 5-minute timeout. These can be tuned via environment variables if needed.

Minimal Proxmox VE Permissions

Required API Token Permissions

Create an API token with these minimal privileges:

# Create user for VirtRigaud
pveum user add virtrigaud@pve --comment "VirtRigaud Provider"

# Create API token
pveum user token add virtrigaud@pve vrtg-token --privsep 1

# Grant minimal required permissions
pveum acl modify / --users virtrigaud@pve --roles PVEVMAdmin,PVEDatastoreUser

# Custom role with minimal permissions (alternative)
pveum role add VirtRigaud --privs "VM.Allocate,VM.Audit,VM.Config.CPU,VM.Config.Memory,VM.Config.Disk,VM.Config.Network,VM.Config.Options,VM.Monitor,VM.PowerMgmt,VM.Snapshot,VM.Clone,Datastore.Allocate,Datastore.AllocateSpace,Pool.Allocate"
pveum acl modify / --users virtrigaud@pve --roles VirtRigaud

Permission Details

Permission	Usage	Required
`VM.Allocate`	Create new VMs	✅ Core
`VM.Audit`	Read VM configuration	✅ Core
`VM.Config.*`	Modify VM settings	✅ Reconfigure
`VM.Monitor`	VM status monitoring	✅ Core
`VM.PowerMgmt`	Power operations	✅ Core
`VM.Snapshot`	Snapshot operations	⚠️ Optional
`VM.Clone`	VM cloning	⚠️ Optional
`Datastore.Allocate`	Create VM disks	✅ Core
`Pool.Allocate`	Resource pool usage	⚠️ Optional

Token Rotation Procedure

# 1. Create new token
NEW_TOKEN=$(pveum user token add virtrigaud@pve vrtg-token-2 --privsep 1 --output-format json | jq -r '.value')

# 2. Update Kubernetes secret
kubectl patch secret pve-credentials -n virtrigaud-system --type='merge' -p='{"stringData":{"token_id":"virtrigaud@pve!vrtg-token-2","token_secret":"'$NEW_TOKEN'"}}'

# 3. Restart provider to use new token
kubectl rollout restart deployment provider-proxmox -n virtrigaud-system

# 4. Verify new token works
kubectl logs deployment/provider-proxmox -n virtrigaud-system

# 5. Remove old token
pveum user token remove virtrigaud@pve vrtg-token

NetworkPolicy Examples

Production NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: provider-proxmox-netpol
  namespace: virtrigaud-system
spec:
  podSelector:
    matchLabels:
      app: provider-proxmox
  policyTypes: [Ingress, Egress]
  
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: virtrigaud-manager
    ports: [9443, 8080]
  
  egress:
  # DNS resolution
  - to: []
    ports: [53]
  
  # Proxmox VE API
  - to:
    - ipBlock:
        cidr: 192.168.1.0/24  # Your PVE network
    ports: [8006]

Development NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: provider-proxmox-dev-netpol
  namespace: virtrigaud-system
spec:
  podSelector:
    matchLabels:
      app: provider-proxmox
      environment: development
  egress:
  - to: []  # Allow all egress for development

Storage and Placement

Storage Class Mapping

Configure storage placement for different workloads:

# High-performance storage
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: high-performance
spec:
  cpus: 8
  memory: "32Gi"
  storage:
    class: "nvme-storage"  # Maps to PVE storage
    type: "thin"           # Thin provisioning
    
# Standard storage
apiVersion: infra.virtrigaud.io/v1beta1  
kind: VMClass
metadata:
  name: standard
spec:
  cpus: 4
  memory: "8Gi"
  storage:
    class: "ssd-storage"
    type: "thick"          # Thick provisioning

Placement Policies

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMPlacementPolicy
metadata:
  name: production-placement
spec:
  nodeSelector:
    - "pve-node-1"
    - "pve-node-2"
  antiAffinity:
    - key: "vm.type"
      operator: "In"
      values: ["database"]
  constraints:
    maxVMsPerNode: 10
    minFreeMemory: "4Gi"

Performance Testing

Load Test Results

Performance benchmarks using virtrigaud-loadgen against fake PVE server:

Operation	P50 Latency	P95 Latency	Throughput	Notes
Create VM	2.3s	4.1s	12 ops/min	Including cloud-init
Power On	800ms	1.2s	45 ops/min	Async operation
Power Off	650ms	1.1s	50 ops/min	Graceful shutdown
Describe	120ms	200ms	200 ops/min	Status query
Reconfigure CPU	1.8s	3.2s	15 ops/min	Online hot-plug
Snapshot Create	3.5s	6.8s	8 ops/min	With memory
Clone (Linked)	1.9s	3.4s	12 ops/min	Fast COW clone

Running Performance Tests

# Deploy fake PVE server for testing
kubectl apply -f test/performance/proxmox-loadtest.yaml

# Run performance test
kubectl create job proxmox-perf-test --from=cronjob/proxmox-performance-test

# View results
kubectl logs job/proxmox-perf-test -f

Security Best Practices

Use API Tokens: Prefer API tokens over username/password
Least Privilege: Grant minimal required permissions (see above)
TLS Verification: Always verify certificates in production
Secret Management: Use Kubernetes secrets with proper RBAC
Network Policies: Restrict provider network access (see examples)
Regular Rotation: Rotate API tokens quarterly
Audit Logging: Enable PVE audit logs for provider actions
Resource Quotas: Limit provider resource consumption

Examples

Multi-Node Setup

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: proxmox-cluster
spec:
  type: proxmox
  endpoint: https://pve-cluster.example.com:8006
  runtime:
    env:
      - name: PVE_NODE_SELECTOR
        value: "pve-1,pve-2,pve-3"

High-Availability Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: provider-proxmox
spec:
  replicas: 2
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: provider-proxmox
              topologyKey: kubernetes.io/hostname

Troubleshooting

Common Issues

❌ “endpoint is required” Error

Symptom: Provider pod crashes with ERROR Failed to create PVE client error="endpoint is required"

Cause: Missing or empty PVE_ENDPOINT environment variable

Solution:

# Ensure PVE_ENDPOINT is set in deployment
env:
  - name: PVE_ENDPOINT
    value: "https://your-proxmox.example.com:8006/api2"

❌ Connection Timeout/Refused

Symptom: Provider fails with connection timeouts or “connection refused”

Cause: Network connectivity issues or wrong endpoint URL

Solutions:

Verify endpoint: Test from a pod in the cluster:

kubectl run test-curl --rm -i --tty --image=curlimages/curl -- \
  curl -k https://your-proxmox.example.com:8006/api2/json/version

Check firewall: Ensure port 8006 is accessible from Kubernetes cluster
Verify URL format: Should be https://hostname:8006/api2 (note the /api2 path)

❌ TLS Certificate Errors

Symptom: x509: certificate signed by unknown authority

Solutions:

Development: Set PVE_INSECURE_SKIP_VERIFY=true (not for production!)
Production: Provide valid TLS certificates or CA bundle

❌ Authentication Failures

Symptom: 401 Unauthorized or authentication failure

Solutions:

Verify token permissions:

# Test API token manually
curl -k "https://pve.example.com:8006/api2/json/version" \
  -H "Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET"

Check user privileges: Ensure user has VM management permissions
Verify token format: Should be user@realm!tokenid (note the !)

❌ Provider Not Starting

Symptom: Pod in CrashLoopBackOff or 0/1 Ready

Diagnostic Steps:

# Check pod logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-provider-proxmox

# Check environment variables
kubectl describe pod -n virtrigaud-system -l app.kubernetes.io/component=provider-proxmox

# Verify configuration
kubectl get secret proxmox-credentials -o yaml

Validation Commands

Test your Proxmox connection before deploying:

# 1. Test network connectivity
telnet your-proxmox.example.com 8006

# 2. Test API endpoint
curl -k https://your-proxmox.example.com:8006/api2/json/version

# 3. Test authentication
curl -k "https://your-proxmox.example.com:8006/api2/json/nodes" \
  -H "Authorization: PVEAPIToken=USER@REALM!TOKENID=SECRET"

# 4. Test from within cluster
kubectl run debug --rm -i --tty --image=curlimages/curl -- sh
# Then run curl commands from inside the pod

Debug Logging

Enable verbose logging for the provider:

providers:
  proxmox:
    env:
      - name: LOG_LEVEL
        value: "debug"
      - name: PVE_ENDPOINT
        value: "https://pve.example.com:8006/api2"

API Reference

For complete API reference, see the Provider API Documentation.

Contributing

To contribute to the Proxmox provider:

Support

Documentation: VirtRigaud Docs
Issues: GitHub Issues
Community: Discord

Provider Developer Tutorial

This comprehensive tutorial walks you through creating a complete VirtRigaud provider from scratch. By the end, you’ll have a fully functional provider that can create, manage, and delete virtual machines.

Prerequisites

Before starting this tutorial, ensure you have:

Go 1.23 or later installed
Docker installed for containerization
kubectl and a Kubernetes cluster (Kind/minikube for local development)
Helm 3.x installed
Basic understanding of gRPC and protobuf

Tutorial Overview

We’ll build a File Provider that manages “virtual machines” as JSON files on disk. While not practical for production, this provider demonstrates all the core concepts without requiring actual hypervisor access.

What we’ll build:

A complete provider implementation using the VirtRigaud SDK
Conformance tests that pass VCTS core profile
A Helm chart for deployment
CI/CD integration
Publication to the provider catalog

Step 1: Initialize Your Provider Project

1.1 Create Project Structure

# Create project directory
mkdir virtrigaud-provider-file
cd virtrigaud-provider-file

# Initialize the provider project
vrtg-provider init file

The vrtg-provider init command creates the following structure:

virtrigaud-provider-file/
├── cmd/
│   └── provider-file/
│       ├── main.go
│       └── Dockerfile
├── internal/
│   └── provider/
│       ├── provider.go
│       ├── capabilities.go
│       └── provider_test.go
├── charts/
│   └── provider-file/
│       ├── Chart.yaml
│       ├── values.yaml
│       └── templates/
├── .github/
│   └── workflows/
│       └── ci.yml
├── Makefile
├── go.mod
├── go.sum
├── .gitignore
└── README.md

1.2 Examine Generated Files

main.go - Entry point that sets up the gRPC server:

package main

import (
    "log"
    
    "github.com/projectbeskar/virtrigaud/sdk/provider/server"
    "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
    "virtrigaud-provider-file/internal/provider"
)

func main() {
    // Create provider instance
    p, err := provider.New()
    if err != nil {
        log.Fatalf("Failed to create provider: %v", err)
    }
    
    // Configure server
    config := &server.Config{
        Port:        9443,
        HealthPort:  8080,
        EnableTLS:   false,
    }
    
    srv, err := server.New(config)
    if err != nil {
        log.Fatalf("Failed to create server: %v", err)
    }
    
    // Register provider service
    providerv1.RegisterProviderServiceServer(srv.GRPCServer(), p)
    
    // Start server
    log.Println("Starting file provider on port 9443...")
    if err := srv.Serve(); err != nil {
        log.Fatalf("Server failed: %v", err)
    }
}

go.mod - Module definition with SDK dependency:

module virtrigaud-provider-file

go 1.23

require (
    github.com/projectbeskar/virtrigaud/sdk v0.1.0
    github.com/projectbeskar/virtrigaud/proto v0.1.0
)

Step 2: Implement the Core Provider

2.1 Design the File Provider

Our file provider will:

Store VM metadata as JSON files in /var/lib/virtrigaud/vms/
Use filename as VM ID
Simulate power operations with state files
Support basic CRUD operations

2.2 Define the VM Model

Create internal/provider/vm.go:

package provider

import (
    "encoding/json"
    "fmt"
    "io/ioutil"
    "os"
    "path/filepath"
    "time"
    
    "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
)

type VirtualMachine struct {
    ID          string                 `json:"id"`
    Name        string                 `json:"name"`
    Spec        *providerv1.VMSpec     `json:"spec"`
    Status      *providerv1.VMStatus   `json:"status"`
    CreatedAt   time.Time              `json:"created_at"`
    UpdatedAt   time.Time              `json:"updated_at"`
}

type FileStore struct {
    baseDir string
}

func NewFileStore(baseDir string) *FileStore {
    return &FileStore{baseDir: baseDir}
}

func (fs *FileStore) Save(vm *VirtualMachine) error {
    if err := os.MkdirAll(fs.baseDir, 0755); err != nil {
        return fmt.Errorf("failed to create directory: %w", err)
    }
    
    vm.UpdatedAt = time.Now()
    data, err := json.MarshalIndent(vm, "", "  ")
    if err != nil {
        return fmt.Errorf("failed to marshal VM: %w", err)
    }
    
    filename := filepath.Join(fs.baseDir, vm.ID+".json")
    return ioutil.WriteFile(filename, data, 0644)
}

func (fs *FileStore) Load(id string) (*VirtualMachine, error) {
    filename := filepath.Join(fs.baseDir, id+".json")
    data, err := ioutil.ReadFile(filename)
    if err != nil {
        if os.IsNotExist(err) {
            return nil, fmt.Errorf("VM not found: %s", id)
        }
        return nil, fmt.Errorf("failed to read VM file: %w", err)
    }
    
    var vm VirtualMachine
    if err := json.Unmarshal(data, &vm); err != nil {
        return nil, fmt.Errorf("failed to unmarshal VM: %w", err)
    }
    
    return &vm, nil
}

func (fs *FileStore) Delete(id string) error {
    filename := filepath.Join(fs.baseDir, id+".json")
    if err := os.Remove(filename); err != nil && !os.IsNotExist(err) {
        return fmt.Errorf("failed to delete VM file: %w", err)
    }
    return nil
}

func (fs *FileStore) List() ([]*VirtualMachine, error) {
    files, err := ioutil.ReadDir(fs.baseDir)
    if err != nil {
        if os.IsNotExist(err) {
            return []*VirtualMachine{}, nil
        }
        return nil, fmt.Errorf("failed to read directory: %w", err)
    }
    
    var vms []*VirtualMachine
    for _, file := range files {
        if !file.IsDir() && filepath.Ext(file.Name()) == ".json" {
            id := file.Name()[:len(file.Name())-5] // Remove .json extension
            vm, err := fs.Load(id)
            if err != nil {
                continue // Skip invalid files
            }
            vms = append(vms, vm)
        }
    }
    
    return vms, nil
}

2.3 Implement the Provider Interface

Update internal/provider/provider.go:

package provider

import (
    "context"
    "fmt"
    "os"
    "path/filepath"
    "time"
    
    "github.com/google/uuid"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/status"
    
    "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
    "github.com/projectbeskar/virtrigaud/sdk/provider/capabilities"
    "github.com/projectbeskar/virtrigaud/sdk/provider/errors"
)

type Provider struct {
    store *FileStore
    caps  *capabilities.ProviderCapabilities
}

func New() (*Provider, error) {
    // Get storage directory from environment or use default
    baseDir := os.Getenv("PROVIDER_STORAGE_DIR")
    if baseDir == "" {
        baseDir = "/var/lib/virtrigaud/vms"
    }
    
    // Create capabilities
    caps := &capabilities.ProviderCapabilities{
        ProviderInfo: &providerv1.ProviderInfo{
            Name:        "file",
            Version:     "0.1.0",
            Description: "File-based virtual machine provider for development and testing",
        },
        SupportedCapabilities: []capabilities.Capability{
            capabilities.CapabilityCore,
            capabilities.CapabilitySnapshot,
            capabilities.CapabilityClone,
        },
    }
    
    return &Provider{
        store: NewFileStore(baseDir),
        caps:  caps,
    }, nil
}

// GetCapabilities returns provider capabilities
func (p *Provider) GetCapabilities(ctx context.Context, req *providerv1.GetCapabilitiesRequest) (*providerv1.GetCapabilitiesResponse, error) {
    return &providerv1.GetCapabilitiesResponse{
        ProviderId: "file-provider",
        Capabilities: []*providerv1.Capability{
            {
                Name:        "vm.create",
                Supported:   true,
                Description: "Create virtual machines",
            },
            {
                Name:        "vm.read",
                Supported:   true,
                Description: "Read virtual machine information",
            },
            {
                Name:        "vm.update",
                Supported:   true,
                Description: "Update virtual machine configuration",
            },
            {
                Name:        "vm.delete",
                Supported:   true,
                Description: "Delete virtual machines",
            },
            {
                Name:        "vm.power",
                Supported:   true,
                Description: "Control virtual machine power state",
            },
            {
                Name:        "vm.snapshot",
                Supported:   true,
                Description: "Create and manage VM snapshots",
            },
            {
                Name:        "vm.clone",
                Supported:   true,
                Description: "Clone virtual machines",
            },
        },
    }, nil
}

// CreateVM creates a new virtual machine
func (p *Provider) CreateVM(ctx context.Context, req *providerv1.CreateVMRequest) (*providerv1.CreateVMResponse, error) {
    // Validate request
    if req.Name == "" {
        return nil, errors.NewInvalidSpec("VM name is required")
    }
    
    if req.Spec == nil {
        return nil, errors.NewInvalidSpec("VM spec is required")
    }
    
    // Generate unique ID
    vmID := uuid.New().String()
    
    // Create VM object
    vm := &VirtualMachine{
        ID:   vmID,
        Name: req.Name,
        Spec: req.Spec,
        Status: &providerv1.VMStatus{
            State:   "Creating",
            Message: "VM is being created",
        },
        CreatedAt: time.Now(),
        UpdatedAt: time.Now(),
    }
    
    // Save to store
    if err := p.store.Save(vm); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to save VM: %v", err)
    }
    
    // Simulate creation time
    go func() {
        time.Sleep(2 * time.Second)
        vm.Status.State = "Running"
        vm.Status.Message = "VM is running"
        p.store.Save(vm)
    }()
    
    return &providerv1.CreateVMResponse{
        VmId:   vmID,
        Status: vm.Status,
    }, nil
}

// GetVM retrieves virtual machine information
func (p *Provider) GetVM(ctx context.Context, req *providerv1.GetVMRequest) (*providerv1.GetVMResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    vm, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    return &providerv1.GetVMResponse{
        VmId:   vm.ID,
        Name:   vm.Name,
        Spec:   vm.Spec,
        Status: vm.Status,
    }, nil
}

// UpdateVM updates virtual machine configuration
func (p *Provider) UpdateVM(ctx context.Context, req *providerv1.UpdateVMRequest) (*providerv1.UpdateVMResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    vm, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    // Update spec if provided
    if req.Spec != nil {
        vm.Spec = req.Spec
        vm.Status.Message = "VM configuration updated"
        
        if err := p.store.Save(vm); err != nil {
            return nil, status.Errorf(codes.Internal, "failed to save VM: %v", err)
        }
    }
    
    return &providerv1.UpdateVMResponse{
        Status: vm.Status,
    }, nil
}

// DeleteVM deletes a virtual machine
func (p *Provider) DeleteVM(ctx context.Context, req *providerv1.DeleteVMRequest) (*providerv1.DeleteVMResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    // Check if VM exists
    _, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    // Delete VM
    if err := p.store.Delete(req.VmId); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to delete VM: %v", err)
    }
    
    return &providerv1.DeleteVMResponse{
        Success: true,
        Message: "VM deleted successfully",
    }, nil
}

// PowerVM controls virtual machine power state
func (p *Provider) PowerVM(ctx context.Context, req *providerv1.PowerVMRequest) (*providerv1.PowerVMResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    vm, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    // Update power state based on operation
    switch req.PowerOp {
    case providerv1.PowerOp_POWER_OP_ON:
        vm.Status.State = "Running"
        vm.Status.Message = "VM is running"
    case providerv1.PowerOp_POWER_OP_OFF:
        vm.Status.State = "Stopped"
        vm.Status.Message = "VM is stopped"
    case providerv1.PowerOp_POWER_OP_REBOOT:
        vm.Status.State = "Rebooting"
        vm.Status.Message = "VM is rebooting"
        // Simulate reboot
        go func() {
            time.Sleep(3 * time.Second)
            vm.Status.State = "Running"
            vm.Status.Message = "VM is running"
            p.store.Save(vm)
        }()
    default:
        return nil, errors.NewInvalidSpec("unsupported power operation: %v", req.PowerOp)
    }
    
    if err := p.store.Save(vm); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to save VM: %v", err)
    }
    
    return &providerv1.PowerVMResponse{
        Status: vm.Status,
    }, nil
}

// ListVMs lists all virtual machines
func (p *Provider) ListVMs(ctx context.Context, req *providerv1.ListVMsRequest) (*providerv1.ListVMsResponse, error) {
    vms, err := p.store.List()
    if err != nil {
        return nil, status.Errorf(codes.Internal, "failed to list VMs: %v", err)
    }
    
    var vmInfos []*providerv1.VMInfo
    for _, vm := range vms {
        vmInfos = append(vmInfos, &providerv1.VMInfo{
            VmId:   vm.ID,
            Name:   vm.Name,
            Status: vm.Status,
        })
    }
    
    return &providerv1.ListVMsResponse{
        Vms: vmInfos,
    }, nil
}

// CreateSnapshot creates a VM snapshot
func (p *Provider) CreateSnapshot(ctx context.Context, req *providerv1.CreateSnapshotRequest) (*providerv1.CreateSnapshotResponse, error) {
    if req.VmId == "" {
        return nil, errors.NewInvalidSpec("VM ID is required")
    }
    
    vm, err := p.store.Load(req.VmId)
    if err != nil {
        return nil, errors.NewNotFound("VM not found: %s", req.VmId)
    }
    
    // Create snapshot (simulate by copying VM file)
    snapshotID := uuid.New().String()
    snapshotPath := filepath.Join(filepath.Dir(p.store.baseDir), "snapshots")
    
    if err := os.MkdirAll(snapshotPath, 0755); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to create snapshot directory: %v", err)
    }
    
    // Copy VM data to snapshot
    snapshotVM := *vm
    snapshotVM.ID = snapshotID
    snapshotStore := NewFileStore(snapshotPath)
    
    if err := snapshotStore.Save(&snapshotVM); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to save snapshot: %v", err)
    }
    
    return &providerv1.CreateSnapshotResponse{
        SnapshotId: snapshotID,
        Status: &providerv1.TaskStatus{
            State:   "Completed",
            Message: "Snapshot created successfully",
        },
    }, nil
}

// CloneVM clones a virtual machine
func (p *Provider) CloneVM(ctx context.Context, req *providerv1.CloneVMRequest) (*providerv1.CloneVMResponse, error) {
    if req.SourceVmId == "" {
        return nil, errors.NewInvalidSpec("Source VM ID is required")
    }
    
    if req.CloneName == "" {
        return nil, errors.NewInvalidSpec("Clone name is required")
    }
    
    // Load source VM
    sourceVM, err := p.store.Load(req.SourceVmId)
    if err != nil {
        return nil, errors.NewNotFound("Source VM not found: %s", req.SourceVmId)
    }
    
    // Create clone
    cloneID := uuid.New().String()
    cloneVM := &VirtualMachine{
        ID:   cloneID,
        Name: req.CloneName,
        Spec: sourceVM.Spec, // Copy spec from source
        Status: &providerv1.VMStatus{
            State:   "Stopped",
            Message: "Clone created successfully",
        },
        CreatedAt: time.Now(),
        UpdatedAt: time.Now(),
    }
    
    if err := p.store.Save(cloneVM); err != nil {
        return nil, status.Errorf(codes.Internal, "failed to save clone: %v", err)
    }
    
    return &providerv1.CloneVMResponse{
        CloneVmId: cloneID,
        Status: &providerv1.TaskStatus{
            State:   "Completed",
            Message: "VM cloned successfully",
        },
    }, nil
}

Step 3: Add Tests and Validation

3.1 Create Unit Tests

Create internal/provider/provider_test.go:

package provider

import (
    "context"
    "os"
    "path/filepath"
    "testing"
    "time"
    
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    
    "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
)

func TestProvider_CreateVM(t *testing.T) {
    // Create temporary directory for testing
    tmpDir, err := os.MkdirTemp("", "file-provider-test")
    require.NoError(t, err)
    defer os.RemoveAll(tmpDir)
    
    // Set storage directory
    os.Setenv("PROVIDER_STORAGE_DIR", tmpDir)
    defer os.Unsetenv("PROVIDER_STORAGE_DIR")
    
    // Create provider
    p, err := New()
    require.NoError(t, err)
    
    // Test VM creation
    req := &providerv1.CreateVMRequest{
        Name: "test-vm",
        Spec: &providerv1.VMSpec{
            Cpu:    2,
            Memory: 4096,
            Image:  "ubuntu:20.04",
        },
    }
    
    resp, err := p.CreateVM(context.Background(), req)
    require.NoError(t, err)
    assert.NotEmpty(t, resp.VmId)
    assert.Equal(t, "Creating", resp.Status.State)
    
    // Verify VM file was created
    vmFile := filepath.Join(tmpDir, resp.VmId+".json")
    assert.FileExists(t, vmFile)
}

func TestProvider_GetVM(t *testing.T) {
    tmpDir, err := os.MkdirTemp("", "file-provider-test")
    require.NoError(t, err)
    defer os.RemoveAll(tmpDir)
    
    os.Setenv("PROVIDER_STORAGE_DIR", tmpDir)
    defer os.Unsetenv("PROVIDER_STORAGE_DIR")
    
    p, err := New()
    require.NoError(t, err)
    
    // Create VM first
    createReq := &providerv1.CreateVMRequest{
        Name: "test-vm",
        Spec: &providerv1.VMSpec{
            Cpu:    2,
            Memory: 4096,
        },
    }
    
    createResp, err := p.CreateVM(context.Background(), createReq)
    require.NoError(t, err)
    
    // Get VM
    getReq := &providerv1.GetVMRequest{
        VmId: createResp.VmId,
    }
    
    getResp, err := p.GetVM(context.Background(), getReq)
    require.NoError(t, err)
    assert.Equal(t, createResp.VmId, getResp.VmId)
    assert.Equal(t, "test-vm", getResp.Name)
    assert.Equal(t, int32(2), getResp.Spec.Cpu)
}

func TestProvider_PowerVM(t *testing.T) {
    tmpDir, err := os.MkdirTemp("", "file-provider-test")
    require.NoError(t, err)
    defer os.RemoveAll(tmpDir)
    
    os.Setenv("PROVIDER_STORAGE_DIR", tmpDir)
    defer os.Unsetenv("PROVIDER_STORAGE_DIR")
    
    p, err := New()
    require.NoError(t, err)
    
    // Create VM
    createReq := &providerv1.CreateVMRequest{
        Name: "test-vm",
        Spec: &providerv1.VMSpec{Cpu: 1, Memory: 1024},
    }
    
    createResp, err := p.CreateVM(context.Background(), createReq)
    require.NoError(t, err)
    
    // Power off VM
    powerReq := &providerv1.PowerVMRequest{
        VmId:    createResp.VmId,
        PowerOp: providerv1.PowerOp_POWER_OP_OFF,
    }
    
    powerResp, err := p.PowerVM(context.Background(), powerReq)
    require.NoError(t, err)
    assert.Equal(t, "Stopped", powerResp.Status.State)
    
    // Power on VM
    powerReq.PowerOp = providerv1.PowerOp_POWER_OP_ON
    powerResp, err = p.PowerVM(context.Background(), powerReq)
    require.NoError(t, err)
    assert.Equal(t, "Running", powerResp.Status.State)
}

func TestProvider_GetCapabilities(t *testing.T) {
    p, err := New()
    require.NoError(t, err)
    
    req := &providerv1.GetCapabilitiesRequest{}
    resp, err := p.GetCapabilities(context.Background(), req)
    require.NoError(t, err)
    
    assert.Equal(t, "file-provider", resp.ProviderId)
    assert.NotEmpty(t, resp.Capabilities)
    
    // Check for core capabilities
    capNames := make(map[string]bool)
    for _, cap := range resp.Capabilities {
        capNames[cap.Name] = cap.Supported
    }
    
    assert.True(t, capNames["vm.create"])
    assert.True(t, capNames["vm.read"])
    assert.True(t, capNames["vm.delete"])
    assert.True(t, capNames["vm.power"])
}

func TestProvider_CloneVM(t *testing.T) {
    tmpDir, err := os.MkdirTemp("", "file-provider-test")
    require.NoError(t, err)
    defer os.RemoveAll(tmpDir)
    
    os.Setenv("PROVIDER_STORAGE_DIR", tmpDir)
    defer os.Unsetenv("PROVIDER_STORAGE_DIR")
    
    p, err := New()
    require.NoError(t, err)
    
    // Create source VM
    createReq := &providerv1.CreateVMRequest{
        Name: "source-vm",
        Spec: &providerv1.VMSpec{
            Cpu:    4,
            Memory: 8192,
            Image:  "centos:8",
        },
    }
    
    createResp, err := p.CreateVM(context.Background(), createReq)
    require.NoError(t, err)
    
    // Clone VM
    cloneReq := &providerv1.CloneVMRequest{
        SourceVmId: createResp.VmId,
        CloneName:  "cloned-vm",
    }
    
    cloneResp, err := p.CloneVM(context.Background(), cloneReq)
    require.NoError(t, err)
    assert.NotEmpty(t, cloneResp.CloneVmId)
    assert.NotEqual(t, createResp.VmId, cloneResp.CloneVmId)
    
    // Verify clone has same specs as source
    getReq := &providerv1.GetVMRequest{
        VmId: cloneResp.CloneVmId,
    }
    
    getResp, err := p.GetVM(context.Background(), getReq)
    require.NoError(t, err)
    assert.Equal(t, "cloned-vm", getResp.Name)
    assert.Equal(t, int32(4), getResp.Spec.Cpu)
    assert.Equal(t, int32(8192), getResp.Spec.Memory)
    assert.Equal(t, "centos:8", getResp.Spec.Image)
}

3.2 Add Build and Test Targets

Update the Makefile:

# File Provider Makefile

.PHONY: help build test lint clean run docker-build docker-push

help: ## Show this help message
	@echo 'Usage: make [target]'
	@echo ''
	@echo 'Targets:'
	@awk 'BEGIN {FS = ":.*?## "} /^[a-zA-Z_-]+:.*?## / {printf "  %-15s %s\n", $$1, $$2}' $(MAKEFILE_LIST)

build: ## Build the provider binary
	go build -o bin/provider-file ./cmd/provider-file

test: ## Run tests
	go test -v ./...

test-coverage: ## Run tests with coverage
	go test -v -coverprofile=coverage.out ./...
	go tool cover -html=coverage.out -o coverage.html

lint: ## Run linters
	golangci-lint run ./...

clean: ## Clean build artifacts
	rm -rf bin/
	rm -f coverage.out coverage.html

run: build ## Run the provider locally
	PROVIDER_STORAGE_DIR=/tmp/virtrigaud-file ./bin/provider-file

docker-build: ## Build Docker image
	docker build -f cmd/provider-file/Dockerfile -t provider-file:latest .

docker-push: docker-build ## Build and push Docker image
	docker tag provider-file:latest ghcr.io/yourorg/provider-file:latest
	docker push ghcr.io/yourorg/provider-file:latest

# Development targets
dev-setup: ## Set up development environment
	go mod download
	go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest

integration-test: build ## Run integration tests
	./scripts/integration-test.sh

Step 4: Test with VCTS (VirtRigaud Conformance Test Suite)

4.1 Install VCTS

# Build VCTS from the main repository
go install github.com/projectbeskar/virtrigaud/cmd/vcts@latest

4.2 Create VCTS Configuration

Create vcts-config.yaml:

provider:
  name: "file"
  endpoint: "localhost:9443"
  tls: false
  
profiles:
  core:
    enabled: true
    vm_specs:
      - name: "basic"
        cpu: 1
        memory: 1024
        image: "test:latest"
      - name: "medium"
        cpu: 2
        memory: 4096
        image: "ubuntu:20.04"
        
  snapshot:
    enabled: true
    
  clone:
    enabled: true

tests:
  timeout: "30s"
  parallel: false
  cleanup: true

4.3 Run Conformance Tests

# Start the provider
make run &
PROVIDER_PID=$!

# Wait for provider to start
sleep 3

# Run VCTS core profile
vcts run --config vcts-config.yaml --profile core

# Run all enabled profiles
vcts run --config vcts-config.yaml --profile all

# Stop the provider
kill $PROVIDER_PID

Expected output:

✅ Core Profile Tests
  ✅ Provider.GetCapabilities
  ✅ Provider.CreateVM
  ✅ Provider.GetVM
  ✅ Provider.UpdateVM
  ✅ Provider.DeleteVM
  ✅ Provider.PowerVM
  ✅ Provider.ListVMs

✅ Snapshot Profile Tests
  ✅ Provider.CreateSnapshot

✅ Clone Profile Tests
  ✅ Provider.CloneVM

🎉 All tests passed! Provider is conformant.

Step 5: Create Helm Chart for Deployment

5.1 Chart Structure

The generated chart in charts/provider-file/ includes:

charts/provider-file/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── serviceaccount.yaml
│   ├── rbac.yaml
│   └── _helpers.tpl
└── examples/
    └── values-development.yaml

5.2 Customize Chart Values

Update charts/provider-file/values.yaml:

# Default values for provider-file

replicaCount: 1

image:
  repository: ghcr.io/yourorg/provider-file
  pullPolicy: IfNotPresent
  tag: "0.1.0"

nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations: {}

podSecurityContext:
  fsGroup: 2000
  runAsNonRoot: true
  runAsUser: 1000

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

service:
  type: ClusterIP
  port: 9443
  healthPort: 8080

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

nodeSelector: {}

tolerations: []

affinity: {}

# Provider-specific configuration
provider:
  storageDir: "/var/lib/virtrigaud/vms"
  logLevel: "info"

# Persistent storage for VM data
persistence:
  enabled: true
  accessMode: ReadWriteOnce
  size: 10Gi
  storageClass: ""

5.3 Test Helm Chart

# Lint the chart
helm lint charts/provider-file/

# Template the chart
helm template provider-file charts/provider-file/ \
  --values charts/provider-file/values.yaml

# Install to local cluster
helm install provider-file charts/provider-file/ \
  --namespace provider-file \
  --create-namespace \
  --values charts/provider-file/examples/values-development.yaml

Step 6: Set Up CI/CD

6.1 GitHub Actions Workflow

The generated .github/workflows/ci.yml includes:

name: CI

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main, develop ]

env:
  GO_VERSION: '1.23'

jobs:
  test:
    name: Test
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: ${{ env.GO_VERSION }}
    
    - name: Run tests
      run: make test

    - name: Run linting
      run: make lint

  build:
    name: Build
    runs-on: ubuntu-latest
    needs: test
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: ${{ env.GO_VERSION }}
    
    - name: Build binary
      run: make build

    - name: Build Docker image
      run: make docker-build

  conformance:
    name: Conformance Tests
    runs-on: ubuntu-latest
    needs: build
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Go
      uses: actions/setup-go@v4
      with:
        go-version: ${{ env.GO_VERSION }}
    
    - name: Build provider
      run: make build

    - name: Install VCTS
      run: go install github.com/projectbeskar/virtrigaud/cmd/vcts@latest

    - name: Run conformance tests
      run: |
        # Start provider in background
        PROVIDER_STORAGE_DIR=/tmp/vcts-test ./bin/provider-file &
        PROVIDER_PID=$!
        
        # Wait for startup
        sleep 5
        
        # Run VCTS
        vcts run --config vcts-config.yaml --profile core
        
        # Clean up
        kill $PROVIDER_PID

  release:
    name: Release
    runs-on: ubuntu-latest
    needs: [test, build, conformance]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
    - uses: actions/checkout@v4
    
    - name: Build and push Docker image
      run: |
        echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin
        make docker-push

    - name: Package Helm chart
      run: |
        helm package charts/provider-file/ -d dist/
        
    - name: Upload artifacts
      uses: actions/upload-artifact@v4
      with:
        name: release-artifacts
        path: |
          bin/
          dist/

Step 7: Publish to Provider Catalog

7.1 Run Provider Verification

# Verify the provider meets all requirements
vrtg-provider verify --profile all

7.2 Publish to Catalog

# Publish to the VirtRigaud provider catalog
vrtg-provider publish \
  --name file \
  --image ghcr.io/yourorg/provider-file \
  --tag 0.1.0 \
  --repo https://github.com/yourorg/virtrigaud-provider-file \
  --maintainer your-email@example.com \
  --license Apache-2.0

This command will:

Run VCTS conformance tests
Generate a provider badge
Create a catalog entry
Open a pull request to the main VirtRigaud repository

7.3 Example Catalog Entry

The generated catalog entry will look like:

- name: file
  displayName: "File Provider"
  description: "File-based virtual machine provider for development and testing"
  repo: "https://github.com/yourorg/virtrigaud-provider-file"
  image: "ghcr.io/yourorg/provider-file"
  tag: "0.1.0"
  capabilities:
    - core
    - snapshot
    - clone
  conformance:
    profiles:
      core: pass
      snapshot: pass
      clone: pass
      image-prepare: skip
      advanced: skip
    report_url: "https://github.com/yourorg/virtrigaud-provider-file/actions"
    badge_url: "https://img.shields.io/badge/conformance-pass-green"
    last_tested: "2025-08-26T15:00:00Z"
  maintainer: "your-email@example.com"
  license: "Apache-2.0"
  maturity: "beta"
  tags:
    - file
    - development
    - testing
  documentation: "https://github.com/yourorg/virtrigaud-provider-file/blob/main/README.md"

Step 8: Production Considerations

8.1 Security Hardening

# Production values.yaml
securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 65534

podSecurityContext:
  fsGroup: 65534
  runAsNonRoot: true
  runAsUser: 65534
  seccompProfile:
    type: RuntimeDefault

networkPolicy:
  enabled: true
  ingress:
    fromNamespaces:
      - virtrigaud-system
  egress:
    - to: []
      ports:
        - protocol: UDP
          port: 53

8.2 Observability

Add monitoring and logging:

// Add to provider.go
import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    vmOperations = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "file_provider_vm_operations_total",
            Help: "Total number of VM operations",
        },
        []string{"operation", "status"},
    )
    
    vmOperationDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "file_provider_vm_operation_duration_seconds",
            Help: "Duration of VM operations",
        },
        []string{"operation"},
    )
)

func (p *Provider) CreateVM(ctx context.Context, req *providerv1.CreateVMRequest) (*providerv1.CreateVMResponse, error) {
    start := time.Now()
    defer func() {
        vmOperationDuration.WithLabelValues("create").Observe(time.Since(start).Seconds())
    }()
    
    // ... existing implementation ...
    
    vmOperations.WithLabelValues("create", "success").Inc()
    return resp, nil
}

8.3 Performance Optimization

Add connection pooling for gRPC clients
Implement caching for frequently accessed VMs
Use background workers for long-running operations
Add rate limiting and request validation

8.4 Error Handling and Resilience

Implement circuit breakers for external dependencies
Add retry logic with exponential backoff
Use structured logging with correlation IDs
Implement graceful shutdown handling

Conclusion

You’ve successfully created a complete VirtRigaud provider! This tutorial covered:

✅ Provider Implementation - Full gRPC service with all core operations
✅ SDK Integration - Using VirtRigaud SDK for server setup and utilities
✅ Testing - Unit tests and VCTS conformance validation
✅ Containerization - Docker images and Helm charts
✅ CI/CD - Automated testing and publishing
✅ Catalog Integration - Publishing to the provider ecosystem

Next Steps

Explore Advanced Features:
- Add image management capabilities
- Implement networking configuration
- Add storage volume management
Integration Examples:
- Connect to real hypervisors (libvirt, vSphere, etc.)
- Add authentication and authorization
- Implement backup and disaster recovery
Community Contribution:
- Submit your provider to the catalog
- Contribute improvements to the SDK
- Help other developers with provider development
Production Deployment:
- Set up monitoring and alerting
- Implement proper security measures
- Plan for scaling and high availability

For more information, visit the VirtRigaud documentation or join our community discussions.

Versioning & Breaking Changes

This document outlines VirtRigaud’s approach to versioning, compatibility, and managing breaking changes across the provider ecosystem.

Overview

VirtRigaud follows semantic versioning (SemVer) principles and maintains backward compatibility through careful API design and migration strategies. The system has multiple versioning dimensions:

VirtRigaud Core - The main platform (API server, manager, CRDs)
Provider SDK - Go SDK for building providers
Proto Contracts - gRPC/protobuf API definitions
Individual Providers - Each provider has independent versioning

Semantic Versioning

All VirtRigaud components follow Semantic Versioning 2.0.0:

Version Format: MAJOR.MINOR.PATCH

MAJOR (X.0.0): Breaking changes that require user action
MINOR (0.X.0): New features that are backward compatible
PATCH (0.0.X): Bug fixes and security updates

Examples

1.0.0 → 1.0.1  # Patch: Bug fixes only
1.0.1 → 1.1.0  # Minor: New features, backward compatible
1.1.0 → 2.0.0  # Major: Breaking changes

Component Versioning Strategy

VirtRigaud Core APIs

Kubernetes-style API versioning with multiple supported versions:

# Supported API versions
apiVersion: infra.virtrigaud.io/v1beta1  # Development/preview
apiVersion: infra.virtrigaud.io/v1beta1   # Pre-release/testing
apiVersion: infra.virtrigaud.io/v1        # Stable/production

Stability Levels:

Alpha (v1beta1): Experimental, may change or be removed
Beta (v1beta1): Well-tested, minimal changes expected
Stable (v1): Production-ready, strong backward compatibility

Support Windows:

Alpha: Best effort, no guarantees
Beta: Supported for 2 minor releases after stable equivalent
Stable: Supported for 12 months after deprecation

Provider SDK Versioning

SDK versions are independent of core VirtRigaud versions:

// Go module versioning
module github.com/projectbeskar/virtrigaud/sdk

// Version tags
sdk/v0.1.0    # Initial release
sdk/v0.2.0    # New features
sdk/v1.0.0    # First stable release
sdk/v2.0.0    # Breaking changes (new module path: sdk/v2)

SDK Compatibility Matrix:

SDK Version	VirtRigaud Core	Go Version	Status
v0.1.x	0.1.0 - 0.2.x	1.23+	Beta
v1.0.x	0.2.0 - 1.0.x	1.23+	Stable
v1.1.x	0.3.0 - 1.1.x	1.23+	Stable
v2.0.x	1.0.0+	1.24+	Future

Proto Contract Versioning

Protobuf APIs use both module versions and service versions:

// Service versioning in proto files
package provider.v1;
service ProviderService {
  // API methods
}

// Module versioning
module github.com/projectbeskar/virtrigaud/proto

Proto Evolution Rules:

✅ Add new fields (with proper defaults)
✅ Add new RPC methods
✅ Add new enum values
❌ Remove fields or methods
❌ Change field types or semantics
❌ Remove enum values

Provider Versioning

Each provider maintains independent versioning:

# Provider catalog entry
name: vsphere
tag: "1.2.3"      # Provider version
sdk_version: "v1.0.0"  # SDK dependency
proto_version: "v0.1.0"  # Proto dependency

Breaking Change Policy

What Constitutes a Breaking Change

API Breaking Changes:

Removing or renaming API fields
Changing field types or semantics
Removing API endpoints or methods
Changing required vs optional fields
Modifying default behaviors
Changing error codes or messages that clients depend on

SDK Breaking Changes:

Removing public functions, types, or methods
Changing function signatures
Modifying struct fields (without proper backward compatibility)
Changing package import paths
Removing or renaming configuration options

Proto Breaking Changes:

Removing fields or RPC methods
Changing field numbers or types
Removing enum values
Modifying service or method names

Breaking Change Process

1. Proposal Phase

# Breaking Change Proposal: [Title]

## Summary
Brief description of the change and motivation.

## Motivation  
Why is this change necessary? What problems does it solve?

## Proposed Changes
Detailed description of the changes.

## Migration Path
How will users migrate from old to new behavior?

## Timeline
- Deprecation announcement: v1.1.0
- Breaking change implementation: v2.0.0
- Legacy support removal: v3.0.0

## Alternatives Considered
What other approaches were considered?

2. Deprecation Phase

// Deprecated functions include clear migration guidance
// Deprecated: Use NewCreateVMRequest instead. Will be removed in v2.0.0.
func CreateVM(name string) *VMRequest {
    return &VMRequest{Name: name}
}

// New recommended approach
func NewCreateVMRequest(spec *VMSpec) *CreateVMRequest {
    return &CreateVMRequest{Spec: spec}
}

3. Migration Tools

# Migration command examples
vrtg-provider migrate --from v1 --to v2
vrtg-provider check-compatibility --target-version v2.0.0

4. Communication

Release notes with migration guide
Blog posts for major changes
Community discussions and Q&A
Updated documentation

Compatibility Testing

Automated Compatibility Checks

# .github/workflows/compatibility.yml
name: Compatibility Check

jobs:
  compatibility-matrix:
    strategy:
      matrix:
        sdk_version: [v1.0.0, v1.1.0, current]
        provider_version: [v1.0.0, v1.1.0, current]
    
    steps:
    - name: Test SDK ${{ matrix.sdk_version }} with Provider ${{ matrix.provider_version }}
      run: |
        # Build provider with specific SDK version
        # Run conformance tests
        # Report compatibility results

Buf Proto Compatibility

# proto/buf.yaml
version: v1
breaking:
  use:
    # Prevent breaking changes
    - FILE_NO_DELETE
    - FIELD_NO_DELETE
    - FIELD_SAME_TYPE
    - ENUM_VALUE_NO_DELETE
    - RPC_NO_DELETE
    - SERVICE_NO_DELETE
  ignore:
    # Allowed changes during alpha/beta
    - "provider/v1beta1"

# Check for breaking changes
buf breaking --against 'https://github.com/projectbeskar/virtrigaud.git#branch=main'

Provider Compatibility Testing

# Test provider against multiple VirtRigaud versions
vcts run --provider ./provider --virtrigaud-version 0.1.0
vcts run --provider ./provider --virtrigaud-version 0.2.0
vcts run --provider ./provider --virtrigaud-version 1.0.0

Migration Strategies

API Version Migration

Example: VirtualMachine v1beta1 → v1beta1

// Conversion webhook approach
func (src *v1beta1.VirtualMachine) ConvertTo(dst *v1beta1.VirtualMachine) error {
    // Convert common fields
    dst.ObjectMeta = src.ObjectMeta
    
    // Handle field migrations
    if src.Spec.PowerState == "On" {
        dst.Spec.PowerState = v1beta1.PowerStateOn
    }
    
    // Set new fields with appropriate defaults
    if dst.Spec.Phase == "" {
        dst.Spec.Phase = v1beta1.PhaseUnknown
    }
    
    return nil
}

Gradual Migration Process

# Phase 1: Dual support (both versions work)
kubectl apply -f vm-v1beta1.yaml  # Still works
kubectl apply -f vm-v1beta1.yaml   # Also works

# Phase 2: Deprecation warning
kubectl apply -f vm-v1beta1.yaml
# Warning: v1beta1 is deprecated, use v1beta1

# Phase 3: Conversion only (internal storage uses v1beta1)
kubectl apply -f vm-v1beta1.yaml  # Automatically converted

# Phase 4: Removal (after support window)
kubectl apply -f vm-v1beta1.yaml  # Error: version not supported

Provider SDK Migration

Example: SDK v1 → v2

SDK v1 (deprecated):

// Old SDK pattern
func NewProvider(config Config) *Provider {
    return &Provider{config: config}
}

func (p *Provider) CreateVM(name string, cpu int, memory int) error {
    // Implementation
}

SDK v2 (new):

// New SDK pattern with better types
func NewProvider(config *Config) (*Provider, error) {
    if err := config.Validate(); err != nil {
        return nil, err
    }
    return &Provider{config: config}, nil
}

func (p *Provider) CreateVM(ctx context.Context, req *CreateVMRequest) (*CreateVMResponse, error) {
    // Implementation with proper context and structured types
}

Migration Bridge:

// sdk/v2/compat/v1.go - Compatibility layer
package compat

import (
    v1 "github.com/projectbeskar/virtrigaud/sdk/provider"
    v2 "github.com/projectbeskar/virtrigaud/sdk/v2/provider"
)

// Bridge for gradual migration
func AdaptV1Provider(v1Provider v1.Provider) v2.Provider {
    return &v1ProviderAdapter{old: v1Provider}
}

type v1ProviderAdapter struct {
    old v1.Provider
}

func (a *v1ProviderAdapter) CreateVM(ctx context.Context, req *v2.CreateVMRequest) (*v2.CreateVMResponse, error) {
    // Convert v2 request to v1 format
    err := a.old.CreateVM(req.Name, int(req.Spec.CPU), int(req.Spec.Memory))
    
    // Convert v1 response to v2 format
    if err != nil {
        return nil, err
    }
    
    return &v2.CreateVMResponse{
        Status: "Created",
    }, nil
}

Configuration Migration

Example: Configuration Schema Changes

v1 Configuration:

# provider-config-v1.yaml
provider:
  type: "vsphere"
  server: "vcenter.example.com"
  username: "admin"
  password: "secret"

v2 Configuration:

# provider-config-v2.yaml
apiVersion: config.virtrigaud.io/v2
kind: ProviderConfig
metadata:
  name: vsphere-config
spec:
  type: "vsphere"
  connection:
    endpoint: "vcenter.example.com"
    authentication:
      method: "basic"
      secretRef:
        name: "vsphere-credentials"
  features:
    snapshots: true
    cloning: true

Migration Command:

# Automatic migration tool
vrtg-provider config migrate \
  --from provider-config-v1.yaml \
  --to provider-config-v2.yaml \
  --create-secret vsphere-credentials

Release Planning

Release Cadence

Patch releases: As needed for critical bugs/security
Minor releases: Every 2-3 months
Major releases: Every 12-18 months

Feature Lifecycle

Experimental → Alpha → Beta → Stable → Deprecated → Removed
     |          |       |       |         |          |
     |          |       |       |         |          +-- After support window
     |          |       |       |         +-- 2 releases notice
     |          |       |       +-- Production ready
     |          |       +-- Pre-release testing
     |          +-- Public preview
     +-- Internal/development only

Release Branch Strategy

main                    # Current development
├── release-0.1        # Patch releases for v0.1.x
├── release-0.2        # Patch releases for v0.2.x
└── release-1.0        # Patch releases for v1.0.x

Support Matrix

Version	Status	Support Level	End of Life
1.0.x	Stable	Full support	2026-01-01
0.2.x	Stable	Security only	2025-06-01
0.1.x	Deprecated	None	2025-01-01

Best Practices

For Provider Developers

Version Dependencies Carefully

// Use specific versions, not floating
require github.com/projectbeskar/virtrigaud/sdk v1.2.3

Test Compatibility Early

# Test against multiple SDK versions
go mod edit -require=github.com/projectbeskar/virtrigaud/sdk@v1.1.0
go test ./...
go mod edit -require=github.com/projectbeskar/virtrigaud/sdk@v1.2.0
go test ./...

Handle Deprecations Gracefully

// Check for deprecated features
if provider.IsDeprecated("vm.legacy-create") {
    log.Warn("Using deprecated API, migrate to vm.create")
}

Document Breaking Changes

# CHANGELOG.md
## [2.0.0] - 2025-01-15
### BREAKING CHANGES
- Removed deprecated `CreateVM` method, use `CreateVMRequest` instead
- Changed configuration format, see migration guide

### Migration Guide
Old: `provider.CreateVM("vm1", 2, 4096)`
New: `provider.CreateVM(ctx, &CreateVMRequest{...})`

For Users

Pin Versions in Production

# Helm values
image:
  tag: "1.2.3"  # Not "latest"

Test Upgrades in Staging

# Upgrade strategy
helm upgrade provider-test virtrigaud/provider \
  --version 1.3.0 \
  --namespace staging

Monitor Deprecation Warnings

# Check for deprecation warnings
kubectl logs -l app=provider | grep -i deprecat

Plan Migration Windows

# Schedule upgrades during maintenance windows
# Have rollback plans ready
# Test compatibility thoroughly

Future Considerations

Long-term Compatibility

10-year Support Goal: Core APIs should remain usable for 10 years
Gradual Evolution: Prefer gradual evolution over revolutionary changes
Ecosystem Stability: Consider impact on the entire provider ecosystem

Emerging Standards

OCI Compliance: Align with OCI runtime and image standards
CNCF Integration: Follow CNCF project graduation requirements
Industry Standards: Adopt relevant industry standards as they emerge

Technology Evolution

Go Version Support: Support 2-3 latest Go versions
Kubernetes Compatibility: Support 3-4 latest Kubernetes versions
gRPC Evolution: Adapt to gRPC and protobuf improvements

This versioning strategy ensures VirtRigaud can evolve while maintaining stability and compatibility for the provider ecosystem.

Advanced VM Lifecycle Management

This document describes the advanced VM lifecycle features in VirtRigaud, including reconfiguration, snapshots, cloning, multi-VM sets, and placement policies.

Overview

VirtRigaud Stage E introduces comprehensive VM lifecycle management capabilities that go beyond basic create/delete operations:

VM Reconfiguration: Modify CPU, memory, and disk resources of running VMs
Snapshot Management: Create, delete, and revert VM snapshots
VM Cloning: Create new VMs from existing ones with linked clone support
Multi-VM Sets: Manage groups of VMs with rolling updates
Placement Policies: Advanced placement rules and anti-affinity constraints
Image Preparation: Automated image import and preparation workflows

VM Reconfiguration

Online vs Offline Reconfiguration

VirtRigaud supports both online (hot) and offline reconfiguration depending on provider capabilities:

vSphere: Supports online CPU/memory changes and hot disk expansion Libvirt: Typically requires power cycle for resource changes

Example: CPU/Memory Upgrade

# Original VM with 2 CPU, 4GB RAM
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server
spec:
  resources:
    cpu: 2
    memoryMiB: 4096

# Patch to upgrade resources
# kubectl patch vm web-server --type merge -p '{"spec":{"resources":{"cpu":4,"memoryMiB":8192}}}'

The controller will:

Detect resource changes in VM spec
Attempt online reconfiguration if supported
If offline required, orchestrate graceful power cycle:
- Set condition ReconfigurePendingPowerCycle=True
- Power off VM gracefully
- Apply reconfiguration
- Power on VM
- Update status.lastReconfigureTime

Disk Expansion

spec:
  disks:
    - name: data
      sizeGiB: 100  # Expanded from 50GB
      expandPolicy: "Online"  # Try online first

Snapshot Management

Creating Snapshots

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSnapshot
metadata:
  name: pre-maintenance-backup
spec:
  vmRef:
    name: web-server
  nameHint: "maintenance-backup"
  memory: true  # Include memory state
  description: "Backup before maintenance"
  retentionPolicy:
    maxAge: "7d"
    deleteOnVMDelete: true

Snapshot Lifecycle

Creating: Snapshot creation in progress
Ready: Snapshot available for use
Deleting: Snapshot being removed
Failed: Snapshot operation failed

Reverting to Snapshots

# Patch VM to revert to snapshot
spec:
  snapshot:
    revertToRef:
      name: pre-maintenance-backup

The controller will:

Power off VM if running
Call provider’s SnapshotRevert RPC
Power on VM
Clear revertToRef when complete

VM Cloning

Basic Cloning

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClone
metadata:
  name: web-server-clone
spec:
  sourceRef:
    name: web-server
  target:
    name: web-server-test
    classRef:
      name: test-class
  linked: true  # Faster, space-efficient
  powerOn: true

Clone Customization

spec:
  customization:
    hostname: web-server-test
    networks:
      - name: primary
        ipAddress: "192.168.1.100"
        gateway: "192.168.1.1"
        dns: ["8.8.8.8"]
    userData:
      cloudInit:
        inline: |
          #cloud-config
          runcmd:
            - echo "Test environment" > /etc/motd

Multi-VM Sets (VMSet)

VMSets provide declarative management of multiple VMs with rolling updates.

Basic VMSet

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMSet
metadata:
  name: web-tier
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-server
  template:
    metadata:
      labels:
        app: web-server
    spec:
      providerRef:
        name: vsphere-prod
      classRef:
        name: web-class
      imageRef:
        name: nginx-image
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

Rolling Updates

When you update the template spec, VMSet will:

Create new VMs with updated configuration
Wait for new VMs to be ready
Delete old VMs respecting maxUnavailable
Continue until all replicas are updated

Placement Policies

Advanced Placement Rules

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMPlacementPolicy
metadata:
  name: production-policy
spec:
  hard:
    clusters: ["prod-cluster-1", "prod-cluster-2"]
    datastores: ["ssd-datastore-1", "ssd-datastore-2"]
    hosts: ["esxi-01", "esxi-02", "esxi-03"]
  soft:
    folders: ["/Production/WebServers"]
    zones: ["zone-a", "zone-b"]
  antiAffinity:
    hostAntiAffinity: true      # Spread across hosts
    clusterAntiAffinity: false
    datastoreAntiAffinity: true # Spread across datastores

Using Placement Policies

spec:
  placementRef:
    name: production-policy

The provider will attempt to satisfy:

Hard constraints: Must be satisfied
Soft constraints: Best effort
Anti-affinity rules: Avoid co-location

Image Preparation

Automated Image Import

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-22-04
spec:
  vsphere:
    ovaURL: "https://releases.ubuntu.com/22.04/ubuntu-22.04-server.ova"
    checksum: "sha256:abcd1234..."
  libvirt:
    url: "https://cloud-images.ubuntu.com/22.04/ubuntu-22.04-server.img"
    format: "qcow2"
  prepare:
    onMissing: "Import"  # Auto-import if missing
    validateChecksum: true
    timeout: "30m"
    retries: 3
    storage:
      vsphere:
        datastore: "images-datastore"
        folder: "/Templates"
        thinProvisioned: true

Image Preparation Phases

Pending: Waiting to start preparation
Importing: Downloading/importing image
Preparing: Processing image (conversion, etc.)
Ready: Image ready for use
Failed: Preparation failed

Provider Capabilities

Different providers support different features. Query capabilities:

# Example capabilities response
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
status:
  capabilities:
    supportsReconfigureOnline: true      # vSphere: true, Libvirt: false
    supportsDiskExpansionOnline: true    # vSphere: true, Libvirt: false
    supportsSnapshots: true              # Both: true
    supportsMemorySnapshots: true        # vSphere: true, Libvirt: varies
    supportsLinkedClones: true           # Both: true
    supportsImageImport: true            # Both: true
    supportedDiskTypes: ["thin", "thick"]
    supportedNetworkTypes: ["VMXNET3", "E1000"]

Observability

Metrics

New metrics for advanced lifecycle operations:

virtrigaud_vm_reconfigure_total{provider_type,outcome}
virtrigaud_vm_snapshot_total{action,provider_type,outcome}
virtrigaud_vm_clone_total{linked,provider_type,outcome}
virtrigaud_vm_image_prepare_total{provider_type,outcome}

Events

Detailed events for lifecycle operations:

Normal   SnapshotCreating    Started snapshot creation
Normal   SnapshotReady       Snapshot created successfully
Normal   ReconfigureStarted  Started VM reconfiguration
Warning  ReconfigurePowerCycle  Reconfiguration requires power cycle
Normal   CloneCompleted      VM clone created successfully

Conditions

Comprehensive condition reporting:

VM Conditions:

Ready: VM is ready for use
Provisioning: VM is being created
Reconfiguring: VM is being reconfigured
ReconfigurePendingPowerCycle: Needs power cycle for changes

Snapshot Conditions:

Ready: Snapshot is ready
Creating: Snapshot being created
Deleting: Snapshot being deleted

Clone Conditions:

Ready: Clone completed successfully
Cloning: Clone operation in progress
Customizing: Applying customizations

Best Practices

Snapshot Management

Retention Policies: Always set appropriate retention policies
Memory Snapshots: Use sparingly due to storage overhead
Cleanup: Implement automated cleanup for old snapshots
Testing: Test snapshot revert procedures regularly

VM Reconfiguration

Gradual Changes: Make incremental resource changes
Monitoring: Monitor VM performance after changes
Rollback Plan: Have snapshots before major changes
Capacity Planning: Ensure host resources before scaling up

Placement Policies

Start Simple: Begin with basic constraints
Test Anti-Affinity: Verify rules work as expected
Monitor Placement: Check actual VM placement matches policy
Balance Performance: Don’t over-constrain placement

Multi-VM Operations

Rolling Updates: Use appropriate maxUnavailable settings
Health Checks: Implement proper readiness checks
Monitoring: Monitor rollout progress
Rollback Strategy: Plan for rollback scenarios

Troubleshooting

Common Issues

Reconfiguration Fails:

Check provider capabilities
Verify resource availability on host
Check for VM tools/agent issues

Snapshot Operations Fail:

Verify storage backend supports snapshots
Check available storage space
Ensure VM is not in transitional state

Clone Customization Issues:

Verify network configuration
Check cloud-init/guest tools
Validate IP address availability

Placement Policy Violations:

Check resource availability in target locations
Verify anti-affinity rules aren’t too restrictive
Review cluster resource distribution

Debugging

# Check VM reconfiguration status
kubectl describe vm web-server

# Monitor snapshot progress
kubectl get vmsnapshots -w

# Check clone status
kubectl describe vmclone web-server-clone

# Review placement policy usage
kubectl describe vmplacementpolicy production-policy

# Check VMSet rollout
kubectl describe vmset web-tier

Migration from Basic VMs

Existing VMs can be enhanced with advanced features:

Add Placement Policy: Update VM spec with placementRef
Enable Reconfiguration: Add resource overrides
Create Snapshots: Deploy VMSnapshot resources
Scale with VMSets: Migrate to VMSet for multi-instance workloads

The controller maintains backward compatibility with existing VM definitions.

Nested Virtualization Support

This document describes how to enable and configure nested virtualization in VirtRigaud virtual machines across different hypervisor providers.

Overview

Nested virtualization allows virtual machines to run hypervisors and create their own virtual machines. This is useful for:

Development and testing of virtualization software
Running container orchestration platforms like Kubernetes
Creating nested lab environments
Educational purposes for learning virtualization concepts

VirtRigaud supports nested virtualization through the PerformanceProfile configuration in VMClass resources.

Prerequisites

vSphere Provider

ESXi 6.0 or later
VM hardware version 9 or later (recommended: version 14+)
ESXi host must have VT-x/AMD-V enabled in BIOS
Sufficient CPU and memory resources on the ESXi host

LibVirt Provider

QEMU/KVM hypervisor
Host CPU with VT-x (Intel) or AMD-V (AMD) support
Nested virtualization enabled in host kernel modules
libvirt 1.2.13 or later

Proxmox Provider

Proxmox VE 6.0 or later
Host CPU with nested virtualization support
Nested virtualization enabled in Proxmox configuration

Enabling Nested Virtualization

Nested virtualization is configured at the VMClass level using the PerformanceProfile section:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: nested-vm-class
  namespace: virtrigaud-system
spec:
  cpu: 4
  memory: 8Gi
  firmware: UEFI  # Recommended for modern features
  
  # Enable nested virtualization
  performanceProfile:
    nestedVirtualization: true
    # Optional: Enable additional features
    virtualizationBasedSecurity: true
    cpuHotAddEnabled: true
    memoryHotAddEnabled: true
  
  # Optional: Security features that work well with nested virtualization
  securityProfile:
    secureBoot: false  # May interfere with some nested hypervisors
    tpmEnabled: false  # Optional, depending on nested OS requirements
    vtdEnabled: true   # Enable VT-d/AMD-Vi for better performance
  
  diskDefaults:
    type: thin
    size: 100Gi  # Larger disk for nested VMs

Complete Example

Here’s a complete example showing how to create a VM with nested virtualization support:

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: hypervisor-class
  namespace: default
spec:
  cpu: 8
  memory: 16Gi
  firmware: UEFI
  
  performanceProfile:
    nestedVirtualization: true
    virtualizationBasedSecurity: false  # May conflict with nested hypervisors
    cpuHotAddEnabled: true
    memoryHotAddEnabled: true
    latencySensitivity: low  # Better performance for nested VMs
    hyperThreadingPolicy: prefer
  
  securityProfile:
    secureBoot: false  # Disable for compatibility
    tpmEnabled: false
    vtdEnabled: true   # Enable for better I/O performance
  
  resourceLimits:
    cpuReservation: 4000  # Reserve 4GHz for nested VMs
    memoryReservation: 8Gi
  
  diskDefaults:
    type: thin
    size: 200Gi
    storageClass: fast-ssd

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMImage
metadata:
  name: ubuntu-server-22-04
  namespace: default
spec:
  source:
    libvirt:
      url: "https://cloud-images.ubuntu.com/releases/22.04/release/ubuntu-22.04-server-cloudimg-amd64.img"
      checksum: "sha256:de5e632e17b8965f2baf4ea6d2b824788e154d9a65df4fd419ec4019898e15cd"

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: nested-hypervisor
  namespace: default
spec:
  providerRef:
    name: my-provider
  classRef:
    name: hypervisor-class
  imageRef:
    name: ubuntu-server-22-04
  
  userData:
    cloudInit:
      inline: |
        #cloud-config
        hostname: nested-hypervisor
        users:
          - name: ubuntu
            sudo: ALL=(ALL) NOPASSWD:ALL
            ssh_authorized_keys:
              - ssh-rsa AAAAB3NzaC1yc2E... # Your SSH key
        
        packages:
          - qemu-kvm
          - libvirt-daemon-system
          - libvirt-clients
          - bridge-utils
          - virt-manager
        
        runcmd:
          # Enable nested virtualization verification
          - echo "Checking nested virtualization support..."
          - cat /proc/cpuinfo | grep -E "(vmx|svm)"
          - ls -la /dev/kvm
          
          # Configure libvirt
          - systemctl enable libvirtd
          - systemctl start libvirtd
          - usermod -aG libvirt ubuntu
          
          # Verify nested KVM support
          - modprobe kvm_intel nested=1 || modprobe kvm_amd nested=1
          - echo "Nested virtualization setup complete"
  
  powerState: On

Provider-Specific Configuration

vSphere Provider

For vSphere, nested virtualization is enabled using the following VM configuration:

vhv.enable = TRUE - Enables hardware-assisted virtualization
vhv.allowNestedPageTables = TRUE - Improves nested VM performance
Hardware version 14+ recommended for best compatibility

Additional considerations:

Use UEFI firmware for modern guest operating systems
Ensure sufficient CPU and memory allocation
Consider enabling VT-d for better I/O performance

LibVirt Provider

For LibVirt/KVM, nested virtualization requires:

Host kernel modules: kvm_intel nested=1 or kvm_amd nested=1
CPU features: vmx (Intel) or svm (AMD) passed through to guest
QEMU machine type: q35 recommended for modern features

The LibVirt provider automatically configures:

<cpu mode='host-model' check='partial'>
  <feature policy='require' name='vmx'/>  <!-- Intel -->
  <feature policy='require' name='svm'/>  <!-- AMD -->
</cpu>

Proxmox Provider

For Proxmox VE, nested virtualization is configured through:

CPU type: host or kvm64 with nested features
Enable nested virtualization in VM CPU configuration
Ensure host has nested virtualization enabled

Verification

After creating a VM with nested virtualization enabled, verify the setup:

On Linux Guests

# Check for virtualization extensions
grep -E "(vmx|svm)" /proc/cpuinfo

# Verify KVM device availability
ls -la /dev/kvm

# Check nested virtualization status
cat /sys/module/kvm_intel/parameters/nested  # Intel
cat /sys/module/kvm_amd/parameters/nested    # AMD

# Test with a simple nested VM
virt-host-validate

On Windows Guests

# Check Hyper-V compatibility
systeminfo | findstr /i hyper

# Verify virtualization extensions
Get-ComputerInfo | Select-Object HyperV*

Performance Considerations

CPU Allocation

Allocate sufficient CPU cores (minimum 4, recommended 8+)
Consider CPU reservation for consistent performance
Enable CPU hot-add for flexibility

Memory Configuration

Allocate generous memory (minimum 8GB, recommended 16GB+)
Consider memory reservation for nested VMs
Enable memory hot-add for dynamic scaling

Storage

Use fast storage (SSD/NVMe) for better nested VM performance
Allocate sufficient disk space for multiple nested VMs
Consider thin provisioning for efficient space usage

Network

Configure appropriate network topology
Consider SR-IOV for high-performance networking
Plan IP address allocation for nested environments

Troubleshooting

Common Issues

Nested virtualization not working
- Verify host CPU supports VT-x/AMD-V
- Check host BIOS settings
- Ensure hypervisor nested virtualization is enabled
Poor performance in nested VMs
- Increase CPU and memory allocation
- Enable CPU/memory reservations
- Use faster storage
- Verify nested page tables are enabled
Guest OS doesn’t detect virtualization extensions
- Check VM hardware version (vSphere)
- Verify CPU feature passthrough (LibVirt)
- Ensure proper CPU type configuration (Proxmox)

Debugging Commands

# Check virtualization support on host
lscpu | grep Virtualization

# Verify KVM nested support
cat /sys/module/kvm_*/parameters/nested

# Check VM CPU features (inside guest)
lscpu | grep -E "(vmx|svm|Virtualization)"

# Test nested VM creation
virt-install --name test-nested --memory 1024 --vcpus 1 --disk size=10 --cdrom /path/to/iso

Security Considerations

Isolation

Nested VMs add additional attack surface
Consider network isolation for nested environments
Implement proper access controls

Resource Limits

Set appropriate resource limits to prevent resource exhaustion
Monitor nested VM resource usage
Implement quotas for nested environments

Updates and Patches

Keep host hypervisor updated
Maintain guest hypervisor software
Apply security patches to nested VMs

Best Practices

Planning
- Design nested architecture carefully
- Plan resource allocation in advance
- Consider network topology requirements
Configuration
- Use UEFI firmware for modern features
- Enable VT-d/AMD-Vi for better performance
- Configure appropriate CPU and memory reservations
Monitoring
- Monitor resource usage at all levels
- Set up alerting for resource exhaustion
- Track performance metrics
Maintenance
- Regular backup of nested environments
- Plan for hypervisor updates
- Test disaster recovery procedures

Limitations

vSphere Provider

Requires ESXi 6.0+ and hardware version 9+
Performance overhead of 10-20% typical
Some advanced features may not be available in nested VMs

LibVirt Provider

Requires host kernel support
Performance depends on host CPU features
Limited to x86_64 architecture

Proxmox Provider

Requires Proxmox VE 6.0+
Performance overhead varies by workload
Some clustering features may not work in nested environments

Support Matrix

Provider	Min Version	Nested Support	Performance	Security Features
vSphere	ESXi 6.0	Full	Good	TPM, Secure Boot
LibVirt	1.2.13	Full	Good	TPM, Secure Boot
Proxmox	PVE 6.0	Planned	Good	Limited

For more information, see the provider-specific documentation in the docs/providers/ directory.

Graceful Shutdown Feature

The virtrigaud VM management platform now supports graceful shutdown of virtual machines to prevent data corruption and ensure proper cleanup of running processes.

Overview

Graceful shutdown uses VM guest tools (VMware Tools, QEMU Guest Agent, etc.) to properly shut down the operating system before powering off the virtual machine. This prevents data corruption and allows applications to save their state properly.

Power States

virtrigaud supports three power states:

On: Power on the VM
Off: Hard power off (immediate shutdown without guest OS notification)
OffGraceful: Graceful shutdown using guest tools with automatic fallback to hard power off

Configuration

Basic Usage

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: my-vm
spec:
  powerState: OffGraceful  # Use graceful shutdown
  # ... other configuration

Advanced Configuration with Lifecycle Hooks

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: my-vm
spec:
  powerState: OffGraceful
  
  lifecycle:
    # Timeout for graceful shutdown (default: 60s)
    gracefulShutdownTimeout: "120s"
    
    # Pre-stop hook runs before shutdown
    preStop:
      exec:
        command:
          - "/bin/bash"
          - "-c"
          - |
            # Save application state
            systemctl stop my-application
            # Sync filesystem
            sync

How It Works

vSphere Provider

Guest Tools Check: Verifies VMware Tools is installed and running
Graceful Shutdown: Calls vm.ShutdownGuest() to initiate OS shutdown
Monitoring: Polls VM power state every 2 seconds
Timeout Handling: Falls back to hard power off if timeout is reached
Fallback: Uses vm.PowerOff() if graceful shutdown fails

Libvirt Provider

Graceful Attempt: Uses virsh shutdown command
Fallback: Falls back to virsh destroy if shutdown fails
Guest Agent: Requires QEMU Guest Agent for best results

Proxmox Provider

API Call: Uses Proxmox shutdown API endpoint
Built-in Timeout: Proxmox handles timeout and fallback internally

Default Timeouts

vSphere: 60 seconds (configurable via gRPC request)
Libvirt: Immediate fallback if virsh shutdown fails
Proxmox: Managed by Proxmox server configuration

Requirements

VMware vSphere

VMware Tools must be installed and running in the guest OS
Guest OS must support ACPI shutdown signals

Libvirt/KVM

QEMU Guest Agent recommended for reliable graceful shutdown
Guest OS must support ACPI shutdown signals

Proxmox

QEMU Guest Agent recommended
Guest OS must support ACPI shutdown signals

Best Practices

Always Install Guest Tools: Ensure VMware Tools or QEMU Guest Agent is installed
Test Graceful Shutdown: Verify your VMs respond properly to shutdown signals
Set Appropriate Timeouts: Allow enough time for applications to shut down gracefully
Use Lifecycle Hooks: Implement pre-stop hooks for critical applications
Monitor Logs: Check provider logs to verify graceful shutdown is working

Troubleshooting

Graceful Shutdown Not Working

Check Guest Tools Status:

# For VMware
vmware-toolbox-cmd stat running

# For QEMU/KVM
systemctl status qemu-guest-agent

Verify ACPI Support:

# Check if ACPI shutdown is supported
cat /proc/acpi/button/power/*/info

Test Manual Shutdown:

# Test graceful shutdown manually
sudo shutdown -h now

Timeout Issues

If VMs consistently hit the graceful shutdown timeout:

Increase Timeout: Set a longer gracefulShutdownTimeout
Optimize Applications: Ensure applications shut down quickly
Check System Resources: Verify the system isn’t under heavy load

Fallback to Hard Power Off

The provider will automatically fall back to hard power off if:

Guest tools are not available
Graceful shutdown times out
Guest tools command fails

This ensures VMs are always powered off even if graceful shutdown isn’t possible.

Examples

See examples/graceful-shutdown-vm.yaml for complete examples of using graceful shutdown with various configurations.

Provider Architecture

This document describes the provider architecture in VirtRigaud.

Overview

VirtRigaud uses a Remote Provider architecture where providers run as independent pods, communicating with the manager controller via gRPC. This design provides scalability, security, and reliability benefits.

Architecture

┌─────────────────┐    ┌───────────────────┐    ┌─────────────────┐
│   VirtualMachine │    │     Provider      │    │ Provider Runtime│
│      CRD        │    │       CRD         │    │   Deployment    │
└─────────────────┘    └───────────────────┘    └─────────────────┘
         │                        │                        │
         │                        │                        │
         v                        v                        │
┌─────────────────┐    ┌───────────────────┐              │
│    Manager      │    │ Provider          │              │
│   Controller    │    │ Controller        │              │
│                 │    │                   │              │
│   ┌─────────────┤    │ - Creates Deploy  │              │
│   │ VM Reconcile│    │ - Creates Service │              │
│   │             │    │ - Updates Status  │              │
│   └─────────────┤    │                   │              │
│                 │    └───────────────────┘              │
│   ┌─────────────┤                                       │
│   │ gRPC Client │◄──────────────────────────────────────┘
│   │             │        gRPC Connection
│   └─────────────┤        Port 9090
└─────────────────┘

Provider Components

1. Provider Runtime Deployments

Each Provider resource automatically creates:

Deployment: Runs provider-specific containers
Service: ClusterIP service for gRPC communication
ConfigMaps: Provider configuration
Secret mounts: Credentials for hypervisor access

Configuration Flow: Provider Resource → Provider Pod

The VirtRigaud Provider Controller automatically translates your Provider resource configuration into the appropriate command-line arguments and environment variables for the provider pod.

Command-Line Arguments

The controller generates these arguments from your Provider spec:

Provider Field	Generated Argument	Example
`spec.type`	`--provider-type`	`--provider-type=vsphere`
`spec.endpoint`	`--provider-endpoint`	`--provider-endpoint=https://vcenter.example.com`
`spec.runtime.service.port`	`--grpc-addr`	`--grpc-addr=:9090`
(hardcoded)	`--metrics-addr`	`--metrics-addr=:8080`
(optional)	`--tls-enabled`	`--tls-enabled=false`

Environment Variables

The controller also sets these environment variables:

Provider Field	Environment Variable	Example
`spec.type`	`PROVIDER_TYPE`	`vsphere`
`spec.endpoint`	`PROVIDER_ENDPOINT`	`https://vcenter.example.com`
`metadata.namespace`	`PROVIDER_NAMESPACE`	`default`
`metadata.name`	`PROVIDER_NAME`	`vsphere-datacenter`
(optional)	`TLS_ENABLED`	`false`

Secret Volume Mounts

Credentials from spec.credentialSecretRef are automatically mounted at:

Mount Path: /etc/virtrigaud/credentials/
Files Created: Each secret key becomes a file
- username → /etc/virtrigaud/credentials/username
- password → /etc/virtrigaud/credentials/password
- token → /etc/virtrigaud/credentials/token

Complete Example

When you create this Provider resource:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-datacenter
  namespace: default
spec:
  type: vsphere
  endpoint: "https://vcenter.example.com:443"
  credentialSecretRef:
    name: vsphere-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"
    service:
      port: 9090

The controller automatically creates a deployment with:

Command-line arguments:

/provider-vsphere \
  --grpc-addr=:9090 \
  --metrics-addr=:8080 \
  --provider-type=vsphere \
  --provider-endpoint=https://vcenter.example.com:443 \
  --tls-enabled=false

Environment variables:

PROVIDER_TYPE=vsphere
PROVIDER_ENDPOINT=https://vcenter.example.com:443
PROVIDER_NAMESPACE=default
PROVIDER_NAME=vsphere-datacenter
TLS_ENABLED=false

Volume mounts:

/etc/virtrigaud/credentials/username  # Contains: admin@vsphere.local
/etc/virtrigaud/credentials/password  # Contains: your-password

✅ Key Point: You Don’t Configure This Manually

The beauty of VirtRigaud’s Remote Provider architecture is that you never need to manually configure command-line arguments or environment variables. Simply create the Provider resource, and the controller handles all the deployment details automatically!

2. Provider Images

Specialized images for each provider type:

ghcr.io/projectbeskar/virtrigaud/provider-vsphere: vSphere provider with govmomi
ghcr.io/projectbeskar/virtrigaud/provider-libvirt: LibVirt provider via virsh commands
ghcr.io/projectbeskar/virtrigaud/provider-proxmox: Proxmox VE provider
ghcr.io/projectbeskar/virtrigaud/provider-mock: Mock provider for testing

3. gRPC Communication

Protocol: gRPC with protocol buffers
Security: Secure communication over TLS (optional)
Health: Built-in health checks and graceful shutdown
Metrics: Prometheus metrics on port 8080

Provider Configuration

Basic Provider Setup

apiVersion: v1
kind: Secret
metadata:
  name: vsphere-credentials
  namespace: default
type: Opaque
stringData:
  username: "admin@vsphere.local"
  password: "your-password"

---
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-datacenter
  namespace: default
spec:
  type: vsphere
  endpoint: "https://vcenter.example.com:443"
  credentialSecretRef:
    name: vsphere-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"
    service:
      port: 9090

Advanced Configuration

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: libvirt-cluster
  namespace: production
spec:
  type: libvirt
  endpoint: "qemu+ssh://admin@kvm.example.com/system"
  credentialSecretRef:
    name: libvirt-credentials
  defaults:
    cluster: production
  rateLimit:
    qps: 20
    burst: 50
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.0"
    replicas: 3
    
    service:
      port: 9090
      
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "2"
        memory: "2Gi"
        
    # High availability setup
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/instance: libvirt-cluster
          topologyKey: kubernetes.io/hostname
          
    # Node placement
    nodeSelector:
      workload-type: compute
      
    tolerations:
    - key: "compute-dedicated"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
      
    # Environment variables
    env:
    - name: LIBVIRT_DEBUG
      value: "1"
    - name: PROVIDER_TIMEOUT
      value: "300s"

Security Model

Pod Security

Non-root execution: All containers run as non-root users
Read-only filesystem: Immutable container filesystem
Minimal capabilities: Reduced Linux capabilities
Security contexts: Enforced via deployment templates

Credential Isolation

Separated secrets: Each provider has dedicated credential secrets
Scoped access: Providers only access their own hypervisor credentials
RBAC isolation: Fine-grained RBAC per provider namespace

Network Security

Service mesh ready: Compatible with Istio/Linkerd
Network policies: Optional traffic restrictions
TLS support: Secure gRPC communication (configurable)

Communication Protocol

gRPC Service Definition

service Provider {
  rpc Validate(ValidateRequest) returns (ValidateResponse);
  rpc Create(CreateRequest) returns (CreateResponse);
  rpc Delete(DeleteRequest) returns (TaskResponse);
  rpc Power(PowerRequest) returns (TaskResponse);
  rpc Reconfigure(ReconfigureRequest) returns (TaskResponse);
  rpc Describe(DescribeRequest) returns (DescribeResponse);
  rpc TaskStatus(TaskStatusRequest) returns (TaskStatusResponse);
  rpc ListCapabilities(CapabilitiesRequest) returns (CapabilitiesResponse);
}

Error Handling

Retry logic: Exponential backoff for transient failures
Circuit breakers: Prevent cascade failures
Timeout controls: Configurable per-operation timeouts
Status reporting: Conditions reflected in Kubernetes status

Observability

Metrics

Provider pods expose Prometheus metrics on port 8080:

# Request metrics
provider_grpc_requests_total{method="Create",status="success"} 42
provider_grpc_request_duration_seconds{method="Create",quantile="0.95"} 2.5

# VM metrics  
provider_vms_total{state="running"} 15
provider_vms_total{state="stopped"} 3

# Health metrics
provider_health_status{provider="vsphere-datacenter"} 1
provider_hypervisor_connection_status{endpoint="vcenter.example.com"} 1

Logging

Structured logs: JSON format with correlation IDs
Log levels: Configurable verbosity (debug, info, warn, error)
Request tracing: Context propagation across gRPC calls

Health Checks

Kubernetes probes: Liveness and readiness probes
gRPC health protocol: Standard health check implementation
Hypervisor connectivity: Validates connection to external systems

Deployment Patterns

Single Provider Setup

# Simple development setup
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: dev-vsphere
spec:
  type: vsphere
  endpoint: "https://vcenter-dev.example.com:443"
  credentialSecretRef:
    name: dev-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"

High Availability Setup

# Production HA setup
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: prod-vsphere
spec:
  type: vsphere
  endpoint: "https://vcenter-prod.example.com:443"
  credentialSecretRef:
    name: prod-credentials
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"
    replicas: 3
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/instance: prod-vsphere
          topologyKey: kubernetes.io/hostname

Multi-Environment Setup

# Development environment
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: dev-libvirt
  namespace: development
spec:
  type: libvirt
  endpoint: "qemu+ssh://dev@libvirt-dev.example.com/system"
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.0"
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"

---
# Production environment  
apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: prod-libvirt
  namespace: production
spec:
  type: libvirt
  endpoint: "qemu+ssh://prod@libvirt-prod.example.com/system"
  runtime:
    mode: Remote
    image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.2.0"
    replicas: 2
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
      limits:
        cpu: "2"
        memory: "2Gi"

Benefits

Scalability

Horizontal scaling: Multiple provider replicas per hypervisor
Resource isolation: Independent resource allocation per provider
Load distribution: gRPC load balancing across provider instances

Security

Credential isolation: Hypervisor credentials isolated to provider pods
Network segmentation: Providers can run in separate namespaces
Least privilege: Manager runs without direct hypervisor access

Reliability

Fault isolation: Provider failures don’t affect the manager
Independent updates: Provider images updated separately
Circuit breaking: Automatic failure detection and recovery

Operational Excellence

Rolling updates: Zero-downtime provider updates
Health monitoring: Built-in health checks and metrics
Debugging: Isolated provider logs and observability

Troubleshooting

Common Issues

Image Pull Failures

# Check image availability
docker pull ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0

# Verify imagePullSecrets if using private registry
kubectl get secret regcred -o yaml

Network Connectivity

# Test provider service
kubectl get svc virtrigaud-provider-*

# Check provider pod logs
kubectl logs -l app.kubernetes.io/name=virtrigaud-provider

Credential Issues

# Verify secret exists and is mounted
kubectl get secret vsphere-credentials
kubectl describe pod virtrigaud-provider-*

Debugging Commands

# Check provider status
kubectl describe provider vsphere-datacenter

# Check provider deployment
kubectl get deployment -l app.kubernetes.io/instance=vsphere-datacenter

# Check provider pods
kubectl get pods -l app.kubernetes.io/instance=vsphere-datacenter

# View provider logs
kubectl logs -l app.kubernetes.io/instance=vsphere-datacenter -f

# Check provider metrics
kubectl port-forward svc/virtrigaud-provider-vsphere-datacenter 8080:8080
curl http://localhost:8080/metrics

Performance Tuning

# Optimize for high-volume workloads
spec:
  rateLimit:
    qps: 100        # Increase API rate limit
    burst: 200      # Allow burst capacity
  runtime:
    replicas: 5     # Scale out for throughput
    resources:
      requests:
        cpu: "1"    # Guarantee CPU resources
        memory: "1Gi"
      limits:
        cpu: "4"    # Allow burst CPU
        memory: "4Gi"

Best Practices

Resource Management

Right-sizing: Start with small requests, monitor and adjust
Limits: Always set memory limits to prevent OOM kills
QoS: Use Guaranteed QoS for production workloads

Security

Secrets rotation: Implement regular credential rotation
Network policies: Restrict provider-to-hypervisor traffic
RBAC: Use dedicated service accounts per provider

Monitoring

Alerting: Set up alerts on provider health metrics
Dashboards: Create Grafana dashboards for provider metrics
Log aggregation: Centralize logs for debugging and auditing

Migration and Upgrades

Provider Image Updates

# Update provider image
kubectl patch provider vsphere-datacenter -p '
{
  "spec": {
    "runtime": {
      "image": "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.0"
    }
  }
}'

# Monitor rollout
kubectl rollout status deployment virtrigaud-provider-vsphere-datacenter

Configuration Changes

# Update provider configuration
kubectl edit provider vsphere-datacenter

# Verify changes applied
kubectl describe provider vsphere-datacenter

VirtRigaud Observability Guide

This document describes the comprehensive observability features of VirtRigaud, including structured logging, metrics, tracing, and monitoring.

Overview

VirtRigaud provides production-grade observability through:

Structured JSON Logging with correlation IDs and automatic secret redaction
Comprehensive Prometheus Metrics for all components and operations
OpenTelemetry Tracing with gRPC instrumentation
Health Endpoints for liveness and readiness probes
Grafana Dashboards for visualization
Prometheus Alerts for proactive monitoring

Logging

Configuration

Configure logging via environment variables:

LOG_LEVEL=info              # debug, info, warn, error
LOG_FORMAT=json             # json or console
LOG_SAMPLING=true           # Enable log sampling
LOG_DEVELOPMENT=false       # Development mode

Correlation IDs

All log entries include correlation fields:

{
  "level": "info",
  "ts": "2025-01-27T10:30:45.123Z",
  "msg": "VM operation started",
  "correlationID": "req-12345",
  "vm": "default/web-server-1",
  "provider": "default/vsphere-prod",
  "providerType": "vsphere",
  "taskRef": "task-67890",
  "reconcile": "uuid-abcdef"
}

Secret Redaction

Sensitive information is automatically redacted:

{
  "msg": "Connecting to provider",
  "endpoint": "vcenter://user:[REDACTED]@vc.example.com/Datacenter",
  "userData": "[REDACTED]"
}

Metrics Catalog

Manager Metrics

Metric	Type	Description	Labels
`virtrigaud_manager_reconcile_total`	Counter	Total reconcile operations	`kind`, `outcome`
`virtrigaud_manager_reconcile_duration_seconds`	Histogram	Reconcile duration	`kind`
`virtrigaud_queue_depth`	Gauge	Work queue depth	`kind`

Provider Metrics

Metric	Type	Description	Labels
`virtrigaud_provider_rpc_requests_total`	Counter	RPC requests	`provider_type`, `method`, `code`
`virtrigaud_provider_rpc_latency_seconds`	Histogram	RPC latency	`provider_type`, `method`
`virtrigaud_provider_tasks_inflight`	Gauge	Inflight tasks	`provider_type`, `provider`

VM Operation Metrics

Metric	Type	Description	Labels
`virtrigaud_vm_operations_total`	Counter	VM operations	`operation`, `provider_type`, `provider`, `outcome`
`virtrigaud_ip_discovery_duration_seconds`	Histogram	IP discovery time	`provider_type`

Circuit Breaker Metrics

Metric	Type	Description	Labels
`virtrigaud_circuit_breaker_state`	Gauge	CB state (0=closed, 1=half-open, 2=open)	`provider_type`, `provider`
`virtrigaud_circuit_breaker_failures_total`	Counter	CB failures	`provider_type`, `provider`

Error Metrics

Metric	Type	Description	Labels
`virtrigaud_errors_total`	Counter	Errors by reason	`reason`, `component`

Tracing

Configuration

Enable OpenTelemetry tracing:

VIRTRIGAUD_TRACING_ENABLED=true
VIRTRIGAUD_TRACING_ENDPOINT=http://jaeger:14268/api/traces
VIRTRIGAUD_TRACING_SAMPLING_RATIO=0.1
VIRTRIGAUD_TRACING_INSECURE=true

Span Structure

Key spans include:

vm.reconcile - Full VM reconciliation
vm.create - VM creation operation
provider.validate - Provider validation
rpc.Create - gRPC calls to providers

Trace Attributes

Standard attributes:

vm.namespace = "default"
vm.name = "web-server-1"
provider.type = "vsphere"
operation = "Create"
task.ref = "task-12345"

Health Endpoints

HTTP Endpoints

All components expose health endpoints on port 8080:

GET /healthz - Liveness probe (always returns 200)
GET /readyz - Readiness probe (checks dependencies)
GET /health - Detailed health status (JSON)

gRPC Health

Providers implement grpc.health.v1.Health service for health checks.

Grafana Dashboards

Manager Dashboard

Reconcile rates and duration
Queue depth monitoring
Error rate tracking
Resource usage (CPU/memory)

Provider Dashboard

RPC latency and error rates
Task monitoring
Circuit breaker status
Provider-specific metrics

VM Lifecycle Dashboard

Creation success rates
IP discovery times
Failure analysis
Provider comparison

Prometheus Alerts

Critical Alerts

VirtrigaudProviderDown - Provider unavailable
VirtrigaudManagerDown - Manager unavailable

Warning Alerts

VirtrigaudProviderErrorRateHigh - High error rate (>50%)
VirtrigaudReconcileStuck - Slow reconciles (>5min)
VirtrigaudQueueBackedUp - Queue depth >100
VirtrigaudCircuitBreakerOpen - CB protection active

Configuration Reference

Complete Environment Variables

# Logging
LOG_LEVEL=info
LOG_FORMAT=json
LOG_SAMPLING=true
LOG_DEVELOPMENT=false

# Tracing
VIRTRIGAUD_TRACING_ENABLED=false
VIRTRIGAUD_TRACING_ENDPOINT=""
VIRTRIGAUD_TRACING_SAMPLING_RATIO=0.1
VIRTRIGAUD_TRACING_INSECURE=true

# RPC Timeouts
RPC_TIMEOUT_DESCRIBE=30s
RPC_TIMEOUT_MUTATING=4m
RPC_TIMEOUT_VALIDATE=10s
RPC_TIMEOUT_TASK_STATUS=10s

# Retry Configuration
RETRY_MAX_ATTEMPTS=5
RETRY_BASE_DELAY=500ms
RETRY_MAX_DELAY=30s
RETRY_MULTIPLIER=2.0
RETRY_JITTER=true

# Circuit Breaker
CB_FAILURE_THRESHOLD=10
CB_RESET_SECONDS=60s
CB_HALF_OPEN_MAX_CALLS=3

# Rate Limiting
RATE_LIMIT_QPS=10
RATE_LIMIT_BURST=20

# Workers
WORKERS_PER_KIND=2
MAX_INFLIGHT_TASKS=100

# Feature Gates
FEATURE_GATES=""

# Performance
VIRTRIGAUD_PPROF_ENABLED=false
VIRTRIGAUD_PPROF_ADDR=:6060

Deployment

ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: virtrigaud-manager
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud
  endpoints:
  - port: metrics
    interval: 30s

PrometheusRule

Deploy alerts:

kubectl apply -f deploy/observability/prometheus/alerts.yaml

Grafana Dashboards

Import dashboards from deploy/observability/grafana/

Troubleshooting

High Error Rates

Check provider health: kubectl get providers
Review error metrics: virtrigaud_errors_total
Check circuit breaker state
Review provider logs

Slow Operations

Check RPC latency metrics
Review reconcile duration
Check resource constraints
Monitor task queue depth

Memory Issues

Monitor process_resident_memory_bytes
Check for goroutine leaks: go_goroutines
Review heap usage: go_memstats_heap_inuse_bytes

Security Policy

Supported Versions

We actively support the following versions of VirtRigaud with security updates:

Version	Supported
0.1.x	:white_check_mark:
< 0.1	:x:

Reporting a Vulnerability

The VirtRigaud team takes security vulnerabilities seriously. We appreciate your efforts to responsibly disclose your findings, and will make every effort to acknowledge your contributions.

How to Report

Please do not report security vulnerabilities through public GitHub issues.

Instead, please send an email to security@virtrigaud.io with the following information:

A description of the vulnerability
Steps to reproduce the issue
Potential impact
Any possible mitigations you’ve identified

You should receive a response within 48 hours. If for some reason you do not, please follow up via email to ensure we received your original message.

What to Expect

Acknowledgment: We will acknowledge receipt of your vulnerability report within 48 hours.
Assessment: We will assess the vulnerability and determine its severity within 5 business days.
Mitigation: For confirmed vulnerabilities, we will work on a fix and coordinate disclosure timeline with you.
Recognition: We will credit you in our security advisory and release notes (unless you prefer to remain anonymous).

Disclosure Policy

We ask that you do not publicly disclose the vulnerability until we have had a chance to address it.
We will coordinate with you on an appropriate disclosure timeline.
We typically aim to disclose within 90 days of initial report.

Security Considerations

General Security

VirtRigaud runs with minimal privileges and follows security best practices
All communications with providers use TLS encryption
Sensitive data (credentials, user data) is properly handled and never logged
RBAC is enforced to limit access to resources

Supply Chain Security

All container images are signed with Cosign
Software Bill of Materials (SBOM) is provided for all releases
Container images are scanned for vulnerabilities
Dependencies are regularly updated

Network Security

Network policies are provided to restrict traffic
mTLS is supported for provider communications
No unnecessary ports are exposed

Access Control

RBAC roles follow principle of least privilege
Service accounts are properly scoped
Admission webhooks enforce security policies

Vulnerability Management

Scanning

We regularly scan our codebase and dependencies for known vulnerabilities using:

GitHub Security Advisories
Trivy for container scanning
Go vulnerability database
OWASP dependency checking

Response Process

Detection: Vulnerability discovered through scanning or reporting
Assessment: Determine severity and impact
Patching: Develop and test fix
Release: Create security release with patch
Notification: Inform users through security advisory

Severity Classification

We use the following severity levels:

Critical: Immediate action required, patch within 24 hours
High: Patch within 7 days
Medium: Patch within 30 days
Low: Patch in next regular release

Security Features

Authentication and Authorization

Integration with Kubernetes RBAC
Support for external identity providers
Service account token projection
Webhook authentication

Encryption

TLS 1.2+ for all communications
Certificate rotation and management
Support for custom CA certificates
Secrets encryption at rest (Kubernetes level)

Audit and Monitoring

Comprehensive audit logging
Security event monitoring
Metrics for security-relevant events
Integration with security monitoring tools

Best Practices for Users

Deployment Security

Use namespace isolation: Deploy in dedicated namespace
Apply network policies: Restrict network access
Enable Pod Security Standards: Use strict or baseline profiles
Regular updates: Keep VirtRigaud and dependencies updated
Monitor security advisories: Subscribe to security notifications

Credential Management

Use external secret management: HashiCorp Vault, External Secrets Operator
Rotate credentials regularly: Implement credential rotation
Principle of least privilege: Grant minimal required permissions
Secure storage: Never store credentials in Git or plain text

Network Security

Enable TLS: Use TLS for all communications
Network segmentation: Isolate provider networks
Firewall rules: Restrict hypervisor access
VPN access: Use VPN for remote hypervisor access

Monitoring and Alerting

Security monitoring: Monitor for security events
Failed authentication alerts: Alert on authentication failures
Unusual activity: Monitor for unexpected behavior
Compliance scanning: Regular security scans

Compliance

VirtRigaud is designed to support compliance with various security frameworks:

SOC 2: Control implementation guidance available
ISO 27001: Security control mapping provided
CIS Kubernetes Benchmark: Alignment with security benchmarks
NIST Cybersecurity Framework: Control implementation guidance

Security Tools and Integrations

Supported Security Tools

Falco: Runtime security monitoring
OPA Gatekeeper: Policy enforcement
Twistlock/Prisma: Container security scanning
Aqua Security: Container and runtime security
Cilium: Network security and observability

Security Configurations

Example security-hardened configurations are provided in:

examples/security/strict-rbac.yaml
examples/security/network-policies.yaml
examples/security/pod-security-policies.yaml
examples/security/external-secrets.yaml

Contact

For security-related questions that are not vulnerabilities, you can:

Open a GitHub Discussion in the Security category
Email security@virtrigaud.io
Join the #virtrigaud-security channel on Kubernetes Slack

Recognition

We maintain a security hall of fame for researchers who have helped improve VirtRigaud security:

Security Contributors

Thank you to all the security researchers who have contributed to making VirtRigaud more secure!

VirtRigaud Resilience Guide

This document describes the resilience patterns and error handling mechanisms in VirtRigaud.

Overview

VirtRigaud implements comprehensive resilience patterns:

Error Taxonomy - Structured error classification
Circuit Breakers - Protection against cascading failures
Exponential Backoff - Intelligent retry strategies
Timeout Policies - Prevent resource exhaustion
Rate Limiting - Provider protection

Error Taxonomy

Error Types

VirtRigaud classifies all errors into specific categories:

Type	Retryable	Description	Example
`NotFound`	No	Resource doesn’t exist	VM not found
`InvalidSpec`	No	Invalid configuration	Malformed VM spec
`Unauthorized`	No	Authentication failed	Invalid credentials
`NotSupported`	No	Unsupported operation	Feature not available
`Retryable`	Yes	Transient error	Network timeout
`Unavailable`	Yes	Service unavailable	Provider down
`RateLimit`	Yes	Rate limited	API quota exceeded
`Timeout`	Yes	Operation timeout	Long-running task
`QuotaExceeded`	No	Resource quota hit	Storage full
`Conflict`	No	Resource conflict	Duplicate name

Error Creation

import "github.com/projectbeskar/virtrigaud/internal/providers/contracts"

// Create specific error types
err := contracts.NewNotFoundError("VM not found", originalErr)
err := contracts.NewRetryableError("Network timeout", originalErr)
err := contracts.NewUnavailableError("Provider unavailable", originalErr)

// Check if error is retryable
if providerErr, ok := err.(*contracts.ProviderError); ok {
    if providerErr.IsRetryable() {
        // Retry the operation
    }
}

Circuit Breaker Pattern

Configuration

import "github.com/projectbeskar/virtrigaud/internal/resilience"

config := &resilience.Config{
    FailureThreshold: 10,              // Open after 10 failures
    ResetTimeout:     60 * time.Second, // Try again after 60s
    HalfOpenMaxCalls: 3,               // Allow 3 test calls
}

cb := resilience.NewCircuitBreaker("provider-vsphere", "vsphere", "prod", config)

Usage

err := cb.Call(ctx, func(ctx context.Context) error {
    // Call the potentially failing operation
    return provider.Create(ctx, request)
})

if err != nil {
    // Handle error (may be circuit breaker protection)
    log.Error(err, "Operation failed")
}

States

Closed - Normal operation, failures are counted
Open - Fast-fail mode, requests are rejected immediately
Half-Open - Testing mode, limited requests allowed

Metrics

Circuit breaker state is exposed via metrics:

virtrigaud_circuit_breaker_state{provider_type="vsphere",provider="prod"} 0
virtrigaud_circuit_breaker_failures_total{provider_type="vsphere",provider="prod"} 5

Retry Strategies

Exponential Backoff

import "github.com/projectbeskar/virtrigaud/internal/resilience"

config := &resilience.RetryConfig{
    MaxAttempts: 5,
    BaseDelay:   500 * time.Millisecond,
    MaxDelay:    30 * time.Second,
    Multiplier:  2.0,
    Jitter:      true,
}

err := resilience.Retry(ctx, config, func(ctx context.Context, attempt int) error {
    return provider.Describe(ctx, vmID)
})

Backoff Calculation

For attempt n:

delay = BaseDelay × Multiplier^n
delay = min(delay, MaxDelay)
if Jitter:
    delay += random(0, delay * 0.1)

Example delays with BaseDelay=500ms, Multiplier=2.0:

Attempt 0: 500ms
Attempt 1: 1s
Attempt 2: 2s
Attempt 3: 4s
Attempt 4: 8s

Predefined Configurations

// For frequent, low-latency operations
aggressive := resilience.AggressiveRetryConfig()
// MaxAttempts: 10, BaseDelay: 100ms, Multiplier: 1.5

// For expensive operations
conservative := resilience.ConservativeRetryConfig()
// MaxAttempts: 3, BaseDelay: 1s, Multiplier: 3.0

// Disable retries
none := resilience.NoRetryConfig()
// MaxAttempts: 1

Combined Resilience Policies

Policy Builder

policy := resilience.NewPolicyBuilder("vm-operations").
    WithRetry(resilience.DefaultRetryConfig()).
    WithCircuitBreaker(circuitBreaker).
    Build()

err := policy.Execute(ctx, func(ctx context.Context) error {
    return provider.Create(ctx, request)
})

Integration Example

// In VirtualMachine controller
func (r *VirtualMachineReconciler) createVM(ctx context.Context, vm *v1beta1.VirtualMachine) error {
    // Get circuit breaker for this provider
    cb := r.CircuitBreakerRegistry.GetOrCreate(
        "vm-operations", 
        provider.Spec.Type, 
        provider.Name,
    )
    
    // Create resilience policy
    policy := resilience.NewPolicyBuilder("create-vm").
        WithRetry(&resilience.RetryConfig{
            MaxAttempts: 3,
            BaseDelay:   1 * time.Second,
            MaxDelay:    30 * time.Second,
            Multiplier:  2.0,
            Jitter:      true,
        }).
        WithCircuitBreaker(cb).
        Build()
    
    // Execute with resilience
    return policy.Execute(ctx, func(ctx context.Context) error {
        resp, err := provider.Create(ctx, createReq)
        if err != nil {
            return err
        }
        
        vm.Status.ID = resp.ID
        vm.Status.TaskRef = resp.TaskRef
        return nil
    })
}

Timeout Policies

RPC Timeouts

Different operations have different timeout requirements:

// Operation-specific timeouts
config := &config.RPCConfig{
    TimeoutDescribe:   30 * time.Second,  // Quick status check
    TimeoutMutating:   4 * time.Minute,   // Create/Delete/Power
    TimeoutValidate:   10 * time.Second,  // Provider validation
    TimeoutTaskStatus: 10 * time.Second,  // Task polling
}

// Usage in gRPC client
timeout := config.GetRPCTimeout("Create")
ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()

resp, err := client.Create(ctx, request)

Context Propagation

Always respect context deadlines:

func (p *Provider) Create(ctx context.Context, req CreateRequest) error {
    // Check if context is already cancelled
    select {
    case <-ctx.Done():
        return ctx.Err()
    default:
    }
    
    // Perform operation with context
    return p.performCreate(ctx, req)
}

Rate Limiting

Provider Protection

import "golang.org/x/time/rate"

// Configure rate limiter
limiter := rate.NewLimiter(
    rate.Limit(config.RateLimit.QPS),    // 10 requests per second
    config.RateLimit.Burst,              // Allow bursts of 20
)

// Check rate limit before operation
if !limiter.Allow() {
    return contracts.NewRateLimitError("Rate limit exceeded", nil)
}

// Proceed with operation
return provider.Create(ctx, request)

Per-Provider Limits

Each provider instance has its own rate limiter:

type ProviderManager struct {
    limiters map[string]*rate.Limiter
}

func (pm *ProviderManager) getLimiter(providerType, provider string) *rate.Limiter {
    key := fmt.Sprintf("%s:%s", providerType, provider)
    if limiter, exists := pm.limiters[key]; exists {
        return limiter
    }
    
    // Create new limiter
    limiter := rate.NewLimiter(rate.Limit(10), 20)
    pm.limiters[key] = limiter
    return limiter
}

Condition Mapping

VM Conditions

VirtRigaud sets standard conditions based on operations:

Condition	Status	Reason	Description
`Ready`	True	`VMReady`	VM is ready for use
`Ready`	False	`ProviderError`	Provider operation failed
`Ready`	False	`ValidationError`	Spec validation failed
`Provisioning`	True	`Creating`	VM creation in progress
`Provisioning`	False	`CreateFailed`	VM creation failed

Provider Conditions

Condition	Status	Reason	Description
`ProviderRuntimeReady`	True	`DeploymentReady`	Remote runtime ready
`ProviderRuntimeReady`	False	`DeploymentError`	Deployment failed
`ProviderAvailable`	True	`HealthCheckPassed`	Provider healthy
`ProviderAvailable`	False	`HealthCheckFailed`	Provider unhealthy

Error to Condition Mapping

func mapErrorToCondition(err error) metav1.Condition {
    if providerErr, ok := err.(*contracts.ProviderError); ok {
        switch providerErr.Type {
        case contracts.ErrorTypeNotFound:
            return metav1.Condition{
                Type:    "Ready",
                Status:  metav1.ConditionFalse,
                Reason:  "ResourceNotFound",
                Message: providerErr.Message,
            }
        case contracts.ErrorTypeUnauthorized:
            return metav1.Condition{
                Type:    "Ready", 
                Status:  metav1.ConditionFalse,
                Reason:  "AuthenticationFailed",
                Message: providerErr.Message,
            }
        case contracts.ErrorTypeUnavailable:
            return metav1.Condition{
                Type:    "Ready",
                Status:  metav1.ConditionFalse,
                Reason:  "ProviderUnavailable", 
                Message: providerErr.Message,
            }
        }
    }
    
    // Default error condition
    return metav1.Condition{
        Type:    "Ready",
        Status:  metav1.ConditionFalse,
        Reason:  "InternalError",
        Message: err.Error(),
    }
}

Best Practices

Error Handling

Always classify errors - Use appropriate error types
Preserve context - Wrap errors with additional context
Avoid retrying non-retryable errors - Check error type first
Set meaningful conditions - Help users understand state

Circuit Breakers

Per-provider instances - Isolate failures
Appropriate thresholds - Balance protection vs availability
Monitor state changes - Alert on circuit breaker trips
Manual override - Provide way to reset if needed

Timeouts

Operation-appropriate - Different timeouts for different ops
Propagate context - Always pass context through
Handle cancellation - Check context.Done() regularly
Resource cleanup - Ensure resources are freed on timeout

Rate Limiting

Provider protection - Prevent overwhelming providers
Burst handling - Allow reasonable bursts
Back-pressure - Surface rate limits to users
Fair sharing - Consider tenant isolation

Configuration Examples

Development Environment

apiVersion: v1
kind: ConfigMap
metadata:
  name: virtrigaud-config
data:
  # Relaxed timeouts for development
  RPC_TIMEOUT_MUTATING: "10m"
  
  # Aggressive retries for flaky dev environments  
  RETRY_MAX_ATTEMPTS: "10"
  RETRY_BASE_DELAY: "100ms"
  
  # Lower circuit breaker threshold
  CB_FAILURE_THRESHOLD: "5"
  CB_RESET_SECONDS: "30s"

Production Environment

apiVersion: v1
kind: ConfigMap
metadata:
  name: virtrigaud-config
data:
  # Strict timeouts
  RPC_TIMEOUT_MUTATING: "4m"
  RPC_TIMEOUT_DESCRIBE: "30s"
  
  # Conservative retries
  RETRY_MAX_ATTEMPTS: "3"
  RETRY_BASE_DELAY: "1s"
  RETRY_MAX_DELAY: "60s"
  
  # Higher circuit breaker threshold
  CB_FAILURE_THRESHOLD: "15" 
  CB_RESET_SECONDS: "120s"
  
  # Rate limiting
  RATE_LIMIT_QPS: "20"
  RATE_LIMIT_BURST: "50"

VirtRigaud Upgrade Guide

This guide covers upgrading VirtRigaud installations, including CRD updates and breaking changes.

Quick Upgrade

Helm-based Upgrade (Recommended)

# 1. Update Helm repository
helm repo update

# 2. Check for breaking changes
helm diff upgrade virtrigaud virtrigaud/virtrigaud --version v0.2.1

# 3. Upgrade CRDs first (required for schema changes)
helm pull virtrigaud/virtrigaud --version v0.2.1 --untar
kubectl apply -f virtrigaud/crds/

# 4. Upgrade VirtRigaud
helm upgrade virtrigaud virtrigaud/virtrigaud \
  --namespace virtrigaud-system \
  --version v0.2.1

Alternative: Direct CRD Download

# Download and apply CRDs from release
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.1/virtrigaud-crds.yaml" | kubectl apply -f -

# Upgrade application
helm upgrade virtrigaud virtrigaud/virtrigaud --version v0.2.1

Version-Specific Upgrade Notes

v0.2.0 → v0.2.1

Breaking Changes:

✅ PowerState validation fixed (OffGraceful now supported)
✅ Hardware version management added (vSphere only)
✅ Disk size configuration respected

Required Actions:

CRD Update Required: New powerState validation and schema changes
Provider Image Update: Ensure providers use v0.2.1+ images for new features
Field Testing: Verify OffGraceful, hardware version, and disk sizing work correctly

Upgrade Steps:

# 1. Backup existing resources
kubectl get virtualmachines,vmclasses,providers -A -o yaml > virtrigaud-backup-v021.yaml

# 2. Update CRDs (fixes OffGraceful validation)
kubectl apply -f https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.1/virtrigaud-crds.yaml

# 3. Upgrade VirtRigaud
helm upgrade virtrigaud virtrigaud/virtrigaud --version v0.2.1

# 4. Verify OffGraceful works
kubectl patch virtualmachine <vm-name> --type='merge' -p='{"spec":{"powerState":"OffGraceful"}}'

Rollback Procedures

Rollback to Previous Version

# 1. Rollback application
helm rollback virtrigaud <revision>

# 2. Rollback CRDs (if schema breaking changes)
kubectl apply -f https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.0/virtrigaud-crds.yaml

# 3. Verify resources still work
kubectl get virtualmachines -A

Emergency Recovery

# 1. Restore from backup
kubectl apply -f virtrigaud-backup-v021.yaml

# 2. Check controller logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager

# 3. Force reconciliation
kubectl annotate virtualmachine <vm-name> virtrigaud.io/force-sync="$(date)"

Automated Upgrade with GitOps

ArgoCD

apiVersion: argoproj.io/v1beta1
kind: Application
metadata:
  name: virtrigaud
spec:
  source:
    chart: virtrigaud
    repoURL: https://projectbeskar.github.io/virtrigaud
    targetRevision: "0.2.1"
    helm:
      parameters:
      - name: manager.image.tag
        value: "v0.2.1"
  syncPolicy:
    syncOptions:
    - CreateNamespace=true
    - Replace=true  # Required for CRD updates

Flux

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: virtrigaud
spec:
  chart:
    spec:
      chart: virtrigaud
      version: "0.2.1"
      sourceRef:
        kind: HelmRepository
        name: virtrigaud
  upgrade:
    crds: CreateReplace  # Ensure CRDs are updated

Troubleshooting Upgrades

CRD Validation Errors

# Check CRD status
kubectl get crd virtualmachines.infra.virtrigaud.io -o yaml

# Fix validation conflicts
kubectl patch crd virtualmachines.infra.virtrigaud.io --type='json' -p='[{"op": "remove", "path": "/spec/versions/0/schema/openAPIV3Schema/properties/spec/properties/powerState/allOf"}]'

Provider Image Mismatch

# Check provider images
kubectl get providers -o jsonpath='{.items[*].spec.runtime.image}'

# Update provider image
kubectl patch provider <provider-name> --type='merge' -p='{"spec":{"runtime":{"image":"ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.2.1"}}}'

Resource Conflicts

# Check for resource conflicts
kubectl get events --sort-by=.metadata.creationTimestamp

# Force resource refresh
kubectl delete pod -l app.kubernetes.io/name=virtrigaud -n virtrigaud-system

Best Practices

Pre-Upgrade Checklist

Backup all VirtRigaud resources
Check for breaking changes in release notes
Test upgrade in staging environment
Verify provider connectivity
Plan rollback strategy

Post-Upgrade Verification

All CRDs updated successfully
Controller manager running
Providers healthy and responsive
Existing VMs still manageable
New features working (OffGraceful, hardware version, etc.)

Monitoring During Upgrade

# Watch controller logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager -f

# Monitor VM status
kubectl get virtualmachines -A --watch

# Check provider health
kubectl get providers -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[0].type,MESSAGE:.status.conditions[0].message

Support and Recovery

If you encounter issues during upgrade:

Check Release Notes: https://github.com/projectbeskar/virtrigaud/releases
Review Logs: Controller and provider logs for error details
Community Support: GitHub issues and discussions
Emergency Rollback: Use documented rollback procedures

Remember: Always test upgrades in non-production environments first!

Development Workflow (v0.2.1+)

CRD Management

Starting with v0.2.1+, VirtRigaud uses a single-source-of-truth approach for CRDs:

Code is the source of truth (API types in api/infra.virtrigaud.io/v1beta1)
config/crd/bases/ contains generated CRDs for local development and is checked into git
charts/virtrigaud/crds/ CRDs are generated during Helm chart packaging and are NOT checked into git

For Developers

# Generate CRDs for local development
make gen-crds

# Generate CRDs for Helm chart packaging
make gen-helm-crds

# Package Helm chart with generated CRDs
make helm-package

Pre-commit Hooks

Install pre-commit hooks to automatically generate CRDs:

# Install pre-commit
pip install pre-commit

# Install hooks
pre-commit install

# CRDs will now be generated automatically on commits that modify:
# - api/**.go files

CI/CD Integration

The CI/CD pipeline automatically:

Generates CRDs from code during builds
Includes CRDs in release artifacts for users to download
Generates Helm chart CRDs during packaging

This ensures CRDs are always up-to-date and not duplicated in the repository.

Repository Workflow

# 1. Make API changes
vim api/infra.virtrigaud.io/v1beta1/virtualmachine_types.go

# 2. Generate CRDs (automated by pre-commit)
make gen-crds

# 3. Commit changes
git add .
git commit -m "feat: add new VM power states"

# 4. CI validates and builds with generated CRDs
git push origin feature-branch

vSphere Hardware Version Management

This document describes how to configure and upgrade VM hardware compatibility versions in VMware vSphere environments using virtrigaud.

Overview

VMware vSphere virtual machines have a hardware compatibility version (also called virtual hardware version) that determines which features and capabilities are available to the VM. Higher hardware versions provide access to newer features but require compatible ESXi hosts.

Note: Hardware version management is specific to VMware vSphere and is not available for other providers (LibVirt, Proxmox, etc.).

Hardware Version Numbers

Common hardware versions and their corresponding VMware products:

Hardware Version	vSphere/ESXi Version	Key Features
10	ESXi 5.5	Legacy baseline
11	ESXi 6.0	Enhanced graphics, larger VM memory
13	ESXi 6.5	Enhanced security, more CPU/memory
14	ESXi 6.7	Persistent memory, enhanced security
15	ESXi 6.7 U2	Enhanced graphics, more vCPU
17	ESXi 7.0	TPM 2.0, enhanced security
18	ESXi 7.0 U1	Enhanced networking
19	ESXi 7.0 U2	Precision time protocol
20	ESXi 7.0 U3	Enhanced graphics, more memory
21	ESXi 8.0	Latest features, DPU support

Setting Hardware Version During VM Creation

Configure the hardware version in the VMClass using the extraConfig field:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: modern-vm-class
  namespace: virtrigaud-system
spec:
  cpu: 4
  memory: 8Gi
  firmware: UEFI
  
  # vSphere-specific hardware version configuration
  extraConfig:
    vsphere.hardwareVersion: "21"  # Use latest hardware version
  
  diskDefaults:
    type: thin
    sizeGiB: 50
---
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: modern-vm
  namespace: default
spec:
  providerRef:
    name: vsphere-provider
    namespace: virtrigaud-system
  
  classRef:
    name: modern-vm-class  # Uses hardware version 21
    namespace: virtrigaud-system
  
  imageRef:
    name: ubuntu-22-04
    namespace: virtrigaud-system

Upgrading Hardware Version for Existing VMs

You can upgrade the hardware version of existing VMs using the dedicated hardware upgrade API:

Using kubectl with Raw gRPC

# First, ensure the VM is powered off
kubectl patch vm my-vm --type='merge' -p='{"spec":{"powerState":"Off"}}'

# Wait for VM to be powered off, then upgrade hardware version
# Note: This requires direct access to the provider gRPC endpoint
# A kubectl plugin or controller extension would be needed for this operation

Programmatic Upgrade (Example Go Code)

package main

import (
    "context"
    "fmt"
    "log"
    
    providerv1 "github.com/projectbeskar/virtrigaud/proto/rpc/provider/v1"
    "google.golang.org/grpc"
)

func upgradeVMHardwareVersion(vmID string, targetVersion int32) error {
    // Connect to vSphere provider
    conn, err := grpc.Dial("vsphere-provider:9090", grpc.WithInsecure())
    if err != nil {
        return fmt.Errorf("failed to connect: %w", err)
    }
    defer conn.Close()
    
    client := providerv1.NewProviderClient(conn)
    
    // Upgrade hardware version
    req := &providerv1.HardwareUpgradeRequest{
        Id:            vmID,
        TargetVersion: targetVersion,
    }
    
    resp, err := client.HardwareUpgrade(context.Background(), req)
    if err != nil {
        return fmt.Errorf("hardware upgrade failed: %w", err)
    }
    
    log.Printf("Hardware upgrade completed: %+v", resp)
    return nil
}

Requirements and Limitations

Prerequisites

VM Must Be Powered Off: Hardware version upgrades require the VM to be completely powered off
ESXi Host Compatibility: Target hardware version must be supported by the ESXi host
VMware Tools: For best results, ensure VMware Tools is installed and up-to-date
Backup Recommended: Take a snapshot before upgrading hardware version

Limitations

One-Way Operation: Hardware version upgrades cannot be downgraded
vSphere Only: This feature is not available for LibVirt, Proxmox, or other providers
Host Requirements: Upgrading to newer versions may prevent VM from running on older ESXi hosts
Compatibility: Some older guest operating systems may not support newer hardware versions

Best Practices

Choosing Hardware Version

Match ESXi Version: Use the hardware version that matches your ESXi environment
Conservative Approach: Don’t always use the latest version unless you need specific features
Test First: Test hardware version upgrades in development before production

Upgrade Process

Plan Maintenance Window: VMs must be powered off during upgrade
Backup First: Always take a snapshot before upgrading
Batch Operations: Group VMs by hardware requirements for efficient upgrades
Verify Compatibility: Ensure all ESXi hosts in your cluster support the target version

Example VMClass Configurations

Legacy Environment (ESXi 6.5)

extraConfig:
  vsphere.hardwareVersion: "13"

Modern Environment (ESXi 7.0)

extraConfig:
  vsphere.hardwareVersion: "17"

Latest Features (ESXi 8.0)

extraConfig:
  vsphere.hardwareVersion: "21"

Troubleshooting

Common Issues

VM Not Powered Off

Error: VM must be powered off for hardware upgrade, current state: poweredOn

Solution: Power off the VM first using powerState: Off

Unsupported Hardware Version
```
Error: target version vmx-21 is not supported by ESXi host
```
Solution: Check ESXi host compatibility and use a supported version
Version Not Newer
```
Error: target version vmx-15 is not newer than current version vmx-17
```
Solution: Hardware versions can only be upgraded, not downgraded

Validation

After upgrading, verify the hardware version:

# Check VM configuration in vSphere
kubectl get vm my-vm -o jsonpath='{.status.provider}'

Integration Examples

Complete VM Lifecycle with Hardware Version

# 1. Create VMClass with specific hardware version
apiVersion: infra.virtrigaud.io/v1beta1
kind: VMClass
metadata:
  name: production-vm-class
spec:
  cpu: 8
  memory: 16Gi
  firmware: UEFI
  extraConfig:
    vsphere.hardwareVersion: "19"  # ESXi 7.0 U2 compatible

---
# 2. Create VM using the class
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: production-vm
spec:
  powerState: On
  providerRef:
    name: vsphere-provider
    namespace: virtrigaud-system
  classRef:
    name: production-vm-class
    namespace: virtrigaud-system
  imageRef:
    name: ubuntu-22-04
    namespace: virtrigaud-system

---
# 3. Update to newer hardware version (requires separate upgrade operation)
# This would typically be done through a controller or manual gRPC call
# after powering off the VM

This vSphere-specific feature provides fine-grained control over VM hardware capabilities while maintaining compatibility with your ESXi infrastructure.

vSphere Datastore Cluster (StoragePod) Support

This document describes how to use vSphere Datastore Clusters (also known as StoragePods) for automatic datastore selection when provisioning virtual machines with virtrigaud.

Note: StoragePod support is specific to the vSphere provider and is not available for Libvirt or Proxmox.

Overview

A vSphere Datastore Cluster (internally called a StoragePod) is a logical grouping of datastores managed together as a single unit. When you specify a Datastore Cluster instead of an individual datastore, virtrigaud automatically selects the datastore within the cluster that has the most available free space at provisioning time.

This simplifies VM placement in environments with multiple datastores: instead of tracking which individual datastore has capacity, you point to the cluster and let virtrigaud choose.

Datastore Selection Strategy

virtrigaud uses a simple, predictable strategy: pick the datastore with the most free space. This distributes VMs across the cluster over time as datastores fill up.

vSphere Storage DRS is not required to be enabled on the cluster. virtrigaud queries datastore summaries directly via the vSphere API.

Configuration

Per-VM Placement (VirtualMachine spec)

Specify storagePod inside spec.placement on a VirtualMachine resource:

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: my-vm
  namespace: virtrigaud-system
spec:
  providerRef:
    name: vsphere-prod
  classRef:
    name: standard-2cpu-4gb
  imageRef:
    name: ubuntu-24-04
  placement:
    cluster: prod-cluster
    storagePod: "Production-DS-Cluster"   # Datastore Cluster name
    folder: /prod/vms

virtrigaud will inspect every datastore in Production-DS-Cluster and clone the VM onto the one with the most free space.

Provider-Level Default

Set spec.defaults.storagePod on the Provider resource to apply a Datastore Cluster as the default for all VMs that do not specify their own placement:

apiVersion: infra.virtrigaud.io/v1beta1
kind: Provider
metadata:
  name: vsphere-prod
  namespace: virtrigaud-system
spec:
  type: vsphere
  endpoint: https://vcenter.example.com
  credentialSecretRef:
    name: vsphere-credentials
  defaults:
    cluster: prod-cluster
    storagePod: "Production-DS-Cluster"   # cluster-wide default
    folder: /prod/vms
  runtime:
    image: ghcr.io/projectbeskar/virtrigaud-provider-vsphere:latest

Alternatively, pass the default through the provider pod’s environment by adding it to spec.runtime.env:

spec:
  runtime:
    env:
      - name: PROVIDER_DEFAULT_STORAGE_POD
        value: "Production-DS-Cluster"

Precedence Rules

When multiple sources specify storage placement, virtrigaud applies the following priority (highest to lowest):

Priority	Source	Field
1	VM spec — explicit datastore	`spec.placement.datastore`
2	VM spec — StoragePod	`spec.placement.storagePod`
3	Provider default — StoragePod	`spec.defaults.storagePod` / `PROVIDER_DEFAULT_STORAGE_POD`
4	Provider default — datastore	`spec.defaults.datastore` / `PROVIDER_DEFAULT_DATASTORE`

An explicit datastore always wins. storagePod is only consulted when no explicit datastore is set.

Examples

StoragePod only (recommended for large environments)

apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server-01
  namespace: virtrigaud-system
spec:
  providerRef:
    name: vsphere-prod
  classRef:
    name: web-4cpu-8gb
  imageRef:
    name: ubuntu-24-04
  placement:
    cluster: prod-cluster
    storagePod: "SSD-Datastore-Cluster"

Override with an explicit datastore (e.g. for compliance)

When you need a specific datastore—for example, a datastore dedicated to regulated workloads—set datastore and storagePod is ignored:

  placement:
    cluster: prod-cluster
    datastore: "regulated-ds-01"    # StoragePod is ignored when this is set
    storagePod: "SSD-Datastore-Cluster"

Use different clusters for different teams via a shared Provider

Combine a provider-level StoragePod default with per-VM overrides:

# Provider default: route most VMs to the general-purpose cluster
spec:
  defaults:
    cluster: general-cluster
    storagePod: "General-DS-Cluster"

# High-performance VM overrides both cluster and StoragePod
spec:
  placement:
    cluster: nvme-cluster
    storagePod: "NVMe-DS-Cluster"

How it Works Internally

When a VM is created and a StoragePod is resolved:

virtrigaud creates a container view scoped to the vSphere root folder and searches for StoragePod managed objects.
It matches the named StoragePod and reads its childEntity list—the set of datastores it contains.
For each child datastore the summary.freeSpace property is retrieved via the property collector.
The datastore with the highest freeSpace is selected and used as the target in the clone specification (VirtualMachineRelocateSpec.Datastore).

The selection happens at provisioning time; it is not re-evaluated on subsequent reconciliations or reboots.

Troubleshooting

“StoragePod ‘X’ not found”

Verify the name exactly matches the Datastore Cluster name in vCenter (case-sensitive).
Confirm the vCenter user account has Datastore.Browse privilege on the Datastore Cluster.
Check provider pod logs for the container view query.

“StoragePod ‘X’ contains no datastores”

The Datastore Cluster exists but is empty (no datastores are members). Add datastores to the cluster in vCenter.

“failed to retrieve datastores from StoragePod”

The provider account lacks permission to read datastore summary properties. Grant Datastore.Browse on the individual datastores within the cluster.

VM is always placed on the same datastore

This is expected when one datastore consistently has significantly more free space. It is not a bug.

Checking which datastore was selected

The provider logs an INFO message at VM creation time:

INFO  Selected datastore from StoragePod  storagePod=Production-DS-Cluster  datastore=vsanDatastore-02  freeSpaceGiB=812

Check the provider pod logs to see the selection for any specific VM.

Limitations

Free-space only: virtrigaud does not use vSphere Storage DRS policies, IOPS limits, or storage tags when selecting a datastore. Only free space is considered.
Point-in-time selection: The datastore is chosen once at clone time. Subsequent Storage vMotion by Storage DRS is not prevented.
No rebalancing: virtrigaud does not rebalance existing VMs when free space changes.
vSphere only: This feature has no equivalent for Libvirt or Proxmox providers.

Bearer Token Authentication

This guide covers how to configure bearer token authentication for VirtRigaud providers using JWT tokens and RBAC.

Overview

Bearer token authentication provides a stateless, scalable authentication mechanism using JSON Web Tokens (JWT). This approach is suitable for:

Multi-tenant environments: Different tokens for different tenants
API-based access: External systems accessing provider services
Short-lived sessions: Tokens with configurable expiration
Fine-grained permissions: Token-based RBAC

JWT Token Structure

Token Claims

{
  "iss": "virtrigaud-manager",
  "sub": "provider-client",
  "aud": "virtrigaud-provider",
  "exp": 1640995200,
  "iat": 1640908800,
  "nbf": 1640908800,
  "scope": "vm:create vm:read vm:update vm:delete",
  "tenant": "default",
  "provider": "vsphere",
  "jti": "unique-token-id"
}

Scopes Definition

Scope	Description
`vm:create`	Create virtual machines
`vm:read`	Read virtual machine information
`vm:update`	Update virtual machine configuration
`vm:delete`	Delete virtual machines
`vm:power`	Control virtual machine power state
`vm:snapshot`	Create and manage snapshots
`vm:clone`	Clone virtual machines
`admin`	Full administrative access

Token Generation

JWT Signing Key

# Generate RS256 private key
openssl genrsa -out jwt-private-key.pem 2048

# Extract public key
openssl rsa -in jwt-private-key.pem -pubout -out jwt-public-key.pem

# Store as Kubernetes secret
kubectl create secret generic jwt-keys \
  --from-file=private-key=jwt-private-key.pem \
  --from-file=public-key=jwt-public-key.pem \
  --namespace=virtrigaud-system

Token Generation Service

package auth

import (
    "crypto/rsa"
    "time"
    
    "github.com/golang-jwt/jwt/v4"
)

type TokenClaims struct {
    Issuer    string   `json:"iss"`
    Subject   string   `json:"sub"`
    Audience  string   `json:"aud"`
    ExpiresAt int64    `json:"exp"`
    IssuedAt  int64    `json:"iat"`
    NotBefore int64    `json:"nbf"`
    Scope     string   `json:"scope"`
    Tenant    string   `json:"tenant"`
    Provider  string   `json:"provider"`
    ID        string   `json:"jti"`
    jwt.RegisteredClaims
}

type TokenService struct {
    privateKey *rsa.PrivateKey
    publicKey  *rsa.PublicKey
    issuer     string
}

func NewTokenService(privateKey *rsa.PrivateKey, publicKey *rsa.PublicKey, issuer string) *TokenService {
    return &TokenService{
        privateKey: privateKey,
        publicKey:  publicKey,
        issuer:     issuer,
    }
}

func (ts *TokenService) GenerateToken(subject, tenant, provider string, scopes []string, duration time.Duration) (string, error) {
    now := time.Now()
    claims := &TokenClaims{
        Issuer:    ts.issuer,
        Subject:   subject,
        Audience:  "virtrigaud-provider",
        ExpiresAt: now.Add(duration).Unix(),
        IssuedAt:  now.Unix(),
        NotBefore: now.Unix(),
        Scope:     strings.Join(scopes, " "),
        Tenant:    tenant,
        Provider:  provider,
        ID:        generateJTI(),
    }
    
    token := jwt.NewWithClaims(jwt.SigningMethodRS256, claims)
    return token.SignedString(ts.privateKey)
}

func (ts *TokenService) ValidateToken(tokenString string) (*TokenClaims, error) {
    token, err := jwt.ParseWithClaims(tokenString, &TokenClaims{}, func(token *jwt.Token) (interface{}, error) {
        if _, ok := token.Method.(*jwt.SigningMethodRSA); !ok {
            return nil, fmt.Errorf("unexpected signing method: %v", token.Header["alg"])
        }
        return ts.publicKey, nil
    })
    
    if err != nil {
        return nil, err
    }
    
    if claims, ok := token.Claims.(*TokenClaims); ok && token.Valid {
        return claims, nil
    }
    
    return nil, fmt.Errorf("invalid token")
}

func generateJTI() string {
    return uuid.New().String()
}

Provider Authentication Interceptor

gRPC Interceptor

package middleware

import (
    "context"
    "strings"
    
    "google.golang.org/grpc"
    "google.golang.org/grpc/codes"
    "google.golang.org/grpc/metadata"
    "google.golang.org/grpc/status"
)

type AuthInterceptor struct {
    tokenService *auth.TokenService
    rbac         *RBACManager
}

func NewAuthInterceptor(tokenService *auth.TokenService, rbac *RBACManager) *AuthInterceptor {
    return &AuthInterceptor{
        tokenService: tokenService,
        rbac:         rbac,
    }
}

func (ai *AuthInterceptor) Unary() grpc.UnaryServerInterceptor {
    return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
        // Skip authentication for health checks
        if strings.HasSuffix(info.FullMethod, "/Health/Check") {
            return handler(ctx, req)
        }
        
        token, err := ai.extractToken(ctx)
        if err != nil {
            return nil, status.Errorf(codes.Unauthenticated, "missing or invalid token: %v", err)
        }
        
        claims, err := ai.tokenService.ValidateToken(token)
        if err != nil {
            return nil, status.Errorf(codes.Unauthenticated, "invalid token: %v", err)
        }
        
        // Check authorization
        if !ai.rbac.IsAuthorized(claims, info.FullMethod) {
            return nil, status.Errorf(codes.PermissionDenied, "insufficient permissions")
        }
        
        // Add claims to context
        ctx = context.WithValue(ctx, "claims", claims)
        
        return handler(ctx, req)
    }
}

func (ai *AuthInterceptor) extractToken(ctx context.Context) (string, error) {
    md, ok := metadata.FromIncomingContext(ctx)
    if !ok {
        return "", fmt.Errorf("missing metadata")
    }
    
    authHeaders := md.Get("authorization")
    if len(authHeaders) == 0 {
        return "", fmt.Errorf("missing authorization header")
    }
    
    authHeader := authHeaders[0]
    if !strings.HasPrefix(authHeader, "Bearer ") {
        return "", fmt.Errorf("invalid authorization header format")
    }
    
    return strings.TrimPrefix(authHeader, "Bearer "), nil
}

RBAC Manager

package middleware

import (
    "strings"
)

type Permission struct {
    Resource string
    Action   string
}

type RBACManager struct {
    permissions map[string][]Permission
}

func NewRBACManager() *RBACManager {
    return &RBACManager{
        permissions: map[string][]Permission{
            // RPC method to required permissions mapping
            "/provider.v1.ProviderService/CreateVM": {
                {Resource: "vm", Action: "create"},
            },
            "/provider.v1.ProviderService/GetVM": {
                {Resource: "vm", Action: "read"},
            },
            "/provider.v1.ProviderService/UpdateVM": {
                {Resource: "vm", Action: "update"},
            },
            "/provider.v1.ProviderService/DeleteVM": {
                {Resource: "vm", Action: "delete"},
            },
            "/provider.v1.ProviderService/PowerVM": {
                {Resource: "vm", Action: "power"},
            },
            "/provider.v1.ProviderService/CreateSnapshot": {
                {Resource: "vm", Action: "snapshot"},
            },
            "/provider.v1.ProviderService/CloneVM": {
                {Resource: "vm", Action: "clone"},
            },
        },
    }
}

func (rbac *RBACManager) IsAuthorized(claims *auth.TokenClaims, method string) bool {
    requiredPerms, exists := rbac.permissions[method]
    if !exists {
        // Allow if no specific permissions required
        return true
    }
    
    userScopes := strings.Split(claims.Scope, " ")
    
    // Check if user has admin scope
    for _, scope := range userScopes {
        if scope == "admin" {
            return true
        }
    }
    
    // Check specific permissions
    for _, requiredPerm := range requiredPerms {
        requiredScope := requiredPerm.Resource + ":" + requiredPerm.Action
        
        hasPermission := false
        for _, userScope := range userScopes {
            if userScope == requiredScope {
                hasPermission = true
                break
            }
        }
        
        if !hasPermission {
            return false
        }
    }
    
    return true
}

Kubernetes RBAC Integration

ServiceAccount and ClusterRole

apiVersion: v1
kind: ServiceAccount
metadata:
  name: virtrigaud-token-manager
  namespace: virtrigaud-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: virtrigaud-token-manager
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: virtrigaud-token-manager
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: virtrigaud-token-manager
subjects:
  - kind: ServiceAccount
    name: virtrigaud-token-manager
    namespace: virtrigaud-system

Token Management ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: token-config
  namespace: virtrigaud-system
data:
  config.yaml: |
    tokenService:
      issuer: "virtrigaud-manager"
      defaultDuration: "1h"
      maxDuration: "24h"
      
    scopes:
      - name: "vm:create"
        description: "Create virtual machines"
      - name: "vm:read"
        description: "Read virtual machine information"
      - name: "vm:update"
        description: "Update virtual machine configuration"
      - name: "vm:delete"
        description: "Delete virtual machines"
      - name: "vm:power"
        description: "Control virtual machine power state"
      - name: "vm:snapshot"
        description: "Create and manage snapshots"
      - name: "vm:clone"
        description: "Clone virtual machines"
      - name: "admin"
        description: "Full administrative access"
        
    tenants:
      - name: "default"
        description: "Default tenant"
        allowedScopes: ["vm:create", "vm:read", "vm:update", "vm:delete", "vm:power"]
      - name: "development"
        description: "Development environment"
        allowedScopes: ["vm:create", "vm:read", "vm:update", "vm:delete", "vm:power", "vm:snapshot", "vm:clone"]
      - name: "production"
        description: "Production environment"
        allowedScopes: ["vm:read", "vm:power"]

Client Configuration

Manager Client Setup

package client

import (
    "context"
    "time"
    
    "google.golang.org/grpc"
    "google.golang.org/grpc/metadata"
)

type AuthenticatedClient struct {
    client providerv1.ProviderServiceClient
    token  string
}

func NewAuthenticatedClient(endpoint, token string) (*AuthenticatedClient, error) {
    conn, err := grpc.Dial(endpoint, grpc.WithInsecure())
    if err != nil {
        return nil, err
    }
    
    return &AuthenticatedClient{
        client: providerv1.NewProviderServiceClient(conn),
        token:  token,
    }, nil
}

func (ac *AuthenticatedClient) CreateVM(ctx context.Context, req *providerv1.CreateVMRequest) (*providerv1.CreateVMResponse, error) {
    ctx = ac.addAuthHeader(ctx)
    return ac.client.CreateVM(ctx, req)
}

func (ac *AuthenticatedClient) addAuthHeader(ctx context.Context) context.Context {
    md := metadata.Pairs("authorization", "Bearer "+ac.token)
    return metadata.NewOutgoingContext(ctx, md)
}

Token Refresh

package auth

import (
    "sync"
    "time"
)

type TokenManager struct {
    tokenService *TokenService
    currentToken string
    expiresAt    time.Time
    mutex        sync.RWMutex
    
    subject  string
    tenant   string
    provider string
    scopes   []string
}

func NewTokenManager(tokenService *TokenService, subject, tenant, provider string, scopes []string) *TokenManager {
    return &TokenManager{
        tokenService: tokenService,
        subject:      subject,
        tenant:       tenant,
        provider:     provider,
        scopes:       scopes,
    }
}

func (tm *TokenManager) GetToken() (string, error) {
    tm.mutex.RLock()
    if tm.currentToken != "" && time.Now().Before(tm.expiresAt.Add(-5*time.Minute)) {
        token := tm.currentToken
        tm.mutex.RUnlock()
        return token, nil
    }
    tm.mutex.RUnlock()
    
    return tm.refreshToken()
}

func (tm *TokenManager) refreshToken() (string, error) {
    tm.mutex.Lock()
    defer tm.mutex.Unlock()
    
    // Double-check after acquiring write lock
    if tm.currentToken != "" && time.Now().Before(tm.expiresAt.Add(-5*time.Minute)) {
        return tm.currentToken, nil
    }
    
    token, err := tm.tokenService.GenerateToken(tm.subject, tm.tenant, tm.provider, tm.scopes, time.Hour)
    if err != nil {
        return "", err
    }
    
    tm.currentToken = token
    tm.expiresAt = time.Now().Add(time.Hour)
    
    return token, nil
}

Helm Chart Integration

Provider Runtime with Bearer Token Auth

# values-bearer-auth.yaml
auth:
  type: "bearer"
  jwt:
    publicKeySecret: "jwt-keys"
    publicKeyKey: "public-key"
    issuer: "virtrigaud-manager"
    audience: "virtrigaud-provider"

# Environment variables for authentication
env:
  - name: AUTH_TYPE
    value: "bearer"
  - name: JWT_PUBLIC_KEY_PATH
    value: "/etc/jwt/public-key"
  - name: JWT_ISSUER
    value: "virtrigaud-manager"
  - name: JWT_AUDIENCE
    value: "virtrigaud-provider"

# Mount JWT public key
volumes:
  - name: jwt-public-key
    secret:
      secretName: jwt-keys

volumeMounts:
  - name: jwt-public-key
    mountPath: /etc/jwt
    readOnly: true

Monitoring and Logging

Authentication Metrics

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    authenticationAttempts = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "virtrigaud_authentication_attempts_total",
            Help: "Total number of authentication attempts",
        },
        []string{"method", "result", "tenant"},
    )
    
    authenticationDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "virtrigaud_authentication_duration_seconds",
            Help: "Duration of authentication operations",
        },
        []string{"method", "result"},
    )
    
    activeTokens = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "virtrigaud_active_tokens",
            Help: "Number of active tokens by tenant",
        },
        []string{"tenant", "provider"},
    )
)

func RecordAuthAttempt(method, result, tenant string) {
    authenticationAttempts.WithLabelValues(method, result, tenant).Inc()
}

func RecordAuthDuration(method, result string, duration time.Duration) {
    authenticationDuration.WithLabelValues(method, result).Observe(duration.Seconds())
}

Audit Logging

package audit

import (
    "context"
    "encoding/json"
    "time"
    
    "go.uber.org/zap"
)

type AuditEvent struct {
    Timestamp time.Time `json:"timestamp"`
    EventType string    `json:"event_type"`
    Subject   string    `json:"subject"`
    Tenant    string    `json:"tenant"`
    Provider  string    `json:"provider"`
    Resource  string    `json:"resource"`
    Action    string    `json:"action"`
    Result    string    `json:"result"`
    Error     string    `json:"error,omitempty"`
    Metadata  map[string]interface{} `json:"metadata,omitempty"`
}

type AuditLogger struct {
    logger *zap.Logger
}

func NewAuditLogger(logger *zap.Logger) *AuditLogger {
    return &AuditLogger{logger: logger}
}

func (al *AuditLogger) LogAuthEvent(ctx context.Context, eventType, subject, tenant, provider, result string, err error) {
    event := AuditEvent{
        Timestamp: time.Now(),
        EventType: eventType,
        Subject:   subject,
        Tenant:    tenant,
        Provider:  provider,
        Result:    result,
    }
    
    if err != nil {
        event.Error = err.Error()
    }
    
    eventJSON, _ := json.Marshal(event)
    al.logger.Info("audit_event", zap.String("event", string(eventJSON)))
}

Security Best Practices

1. Token Validation

// Always validate all token claims
func validateTokenClaims(claims *TokenClaims) error {
    now := time.Now()
    
    // Check expiration
    if claims.ExpiresAt < now.Unix() {
        return fmt.Errorf("token expired")
    }
    
    // Check not before
    if claims.NotBefore > now.Unix() {
        return fmt.Errorf("token not yet valid")
    }
    
    // Check issuer
    if claims.Issuer != expectedIssuer {
        return fmt.Errorf("invalid issuer")
    }
    
    // Check audience
    if claims.Audience != expectedAudience {
        return fmt.Errorf("invalid audience")
    }
    
    return nil
}

2. Rate Limiting

// Implement rate limiting for token generation
type RateLimiter struct {
    requests map[string][]time.Time
    mutex    sync.RWMutex
    limit    int
    window   time.Duration
}

func (rl *RateLimiter) Allow(key string) bool {
    rl.mutex.Lock()
    defer rl.mutex.Unlock()
    
    now := time.Now()
    requests := rl.requests[key]
    
    // Remove old requests outside the window
    var validRequests []time.Time
    for _, req := range requests {
        if now.Sub(req) < rl.window {
            validRequests = append(validRequests, req)
        }
    }
    
    // Check if we've exceeded the limit
    if len(validRequests) >= rl.limit {
        return false
    }
    
    // Add the current request
    validRequests = append(validRequests, now)
    rl.requests[key] = validRequests
    
    return true
}

3. Token Blacklisting

// Implement token blacklisting for revoked tokens
type TokenBlacklist struct {
    blacklistedTokens map[string]time.Time
    mutex             sync.RWMutex
}

func (tb *TokenBlacklist) IsBlacklisted(jti string) bool {
    tb.mutex.RLock()
    defer tb.mutex.RUnlock()
    
    expiresAt, exists := tb.blacklistedTokens[jti]
    if !exists {
        return false
    }
    
    // Remove expired entries
    if time.Now().After(expiresAt) {
        delete(tb.blacklistedTokens, jti)
        return false
    }
    
    return true
}

func (tb *TokenBlacklist) BlacklistToken(jti string, expiresAt time.Time) {
    tb.mutex.Lock()
    defer tb.mutex.Unlock()
    tb.blacklistedTokens[jti] = expiresAt
}

mTLS Security Configuration

This guide covers how to configure mutual TLS (mTLS) authentication between VirtRigaud managers and providers.

Overview

mTLS provides strong authentication and encryption for gRPC communication between the VirtRigaud manager and provider services. It ensures:

Authentication: Both client and server verify each other’s certificates
Encryption: All traffic is encrypted in transit
Certificate Pinning: Specific certificate authorities are trusted
Certificate Rotation: Automated certificate renewal

Certificate Management

1. Generate CA Certificate

# Create CA private key
openssl genrsa -out ca-key.pem 4096

# Create CA certificate
openssl req -new -x509 -key ca-key.pem -out ca-cert.pem -days 365 \
  -subj "/C=US/ST=CA/L=San Francisco/O=VirtRigaud/CN=VirtRigaud CA"

2. Generate Server Certificate (Provider)

# Create server private key
openssl genrsa -out server-key.pem 4096

# Create server certificate signing request
openssl req -new -key server-key.pem -out server-csr.pem \
  -subj "/C=US/ST=CA/L=San Francisco/O=VirtRigaud/CN=provider-service"

# Sign server certificate
openssl x509 -req -in server-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \
  -CAcreateserial -out server-cert.pem -days 365 \
  -extensions v3_req -extfile <(cat <<EOF
[v3_req]
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = provider-service
DNS.2 = provider-service.default.svc.cluster.local
DNS.3 = localhost
IP.1 = 127.0.0.1
EOF
)

3. Generate Client Certificate (Manager)

# Create client private key
openssl genrsa -out client-key.pem 4096

# Create client certificate signing request
openssl req -new -key client-key.pem -out client-csr.pem \
  -subj "/C=US/ST=CA/L=San Francisco/O=VirtRigaud/CN=manager-client"

# Sign client certificate
openssl x509 -req -in client-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \
  -CAcreateserial -out client-cert.pem -days 365 \
  -extensions v3_req -extfile <(cat <<EOF
[v3_req]
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = clientAuth
EOF
)

Kubernetes Secret Configuration

Provider TLS Secret

apiVersion: v1
kind: Secret
metadata:
  name: provider-tls
  namespace: default
type: kubernetes.io/tls
data:
  tls.crt: # base64 encoded server-cert.pem
  tls.key: # base64 encoded server-key.pem
  ca.crt: # base64 encoded ca-cert.pem

Manager TLS Secret

apiVersion: v1
kind: Secret
metadata:
  name: manager-tls
  namespace: virtrigaud-system
type: kubernetes.io/tls
data:
  tls.crt: # base64 encoded client-cert.pem
  tls.key: # base64 encoded client-key.pem
  ca.crt: # base64 encoded ca-cert.pem

Provider Configuration

SDK Server Configuration

package main

import (
    "crypto/tls"
    "crypto/x509"
    "fmt"
    "io/ioutil"
    
    "github.com/projectbeskar/virtrigaud/sdk/provider/server"
)

func main() {
    // Load certificates
    cert, err := tls.LoadX509KeyPair("/etc/tls/tls.crt", "/etc/tls/tls.key")
    if err != nil {
        panic(fmt.Sprintf("Failed to load server certificates: %v", err))
    }
    
    // Load CA certificate for client verification
    caCert, err := ioutil.ReadFile("/etc/tls/ca.crt")
    if err != nil {
        panic(fmt.Sprintf("Failed to load CA certificate: %v", err))
    }
    
    caCertPool := x509.NewCertPool()
    if !caCertPool.AppendCertsFromPEM(caCert) {
        panic("Failed to parse CA certificate")
    }
    
    // Configure TLS
    tlsConfig := &tls.Config{
        Certificates: []tls.Certificate{cert},
        ClientAuth:   tls.RequireAndVerifyClientCert,
        ClientCAs:    caCertPool,
        MinVersion:   tls.VersionTLS12,
        CipherSuites: []uint16{
            tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
            tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,
            tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
            tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,
        },
    }
    
    // Create server with mTLS
    srv, err := server.New(&server.Config{
        Port:      9443,
        TLS:       tlsConfig,
        EnableTLS: true,
    })
    if err != nil {
        panic(fmt.Sprintf("Failed to create server: %v", err))
    }
    
    // Register your provider implementation here
    // providerv1.RegisterProviderServiceServer(srv.GRPCServer(), &YourProvider{})
    
    if err := srv.Serve(); err != nil {
        panic(fmt.Sprintf("Server failed: %v", err))
    }
}

Helm Chart Values (Provider Runtime)

# values-mtls.yaml
tls:
  enabled: true
  secretName: provider-tls

# Mount TLS certificates
volumes:
  - name: tls-certs
    secret:
      secretName: provider-tls

volumeMounts:
  - name: tls-certs
    mountPath: /etc/tls
    readOnly: true

# Environment variables for TLS
env:
  - name: TLS_ENABLED
    value: "true"
  - name: TLS_CERT_PATH
    value: "/etc/tls/tls.crt"
  - name: TLS_KEY_PATH
    value: "/etc/tls/tls.key"
  - name: TLS_CA_PATH
    value: "/etc/tls/ca.crt"

Manager Configuration

Client TLS Configuration

// In manager code
func createProviderClient(endpoint string) (providerv1.ProviderServiceClient, error) {
    // Load client certificates
    cert, err := tls.LoadX509KeyPair("/etc/manager-tls/tls.crt", "/etc/manager-tls/tls.key")
    if err != nil {
        return nil, fmt.Errorf("failed to load client certificates: %w", err)
    }
    
    // Load CA certificate for server verification
    caCert, err := ioutil.ReadFile("/etc/manager-tls/ca.crt")
    if err != nil {
        return nil, fmt.Errorf("failed to load CA certificate: %w", err)
    }
    
    caCertPool := x509.NewCertPool()
    if !caCertPool.AppendCertsFromPEM(caCert) {
        return nil, fmt.Errorf("failed to parse CA certificate")
    }
    
    // Configure TLS
    tlsConfig := &tls.Config{
        Certificates: []tls.Certificate{cert},
        RootCAs:      caCertPool,
        ServerName:   "provider-service", // Must match server certificate CN/SAN
        MinVersion:   tls.VersionTLS12,
    }
    
    // Create gRPC connection with mTLS
    conn, err := grpc.Dial(endpoint,
        grpc.WithTransportCredentials(credentials.NewTLS(tlsConfig)),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to connect: %w", err)
    }
    
    return providerv1.NewProviderServiceClient(conn), nil
}

Certificate Rotation

Using cert-manager

# Install cert-manager first
# kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.12.0/cert-manager.yaml

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: virtrigaud-ca-issuer
spec:
  ca:
    secretName: virtrigaud-ca-secret

---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: provider-tls
  namespace: default
spec:
  secretName: provider-tls
  issuerRef:
    name: virtrigaud-ca-issuer
    kind: ClusterIssuer
  commonName: provider-service
  dnsNames:
    - provider-service
    - provider-service.default.svc.cluster.local
  duration: 8760h # 1 year
  renewBefore: 720h # 30 days before expiry

Manual Rotation Script

#!/bin/bash
# rotate-certs.sh

NAMESPACE=${1:-default}
SECRET_NAME=${2:-provider-tls}

echo "Rotating certificates for $SECRET_NAME in namespace $NAMESPACE"

# Generate new certificates (using the same process as above)
# ...

# Update Kubernetes secret
kubectl create secret tls $SECRET_NAME \
  --cert=server-cert.pem \
  --key=server-key.pem \
  --namespace=$NAMESPACE \
  --dry-run=client -o yaml | kubectl apply -f -

# Add CA certificate to the secret
kubectl patch secret $SECRET_NAME -n $NAMESPACE \
  --patch="$(cat <<EOF
data:
  ca.crt: $(base64 -w 0 ca-cert.pem)
EOF
)"

# Restart provider deployment to pick up new certificates
kubectl rollout restart deployment/provider-deployment -n $NAMESPACE

echo "Certificate rotation completed"

Security Best Practices

1. Certificate Validation

// Always validate certificate chains
func validateCertificate(cert *x509.Certificate, caCert *x509.Certificate) error {
    roots := x509.NewCertPool()
    roots.AddCert(caCert)
    
    opts := x509.VerifyOptions{
        Roots: roots,
        KeyUsages: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
    }
    
    _, err := cert.Verify(opts)
    return err
}

2. Certificate Pinning

// Pin specific certificate or CA
func createTLSConfigWithPinning(expectedCertFingerprint string) *tls.Config {
    return &tls.Config{
        VerifyPeerCertificate: func(rawCerts [][]byte, verifiedChains [][]*x509.Certificate) error {
            if len(rawCerts) == 0 {
                return fmt.Errorf("no certificates provided")
            }
            
            cert, err := x509.ParseCertificate(rawCerts[0])
            if err != nil {
                return err
            }
            
            fingerprint := sha256.Sum256(cert.Raw)
            if hex.EncodeToString(fingerprint[:]) != expectedCertFingerprint {
                return fmt.Errorf("certificate fingerprint mismatch")
            }
            
            return nil
        },
    }
}

3. Monitoring and Alerting

# Prometheus AlertManager rules
groups:
  - name: virtrigaud.certificates
    rules:
      - alert: CertificateExpiringSoon
        expr: (cert_manager_certificate_expiration_timestamp_seconds - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate expiring soon"
          description: "Certificate {{ $labels.name }} expires in less than 30 days"
      
      - alert: CertificateExpired
        expr: cert_manager_certificate_expiration_timestamp_seconds < time()
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Certificate expired"
          description: "Certificate {{ $labels.name }} has expired"

Troubleshooting

Common Issues

Certificate chain issues

# Verify certificate chain
openssl verify -CAfile ca-cert.pem server-cert.pem

SAN mismatch

# Check certificate SAN entries
openssl x509 -in server-cert.pem -text -noout | grep -A1 "Subject Alternative Name"

TLS handshake failures

# Test TLS connection
openssl s_client -connect provider-service:9443 -cert client-cert.pem -key client-key.pem -CAfile ca-cert.pem

Clock skew issues

# Ensure time synchronization
ntpdate -s time.nist.gov

Debug Commands

# Check certificate validity
kubectl get secret provider-tls -o yaml | grep tls.crt | base64 -d | openssl x509 -text -noout

# Monitor certificate expiration
kubectl get certificates

# Check provider logs for TLS errors
kubectl logs deployment/provider-deployment | grep -i tls

External Secrets Management

This guide covers integrating VirtRigaud providers with external secret management systems using ExternalSecrets operators and best practices for credential security.

Overview

External secret management provides secure, centralized credential storage and automatic secret rotation. Supported systems include:

HashiCorp Vault: Enterprise secret management with dynamic secrets
AWS Secrets Manager: Cloud-native secret storage with automatic rotation
Azure Key Vault: Azure-integrated secret management
Google Secret Manager: GCP secret storage service
Kubernetes External Secrets: Generic external secret integration

External Secrets Operator Setup

Installation

# Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

helm install external-secrets external-secrets/external-secrets \
  --namespace external-secrets-system \
  --create-namespace \
  --set installCRDs=true

Basic Configuration

# ServiceAccount for External Secrets Operator
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-secrets
  namespace: virtrigaud-system
  annotations:
    # For AWS IRSA (IAM Roles for Service Accounts)
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/external-secrets-role

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-secrets
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["create", "update", "patch", "delete", "get", "list", "watch"]
  - apiGroups: ["external-secrets.io"]
    resources: ["*"]
    verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-secrets
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-secrets
subjects:
  - kind: ServiceAccount
    name: external-secrets
    namespace: virtrigaud-system

HashiCorp Vault Integration

Vault SecretStore

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: vault-secret-store
  namespace: virtrigaud-system
spec:
  provider:
    vault:
      server: "https://vault.example.com:8200"
      path: "secret"
      version: "v2"
      auth:
        # Use Kubernetes service account for authentication
        kubernetes:
          mountPath: "kubernetes"
          role: "virtrigaud-role"
          serviceAccountRef:
            name: "external-secrets"

---
# For multi-namespace access
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: vault-cluster-store
spec:
  provider:
    vault:
      server: "https://vault.example.com:8200"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "virtrigaud-cluster-role"
          serviceAccountRef:
            name: "external-secrets"
            namespace: "virtrigaud-system"

Vault Policy Configuration

# Vault policy for VirtRigaud secrets
path "secret/data/virtrigaud/*" {
  capabilities = ["read"]
}

path "secret/data/providers/*" {
  capabilities = ["read"]
}

# Dynamic database credentials
path "database/creds/readonly" {
  capabilities = ["read"]
}

# PKI for TLS certificates
path "pki/issue/virtrigaud" {
  capabilities = ["create", "update"]
}

vSphere Credentials from Vault

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: vsphere-credentials
  namespace: vsphere-providers
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: vsphere-credentials
    creationPolicy: Owner
    template:
      type: Opaque
      data:
        username: "{{ .username }}"
        password: "{{ .password }}"
        server: "{{ .server }}"
        # Optional: TLS certificate
        ca.crt: "{{ .ca_cert | b64dec }}"
  data:
    - secretKey: username
      remoteRef:
        key: secret/data/providers/vsphere
        property: username
    - secretKey: password
      remoteRef:
        key: secret/data/providers/vsphere
        property: password
    - secretKey: server
      remoteRef:
        key: secret/data/providers/vsphere
        property: server
    - secretKey: ca_cert
      remoteRef:
        key: secret/data/providers/vsphere
        property: ca_cert

AWS Secrets Manager Integration

AWS SecretStore with IRSA

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: virtrigaud-system
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-west-2
      auth:
        # Use IAM Roles for Service Accounts (IRSA)
        serviceAccount:
          name: external-secrets
          namespace: virtrigaud-system

---
# IAM Policy for the IRSA role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret"
      ],
      "Resource": [
        "arn:aws:secretsmanager:us-west-2:ACCOUNT:secret:virtrigaud/*"
      ]
    }
  ]
}

AWS Secret Configuration

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: aws-provider-credentials
  namespace: provider-namespace
spec:
  refreshInterval: 15m
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: provider-credentials
    creationPolicy: Owner
  data:
    - secretKey: credentials.json
      remoteRef:
        key: "virtrigaud/provider-credentials"
        property: "credentials"
    - secretKey: api-key
      remoteRef:
        key: "virtrigaud/api-keys"
        property: "provider-api-key"

Azure Key Vault Integration

Azure SecretStore

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: azure-key-vault
  namespace: virtrigaud-system
spec:
  provider:
    azurekv:
      vaultUrl: "https://virtrigaud-vault.vault.azure.net/"
      authType: "ManagedIdentity"
      # Or use Service Principal:
      # authType: "ServicePrincipal"
      # authSecretRef:
      #   clientId:
      #     name: azure-secret
      #     key: client-id
      #   clientSecret:
      #     name: azure-secret
      #     key: client-secret
      tenantId: "tenant-id-here"

---
# Managed Identity setup (ARM template or Terraform)
apiVersion: v1
kind: Secret
metadata:
  name: azure-config
  namespace: virtrigaud-system
type: Opaque
data:
  # Base64 encoded values
  tenant-id: dGVuYW50LWlkLWhlcmU=
  client-id: Y2xpZW50LWlkLWhlcmU=
  client-secret: Y2xpZW50LXNlY3JldC1oZXJl

Azure Key Vault Secret

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: azure-provider-secrets
  namespace: provider-namespace
spec:
  refreshInterval: 30m
  secretStoreRef:
    name: azure-key-vault
    kind: SecretStore
  target:
    name: provider-secrets
    creationPolicy: Owner
  data:
    - secretKey: subscription-id
      remoteRef:
        key: "azure-subscription-id"
    - secretKey: resource-group
      remoteRef:
        key: "azure-resource-group"
    - secretKey: client-certificate
      remoteRef:
        key: "azure-client-cert"

Google Secret Manager Integration

GCP SecretStore

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: gcp-secret-manager
  namespace: virtrigaud-system
spec:
  provider:
    gcpsm:
      projectId: "your-gcp-project"
      auth:
        # Use Workload Identity
        workloadIdentity:
          clusterLocation: us-central1
          clusterName: virtrigaud-cluster
          serviceAccountRef:
            name: external-secrets
            namespace: virtrigaud-system

---
# Workload Identity binding
apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-secrets
  namespace: virtrigaud-system
  annotations:
    iam.gke.io/gcp-service-account: virtrigaud-secrets@PROJECT.iam.gserviceaccount.com

GCP Secret Configuration

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: gcp-provider-secrets
  namespace: provider-namespace
spec:
  refreshInterval: 20m
  secretStoreRef:
    name: gcp-secret-manager
    kind: SecretStore
  target:
    name: gcp-provider-credentials
    creationPolicy: Owner
  data:
    - secretKey: service-account.json
      remoteRef:
        key: "virtrigaud-service-account"
        version: "latest"
    - secretKey: project-id
      remoteRef:
        key: "gcp-project-id"
        version: "latest"

Provider-Specific Configurations

vSphere Provider with Dynamic Credentials

# Vault configuration for vSphere dynamic credentials
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: vsphere-dynamic-credentials
  namespace: vsphere-providers
spec:
  refreshInterval: 15m  # Short refresh for dynamic credentials
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: vsphere-dynamic-creds
    creationPolicy: Owner
    template:
      type: Opaque
      data:
        username: "{{ .username }}"
        password: "{{ .password }}"
        server: "{{ .server }}"
        session_ttl: "{{ .lease_duration }}"
  data:
    - secretKey: username
      remoteRef:
        key: "vsphere/creds/dynamic-role"
        property: "username"
    - secretKey: password
      remoteRef:
        key: "vsphere/creds/dynamic-role"
        property: "password"
    - secretKey: server
      remoteRef:
        key: "secret/data/vsphere/static"
        property: "server"
    - secretKey: lease_duration
      remoteRef:
        key: "vsphere/creds/dynamic-role"
        property: "lease_duration"

---
# Provider deployment using dynamic credentials
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vsphere-provider
  namespace: vsphere-providers
spec:
  template:
    spec:
      containers:
        - name: provider
          env:
            - name: VSPHERE_USERNAME
              valueFrom:
                secretKeyRef:
                  name: vsphere-dynamic-creds
                  key: username
            - name: VSPHERE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: vsphere-dynamic-creds
                  key: password
            - name: VSPHERE_SERVER
              valueFrom:
                secretKeyRef:
                  name: vsphere-dynamic-creds
                  key: server

Libvirt Provider with SSH Keys

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: libvirt-ssh-keys
  namespace: libvirt-providers
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: libvirt-ssh-credentials
    creationPolicy: Owner
    template:
      type: kubernetes.io/ssh-auth
      data:
        ssh-privatekey: "{{ .private_key }}"
        ssh-publickey: "{{ .public_key }}"
        known_hosts: "{{ .known_hosts }}"
  data:
    - secretKey: private_key
      remoteRef:
        key: "secret/data/libvirt/ssh"
        property: "private_key"
    - secretKey: public_key
      remoteRef:
        key: "secret/data/libvirt/ssh"
        property: "public_key"
    - secretKey: known_hosts
      remoteRef:
        key: "secret/data/libvirt/ssh"
        property: "known_hosts"

---
# Mount SSH keys in provider
apiVersion: apps/v1
kind: Deployment
metadata:
  name: libvirt-provider
spec:
  template:
    spec:
      containers:
        - name: provider
          volumeMounts:
            - name: ssh-keys
              mountPath: /home/provider/.ssh
              readOnly: true
          env:
            - name: SSH_AUTH_SOCK
              value: "/tmp/ssh-agent.sock"
      volumes:
        - name: ssh-keys
          secret:
            secretName: libvirt-ssh-credentials
            defaultMode: 0600

TLS Certificate Management

Automatic TLS with External Secrets

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: provider-tls-certs
  namespace: provider-namespace
spec:
  refreshInterval: 24h
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: provider-tls
    creationPolicy: Owner
    template:
      type: kubernetes.io/tls
      data:
        tls.crt: "{{ .certificate }}"
        tls.key: "{{ .private_key }}"
        ca.crt: "{{ .ca_certificate }}"
  data:
    - secretKey: certificate
      remoteRef:
        key: "pki/issue/virtrigaud"
        property: "certificate"
    - secretKey: private_key
      remoteRef:
        key: "pki/issue/virtrigaud"
        property: "private_key"
    - secretKey: ca_certificate
      remoteRef:
        key: "pki/issue/virtrigaud"
        property: "issuing_ca"

---
# Vault PKI configuration (run in Vault)
# vault write pki/roles/virtrigaud \
#   allowed_domains="virtrigaud.local,provider-service" \
#   allow_subdomains=true \
#   max_ttl="8760h" \
#   generate_lease=true

Monitoring and Alerting

ExternalSecret Monitoring

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: external-secrets-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: external-secrets
  endpoints:
    - port: metrics
      interval: 30s

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: external-secrets-alerts
  namespace: monitoring
spec:
  groups:
    - name: external-secrets.rules
      rules:
        - alert: ExternalSecretSyncFailure
          expr: increase(external_secrets_sync_calls_error[5m]) > 0
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "External secret sync failure"
            description: "ExternalSecret {{ $labels.name }} in namespace {{ $labels.namespace }} failed to sync"
        
        - alert: ExternalSecretStale
          expr: (time() - external_secrets_sync_calls_total) > 3600
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "External secret not refreshed"
            description: "ExternalSecret {{ $labels.name }} has not been refreshed for over 1 hour"

Custom Monitoring

package monitoring

import (
    "context"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
)

var (
    secretAge = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "virtrigaud_secret_age_seconds",
            Help: "Age of provider secrets in seconds",
        },
        []string{"secret_name", "namespace", "provider"},
    )
    
    secretRotationCount = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "virtrigaud_secret_rotations_total",
            Help: "Total number of secret rotations",
        },
        []string{"secret_name", "namespace", "provider"},
    )
)

type SecretMonitor struct {
    client kubernetes.Interface
}

func (sm *SecretMonitor) MonitorSecrets(ctx context.Context) {
    ticker := time.NewTicker(60 * time.Second)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            sm.updateSecretMetrics()
        }
    }
}

func (sm *SecretMonitor) updateSecretMetrics() {
    secrets, err := sm.client.CoreV1().Secrets("").List(context.TODO(), metav1.ListOptions{
        LabelSelector: "app.kubernetes.io/managed-by=external-secrets",
    })
    if err != nil {
        return
    }
    
    for _, secret := range secrets.Items {
        provider := secret.Labels["provider"]
        if provider == "" {
            continue
        }
        
        age := time.Since(secret.CreationTimestamp.Time).Seconds()
        secretAge.WithLabelValues(secret.Name, secret.Namespace, provider).Set(age)
    }
}

Security Best Practices

1. Least Privilege Access

# Minimal Vault policy for specific provider
path "secret/data/providers/vsphere/{{ identity.entity.aliases.auth_kubernetes_*.metadata.service_account_namespace }}" {
  capabilities = ["read"]
}

# Time-bound secrets
path "vsphere/creds/readonly" {
  capabilities = ["read"]
  allowed_parameters = {
    "ttl" = ["15m", "30m", "1h"]
  }
}

2. Secret Rotation Automation

apiVersion: batch/v1
kind: CronJob
metadata:
  name: rotate-provider-secrets
  namespace: virtrigaud-system
spec:
  schedule: "0 2 * * 0"  # Weekly on Sunday at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: secret-rotator
              image: virtrigaud/secret-rotator:latest
              command:
                - /bin/sh
                - -c
                - |
                  # Force refresh of all external secrets
                  kubectl annotate externalsecret --all \
                    force-sync="$(date +%s)" \
                    --namespace=vsphere-providers
                  
                  # Restart provider deployments to pick up new secrets
                  kubectl rollout restart deployment \
                    --selector=app.kubernetes.io/name=virtrigaud-provider-runtime \
                    --namespace=vsphere-providers
          restartPolicy: OnFailure
          serviceAccountName: secret-rotator

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: secret-rotator
rules:
  - apiGroups: ["external-secrets.io"]
    resources: ["externalsecrets"]
    verbs: ["get", "list", "patch"]
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "patch"]

3. Audit Logging

# Vault audit configuration
vault audit enable file file_path=/vault/logs/audit.log

# Example audit log entry structure
{
  "time": "2023-12-01T10:30:00Z",
  "type": "request",
  "auth": {
    "client_token": "hvs.xxx",
    "accessor": "hmac-sha256:xxx",
    "display_name": "kubernetes-virtrigaud-system-external-secrets",
    "policies": ["virtrigaud-policy"],
    "metadata": {
      "service_account_name": "external-secrets",
      "service_account_namespace": "virtrigaud-system"
    }
  },
  "request": {
    "id": "request-id",
    "operation": "read",
    "path": "secret/data/providers/vsphere",
    "data": null,
    "remote_address": "10.0.0.100"
  }
}

4. Emergency Procedures

#!/bin/bash
# emergency-secret-rotation.sh

echo "=== Emergency Secret Rotation ==="

# 1. Revoke all active leases for a provider
vault lease revoke -prefix vsphere/creds/

# 2. Force refresh all external secrets
kubectl get externalsecret --all-namespaces -o name | \
  xargs -I {} kubectl annotate {} force-sync="$(date +%s)"

# 3. Restart all provider deployments
kubectl get deployments --all-namespaces \
  -l app.kubernetes.io/name=virtrigaud-provider-runtime \
  -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | \
  while read deployment; do
    kubectl rollout restart deployment $deployment
  done

# 4. Monitor rollout status
kubectl get deployments --all-namespaces \
  -l app.kubernetes.io/name=virtrigaud-provider-runtime \
  -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' | \
  while read deployment; do
    kubectl rollout status deployment $deployment --timeout=300s
  done

echo "Emergency rotation completed"

5. Secret Validation

package validation

import (
    "crypto/x509"
    "encoding/pem"
    "fmt"
    "time"
)

func ValidateSecret(secretData map[string][]byte, secretType string) error {
    switch secretType {
    case "tls":
        return validateTLSSecret(secretData)
    case "ssh":
        return validateSSHSecret(secretData)
    case "credential":
        return validateCredentialSecret(secretData)
    }
    return nil
}

func validateTLSSecret(data map[string][]byte) error {
    cert, ok := data["tls.crt"]
    if !ok {
        return fmt.Errorf("missing tls.crt")
    }
    
    key, ok := data["tls.key"]
    if !ok {
        return fmt.Errorf("missing tls.key")
    }
    
    // Parse certificate
    block, _ := pem.Decode(cert)
    if block == nil {
        return fmt.Errorf("failed to parse certificate PEM")
    }
    
    parsedCert, err := x509.ParseCertificate(block.Bytes)
    if err != nil {
        return fmt.Errorf("failed to parse certificate: %w", err)
    }
    
    // Check expiration
    if time.Now().After(parsedCert.NotAfter) {
        return fmt.Errorf("certificate expired on %v", parsedCert.NotAfter)
    }
    
    if time.Now().Add(24*time.Hour).After(parsedCert.NotAfter) {
        return fmt.Errorf("certificate expires soon on %v", parsedCert.NotAfter)
    }
    
    // Validate key
    block, _ = pem.Decode(key)
    if block == nil {
        return fmt.Errorf("failed to parse private key PEM")
    }
    
    return nil
}

Network Policies for Provider Security

This guide covers Kubernetes NetworkPolicy configurations to secure communication between VirtRigaud components and provider services.

Overview

NetworkPolicies provide network-level security by controlling traffic flow between pods, namespaces, and external endpoints. For VirtRigaud providers, this includes:

Ingress Control: Restricting which services can communicate with providers
Egress Control: Limiting provider access to external hypervisor endpoints
Namespace Isolation: Preventing cross-tenant communication
External Access: Controlling access to hypervisor management interfaces

Basic NetworkPolicy Template

Provider Ingress Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: provider-ingress
  namespace: provider-namespace
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider
  policyTypes:
    - Ingress
  ingress:
    # Allow from VirtRigaud manager
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: virtrigaud-manager
      ports:
        - protocol: TCP
          port: 9443  # gRPC provider port
    
    # Allow health checks from monitoring
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
        - podSelector:
            matchLabels:
              app: prometheus
      ports:
        - protocol: TCP
          port: 8080  # Health/metrics port
    
    # Allow from same namespace (for debugging)
    - from:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 8080

Provider Egress Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: provider-egress
  namespace: provider-namespace
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider
  policyTypes:
    - Egress
  egress:
    # Allow DNS resolution
    - to: []
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    
    # Allow HTTPS to Kubernetes API
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: TCP
          port: 443
    
    # Allow access to hypervisor management interfaces
    - to: []
      ports:
        - protocol: TCP
          port: 443  # vCenter HTTPS
        - protocol: TCP
          port: 80   # vCenter HTTP (if needed)
    
    # For libvirt providers - allow access to hypervisor nodes
    - to:
        - podSelector:
            matchLabels:
              node-role.kubernetes.io/worker: "true"
      ports:
        - protocol: TCP
          port: 16509  # libvirt daemon

Environment-Specific Policies

vSphere Provider

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vsphere-provider-policy
  namespace: vsphere-providers
  labels:
    provider: vsphere
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider-runtime
      provider: vsphere
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Manager access
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
      ports:
        - protocol: TCP
          port: 9443
    
    # Monitoring access
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 8080

  egress:
    # DNS
    - to: []
      ports:
        - protocol: UDP
          port: 53
    
    # vCenter access (specific IP ranges)
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8
            except:
              - 10.244.0.0/16  # Exclude pod network
      ports:
        - protocol: TCP
          port: 443
    
    - to:
        - ipBlock:
            cidr: 192.168.0.0/16
      ports:
        - protocol: TCP
          port: 443
    
    # ESXi host access for direct operations
    - to:
        - ipBlock:
            cidr: 10.1.0.0/24  # ESXi management network
      ports:
        - protocol: TCP
          port: 443
        - protocol: TCP
          port: 902   # vCenter agent

Libvirt Provider

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: libvirt-provider-policy
  namespace: libvirt-providers
  labels:
    provider: libvirt
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider-runtime
      provider: libvirt
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Manager access
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
      ports:
        - protocol: TCP
          port: 9443
    
    # Monitoring access
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring
      ports:
        - protocol: TCP
          port: 8080

  egress:
    # DNS
    - to: []
      ports:
        - protocol: UDP
          port: 53
    
    # Access to hypervisor nodes
    - to: []
      ports:
        - protocol: TCP
          port: 16509  # libvirt daemon
        - protocol: TCP
          port: 22     # SSH for remote libvirt
    
    # Access to shared storage (NFS, iSCSI, etc.)
    - to:
        - ipBlock:
            cidr: 10.2.0.0/24  # Storage network
      ports:
        - protocol: TCP
          port: 2049  # NFS
        - protocol: TCP
          port: 3260  # iSCSI
        - protocol: UDP
          port: 111   # RPC portmapper

Mock Provider (Development)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: mock-provider-policy
  namespace: development
  labels:
    provider: mock
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider-runtime
      provider: mock
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Allow from manager and other development pods
    - from:
        - namespaceSelector:
            matchLabels:
              environment: development
      ports:
        - protocol: TCP
          port: 9443
        - protocol: TCP
          port: 8080

  egress:
    # Allow all egress for development environment
    - to: []

Multi-Tenant Isolation

Tenant Namespace Policies

# Template for tenant-specific policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-isolation
  namespace: tenant-{{TENANT_NAME}}
  labels:
    tenant: "{{TENANT_NAME}}"
spec:
  podSelector: {}  # Apply to all pods in namespace
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Allow from same tenant namespace
    - from:
        - namespaceSelector:
            matchLabels:
              tenant: "{{TENANT_NAME}}"
    
    # Allow from VirtRigaud system namespace
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
    
    # Allow from monitoring namespace
    - from:
        - namespaceSelector:
            matchLabels:
              name: monitoring

  egress:
    # Allow to same tenant namespace
    - to:
        - namespaceSelector:
            matchLabels:
              tenant: "{{TENANT_NAME}}"
    
    # Allow to VirtRigaud system namespace
    - to:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
    
    # DNS resolution
    - to: []
      ports:
        - protocol: UDP
          port: 53
    
    # External hypervisor access (tenant-specific IP ranges)
    - to:
        - ipBlock:
            cidr: "{{TENANT_HYPERVISOR_CIDR}}"
      ports:
        - protocol: TCP
          port: 443

Cross-Tenant Communication Prevention

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-tenant
  namespace: tenant-production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  
  ingress:
    # Explicitly deny from other tenant namespaces
    - from: []
      # Empty from selector with explicit namespace exclusions
  
  egress:
    # Explicitly deny to other tenant namespaces
    - to:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
    - to:
        - namespaceSelector:
            matchLabels:
              name: monitoring
    - to:
        - namespaceSelector:
            matchLabels:
              tenant: production
    # Deny all other namespace access

Advanced Policies

Time-Based Access Control

# Use external controllers like OPA Gatekeeper for time-based policies
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: timerestriction
spec:
  crd:
    spec:
      names:
        kind: TimeRestriction
      validation:
        type: object
        properties:
          allowedHours:
            type: array
            items:
              type: integer
            description: "Allowed hours (0-23) for network access"
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package timerestriction
        
        violation[{"msg": msg}] {
          current_hour := time.now_ns() / 1000000000 / 3600 % 24
          not current_hour in input.parameters.allowedHours
          msg := sprintf("Network access not allowed at hour %v", [current_hour])
        }

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: TimeRestriction
metadata:
  name: business-hours-only
spec:
  match:
    kinds:
      - apiGroups: ["networking.k8s.io"]
        kinds: ["NetworkPolicy"]
    namespaces: ["production"]
  parameters:
    allowedHours: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]  # 8 AM - 5 PM

Dynamic IP Allow-listing

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: dynamic-hypervisor-access
  namespace: provider-namespace
  annotations:
    # Use external controllers to update IP blocks dynamically
    network-policy-controller/update-interval: "300s"
    network-policy-controller/ip-source: "configmap:hypervisor-ips"
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud-provider
  policyTypes:
    - Egress
  egress:
    # Will be dynamically updated by controller
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8
    # Static rules remain
    - to: []
      ports:
        - protocol: UDP
          port: 53

Monitoring and Troubleshooting

NetworkPolicy Monitoring

# ServiceMonitor for network policy violations
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: networkpolicy-monitoring
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: networkpolicy-exporter
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

---
# Example alerts for network policy violations
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: networkpolicy-alerts
  namespace: monitoring
spec:
  groups:
    - name: networkpolicy.rules
      rules:
        - alert: NetworkPolicyDeniedConnections
          expr: increase(networkpolicy_denied_connections_total[5m]) > 10
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "High number of denied network connections"
            description: "{{ $labels.source_namespace }}/{{ $labels.source_pod }} had {{ $value }} denied connections to {{ $labels.dest_namespace }}/{{ $labels.dest_pod }}"

Debug NetworkPolicies

#!/bin/bash
# debug-networkpolicy.sh

NAMESPACE=${1:-default}
POD_NAME=${2}

echo "=== NetworkPolicy Debug for $NAMESPACE/$POD_NAME ==="

# List all NetworkPolicies in namespace
echo "NetworkPolicies in namespace $NAMESPACE:"
kubectl get networkpolicy -n $NAMESPACE

# Show specific NetworkPolicy details
echo -e "\nNetworkPolicy details:"
kubectl get networkpolicy -n $NAMESPACE -o yaml

# Test connectivity
if [ ! -z "$POD_NAME" ]; then
    echo -e "\nTesting connectivity from $POD_NAME:"
    
    # Test DNS resolution
    kubectl exec -n $NAMESPACE $POD_NAME -- nslookup kubernetes.default.svc.cluster.local
    
    # Test internal connectivity
    kubectl exec -n $NAMESPACE $POD_NAME -- wget -qO- --timeout=5 http://kubernetes.default.svc.cluster.local/api
    
    # Test external connectivity (adjust as needed)
    kubectl exec -n $NAMESPACE $POD_NAME -- wget -qO- --timeout=5 https://google.com
fi

# Check iptables rules (if accessible)
echo -e "\nIPTables rules (if accessible):"
kubectl get nodes -o wide
echo "Run the following on a node to see iptables:"
echo "sudo iptables -L -n | grep -E '(KUBE|Chain)'"

CNI-Specific Troubleshooting

Calico

# Check Calico network policies
kubectl get networkpolicy --all-namespaces
kubectl get globalnetworkpolicy

# Check Calico endpoints
kubectl get endpoints --all-namespaces

# Debug Calico connectivity
kubectl exec -it -n kube-system <calico-node-pod> -- /bin/sh
calicoctl get wep --all-namespaces
calicoctl get netpol --all-namespaces

Cilium

# Check Cilium network policies
kubectl get cnp --all-namespaces  # Cilium Network Policies
kubectl get ccnp --all-namespaces # Cilium Cluster Network Policies

# Debug Cilium connectivity
kubectl exec -it -n kube-system <cilium-pod> -- cilium endpoint list
kubectl exec -it -n kube-system <cilium-pod> -- cilium policy get

Security Best Practices

1. Principle of Least Privilege

# Example: Minimal egress for a provider
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: minimal-egress-example
spec:
  podSelector:
    matchLabels:
      app: provider
  policyTypes:
    - Egress
  egress:
    # Only allow what's absolutely necessary
    - to: []
      ports:
        - protocol: UDP
          port: 53  # DNS only
    - to:
        - ipBlock:
            cidr: 10.1.1.100/32  # Specific vCenter IP only
      ports:
        - protocol: TCP
          port: 443  # HTTPS only

2. Default Deny Policies

# Apply default deny to all namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  # Empty ingress/egress rules = deny all

3. Regular Policy Auditing

#!/bin/bash
# audit-networkpolicies.sh

echo "=== NetworkPolicy Audit Report ==="
echo "Generated: $(date)"
echo

# Check for namespaces without NetworkPolicies
echo "Namespaces without NetworkPolicies:"
for ns in $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}'); do
    if [ $(kubectl get networkpolicy -n $ns --no-headers 2>/dev/null | wc -l) -eq 0 ]; then
        echo "  - $ns (WARNING: No network policies)"
    fi
done

echo

# Check for overly permissive policies
echo "Potentially overly permissive policies:"
kubectl get networkpolicy --all-namespaces -o json | jq -r '
  .items[] | 
  select(
    (.spec.egress[]?.to // []) | length == 0 or
    (.spec.ingress[]?.from // []) | length == 0
  ) | 
  "\(.metadata.namespace)/\(.metadata.name) - Check for overly broad rules"
'

echo

# Check for unused NetworkPolicies
echo "NetworkPolicies with no matching pods:"
kubectl get networkpolicy --all-namespaces -o json | jq -r '
  .items[] as $np |
  $np.metadata.namespace as $ns |
  $np.spec.podSelector as $selector |
  if ($selector | keys | length) == 0 then
    "\($ns)/\($np.metadata.name) - Applies to all pods in namespace"
  else
    "\($ns)/\($np.metadata.name) - Check if pods match selector"
  end
'

4. Integration with Service Mesh

# Example: Istio integration with NetworkPolicies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: istio-compatible-policy
spec:
  podSelector:
    matchLabels:
      app: provider
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow Istio sidecar communication
    - from:
        - podSelector:
            matchLabels:
              app: istio-proxy
      ports:
        - protocol: TCP
          port: 15090  # Istio pilot
    # Your application ports
    - from:
        - namespaceSelector:
            matchLabels:
              name: virtrigaud-system
      ports:
        - protocol: TCP
          port: 9443
  egress:
    # Allow Istio control plane
    - to:
        - namespaceSelector:
            matchLabels:
              name: istio-system
      ports:
        - protocol: TCP
          port: 15010  # Pilot
        - protocol: TCP
          port: 15011  # Pilot secure

CLI Tools Reference

VirtRigaud provides a comprehensive set of command-line tools for managing virtual machines, developing providers, running conformance tests, and performing load testing. This guide covers all available CLI tools and their usage.

Overview

Tool	Purpose	Target Users
`vrtg`	Main CLI for VM management and operations	End users, DevOps teams, System administrators
`vcts`	Conformance testing suite	Provider developers, QA teams, CI/CD pipelines
`vrtg-provider`	Provider development toolkit	Provider developers, Contributors
`virtrigaud-loadgen`	Load testing and benchmarking	Performance engineers, SREs

Installation

From GitHub Releases

# Download the latest release
export VIRTRIGAUD_VERSION="v0.2.1"
export PLATFORM="linux-amd64"  # or darwin-amd64, windows-amd64

# Install main CLI tool
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/${VIRTRIGAUD_VERSION}/vrtg-${PLATFORM}" -o vrtg
chmod +x vrtg
sudo mv vrtg /usr/local/bin/

# Install all CLI tools
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/${VIRTRIGAUD_VERSION}/virtrigaud-cli-${PLATFORM}.tar.gz" | tar xz
sudo mv vrtg vcts vrtg-provider virtrigaud-loadgen /usr/local/bin/

From Source

git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Build all CLI tools
make build-cli

# Install to /usr/local/bin
sudo make install-cli

# Or install to custom location
make install-cli PREFIX=/usr/local

Using Go

# Install specific version
go install github.com/projectbeskar/virtrigaud/cmd/vrtg@v0.2.1
go install github.com/projectbeskar/virtrigaud/cmd/vcts@v0.2.1
go install github.com/projectbeskar/virtrigaud/cmd/vrtg-provider@v0.2.1
go install github.com/projectbeskar/virtrigaud/cmd/virtrigaud-loadgen@v0.2.1

# Install latest
go install github.com/projectbeskar/virtrigaud/cmd/vrtg@latest

Completion

Enable shell completion for enhanced productivity:

# Bash
vrtg completion bash > /etc/bash_completion.d/vrtg
source /etc/bash_completion.d/vrtg

# Zsh
vrtg completion zsh > "${fpath[1]}/_vrtg"

# Fish
vrtg completion fish > ~/.config/fish/completions/vrtg.fish

# PowerShell
vrtg completion powershell | Out-String | Invoke-Expression

vrtg

The main CLI tool for managing VirtRigaud resources and virtual machines.

Global Flags

--kubeconfig string   Path to kubeconfig file (default: $KUBECONFIG or ~/.kube/config)
--namespace string    Kubernetes namespace (default: "default")
--output string       Output format: table, json, yaml (default: "table")
--timeout duration    Operation timeout (default: 30s)
--verbose             Enable verbose output
-h, --help           Help for vrtg

Commands

vm - Virtual Machine Management

Manage virtual machines with comprehensive lifecycle operations.

# List virtual machines
vrtg vm list [flags]

# Describe a virtual machine
vrtg vm describe <name> [flags]

# Show VM events
vrtg vm events <name> [flags]

# Get VM console URL
vrtg vm console-url <name> [flags]

Flags:

--all-namespaces: List VMs across all namespaces
--label-selector: Filter by labels (e.g., app=web,env=prod)
--field-selector: Filter by fields (e.g., spec.powerState=On)
--sort-by: Sort output by column (name, namespace, powerState, provider)
--watch: Watch for changes

Examples:

# List all VMs in table format
vrtg vm list

# List VMs with custom output format
vrtg vm list --output json --namespace production

# List VMs across all namespaces
vrtg vm list --all-namespaces

# Filter VMs by labels
vrtg vm list --label-selector environment=production,tier=web

# Watch VM status changes
vrtg vm list --watch

# Get detailed VM information
vrtg vm describe my-vm --output yaml

# Get VM console URL
vrtg vm console-url my-vm

# Show recent VM events
vrtg vm events my-vm

provider - Provider Management

Manage provider configurations and monitor their health.

# List providers
vrtg provider list [flags]

# Show provider status
vrtg provider status <name> [flags]

# Show provider logs
vrtg provider logs <name> [flags]

Flags:

--follow: Follow log output (for logs command)
--tail: Number of lines to show from end of logs (default: 100)
--since: Show logs since timestamp (e.g., 1h, 30m)

Examples:

# List all providers
vrtg provider list

# Check provider status
vrtg provider status vsphere-provider

# View provider logs
vrtg provider logs vsphere-provider --tail 50

# Follow provider logs in real-time
vrtg provider logs vsphere-provider --follow

# Show logs from last hour
vrtg provider logs vsphere-provider --since 1h

snapshot - Snapshot Management

Manage VM snapshots for backup and recovery.

# Create a VM snapshot
vrtg snapshot create <vm-name> <snapshot-name> [flags]

# List snapshots
vrtg snapshot list [vm-name] [flags]

# Revert VM to snapshot
vrtg snapshot revert <vm-name> <snapshot-name> [flags]

Flags for create:

--description: Snapshot description
--include-memory: Include memory state in snapshot

Examples:

# Create a simple snapshot
vrtg snapshot create my-vm pre-upgrade

# Create snapshot with description and memory
vrtg snapshot create my-vm pre-maintenance \
  --description "Before maintenance window" \
  --include-memory

# List all snapshots
vrtg snapshot list

# List snapshots for specific VM
vrtg snapshot list my-vm

# Revert to a snapshot
vrtg snapshot revert my-vm pre-upgrade

clone - VM Cloning

Clone virtual machines for rapid provisioning.

# Clone a virtual machine
vrtg clone run <source-vm> <target-vm> [flags]

# List clone operations
vrtg clone list [flags]

Flags for run:

--linked: Create linked clone (faster, space-efficient)
--target-namespace: Namespace for target VM
--customize: Apply customization during clone

Examples:

# Simple VM clone
vrtg clone run template-vm new-vm

# Linked clone for development
vrtg clone run production-vm dev-vm --linked

# Clone to different namespace
vrtg clone run template-vm test-vm --target-namespace testing

# List clone operations
vrtg clone list

conformance - Provider Testing

Run conformance tests against providers.

# Run conformance tests
vrtg conformance run <provider> [flags]

Flags:

--output-dir: Directory for test results
--skip-tests: Comma-separated list of tests to skip
--timeout: Test timeout (default: 30m)

Examples:

# Run conformance tests
vrtg conformance run vsphere-provider

# Run tests with custom timeout
vrtg conformance run vsphere-provider --timeout 1h

# Skip specific tests
vrtg conformance run vsphere-provider --skip-tests "test-large-vms,test-network"

diag - Diagnostics

Diagnostic tools for troubleshooting.

# Create diagnostic bundle
vrtg diag bundle [flags]

Flags:

--output: Output file path (default: virtrigaud-diag-<timestamp>.tar.gz)
--include-logs: Include provider logs in bundle
--since: Collect logs since timestamp

Examples:

# Create diagnostic bundle
vrtg diag bundle

# Create bundle with logs from last 2 hours
vrtg diag bundle --include-logs --since 2h

# Custom output location
vrtg diag bundle --output /tmp/debug-bundle.tar.gz

init - Installation

Initialize VirtRigaud in a Kubernetes cluster.

# Initialize virtrigaud
vrtg init [flags]

Flags:

--chart-version: Helm chart version to install
--namespace: Installation namespace (default: virtrigaud-system)
--values: Values file for Helm chart
--dry-run: Show what would be installed

Examples:

# Basic installation
vrtg init

# Install specific version
vrtg init --chart-version v0.2.1

# Install with custom values
vrtg init --values custom-values.yaml

# Dry run to see what would be installed
vrtg init --dry-run

vcts

VirtRigaud Conformance Test Suite - runs standardized tests against providers.

Usage

vcts [command] [flags]

Global Flags

--kubeconfig string   Path to kubeconfig file
--namespace string    Kubernetes namespace (default: "virtrigaud-system")
--provider string     Provider name to test
--output-dir string   Output directory for test results (default: "./conformance-results")
--skip-tests strings  Comma-separated list of tests to skip
--timeout duration    Test timeout (default: 30m)
--parallel int        Number of parallel test executions (default: 1)
--verbose             Enable verbose output

Commands

run - Execute Tests

# Run all conformance tests
vcts run --provider vsphere-provider

# Run with custom settings
vcts run --provider vsphere-provider \
  --timeout 1h \
  --parallel 3 \
  --output-dir /tmp/test-results

# Skip specific tests
vcts run --provider libvirt-provider \
  --skip-tests "test-snapshots,test-linked-clones"

# Verbose output for debugging
vcts run --provider proxmox-provider --verbose

list - List Available Tests

# List all available tests
vcts list

# List tests for specific capability
vcts list --capability snapshots

validate - Validate Provider

# Validate provider configuration
vcts validate --provider vsphere-provider

# Check provider connectivity
vcts validate --provider vsphere-provider --check-connectivity

Test Categories

Basic Operations: VM creation, deletion, power operations
Lifecycle Management: Start, stop, restart, suspend operations
Resource Management: CPU, memory, disk operations
Networking: Network configuration and connectivity
Storage: Disk operations, resizing, multiple disks
Snapshots: Create, list, revert, delete snapshots
Cloning: VM cloning and linked clones
Error Handling: Provider error scenarios
Performance: Basic performance benchmarks

Output Formats

Test results are available in multiple formats:

JUnit XML: For CI/CD integration
JSON: Machine-readable format
HTML: Human-readable report
TAP: Test Anything Protocol

vrtg-provider

Provider development toolkit for creating and managing VirtRigaud providers.

Usage

vrtg-provider [command] [flags]

Global Flags

--verbose     Enable verbose output
--help        Help for vrtg-provider

Commands

init - Initialize Provider

Bootstrap a new provider project with scaffolding.

vrtg-provider init <provider-name> [flags]

Flags:

--template: Template to use (grpc, rest, hybrid)
--output-dir: Output directory (default: current directory)
--module: Go module name
--author: Author name for generated files

Examples:

# Create basic gRPC provider
vrtg-provider init my-provider --template grpc

# Create with custom module
vrtg-provider init my-provider \
  --template grpc \
  --module github.com/myorg/my-provider \
  --author "John Doe <john@example.com>"

# Create in specific directory
vrtg-provider init my-provider \
  --output-dir /path/to/providers \
  --template grpc

generate - Code Generation

Generate boilerplate code for provider implementation.

vrtg-provider generate [type] [flags]

Types:

client: Generate client code
server: Generate server implementation
tests: Generate test scaffolding
docs: Generate documentation templates

Examples:

# Generate client code
vrtg-provider generate client --provider my-provider

# Generate test scaffolding
vrtg-provider generate tests --provider my-provider

# Generate documentation
vrtg-provider generate docs --provider my-provider

verify - Verification

Verify provider implementation and compliance.

vrtg-provider verify [flags]

Flags:

--provider-dir: Provider directory to verify
--check-interface: Verify interface compliance
--check-docs: Verify documentation completeness
--check-tests: Verify test coverage

Examples:

# Basic verification
vrtg-provider verify --provider-dir ./my-provider

# Comprehensive check
vrtg-provider verify \
  --provider-dir ./my-provider \
  --check-interface \
  --check-docs \
  --check-tests

publish - Publishing

Prepare provider for publishing and distribution.

vrtg-provider publish [flags]

Flags:

--provider-dir: Provider directory
--version: Version to publish
--registry: Container registry
--chart-repo: Helm chart repository

Examples:

# Publish provider
vrtg-provider publish \
  --provider-dir ./my-provider \
  --version v1.0.0 \
  --registry ghcr.io/myorg

# Publish with Helm chart
vrtg-provider publish \
  --provider-dir ./my-provider \
  --version v1.0.0 \
  --registry ghcr.io/myorg \
  --chart-repo https://charts.myorg.com

Provider Template Structure

my-provider/
├── cmd/
│   └── provider/
│       └── main.go              # Provider entry point
├── internal/
│   ├── provider/
│   │   ├── server.go           # gRPC server implementation
│   │   ├── client.go           # Provider client
│   │   └── types.go            # Provider-specific types
│   └── config/
│       └── config.go           # Configuration management
├── pkg/
│   └── api/                    # Public API interfaces
├── test/
│   ├── conformance/            # Conformance tests
│   └── integration/            # Integration tests
├── deploy/
│   ├── helm/                   # Helm charts
│   └── k8s/                    # Kubernetes manifests
├── docs/                       # Documentation
├── Dockerfile                  # Container image
├── Makefile                    # Build automation
└── README.md                   # Provider documentation

virtrigaud-loadgen

Load testing and benchmarking tool for VirtRigaud deployments.

Usage

virtrigaud-loadgen [command] [flags]

Global Flags

--kubeconfig string   Path to kubeconfig file
--namespace string    Kubernetes namespace (default: "default")  
--output-dir string   Output directory for results (default: "./loadgen-results")
--config-file string  Load generation configuration file
--dry-run            Show what would be executed without running
--verbose            Enable verbose output

Commands

run - Execute Load Test

virtrigaud-loadgen run [flags]

Flags:

--vms: Number of VMs to create (default: 10)
--duration: Test duration (default: 10m)
--ramp-up: Ramp-up time (default: 2m)
--workers: Number of concurrent workers (default: 5)
--provider: Provider to test against
--vm-class: VMClass to use for test VMs
--vm-image: VMImage to use for test VMs

Examples:

# Basic load test
virtrigaud-loadgen run --vms 50 --duration 15m

# Comprehensive load test
virtrigaud-loadgen run \
  --vms 100 \
  --duration 30m \
  --ramp-up 5m \
  --workers 10 \
  --provider vsphere-provider

# Test with specific configuration
virtrigaud-loadgen run --config-file loadtest-config.yaml

config - Configuration Management

# Generate sample configuration
virtrigaud-loadgen config generate --output sample-config.yaml

# Validate configuration
virtrigaud-loadgen config validate --config-file my-config.yaml

Configuration File

# loadtest-config.yaml
metadata:
  name: "production-load-test"
  description: "Load test for production environment"

spec:
  # Test parameters
  vms: 100
  duration: "30m"
  rampUp: "5m"
  workers: 10
  
  # Target configuration
  provider: "vsphere-provider"
  namespace: "loadtest"
  
  # VM configuration
  vmClass: "standard-vm"
  vmImage: "ubuntu-22-04"
  
  # Test scenarios
  scenarios:
    - name: "vm-lifecycle"
      weight: 70
      operations:
        - create
        - start
        - stop
        - delete
    
    - name: "vm-operations"
      weight: 20
      operations:
        - snapshot
        - clone
        - reconfigure
    
    - name: "provider-stress"
      weight: 10
      operations:
        - rapid-create-delete
        - concurrent-operations

  # Reporting
  reporting:
    formats: ["json", "html", "csv"]
    metrics:
      - response-time
      - throughput
      - error-rate
      - resource-usage

Metrics and Reporting

Load test results include:

Performance Metrics: Response times, throughput, latency percentiles
Error Analysis: Error rates, failure patterns, error categorization
Resource Usage: CPU, memory, network utilization
Provider Metrics: Provider-specific performance indicators
Trend Analysis: Performance over time, bottleneck identification

Output Formats

JSON: Machine-readable results for automation
HTML: Interactive dashboard with charts and graphs
CSV: Raw data for further analysis
Prometheus: Metrics export for monitoring systems

Advanced Usage

Automation and Scripting

Bash Integration

#!/bin/bash
# VM management script

# Function to check VM status
check_vm_status() {
  local vm_name=$1
  vrtg vm describe "$vm_name" --output json | jq -r '.status.powerState'
}

# Wait for VM to be ready
wait_for_vm() {
  local vm_name=$1
  local timeout=300
  local count=0
  
  while [ $count -lt $timeout ]; do
    status=$(check_vm_status "$vm_name")
    if [ "$status" = "On" ]; then
      echo "VM $vm_name is ready"
      return 0
    fi
    sleep 5
    count=$((count + 5))
  done
  
  echo "Timeout waiting for VM $vm_name"
  return 1
}

# Create and wait for VM
vrtg vm create --file vm-config.yaml
wait_for_vm "my-vm"

CI/CD Integration

# .github/workflows/vm-test.yml
name: VM Integration Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Install vrtg CLI
        run: |
          curl -L "https://github.com/projectbeskar/virtrigaud/releases/latest/download/vrtg-linux-amd64" -o vrtg
          chmod +x vrtg
          sudo mv vrtg /usr/local/bin/
      
      - name: Setup kubeconfig
        run: echo "${{ secrets.KUBECONFIG }}" | base64 -d > ~/.kube/config
      
      - name: Run conformance tests
        run: vcts run --provider test-provider --output-dir test-results
      
      - name: Upload test results
        uses: actions/upload-artifact@v3
        with:
          name: conformance-results
          path: test-results/

Configuration Management

Environment-specific Configurations

# Development environment
export VRTG_KUBECONFIG=~/.kube/dev-config
export VRTG_NAMESPACE=development
export VRTG_OUTPUT=yaml

# Production environment  
export VRTG_KUBECONFIG=~/.kube/prod-config
export VRTG_NAMESPACE=production
export VRTG_OUTPUT=json

# Use environment-specific settings
vrtg vm list  # Uses environment variables

Configuration Files

Create ~/.vrtg/config.yaml:

contexts:
  development:
    kubeconfig: ~/.kube/dev-config
    namespace: development
    output: yaml
    timeout: 30s
  
  production:
    kubeconfig: ~/.kube/prod-config
    namespace: production
    output: json
    timeout: 60s

current-context: development

aliases:
  ls: vm list
  get: vm describe
  logs: provider logs

Troubleshooting

Common Issues

Connection Issues

# Check cluster connectivity
vrtg provider list

# Validate kubeconfig
kubectl cluster-info

# Check provider logs
vrtg provider logs <provider-name> --tail 100

Permission Issues

# Check RBAC permissions
kubectl auth can-i create virtualmachines

# Get current user context
kubectl auth whoami

Provider Issues

# Check provider status
vrtg provider status <provider-name>

# Run diagnostics
vrtg diag bundle --include-logs

Debug Mode

Enable debug output:

# Global debug flag
vrtg --verbose vm list

# Provider-specific debugging
vrtg provider logs <provider-name> --follow --verbose

# Conformance test debugging
vcts run --provider <provider-name> --verbose

CLI Reference

VirtRigaud provides several command-line tools for managing virtual machines, testing providers, and developing new providers. All tools are available as part of VirtRigaud v0.2.0.

Overview

Tool	Purpose	Target Users
`vrtg`	Main CLI for VM management	End users, DevOps teams
`vcts`	Conformance testing suite	Provider developers, QA teams
`vrtg-provider`	Provider development toolkit	Provider developers
`virtrigaud-loadgen`	Load testing and benchmarking	Performance engineers

Installation

From GitHub Releases

# Download the latest release
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.0/vrtg-linux-amd64" -o vrtg
chmod +x vrtg
sudo mv vrtg /usr/local/bin/

# Install all CLI tools
curl -L "https://github.com/projectbeskar/virtrigaud/releases/download/v0.2.0/virtrigaud-cli-linux-amd64.tar.gz" | tar xz
sudo mv vrtg vcts vrtg-provider virtrigaud-loadgen /usr/local/bin/

From Source

git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Build all CLI tools
make build-cli

# Install to /usr/local/bin
sudo make install-cli

Using Go

go install github.com/projectbeskar/virtrigaud/cmd/vrtg@v0.2.0
go install github.com/projectbeskar/virtrigaud/cmd/vcts@v0.2.0
go install github.com/projectbeskar/virtrigaud/cmd/vrtg-provider@v0.2.0
go install github.com/projectbeskar/virtrigaud/cmd/virtrigaud-loadgen@v0.2.0

vrtg

The main CLI tool for managing VirtRigaud resources and virtual machines.

Global Flags

--kubeconfig string   Path to kubeconfig file (default: $KUBECONFIG or ~/.kube/config)
--namespace string    Kubernetes namespace (default: "default")
--output string       Output format: table, json, yaml (default: "table")
--timeout duration    Operation timeout (default: 5m0s)
-h, --help           Help for vrtg

Commands

vm

Manage virtual machines.

# List all VMs
vrtg vm list

# Get detailed VM information
vrtg vm get <vm-name>

# Create a VM from configuration
vrtg vm create --file vm.yaml

# Delete a VM
vrtg vm delete <vm-name>

# Power operations
vrtg vm start <vm-name>
vrtg vm stop <vm-name>
vrtg vm restart <vm-name>

# Scale VMSet
vrtg vm scale <vmset-name> --replicas 5

# Get VM console URL
vrtg vm console <vm-name>

# Watch VM status changes
vrtg vm watch <vm-name>

Examples:

# List VMs with custom output
vrtg vm list --output json --namespace production

# Create VM with timeout
vrtg vm create --file my-vm.yaml --timeout 10m

# Power on all VMs in namespace
vrtg vm list --output json | jq -r '.items[].metadata.name' | xargs -I {} vrtg vm start {}

provider

Manage provider configurations.

# List providers
vrtg provider list

# Get provider details
vrtg provider get <provider-name>

# Check provider connectivity
vrtg provider validate <provider-name>

# Get provider capabilities
vrtg provider capabilities <provider-name>

# View provider logs
vrtg provider logs <provider-name>

# Test provider functionality
vrtg provider test <provider-name>

Examples:

# Validate all providers
vrtg provider list --output json | jq -r '.items[].metadata.name' | xargs -I {} vrtg provider validate {}

# Get detailed provider status
vrtg provider get vsphere-prod --output yaml

image

Manage VM images and templates.

# List available images
vrtg image list

# Get image details
vrtg image get <image-name>

# Prepare an image
vrtg image prepare <image-name>

# Delete an image
vrtg image delete <image-name>

snapshot

Manage VM snapshots.

# List snapshots for a VM
vrtg snapshot list --vm <vm-name>

# Create a snapshot
vrtg snapshot create <vm-name> --name "pre-upgrade"

# Restore from snapshot
vrtg snapshot restore <vm-name> <snapshot-name>

# Delete a snapshot
vrtg snapshot delete <vm-name> <snapshot-name>

completion

Generate shell completion scripts.

# Bash
vrtg completion bash > /etc/bash_completion.d/vrtg

# Zsh
vrtg completion zsh > "${fpath[1]}/_vrtg"

# Fish
vrtg completion fish > ~/.config/fish/completions/vrtg.fish

# PowerShell
vrtg completion powershell > vrtg.ps1

Configuration

vrtg uses the same kubeconfig as kubectl. Configuration precedence:

--kubeconfig flag
KUBECONFIG environment variable
~/.kube/config

Config File

Create ~/.vrtg/config.yaml for default settings:

defaults:
  namespace: "virtrigaud-system"
  timeout: "10m"
  output: "table"
providers:
  preferred: "vsphere-prod"
output:
  colors: true
  timestamps: true

vcts

VirtRigaud Conformance Test Suite for validating provider implementations.

Global Flags

--kubeconfig string   Path to kubeconfig file
--namespace string    Test namespace (default: "vcts")
--provider string     Provider to test
--output-dir string   Directory for test results
--timeout duration    Test timeout (default: 30m)
--parallel int        Number of parallel tests (default: 1)
--skip strings        Tests to skip (comma-separated)
--verbose             Verbose output
-h, --help           Help for vcts

Commands

run

Run conformance tests against a provider.

# Run all tests
vcts run --provider vsphere-prod

# Run specific test suites
vcts run --provider vsphere-prod --suites core,storage

# Run with custom configuration
vcts run --provider libvirt-test --config test-config.yaml

# Skip specific tests
vcts run --provider vsphere-prod --skip "test-large-vm,test-snapshot-memory"

# Generate detailed report
vcts run --provider vsphere-prod --output-dir ./test-results --verbose

list

List available test suites and tests.

# List all test suites
vcts list suites

# List tests in a suite
vcts list tests --suite core

# List supported providers
vcts list providers

validate

Validate test configuration.

# Validate configuration file
vcts validate --config test-config.yaml

# Validate provider setup
vcts validate --provider vsphere-prod

Test Suites

Core Suite

Basic VM lifecycle (create, start, stop, delete)
Provider connectivity and authentication
Resource allocation and management

Storage Suite

Disk creation and attachment
Volume expansion operations
Storage pool management

Network Suite

Network interface management
IP address allocation
Network connectivity tests

Snapshot Suite

Snapshot creation and deletion
Snapshot restoration
Memory state preservation

Performance Suite

VM creation performance
Resource utilization benchmarks
Concurrent operation handling

Test Configuration

Create test-config.yaml:

provider:
  name: "vsphere-prod"
  type: "vsphere"
  
tests:
  core:
    enabled: true
    timeout: "15m"
  storage:
    enabled: true
    testDiskSize: "10Gi"
  network:
    enabled: false  # Skip network tests
    
resources:
  vmClass: "test-small"
  vmImage: "ubuntu-22-04"
  
cleanup:
  enabled: true
  timeout: "10m"

vrtg-provider

Development toolkit for creating and maintaining VirtRigaud providers.

Global Flags

--verbose            Enable verbose output
-h, --help          Help for vrtg-provider

Commands

init

Initialize a new provider project.

# Create a new provider
vrtg-provider init --name hyperv --type hyperv --output ./hyperv-provider

# Create with custom options
vrtg-provider init --name aws-ec2 --type aws \
  --capabilities snapshots,linked-clones \
  --output ./aws-provider

Options:

--name: Provider name
--type: Provider type
--capabilities: Comma-separated capabilities list
--output: Output directory
--remote: Generate remote provider (default: true)

generate

Generate code for provider components.

# Generate API types
vrtg-provider generate api --provider-type vsphere

# Generate client code
vrtg-provider generate client --provider-type vsphere --api-version v1

# Generate test suite
vrtg-provider generate tests --provider-type vsphere

# Generate documentation
vrtg-provider generate docs --provider-type vsphere

verify

Verify provider implementation.

# Verify provider structure
vrtg-provider verify structure --path ./my-provider

# Verify capabilities
vrtg-provider verify capabilities --path ./my-provider

# Verify API compatibility
vrtg-provider verify api --path ./my-provider --api-version v1beta1

publish

Publish provider artifacts.

# Build and publish provider image
vrtg-provider publish --path ./my-provider --registry ghcr.io/myorg

# Publish with specific tag
vrtg-provider publish --path ./my-provider --tag v1.0.0

# Dry run publication
vrtg-provider publish --path ./my-provider --dry-run

Provider Structure

my-provider/
├── cmd/
│   └── provider-mytype/
│       ├── Dockerfile
│       └── main.go
├── internal/
│   └── provider/
│       ├── provider.go
│       ├── capabilities.go
│       └── provider_test.go
├── deploy/
│   ├── provider.yaml
│   ├── service.yaml
│   └── deployment.yaml
├── docs/
│   └── README.md
├── go.mod
├── go.sum
└── Makefile

virtrigaud-loadgen

Load testing and performance benchmarking tool for VirtRigaud providers.

Global Flags

--kubeconfig string   Path to kubeconfig file
--namespace string    Test namespace (default: "loadgen")
--output-dir string   Output directory for results
--config-file string  Load generation configuration file
--dry-run            Show what would be created without executing
--verbose            Verbose output
-h, --help          Help for virtrigaud-loadgen

Commands

run

Execute load generation scenarios.

# Run default load test
virtrigaud-loadgen run --config loadtest.yaml

# Run with custom settings
virtrigaud-loadgen run --config loadtest.yaml --workers 50 --duration 10m

# Run specific scenario
virtrigaud-loadgen run --scenario vm-creation --vms 100

# Generate performance report
virtrigaud-loadgen run --config loadtest.yaml --output-dir ./perf-results

scenarios

Manage load testing scenarios.

# List available scenarios
virtrigaud-loadgen scenarios list

# Show scenario details
virtrigaud-loadgen scenarios get vm-lifecycle

# Validate scenario configuration
virtrigaud-loadgen scenarios validate --config custom-scenario.yaml

analyze

Analyze load test results.

# Generate performance report
virtrigaud-loadgen analyze --input ./perf-results

# Compare test runs
virtrigaud-loadgen analyze --compare run1.csv,run2.csv

# Generate charts
virtrigaud-loadgen analyze --input ./perf-results --charts

Load Test Configuration

Create loadtest.yaml:

metadata:
  name: "vm-creation-load-test"
  description: "Test VM creation performance"

scenarios:
  - name: "vm-creation"
    type: "vm-lifecycle"
    workers: 20
    duration: "5m"
    resources:
      vmClass: "small"
      vmImage: "ubuntu-22-04"
      provider: "vsphere-prod"
    
  - name: "vm-scaling"
    type: "vmset-scaling"
    workers: 5
    iterations: 10
    scaling:
      min: 1
      max: 50
      step: 5

providers:
  - name: "vsphere-prod"
    type: "vsphere"
  - name: "libvirt-test"
    type: "libvirt"

output:
  format: ["csv", "json"]
  metrics: ["latency", "throughput", "errors"]
  
cleanup:
  enabled: true
  timeout: "15m"

Performance Scenarios

VM Lifecycle

Create, start, stop, delete operations
Measures end-to-end VM management performance

Burst Creation

Rapid VM creation under load
Tests provider scaling capabilities

VMSet Scaling

Scale VMSets up and down
Measures horizontal scaling performance

Provider Stress

High concurrent operations
Tests provider reliability under stress

Results Analysis

Load test results include:

Latency metrics: P50, P95, P99 response times
Throughput: Operations per second
Error rates: Failed operations percentage
Resource usage: CPU, memory, network utilization
Provider metrics: API call statistics

Example output:

timestamp,scenario,operation,latency_ms,status,provider
2025-01-15T10:00:01Z,vm-creation,create,2500,success,vsphere-prod
2025-01-15T10:00:03Z,vm-creation,create,2800,success,vsphere-prod
2025-01-15T10:00:05Z,vm-creation,create,failed,timeout,vsphere-prod

Best Practices

Using vrtg

Use namespaces to organize resources
Set timeouts appropriately for your environment
Use dry-run options for validation before execution
Monitor operations with watch commands

Testing with vcts

Run core tests first to validate basic functionality
Use separate namespaces for different test runs
Clean up resources after testing
Document test results for compliance tracking

Developing with vrtg-provider

Start with init to create proper structure
Implement core capabilities before advanced features
Test thoroughly with vcts before publishing
Follow naming conventions for consistency

Load Testing with virtrigaud-loadgen

Start small and gradually increase load
Monitor system resources during tests
Use realistic scenarios that match production workloads
Analyze results to identify bottlenecks

Support

Documentation: VirtRigaud Docs
Issues: GitHub Issues
Discussions: GitHub Discussions
Community: Discord

Version Information

This documentation covers VirtRigaud CLI tools v0.2.0.

For older versions, see the releases page.

Metrics Catalog

VirtRigaud exposes comprehensive metrics for monitoring and observability. All metrics are available at the /metrics endpoint on port 8080.

Manager Metrics

Reconciliation Metrics

Metric Name	Type	Labels	Description
`virtrigaud_manager_reconcile_total`	Counter	`kind`, `outcome`	Total number of reconcile loops
`virtrigaud_manager_reconcile_duration_seconds`	Histogram	`kind`	Time spent in reconcile loops
`virtrigaud_queue_depth`	Gauge	`kind`	Current queue depth for each resource kind

VM Operation Metrics

Metric Name	Type	Labels	Description
`virtrigaud_vm_operations_total`	Counter	`operation`, `provider_type`, `provider`, `outcome`	Total VM operations
`virtrigaud_vm_reconfigure_total`	Counter	`provider_type`, `outcome`	Total VM reconfiguration operations
`virtrigaud_vm_snapshot_total`	Counter	`action`, `provider_type`, `outcome`	Total VM snapshot operations
`virtrigaud_vm_clone_total`	Counter	`linked`, `provider_type`, `outcome`	Total VM clone operations
`virtrigaud_vm_image_prepare_total`	Counter	`provider_type`, `outcome`	Total VM image preparation operations

Build Information

Metric Name	Type	Labels	Description
`virtrigaud_build_info`	Gauge	`version`, `git_sha`, `go_version`	Build information

Provider Metrics

gRPC Metrics

Metric Name	Type	Labels	Description
`virtrigaud_provider_rpc_requests_total`	Counter	`provider_type`, `method`, `code`	Total gRPC requests
`virtrigaud_provider_rpc_latency_seconds`	Histogram	`provider_type`, `method`	gRPC request latency
`virtrigaud_provider_tasks_inflight`	Gauge	`provider_type`, `provider`	Number of inflight tasks

Provider-Specific Metrics

Metric Name	Type	Labels	Description
`virtrigaud_ip_discovery_duration_seconds`	Histogram	`provider_type`	Time to discover VM IP addresses

Error Metrics

Metric Name	Type	Labels	Description
`virtrigaud_errors_total`	Counter	`reason`, `component`	Total errors by reason and component

Label Definitions

Common Labels

provider_type: The type of provider (vsphere, libvirt)
provider: The name of the provider instance
outcome: The result of an operation (success, failure, error)
kind: The Kubernetes resource kind (VirtualMachine, VMClass, etc.)
component: The component generating the metric (manager, provider)

Operation-Specific Labels

operation: Type of VM operation (Create, Delete, Power, Describe, Reconfigure)
method: gRPC method name (CreateVM, DeleteVM, PowerVM, etc.)
code: gRPC status code (OK, INVALID_ARGUMENT, DEADLINE_EXCEEDED, etc.)
action: Snapshot action (create, delete, revert)
linked: Whether a clone is linked (true, false)
reason: Error reason (ConnectionFailed, AuthenticationError, etc.)

Histogram Buckets

Duration histograms use the following buckets (in seconds):

0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300

Example Queries

Prometheus Queries

Error Rate

# Overall error rate
rate(virtrigaud_vm_operations_total{outcome="failure"}[5m]) /
rate(virtrigaud_vm_operations_total[5m])

# Provider-specific error rate
rate(virtrigaud_provider_rpc_requests_total{code!="OK"}[5m]) /
rate(virtrigaud_provider_rpc_requests_total[5m])

Latency

# 95th percentile VM creation time
histogram_quantile(0.95, 
  rate(virtrigaud_vm_operations_duration_seconds_bucket{operation="Create"}[5m])
)

# gRPC request latency by method
histogram_quantile(0.95,
  rate(virtrigaud_provider_rpc_latency_seconds_bucket[5m])
) by (method)

Throughput

# VM operations per second
rate(virtrigaud_vm_operations_total[5m])

# Operations by provider
rate(virtrigaud_vm_operations_total[5m]) by (provider_type, provider)

Queue Depth

# Current queue depth
virtrigaud_queue_depth

# Average queue depth over time
avg_over_time(virtrigaud_queue_depth[5m])

Inflight Tasks

# Current inflight tasks
virtrigaud_provider_tasks_inflight

# Inflight tasks by provider
virtrigaud_provider_tasks_inflight by (provider_type, provider)

Grafana Dashboard Queries

VM Creation Success Rate Panel

sum(rate(virtrigaud_vm_operations_total{operation="Create",outcome="success"}[5m])) /
sum(rate(virtrigaud_vm_operations_total{operation="Create"}[5m])) * 100

Provider Health Panel

up{job="virtrigaud-provider"}

Error Rate by Provider Panel

sum(rate(virtrigaud_errors_total[5m])) by (component, provider_type)

ServiceMonitor Configuration

Example ServiceMonitor for Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: virtrigaud-manager
  namespace: virtrigaud-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud
      app.kubernetes.io/component: manager
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: virtrigaud-providers
  namespace: virtrigaud-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: virtrigaud
      app.kubernetes.io/component: provider
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Alert Rules

Example PrometheusRule for common alerts:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: virtrigaud-alerts
  namespace: virtrigaud-system
spec:
  groups:
  - name: virtrigaud.rules
    rules:
    - alert: VirtrigaudProviderDown
      expr: up{job="virtrigaud-provider"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Virtrigaud provider is down"
        description: "Provider {{ $labels.instance }} has been down for more than 5 minutes"

    - alert: VirtrigaudHighErrorRate
      expr: |
        rate(virtrigaud_vm_operations_total{outcome="failure"}[5m]) /
        rate(virtrigaud_vm_operations_total[5m]) > 0.1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High error rate in VM operations"
        description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.provider }}"

    - alert: VirtrigaudSlowVMCreation
      expr: |
        histogram_quantile(0.95,
          rate(virtrigaud_vm_operations_duration_seconds_bucket{operation="Create"}[5m])
        ) > 600
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Slow VM creation times"
        description: "95th percentile VM creation time is {{ $value }}s"

    - alert: VirtrigaudQueueBacklog
      expr: virtrigaud_queue_depth > 100
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Queue backlog detected"
        description: "Queue depth for {{ $labels.kind }} is {{ $value }}"

Custom Metrics

Providers can expose additional custom metrics specific to their implementation:

vSphere Provider Metrics

Metric Name	Type	Labels	Description
`virtrigaud_vsphere_sessions_total`	Counter	`datacenter`	Total vSphere sessions created
`virtrigaud_vsphere_api_calls_total`	Counter	`method`, `datacenter`	Total vSphere API calls

Libvirt Provider Metrics

Metric Name	Type	Labels	Description
`virtrigaud_libvirt_connections_total`	Counter	`host`	Total Libvirt connections
`virtrigaud_libvirt_domains_total`	Gauge	`host`, `state`	Current number of domains by state

Metric Collection Best Practices

Scrape Interval: Use 30s interval for most metrics
Retention: Keep metrics for at least 30 days for trending
High Cardinality: Be careful with VM names and IDs in labels
Aggregation: Use recording rules for frequently queried metrics
Alerting: Set up alerts for SLI/SLO violations

Provider Catalog

Last updated: 2025-08-26T14:30:00Z

The VirtRigaud Provider Catalog lists all verified and community providers available for the VirtRigaud virtualization management platform. All providers in this catalog have been tested for conformance and compatibility.

Provider Overview

Provider	Description	Capabilities	Maintainer	License
Mock Provider	A mock provider for testing and demonstrations	core, snapshot, clone, image-prepare, advanced	virtrigaud@projectbeskar.com	Apache-2.0
vSphere Provider	VMware vSphere provider for VirtRigaud	core, snapshot, clone, advanced	virtrigaud@projectbeskar.com	Apache-2.0
Libvirt Provider	Libvirt/KVM provider for VirtRigaud	core, snapshot, clone	virtrigaud@projectbeskar.com	Apache-2.0

Quick Start

Installing a Provider

To install a provider in your Kubernetes cluster, use the VirtRigaud provider runtime Helm chart:

# Add the VirtRigaud Helm repository
helm repo add virtrigaud https://projectbeskar.github.io/virtrigaud
helm repo update

# Install a provider using the runtime chart
helm install my-vsphere-provider virtrigaud/virtrigaud-provider-runtime \
  --namespace vsphere-providers \
  --create-namespace \
  --set image.repository=ghcr.io/projectbeskar/virtrigaud/provider-vsphere \
  --set image.tag=0.1.1 \
  --set env[0].name=VSPHERE_SERVER \
  --set env[0].value=vcenter.example.com

Provider Discovery

Once installed, providers automatically register with the VirtRigaud manager. You can list available providers:

kubectl get providers -n virtrigaud-system

Provider Details

Mock Provider

Image: ghcr.io/projectbeskar/virtrigaud/provider-mock:0.1.1
Repository: https://github.com/projectbeskar/virtrigaud
Maturity: stable
Tags: testing, development, demo
Documentation: https://projectbeskar.github.io/virtrigaud/providers/mock/

The mock provider is perfect for:

Testing VirtRigaud functionality
Development and CI/CD pipelines
Learning provider concepts
Demonstrating VirtRigaud capabilities

Installation:

helm install mock-provider virtrigaud/virtrigaud-provider-runtime \
  --namespace development \
  --create-namespace \
  --set image.repository=ghcr.io/projectbeskar/virtrigaud/provider-mock \
  --set image.tag=0.1.1 \
  --set env[0].name=LOG_LEVEL \
  --set env[0].value=debug

vSphere Provider

Image: ghcr.io/projectbeskar/virtrigaud/provider-vsphere:0.1.1
Repository: https://github.com/projectbeskar/virtrigaud
Maturity: beta
Tags: vmware, vsphere, enterprise
Documentation: https://projectbeskar.github.io/virtrigaud/providers/vsphere/

The vSphere provider enables VirtRigaud to manage VMware vSphere environments, including:

VM lifecycle management (create, update, delete)
Power operations (on, off, restart, suspend)
Snapshot management
VM cloning and templates
Resource allocation and configuration

Prerequisites:

VMware vSphere 6.7 or later
vCenter Server credentials
Network connectivity to vCenter API

Installation:

# Create secret for vSphere credentials
kubectl create secret generic vsphere-credentials \
  --namespace vsphere-providers \
  --from-literal=username=your-username \
  --from-literal=password=your-password

# Install provider
helm install vsphere-provider virtrigaud/virtrigaud-provider-runtime \
  --namespace vsphere-providers \
  --create-namespace \
  --set image.repository=ghcr.io/projectbeskar/virtrigaud/provider-vsphere \
  --set image.tag=0.1.1 \
  --set env[0].name=VSPHERE_SERVER \
  --set env[0].value=vcenter.example.com \
  --set env[1].name=VSPHERE_USERNAME \
  --set env[1].valueFrom.secretKeyRef.name=vsphere-credentials \
  --set env[1].valueFrom.secretKeyRef.key=username \
  --set env[2].name=VSPHERE_PASSWORD \
  --set env[2].valueFrom.secretKeyRef.name=vsphere-credentials \
  --set env[2].valueFrom.secretKeyRef.key=password

Libvirt Provider

Image: ghcr.io/projectbeskar/virtrigaud/provider-libvirt:0.1.1
Repository: https://github.com/projectbeskar/virtrigaud
Maturity: beta
Tags: libvirt, kvm, qemu, open-source
Documentation: https://projectbeskar.github.io/virtrigaud/providers/libvirt/

The libvirt provider manages KVM/QEMU virtual machines through libvirt, supporting:

VM lifecycle management
Power state control
Snapshot operations
Basic cloning capabilities
Local and remote libvirt connections

Prerequisites:

Libvirt daemon running on target hosts
SSH access for remote connections
Shared storage for multi-host deployments

Installation:

helm install libvirt-provider virtrigaud/virtrigaud-provider-runtime \
  --namespace libvirt-providers \
  --create-namespace \
  --set image.repository=ghcr.io/projectbeskar/virtrigaud/provider-libvirt \
  --set image.tag=0.1.1 \
  --set env[0].name=LIBVIRT_URI \
  --set env[0].value=qemu:///system \
  --set securityContext.runAsUser=0 \
  --set podSecurityContext.runAsUser=0

Capability Profiles

VirtRigaud defines several capability profiles that providers can implement:

Core Profile

Required for all providers

vm.create - Create virtual machines
vm.read - Get virtual machine information
vm.update - Update virtual machine configuration
vm.delete - Delete virtual machines
vm.power - Control power state (on/off/restart)
vm.list - List virtual machines

Snapshot Profile

Optional - for providers supporting VM snapshots

vm.snapshot.create - Create VM snapshots
vm.snapshot.list - List VM snapshots
vm.snapshot.delete - Delete VM snapshots
vm.snapshot.restore - Restore VM from snapshot

Clone Profile

Optional - for providers supporting VM cloning

vm.clone - Clone virtual machines
vm.template - Create and manage VM templates

Image Prepare Profile

Optional - for providers with image management

image.prepare - Prepare VM images
image.list - List available images
image.upload - Upload custom images

Advanced Profile

Optional - for advanced provider features

vm.migrate - Live migrate VMs between hosts
vm.resize - Dynamic resource allocation
vm.backup - Backup and restore operations
vm.monitoring - Advanced monitoring and metrics

Contributing a Provider

Want to add your provider to the catalog? Follow these steps:

1. Develop Your Provider

Use the Provider Developer Tutorial to create your provider using the VirtRigaud SDK.

2. Ensure Conformance

Run the VirtRigaud Conformance Test Suite (VCTS) to verify your provider meets the requirements:

# Install the VCTS tool
go install github.com/projectbeskar/virtrigaud/cmd/vcts@latest

# Run conformance tests
vcts run --provider-endpoint=localhost:9443 --profile=core

3. Publish to Catalog

Use the vrtg-provider publish command to submit your provider:

vrtg-provider publish \
  --name your-provider \
  --image ghcr.io/yourorg/your-provider \
  --tag v1.0.0 \
  --repo https://github.com/yourorg/your-provider \
  --maintainer your-email@example.com \
  --license Apache-2.0

This will:

Run conformance tests
Generate provider badges
Create a catalog entry
Open a pull request to add your provider

4. Catalog Requirements

To be included in the catalog, providers must:

✅ Pass VCTS core profile tests
✅ Include comprehensive documentation
✅ Provide Helm chart for deployment
✅ Follow security best practices
✅ Include proper error handling
✅ Support health checks and metrics
✅ Have active maintenance and support

Provider Support Matrix

Provider	Kubernetes	VirtRigaud	Go Version	Platforms
Mock	1.25+	0.1.0+	1.23+	linux/amd64, linux/arm64
vSphere	1.25+	0.1.0+	1.23+	linux/amd64, linux/arm64
Libvirt	1.25+	0.1.0+	1.23+	linux/amd64

Community and Support

Documentation: VirtRigaud Provider Docs
Issues: GitHub Issues
Discussions: GitHub Discussions
Slack: VirtRigaud Community

Versioning and Compatibility

Providers follow semantic versioning (SemVer) and maintain compatibility with VirtRigaud versions:

Major versions (1.0.0 → 2.0.0): Breaking changes to APIs or behavior
Minor versions (1.0.0 → 1.1.0): New features, backward compatible
Patch versions (1.0.0 → 1.0.1): Bug fixes, security updates

Compatibility Policy:

Current VirtRigaud version supports providers from current major version
Providers should support at least 2 minor versions of VirtRigaud
Breaking changes require migration documentation

License and Legal

All providers in this catalog are open source and follow the licensing terms specified in their individual repositories. The catalog itself is maintained under the Apache 2.0 license.

Trademark Notice: VMware and vSphere are trademarks of VMware, Inc. KVM and QEMU are trademarks of their respective owners. All trademarks are the property of their respective owners.

Testing GitHub Actions Workflows Locally

This guide explains how to test VirtRigaud’s GitHub Actions workflows locally before pushing to save on GitHub Actions costs and catch issues early.

Overview

We provide several scripts to test workflows locally:

Script	Purpose	Dependencies	Use Case
`hack/test-workflows-locally.sh`	Main orchestrator using `act`	`act`, `docker`	Full GitHub Actions simulation
`hack/test-lint-locally.sh`	Lint workflow replica	`go`, `golangci-lint`	Quick lint testing
`hack/test-ci-locally.sh`	CI workflow replica	`go`, `helm`, system deps	Comprehensive CI testing
`hack/test-release-locally.sh`	Release workflow simulation	`docker`, `helm`, `go`	Release preparation testing
`hack/test-helm-locally.sh`	Helm charts testing	`helm`, `kind`, `kubectl`	Chart validation and deployment

Quick Start

1. Setup (First Time)

# Install dependencies and configure act
./hack/test-workflows-locally.sh setup

This will:

Install act (GitHub Actions runner)
Create .actrc configuration
Create .env.local with environment variables
Create .secrets file (update with real values if needed)

2. Quick Validation

# Fast syntax check of all workflows
./hack/test-workflows-locally.sh smoke

3. Test Individual Workflows

# Test lint workflow (fastest)
./hack/test-lint-locally.sh

# Test CI workflow (comprehensive)
./hack/test-ci-locally.sh

# Test Helm charts
./hack/test-helm-locally.sh

# Test release workflow (requires Docker)
./hack/test-release-locally.sh v0.2.0-test

Detailed Usage

Lint Testing (`test-lint-locally.sh`)

Replicates the lint.yml workflow:

# Quick lint test
./hack/test-lint-locally.sh

What it tests:

Go version compatibility
golangci-lint installation and execution
Comprehensive code linting (matching CI exactly)

Requirements:

Go 1.21+
Internet access (to download golangci-lint if needed)

CI Testing (`test-ci-locally.sh`)

Replicates the ci.yml workflow jobs:

# Interactive mode (asks about optional jobs)
./hack/test-ci-locally.sh

# Quick essential tests only
./hack/test-ci-locally.sh quick

# Full CI replication including security scans
./hack/test-ci-locally.sh full

Jobs tested:

test: Go tests and coverage
lint: Code linting with golangci-lint
generate: Code and manifest generation
build: Binary compilation
build-tools: CLI tools compilation
helm: Helm chart validation
security: Security scanning (optional)

Requirements:

Go 1.23+
Helm 3.12+
System dependencies (libvirt-dev on Linux)
Python 3 (for YAML validation)

Release Testing (`test-release-locally.sh`)

Simulates the release.yml workflow:

# Test with default tag
./hack/test-release-locally.sh

# Test with specific tag
./hack/test-release-locally.sh v0.3.0-rc.1

# Skip image building (faster)
./hack/test-release-locally.sh --no-images

What it tests:

Container image building and pushing (to local registry)
Helm chart packaging with version updates
CLI tools building for multiple platforms
Changelog generation
Checksum creation
Container image smoke testing

Requirements:

Docker
Go 1.23+
Helm 3.12+
Local Docker registry (started automatically)

Helm Testing (`test-helm-locally.sh`)

Tests Helm charts with real Kubernetes:

# Full helm test suite
./hack/test-helm-locally.sh

# Individual test types
./hack/test-helm-locally.sh lint     # Chart linting only
./hack/test-helm-locally.sh template # Template rendering only
./hack/test-helm-locally.sh crd      # CRD installation only
./hack/test-helm-locally.sh main     # Main chart installation
./hack/test-helm-locally.sh runtime  # Runtime chart installation

# Cleanup after testing
./hack/test-helm-locally.sh cleanup

What it tests:

Helm chart linting (helm lint)
Template rendering with various value files
CRD installation and functionality
Chart installation in Kind cluster
Pod readiness and basic functionality

Requirements:

Helm 3.12+
Kind (Kubernetes in Docker)
kubectl
Docker

Act-Based Testing (`test-workflows-locally.sh`)

Uses act to run actual GitHub Actions workflows:

# Setup first time
./hack/test-workflows-locally.sh setup

# Test individual workflows
./hack/test-workflows-locally.sh lint
./hack/test-workflows-locally.sh ci
./hack/test-workflows-locally.sh runtime

# Test all workflows (interactive)
./hack/test-workflows-locally.sh all

# Cleanup
./hack/test-workflows-locally.sh cleanup

Advanced usage:

Supports secrets from .secrets file
Uses reusable containers for speed
Artifact handling with local storage
Environment variable injection

Configuration Files

`.actrc`

# Act configuration for GitHub Actions simulation
-P ubuntu-latest=catthehacker/ubuntu:act-22.04
-P ubuntu-22.04=catthehacker/ubuntu:act-22.04  
-P ubuntu-24.04=catthehacker/ubuntu:act-22.04
--container-daemon-socket /var/run/docker.sock
--reuse
--rm

`.env.local`

# Local environment variables
GO_VERSION=1.23
GOLANGCI_LINT_VERSION=v1.64.8
REGISTRY=localhost:5000
IMAGE_NAME_PREFIX=virtrigaud
GITHUB_ACTOR=local-user
GITHUB_REPOSITORY=projectbeskar/virtrigaud
# ... more environment variables

`.secrets` (optional)

# GitHub token for release workflows
GITHUB_TOKEN=your_github_token_here
REGISTRY=localhost:5000

Workflow-Specific Notes

Lint Workflow (`lint.yml`)

Fast: Usually completes in 1-2 minutes
Requirements: Minimal (Go + golangci-lint)
Run before: Every commit
Catches: Code style, syntax, and simple errors

CI Workflow (`ci.yml`)

Comprehensive: Tests building, testing, security
Duration: 10-20 minutes for full run
Platform deps: LibVirt requires Linux for full testing
Run before: Pull requests and major changes

Release Workflow (`release.yml`)

Complex: Multi-platform builds, signing, publishing
Duration: 20-30 minutes
Local only: Uses local registry, no real publishing
Run before: Creating releases

Runtime Chart Workflow (`runtime-chart.yml`)

Kubernetes focused: Tests provider runtime charts
Requirements: Kind cluster
Duration: 5-10 minutes
Run before: Chart changes

Best Practices

Daily Development Workflow

# Before committing
./hack/test-lint-locally.sh

# Before pushing feature branch
./hack/test-ci-locally.sh quick

# Before creating PR
./hack/test-ci-locally.sh full

Pre-Release Workflow

# Test release preparation
./hack/test-release-locally.sh v0.2.0-rc.1

# Test chart deployment
./hack/test-helm-locally.sh full

# Test with act for full simulation
./hack/test-workflows-locally.sh all

Troubleshooting

Common Issues

Docker permission denied

sudo usermod -aG docker $USER
# Then logout/login

LibVirt dependencies missing

# Ubuntu/Debian
sudo apt-get install libvirt-dev pkg-config

# Skip libvirt tests on non-Linux
./hack/test-ci-locally.sh quick

Kind cluster creation fails

# Clean up and retry
kind delete cluster --name virtrigaud-test
./hack/test-helm-locally.sh

Act fails with container errors

# Clean up act containers
docker ps -a | grep "act-" | awk '{print $1}' | xargs docker rm -f

# Rebuild without cache
./hack/test-workflows-locally.sh cleanup
./hack/test-workflows-locally.sh setup

Debugging Tips

Check logs: All scripts provide detailed logging
Use dry-run: Most scripts support --help for options
Incremental testing: Start with lint, then ci quick, then full tests
Docker cleanup: Regular docker system prune helps with space

Performance Tips

Use quick modes for daily development
Skip expensive jobs like security scans during iteration
Reuse Kind clusters with ./hack/test-helm-locally.sh
Use local registry for container testing
Run parallel tests when possible

Integration with Development

Git Hooks

Add to .git/hooks/pre-push:

#!/bin/bash
echo "Running local lint check before push..."
./hack/test-lint-locally.sh

IDE Integration

Many IDEs can run these scripts as build tasks:

VS Code (.vscode/tasks.json):

{
  "version": "2.0.0",
  "tasks": [
    {
      "label": "Test Lint Locally",
      "type": "shell", 
      "command": "./hack/test-lint-locally.sh",
      "group": "test",
      "presentation": {
        "echo": true,
        "reveal": "always",
        "focus": false,
        "panel": "shared"
      }
    }
  ]
}

CI Cost Optimization

By testing locally first:

Reduce failed CI runs by ~80%
Save GitHub Actions minutes
Faster feedback (local runs are often faster)
Better debugging (local environment is easier to inspect)

Conclusion

These local testing scripts allow you to:

✅ Catch issues early before they reach GitHub Actions
✅ Save costs by reducing failed CI runs
✅ Debug faster with local environment access
✅ Test thoroughly with multiple approaches
✅ Iterate quickly during development

Start with the lint script for daily use, and gradually incorporate the full test suite for comprehensive validation before releases.

Contributing to VirtRigaud

Thank you for your interest in contributing to VirtRigaud! This document provides guidelines and information for contributors.

Development Setup

Prerequisites

Go 1.23+
Docker
Kubernetes cluster (kind, k3s, or remote)
kubectl
Helm 3.x
make

Clone and Setup

git clone https://github.com/projectbeskar/virtrigaud.git
cd virtrigaud

# Install development dependencies
make dev-setup

# Install pre-commit hooks (optional but recommended)
pip install pre-commit
pre-commit install

Development Workflow

1. Making Changes

API Changes

When modifying API types:

# Edit API types
vim api/infra.virtrigaud.io/v1beta1/virtualmachine_types.go

# Generate code and CRDs
make generate
make gen-crds

Code Changes

For other code changes:

# Run tests
make test

# Lint code
make lint

# Format code
make fmt

2. CRD Management

Important: CRDs are generated from code (the source of truth) and are not duplicated in git.

config/crd/bases/ - CRDs for local development and releases (checked into git)
charts/virtrigaud/crds/ - CRDs for Helm charts (generated during packaging, not checked into git)

# After API changes, generate CRDs
make gen-crds

# For Helm chart development/packaging
make gen-helm-crds

3. Testing

# Unit tests
make test

# Integration tests (requires cluster)
make test-integration

# End-to-end tests
make test-e2e

# Test specific provider
make test-provider-vsphere

4. Local Development

# Deploy to local cluster
make dev-deploy

# Watch for changes and auto-reload
make dev-watch

# Clean up
make dev-clean

Contribution Guidelines

Pull Request Process

Fork and branch: Create a feature branch from main
Make changes: Follow the development workflow above
Test thoroughly: Run all relevant tests
Update docs: Update documentation if needed
CRD sync: Ensure CRDs are synchronized (CI will verify)
Submit PR: Create a pull request with clear description

PR Requirements

All tests pass
CRDs are in sync (verified by CI)
Code is formatted (make fmt)
Code is linted (make lint)
Documentation updated if needed
Changelog entry added (for user-facing changes)

Commit Message Format

Use conventional commit format:

<type>(<scope>): <description>

[optional body]

[optional footer(s)]

Types:

feat: New feature
fix: Bug fix
docs: Documentation changes
style: Code style changes
refactor: Code refactoring
test: Test changes
chore: Maintenance tasks

Examples:

feat(vsphere): add graceful shutdown support
fix(crd): resolve powerState validation conflict
docs(upgrade): add CRD synchronization guide

Code Style

Go Code

Follow standard Go conventions
Use gofmt and golangci-lint
Add meaningful comments for exported functions
Include unit tests for new functionality

YAML/Kubernetes

Use 2-space indentation
Follow Kubernetes API conventions
Add descriptions to CRD fields
Include examples in documentation

Documentation

Use clear, concise language
Include code examples
Update both API docs and user guides
Test documentation examples

Testing

Unit Tests

# Run all unit tests
make test

# Run tests for specific package
go test ./internal/controller/...

# Run with coverage
make test-coverage

Integration Tests

# Requires running Kubernetes cluster
export KUBECONFIG=~/.kube/config
make test-integration

Provider Tests

# Test specific provider (requires infrastructure)
make test-provider-vsphere
make test-provider-libvirt
make test-provider-proxmox

Release Process

For Maintainers

Prepare release:

# Generate CRDs for config directory (will be in release artifacts)
make gen-crds

# Update version in charts
vim charts/virtrigaud/Chart.yaml

# Update changelog
vim CHANGELOG.md

Create release:
```
git tag v0.2.1
git push origin v0.2.1
```
CI handles:
- Building and pushing images
- Creating GitHub release
- Publishing Helm charts
- Generating CLI binaries

Common Issues

CRD Generation Issues

If you need to regenerate CRDs:

# For local development and config directory
make gen-crds

# For Helm chart packaging
make gen-helm-crds

Note: CRDs in charts/virtrigaud/crds/ are generated during packaging and should not be committed to git.

Test Failures

# Clean and retry
make clean
make test

# For libvirt-related failures
export SKIP_LIBVIRT_TESTS=true
make test

Development Environment

# Reset development environment
make dev-clean
make dev-deploy

# Check logs
kubectl logs -n virtrigaud-system deployment/virtrigaud-manager

Getting Help

GitHub Issues: Bug reports and feature requests
GitHub Discussions: Questions and community support
Documentation: Check docs/ directory
Code Review: Maintainers will provide feedback on PRs

Recognition

Contributors are recognized in:

CHANGELOG.md for significant contributions
README.md contributors section
GitHub contributor graphs

Thank you for contributing to VirtRigaud! 🚀

VirtRigaud Examples

This directory contains comprehensive examples for VirtRigaud v0.2.3+, showcasing all features and capabilities.

Quick Start Examples

Basic Examples

complete-example.yaml - Complete end-to-end example with v0.2.1 features
vm-ubuntu-small.yaml - Simple Ubuntu VM with graceful shutdown
vmclass-small.yaml - Basic VMClass with hardware version support

Provider Examples

provider-vsphere.yaml - Basic vSphere provider configuration
provider-libvirt.yaml - Basic LibVirt provider configuration

Resource Examples

vmimage-ubuntu.yaml - VM image configuration
vmnetwork-app.yaml - Network attachment configuration
vm-adoption-example.yaml - VM adoption with filters

v0.2.1 Feature Examples

New in v0.2.1

v021-feature-showcase.yaml - 🌟 COMPREHENSIVE DEMO - All v0.2.1 features in one example
graceful-shutdown-examples.yaml - OffGraceful shutdown configurations
vsphere-hardware-versions.yaml - Hardware version management
disk-sizing-examples.yaml - Disk size configuration tests

Advanced Provider Examples

vsphere-advanced-example.yaml - Advanced vSphere with v0.2.1 features
libvirt-advanced-example.yaml - Advanced LibVirt configuration
proxmox-complete-example.yaml - Complete Proxmox setup

Multi-Provider Examples

multi-provider-example.yaml - Multiple providers in one cluster
libvirt-complete-example.yaml - Complete LibVirt deployment

v0.2.3 Feature Summary

🔧 VM Reconfiguration (vSphere, Libvirt, Proxmox)

# Online resource changes (vSphere, Proxmox)
# Offline changes (Libvirt - requires restart)
spec:
  vmClassRef: medium  # Change from small to medium
  powerState: "On"

📋 Async Task Tracking (vSphere, Proxmox)

# Automatic tracking of long-running operations
# Real-time progress and error reporting

🖥️ Console Access (vSphere, Libvirt)

# Web console URLs automatically generated
status:
  consoleURL: "https://vcenter.example.com/ui/app/vm..."  # vSphere
  # or
  consoleURL: "vnc://libvirt-host:5900"  # Libvirt VNC

🌐 Guest Agent Integration (Proxmox)

# Accurate IP detection via QEMU guest agent
status:
  ipAddresses:
    - 192.168.1.100
    - fd00::1234:5678:9abc:def0

📦 VM Cloning (vSphere)

# Full and linked clones with automatic snapshot handling
spec:
  vmImageRef: source-vm
  cloneType: linked  # or "full"

🔄 Previous Features (v0.2.1)

Graceful Shutdown: OffGraceful power state with VMware Tools
Hardware Version Management: vSphere hardware version control
Proper Disk Sizing: Correct disk allocation across providers
Enhanced Lifecycle Management: postStart/preStop hooks

Usage Patterns

Testing v0.2.3 Features

Test VM reconfiguration:

# Change VM class to trigger reconfiguration
kubectl patch virtualmachine my-vm --type='merge' \
  -p='{"spec":{"vmClassRef":"medium"}}'

# Watch the reconfiguration process
kubectl get vm my-vm -w

Access VM console:

# Get console URL from VM status
kubectl get vm my-vm -o jsonpath='{.status.consoleURL}'

# For VNC (Libvirt): Use any VNC client
vncviewer $(kubectl get vm my-vm -o jsonpath='{.status.consoleURL}' | sed 's/vnc:\/\///')

Monitor async tasks (vSphere, Proxmox):

# Watch task progress in provider logs
kubectl logs -f deployment/virtrigaud-provider-vsphere

Verify guest agent (Proxmox):

# Check IP addresses from guest agent
kubectl get vm my-vm -o jsonpath='{.status.ipAddresses}'

Test VM cloning (vSphere):

# Create a clone of existing VM
kubectl apply -f - <<EOF
apiVersion: infra.virtrigaud.io/v1beta1
kind: VirtualMachine
metadata:
  name: web-server-clone
spec:
  vmClassRef: small
  vmImageRef: web-server-01
  cloneType: linked
EOF

Development Workflow

Choose base example based on your use case
Customize provider, class, and VM specifications
Test locally with your infrastructure
Iterate based on your requirements

Production Deployment

Start with complete-example.yaml
Add security configurations from security/ subdirectory
Configure secrets from secrets/ subdirectory
Apply advanced patterns from advanced/ subdirectory

File Organization

docs/examples/
├── README.md                          # This file
├── complete-example.yaml             # Complete setup guide
├── v021-feature-showcase.yaml        # 🌟 v0.2.1 comprehensive demo
├── vm-ubuntu-small.yaml             # Simple VM example
├── vmclass-small.yaml               # Basic VMClass
├── provider-*.yaml                  # Provider configurations
├── graceful-shutdown-examples.yaml  # OffGraceful demos
├── vsphere-hardware-versions.yaml   # Hardware version examples
├── disk-sizing-examples.yaml        # Disk sizing tests
├── advanced/                        # Complex scenarios
├── secrets/                         # Secret management
└── security/                        # Security configurations

Version Compatibility

v0.2.3+: All examples with v0.2.3 features (Reconfigure, Clone, TaskStatus, ConsoleURL, Guest Agent)
v0.2.2: Nested virtualization, TPM support, snapshot management
v0.2.1: Graceful shutdown, hardware version, disk sizing fixes
v0.2.0: Initial production-ready providers
v0.1.x: Legacy examples in git history

Need Help?

📖 Documentation: ../README.md
🚀 Quick Start: ../getting-started/quickstart.md
🔧 CLI Tools: ../CLI.md
📋 Upgrade Guide: ../UPGRADE.md
🏗️ Contributing: ../../CONTRIBUTING.md

Pro Tip: Start with v021-feature-showcase.yaml to see all v0.2.1 capabilities in action! 🚀