VirtRigaud exposes comprehensive metrics for monitoring and observability. All metrics are available at the /metrics endpoint on port 8080.
Metric Name Type Labels Description
virtrigaud_manager_reconcile_totalCounter kind, outcomeTotal number of reconcile loops
virtrigaud_manager_reconcile_duration_secondsHistogram kindTime spent in reconcile loops
virtrigaud_queue_depthGauge kindCurrent queue depth for each resource kind
Metric Name Type Labels Description
virtrigaud_vm_operations_totalCounter operation, provider_type, provider, outcomeTotal VM operations
virtrigaud_vm_reconfigure_totalCounter provider_type, outcomeTotal VM reconfiguration operations
virtrigaud_vm_snapshot_totalCounter action, provider_type, outcomeTotal VM snapshot operations
virtrigaud_vm_clone_totalCounter linked, provider_type, outcomeTotal VM clone operations
virtrigaud_vm_image_prepare_totalCounter provider_type, outcomeTotal VM image preparation operations
Metric Name Type Labels Description
virtrigaud_build_infoGauge version, git_sha, go_versionBuild information
Metric Name Type Labels Description
virtrigaud_provider_rpc_requests_totalCounter provider_type, method, codeTotal gRPC requests
virtrigaud_provider_rpc_latency_secondsHistogram provider_type, methodgRPC request latency
virtrigaud_provider_tasks_inflightGauge provider_type, providerNumber of inflight tasks
Metric Name Type Labels Description
virtrigaud_ip_discovery_duration_secondsHistogram provider_typeTime to discover VM IP addresses
Metric Name Type Labels Description
virtrigaud_errors_totalCounter reason, componentTotal errors by reason and component
provider_type: The type of provider (vsphere, libvirt)
provider: The name of the provider instance
outcome: The result of an operation (success, failure, error)
kind: The Kubernetes resource kind (VirtualMachine, VMClass, etc.)
component: The component generating the metric (manager, provider)
operation: Type of VM operation (Create, Delete, Power, Describe, Reconfigure)
method: gRPC method name (CreateVM, DeleteVM, PowerVM, etc.)
code: gRPC status code (OK, INVALID_ARGUMENT, DEADLINE_EXCEEDED, etc.)
action: Snapshot action (create, delete, revert)
linked: Whether a clone is linked (true, false)
reason: Error reason (ConnectionFailed, AuthenticationError, etc.)
Duration histograms use the following buckets (in seconds):
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300
# Overall error rate
rate(virtrigaud_vm_operations_total{outcome="failure"}[5m]) /
rate(virtrigaud_vm_operations_total[5m])
# Provider-specific error rate
rate(virtrigaud_provider_rpc_requests_total{code!="OK"}[5m]) /
rate(virtrigaud_provider_rpc_requests_total[5m])
# 95th percentile VM creation time
histogram_quantile(0.95,
rate(virtrigaud_vm_operations_duration_seconds_bucket{operation="Create"}[5m])
)
# gRPC request latency by method
histogram_quantile(0.95,
rate(virtrigaud_provider_rpc_latency_seconds_bucket[5m])
) by (method)
# VM operations per second
rate(virtrigaud_vm_operations_total[5m])
# Operations by provider
rate(virtrigaud_vm_operations_total[5m]) by (provider_type, provider)
# Current queue depth
virtrigaud_queue_depth
# Average queue depth over time
avg_over_time(virtrigaud_queue_depth[5m])
# Current inflight tasks
virtrigaud_provider_tasks_inflight
# Inflight tasks by provider
virtrigaud_provider_tasks_inflight by (provider_type, provider)
sum(rate(virtrigaud_vm_operations_total{operation="Create",outcome="success"}[5m])) /
sum(rate(virtrigaud_vm_operations_total{operation="Create"}[5m])) * 100
up{job="virtrigaud-provider"}
sum(rate(virtrigaud_errors_total[5m])) by (component, provider_type)
Example ServiceMonitor for Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: virtrigaud-manager
namespace: virtrigaud-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: virtrigaud
app.kubernetes.io/component: manager
endpoints:
- port: metrics
interval: 30s
path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: virtrigaud-providers
namespace: virtrigaud-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: virtrigaud
app.kubernetes.io/component: provider
endpoints:
- port: metrics
interval: 30s
path: /metrics
Example PrometheusRule for common alerts:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: virtrigaud-alerts
namespace: virtrigaud-system
spec:
groups:
- name: virtrigaud.rules
rules:
- alert: VirtrigaudProviderDown
expr: up{job="virtrigaud-provider"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Virtrigaud provider is down"
description: "Provider {{ $labels.instance }} has been down for more than 5 minutes"
- alert: VirtrigaudHighErrorRate
expr: |
rate(virtrigaud_vm_operations_total{outcome="failure"}[5m]) /
rate(virtrigaud_vm_operations_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate in VM operations"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.provider }}"
- alert: VirtrigaudSlowVMCreation
expr: |
histogram_quantile(0.95,
rate(virtrigaud_vm_operations_duration_seconds_bucket{operation="Create"}[5m])
) > 600
for: 15m
labels:
severity: warning
annotations:
summary: "Slow VM creation times"
description: "95th percentile VM creation time is {{ $value }}s"
- alert: VirtrigaudQueueBacklog
expr: virtrigaud_queue_depth > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Queue backlog detected"
description: "Queue depth for {{ $labels.kind }} is {{ $value }}"
Providers can expose additional custom metrics specific to their implementation:
Metric Name Type Labels Description
virtrigaud_vsphere_sessions_totalCounter datacenterTotal vSphere sessions created
virtrigaud_vsphere_api_calls_totalCounter method, datacenterTotal vSphere API calls
Metric Name Type Labels Description
virtrigaud_libvirt_connections_totalCounter hostTotal Libvirt connections
virtrigaud_libvirt_domains_totalGauge host, stateCurrent number of domains by state
Scrape Interval : Use 30s interval for most metrics
Retention : Keep metrics for at least 30 days for trending
High Cardinality : Be careful with VM names and IDs in labels
Aggregation : Use recording rules for frequently queried metrics
Alerting : Set up alerts for SLI/SLO violations