What key metrics do you provide for monitoring performance and availability of the data and control planes?

Asked by muratkars Answered by muratkars July 17, 2025
0 views

Monitoring data and control plane performance is critical for maintaining storage availability and performance. Understanding the key metrics and alert thresholds ensures proactive system management.

This question covers:

  • Essential data plane metrics
  • Control plane health monitoring
  • Recommended alert thresholds
  • Replication-specific metrics

Answer

MinIO provides comprehensive metrics for monitoring both data and control plane performance, with specific recommendations for alert thresholds to ensure optimal availability and performance.

Erasure Set Health Metrics

Critical Erasure Set Monitoring: From /cluster/erasure-set:

# Erasure set overall health - CRITICAL
minio_cluster_erasure_set_health
# Alert: if not 1 (indicates degraded erasure set)
# Drives currently healing - WARNING
minio_cluster_erasure_set_healing_drives_count
# Alert: if greater than 0 (healing in progress)
# Read capability health - CRITICAL
minio_cluster_erasure_set_read_health
# Alert: if not 1 (read operations impacted)
# Write capability health - CRITICAL
minio_cluster_erasure_set_write_health
# Alert: if not 1 (write operations impacted)

Alert Configuration Example:

# Erasure set health alerts
- alert: MinIOErasureSetDegraded
expr: minio_cluster_erasure_set_health != 1
for: 2m
labels:
severity: critical
annotations:
summary: "MinIO erasure set degraded"
- alert: MinIOErasureSetHealing
expr: minio_cluster_erasure_set_healing_drives_count > 0
for: 5m
labels:
severity: warning
annotations:
summary: "MinIO erasure set healing in progress"

Cluster Health Metrics

Overall Cluster Status: From /cluster/health:

# Offline drives count - CRITICAL
minio_cluster_health_drives_offline_count
# Alert: if not 0 (drives unavailable)
# Offline nodes count - CRITICAL
minio_cluster_health_nodes_offline_count
# Alert: if not 0 (nodes unavailable)
# Cluster capacity utilization - WARNING
(minio_cluster_health_capacity_usable_free_bytes / minio_cluster_health_capacity_usable_total_bytes)
# Alert: if >70% (approaching capacity limits)

Cluster Health Alerts:

# Cluster availability alerts
- alert: MinIODrivesOffline
expr: minio_cluster_health_drives_offline_count > 0
for: 1m
labels:
severity: critical
annotations:
summary: "MinIO has offline drives"
- alert: MinIONodesOffline
expr: minio_cluster_health_nodes_offline_count > 0
for: 1m
labels:
severity: critical
annotations:
summary: "MinIO has offline nodes"
- alert: MinIOHighCapacityUsage
expr: (minio_cluster_health_capacity_usable_free_bytes / minio_cluster_health_capacity_usable_total_bytes) < 0.3
for: 10m
labels:
severity: warning
annotations:
summary: "MinIO cluster capacity usage above 70%"

System Drive Metrics

Individual Drive Monitoring: From /system/drive:

# Drive space utilization - WARNING
(minio_system_drive_used_bytes/minio_system_drive_total_bytes) > 0.70
# Alert: if >70% (drive approaching full)
# Inode utilization - WARNING
(minio_system_drive_used_inodes/minio_system_drive_total_inodes) > 0.70
# Alert: if >70% (inode exhaustion risk)
# Drive timeout errors - CRITICAL
minio_system_drive_timeout_errors_total > 100
# Alert: if >100 (hardware issues)
# Drive I/O errors - CRITICAL
minio_system_drive_io_errors_total > 100
# Alert: if >100 (drive failure indicators)
# Offline drive count - CRITICAL
minio_system_drive_offline_count > 0
# Alert: if >0 (drive unavailable)
# Drive health status - CRITICAL
minio_system_drive_health != 1
# Alert: if not 1 (drive unhealthy)
# Drive API latency - WARNING
minio_system_drive_api_latency_micros > 1000
# Alert: if >1000μs (performance degradation)

Drive Health Alerts:

# Drive-specific alerts
- alert: MinIODriveHighUtilization
expr: (minio_system_drive_used_bytes/minio_system_drive_total_bytes) > 0.70
for: 15m
labels:
severity: warning
annotations:
summary: "MinIO drive utilization above 70%"
- alert: MinIODriveErrors
expr: increase(minio_system_drive_io_errors_total[1h]) > 100
for: 5m
labels:
severity: critical
annotations:
summary: "MinIO drive experiencing I/O errors"
- alert: MinIODriveHighLatency
expr: minio_system_drive_api_latency_micros > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "MinIO drive latency degraded"

System Memory Metrics

Memory Utilization: From /system/memory:

# Memory utilization - WARNING
(minio_system_memory_used/minio_system_memory_total) > 0.70
# Alert: if >70% (memory pressure)

Memory Alert:

- alert: MinIOHighMemoryUsage
expr: (minio_system_memory_used/minio_system_memory_total) > 0.70
for: 10m
labels:
severity: warning
annotations:
summary: "MinIO memory usage above 70%"

Replication Metrics (Replication Deployments Only)

Replication Performance: From /replication:

# Data transfer rate - WARNING
minio_replication_average_transfer_rate
# Alert: if < threshold (X MB/s - environment specific)
# Queue backlog - CRITICAL
minio_replication_last_minute_queued_count
# Alert: if > 8K (approaching 10K queue maximum)

Per-Bucket Replication: From /bucket/replication:

# Hourly failure count - WARNING
minio_bucket_replication_last_hour_failed_count
# Alert: if > threshold (environment specific)
# Minute failure count - CRITICAL
minio_bucket_replication_last_minute_failed_count
# Alert: if > threshold (immediate failures)
# Replication latency - WARNING
minio_bucket_replication_latency_ms
# Alert: if >1000ms (performance degradation)
# Total failure count - WARNING
minio_bucket_replication_total_failed_count
# Alert: if > threshold (accumulated failures)

Replication Alerts:

# Replication monitoring
- alert: MinIOReplicationQueueBacklog
expr: minio_replication_last_minute_queued_count > 8000
for: 5m
labels:
severity: critical
annotations:
summary: "MinIO replication queue approaching limit"
- alert: MinIOReplicationHighLatency
expr: minio_bucket_replication_latency_ms > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "MinIO replication latency degraded"
- alert: MinIOReplicationFailures
expr: increase(minio_bucket_replication_last_minute_failed_count[5m]) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "MinIO replication failures detected"

Optional Bucket Usage Metrics

Bucket-Level Monitoring: From /cluster/usage/buckets:

# Per-bucket size monitoring - INFORMATIONAL
minio_cluster_usage_buckets_total_bytes
# Alert: if greater than threshold for specific bucket
# Object version count - WARNING
minio_cluster_usage_buckets_versions_count
# Alert: if greater than threshold (version explosion)

Bucket Usage Alerts:

# Optional bucket monitoring
- alert: MinIOBucketSizeThreshold
expr: minio_cluster_usage_buckets_total_bytes{bucket="critical-bucket"} > 1e12 # 1TB
for: 1h
labels:
severity: info
annotations:
summary: "Bucket size exceeds threshold"
- alert: MinIOExcessiveVersions
expr: minio_cluster_usage_buckets_versions_count > 1000000
for: 30m
labels:
severity: warning
annotations:
summary: "Bucket has excessive object versions"

Complete Monitoring Dashboard

Essential Metrics Summary:

Metric CategoryCritical AlertsWarning Alerts
Erasure SetsHealth != 1, Read/Write != 1Healing > 0
Cluster HealthOffline drives/nodes > 0Capacity > 70%
Drive HealthErrors > 100, Offline > 0Utilization > 70%, Latency > 1ms
Memory-Usage > 70%
ReplicationQueue > 8K, Recent failuresTransfer rate low, Latency > 1s

Prometheus Query Examples

Quick Health Check:

# Overall cluster health score
min(minio_cluster_erasure_set_health) *
min(minio_cluster_erasure_set_read_health) *
min(minio_cluster_erasure_set_write_health)
# Result: 1 = healthy, 0 = degraded
# Capacity utilization percentage
100 - (minio_cluster_health_capacity_usable_free_bytes / minio_cluster_health_capacity_usable_total_bytes * 100)
# Average drive latency
avg(minio_system_drive_api_latency_micros)

Key Monitoring Principles

  1. Critical vs Warning: Distinguish between service-impacting (critical) and performance-degrading (warning) conditions
  2. Threshold Tuning: Adjust alert thresholds based on environment and SLA requirements
  3. Alert Fatigue: Balance sensitivity with practicality to avoid excessive alerts
  4. Dependency Awareness: Consider metric relationships (e.g., healing drives affect performance)
  5. Historical Context: Use rate() and increase() functions for trending analysis

Best Practices

  • Monitor all erasure set health metrics - These directly impact data availability
  • Set capacity alerts at 70% - Provides time for expansion planning
  • Track drive errors closely - Early indicators of hardware failure
  • Monitor replication queues - Prevent backlog accumulation
  • Use rate-based alerts - Focus on trends rather than absolute values
  • Implement escalation - Different alert severities for different response times

This comprehensive metric set provides complete visibility into MinIO’s data and control plane performance, enabling proactive monitoring and rapid issue resolution.

0