Monitoring data and control plane performance is critical for maintaining storage availability and performance. Understanding the key metrics and alert thresholds ensures proactive system management.
This question covers:
- Essential data plane metrics
- Control plane health monitoring
- Recommended alert thresholds
- Replication-specific metrics
Answer
MinIO provides comprehensive metrics for monitoring both data and control plane performance, with specific recommendations for alert thresholds to ensure optimal availability and performance.
Erasure Set Health Metrics
Critical Erasure Set Monitoring:
From /cluster/erasure-set:
# Erasure set overall health - CRITICALminio_cluster_erasure_set_health# Alert: if not 1 (indicates degraded erasure set)
# Drives currently healing - WARNINGminio_cluster_erasure_set_healing_drives_count# Alert: if greater than 0 (healing in progress)
# Read capability health - CRITICALminio_cluster_erasure_set_read_health# Alert: if not 1 (read operations impacted)
# Write capability health - CRITICALminio_cluster_erasure_set_write_health# Alert: if not 1 (write operations impacted)Alert Configuration Example:
# Erasure set health alerts- alert: MinIOErasureSetDegraded expr: minio_cluster_erasure_set_health != 1 for: 2m labels: severity: critical annotations: summary: "MinIO erasure set degraded"
- alert: MinIOErasureSetHealing expr: minio_cluster_erasure_set_healing_drives_count > 0 for: 5m labels: severity: warning annotations: summary: "MinIO erasure set healing in progress"Cluster Health Metrics
Overall Cluster Status:
From /cluster/health:
# Offline drives count - CRITICALminio_cluster_health_drives_offline_count# Alert: if not 0 (drives unavailable)
# Offline nodes count - CRITICALminio_cluster_health_nodes_offline_count# Alert: if not 0 (nodes unavailable)
# Cluster capacity utilization - WARNING(minio_cluster_health_capacity_usable_free_bytes / minio_cluster_health_capacity_usable_total_bytes)# Alert: if >70% (approaching capacity limits)Cluster Health Alerts:
# Cluster availability alerts- alert: MinIODrivesOffline expr: minio_cluster_health_drives_offline_count > 0 for: 1m labels: severity: critical annotations: summary: "MinIO has offline drives"
- alert: MinIONodesOffline expr: minio_cluster_health_nodes_offline_count > 0 for: 1m labels: severity: critical annotations: summary: "MinIO has offline nodes"
- alert: MinIOHighCapacityUsage expr: (minio_cluster_health_capacity_usable_free_bytes / minio_cluster_health_capacity_usable_total_bytes) < 0.3 for: 10m labels: severity: warning annotations: summary: "MinIO cluster capacity usage above 70%"System Drive Metrics
Individual Drive Monitoring:
From /system/drive:
# Drive space utilization - WARNING(minio_system_drive_used_bytes/minio_system_drive_total_bytes) > 0.70# Alert: if >70% (drive approaching full)
# Inode utilization - WARNING(minio_system_drive_used_inodes/minio_system_drive_total_inodes) > 0.70# Alert: if >70% (inode exhaustion risk)
# Drive timeout errors - CRITICALminio_system_drive_timeout_errors_total > 100# Alert: if >100 (hardware issues)
# Drive I/O errors - CRITICALminio_system_drive_io_errors_total > 100# Alert: if >100 (drive failure indicators)
# Offline drive count - CRITICALminio_system_drive_offline_count > 0# Alert: if >0 (drive unavailable)
# Drive health status - CRITICALminio_system_drive_health != 1# Alert: if not 1 (drive unhealthy)
# Drive API latency - WARNINGminio_system_drive_api_latency_micros > 1000# Alert: if >1000μs (performance degradation)Drive Health Alerts:
# Drive-specific alerts- alert: MinIODriveHighUtilization expr: (minio_system_drive_used_bytes/minio_system_drive_total_bytes) > 0.70 for: 15m labels: severity: warning annotations: summary: "MinIO drive utilization above 70%"
- alert: MinIODriveErrors expr: increase(minio_system_drive_io_errors_total[1h]) > 100 for: 5m labels: severity: critical annotations: summary: "MinIO drive experiencing I/O errors"
- alert: MinIODriveHighLatency expr: minio_system_drive_api_latency_micros > 1000 for: 10m labels: severity: warning annotations: summary: "MinIO drive latency degraded"System Memory Metrics
Memory Utilization:
From /system/memory:
# Memory utilization - WARNING(minio_system_memory_used/minio_system_memory_total) > 0.70# Alert: if >70% (memory pressure)Memory Alert:
- alert: MinIOHighMemoryUsage expr: (minio_system_memory_used/minio_system_memory_total) > 0.70 for: 10m labels: severity: warning annotations: summary: "MinIO memory usage above 70%"Replication Metrics (Replication Deployments Only)
Replication Performance:
From /replication:
# Data transfer rate - WARNINGminio_replication_average_transfer_rate# Alert: if < threshold (X MB/s - environment specific)
# Queue backlog - CRITICALminio_replication_last_minute_queued_count# Alert: if > 8K (approaching 10K queue maximum)Per-Bucket Replication:
From /bucket/replication:
# Hourly failure count - WARNINGminio_bucket_replication_last_hour_failed_count# Alert: if > threshold (environment specific)
# Minute failure count - CRITICALminio_bucket_replication_last_minute_failed_count# Alert: if > threshold (immediate failures)
# Replication latency - WARNINGminio_bucket_replication_latency_ms# Alert: if >1000ms (performance degradation)
# Total failure count - WARNINGminio_bucket_replication_total_failed_count# Alert: if > threshold (accumulated failures)Replication Alerts:
# Replication monitoring- alert: MinIOReplicationQueueBacklog expr: minio_replication_last_minute_queued_count > 8000 for: 5m labels: severity: critical annotations: summary: "MinIO replication queue approaching limit"
- alert: MinIOReplicationHighLatency expr: minio_bucket_replication_latency_ms > 1000 for: 10m labels: severity: warning annotations: summary: "MinIO replication latency degraded"
- alert: MinIOReplicationFailures expr: increase(minio_bucket_replication_last_minute_failed_count[5m]) > 10 for: 2m labels: severity: critical annotations: summary: "MinIO replication failures detected"Optional Bucket Usage Metrics
Bucket-Level Monitoring:
From /cluster/usage/buckets:
# Per-bucket size monitoring - INFORMATIONALminio_cluster_usage_buckets_total_bytes# Alert: if greater than threshold for specific bucket
# Object version count - WARNINGminio_cluster_usage_buckets_versions_count# Alert: if greater than threshold (version explosion)Bucket Usage Alerts:
# Optional bucket monitoring- alert: MinIOBucketSizeThreshold expr: minio_cluster_usage_buckets_total_bytes{bucket="critical-bucket"} > 1e12 # 1TB for: 1h labels: severity: info annotations: summary: "Bucket size exceeds threshold"
- alert: MinIOExcessiveVersions expr: minio_cluster_usage_buckets_versions_count > 1000000 for: 30m labels: severity: warning annotations: summary: "Bucket has excessive object versions"Complete Monitoring Dashboard
Essential Metrics Summary:
| Metric Category | Critical Alerts | Warning Alerts |
|---|---|---|
| Erasure Sets | Health != 1, Read/Write != 1 | Healing > 0 |
| Cluster Health | Offline drives/nodes > 0 | Capacity > 70% |
| Drive Health | Errors > 100, Offline > 0 | Utilization > 70%, Latency > 1ms |
| Memory | - | Usage > 70% |
| Replication | Queue > 8K, Recent failures | Transfer rate low, Latency > 1s |
Prometheus Query Examples
Quick Health Check:
# Overall cluster health scoremin(minio_cluster_erasure_set_health) *min(minio_cluster_erasure_set_read_health) *min(minio_cluster_erasure_set_write_health)# Result: 1 = healthy, 0 = degraded
# Capacity utilization percentage100 - (minio_cluster_health_capacity_usable_free_bytes / minio_cluster_health_capacity_usable_total_bytes * 100)
# Average drive latencyavg(minio_system_drive_api_latency_micros)Key Monitoring Principles
- Critical vs Warning: Distinguish between service-impacting (critical) and performance-degrading (warning) conditions
- Threshold Tuning: Adjust alert thresholds based on environment and SLA requirements
- Alert Fatigue: Balance sensitivity with practicality to avoid excessive alerts
- Dependency Awareness: Consider metric relationships (e.g., healing drives affect performance)
- Historical Context: Use rate() and increase() functions for trending analysis
Best Practices
- Monitor all erasure set health metrics - These directly impact data availability
- Set capacity alerts at 70% - Provides time for expansion planning
- Track drive errors closely - Early indicators of hardware failure
- Monitor replication queues - Prevent backlog accumulation
- Use rate-based alerts - Focus on trends rather than absolute values
- Implement escalation - Different alert severities for different response times
This comprehensive metric set provides complete visibility into MinIO’s data and control plane performance, enabling proactive monitoring and rapid issue resolution.