Understanding MinIO AIStor’s telemetry and diagnostics capabilities is essential for monitoring deployments, troubleshooting issues, and integrating with observability platforms.
Answer
MinIO provides 38+ metric collectors[1] with Prometheus compatibility and distributed tracing via OpenTelemetry. The comprehensive observability stack includes metrics, health endpoints, distributed tracing, audit logging, and subsystem-specific logging for complete operational visibility.
Metrics V3 Architecture
MinIO’s metrics system exposes detailed operational data organized by category.
Metrics Endpoints
| Path | Purpose | Key Metrics |
|---|---|---|
/api/requests | S3 API request metrics | Latency, throughput, error rates |
/bucket/replication | Per-bucket replication stats | Lag, queue size, failures |
/cluster/health | Drive/node/capacity health | Online status, capacity |
/system/drive | Disk I/O, health, latency | IOPS, latency percentiles |
/system/cpu | CPU usage metrics | Utilization, load average |
/system/memory | Memory statistics | Heap, RSS, GC stats |
/debug/heal | Healing progress | Objects healed, pending |
/scanner | Background scanner stats | Objects scanned, rate |
Metrics Architecture
┌─────────────────────────────────────────────────────────┐│ Metrics V3 Architecture │├─────────────────────────────────────────────────────────┤│ ││ 38+ Metric Collectors ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────┐ ││ │ Metric Categories │ ││ │ ├── API Metrics (requests, latency, errors) │ ││ │ ├── Bucket Metrics (replication, ILM) │ ││ │ ├── Cluster Metrics (health, capacity) │ ││ │ ├── System Metrics (CPU, memory, disk) │ ││ │ └── Debug Metrics (healing, scanner) │ ││ └─────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────┐ ││ │ Prometheus-Compatible Export │ ││ │ └── /minio/v2/metrics/cluster │ ││ └─────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────┘Key Metric Categories
API Request Metrics
minio_api_requests_total├── Labels: api, bucket, method└── Purpose: Request count by API operation
minio_api_requests_latency_seconds├── Labels: api, bucket└── Purpose: Request latency histogram
minio_api_requests_errors_total├── Labels: api, bucket, error_code└── Purpose: Error count by typeReplication Metrics
minio_bucket_replication_sent_bytes├── Labels: bucket, target_arn└── Purpose: Bytes replicated to target
minio_bucket_replication_failed_operations├── Labels: bucket, target_arn└── Purpose: Failed replication count
minio_bucket_replication_pending_count├── Labels: bucket└── Purpose: Objects pending replicationSystem Metrics
minio_system_drive_used_bytes├── Labels: drive, pool, set└── Purpose: Drive space usage
minio_system_drive_latency_seconds├── Labels: drive, api (read/write)└── Purpose: Disk I/O latency
minio_system_cpu_usage_percent└── Purpose: CPU utilizationHealth Endpoints
MinIO provides dedicated health endpoints for orchestration integration.
Health Endpoint Overview[3]
| Endpoint | Purpose | Use Case |
|---|---|---|
/minio/health/live | Liveness check | Kubernetes liveness probe |
/minio/health/ready | Readiness check | Kubernetes readiness probe |
/minio/health/cluster | Write quorum check | Maintenance checks |
/minio/health/cluster/read | Read quorum check | Read availability |
Health Check Details
┌─────────────────────────────────────────────────────────┐│ Health Endpoints │├─────────────────────────────────────────────────────────┤│ ││ /minio/health/live ││ ├── Returns: 200 OK if process is running ││ ├── Use: Kubernetes liveness probe ││ └── Failure: Triggers pod restart ││ ││ /minio/health/ready ││ ├── Returns: 200 OK if ready to serve requests ││ ├── Use: Kubernetes readiness probe ││ └── Failure: Removes pod from service ││ ││ /minio/health/cluster ││ ├── Returns: 200 OK if write quorum available ││ ├── Use: Pre-maintenance checks ││ └── Checks: All erasure sets have write quorum ││ ││ /minio/health/cluster/read ││ ├── Returns: 200 OK if read quorum available ││ ├── Use: Read availability verification ││ └── Checks: All erasure sets have read quorum ││ │└─────────────────────────────────────────────────────────┘Kubernetes Integration Example
livenessProbe: httpGet: path: /minio/health/live port: 9000 initialDelaySeconds: 30 periodSeconds: 30
readinessProbe: httpGet: path: /minio/health/ready port: 9000 initialDelaySeconds: 5 periodSeconds: 15Distributed Tracing
MinIO supports OpenTelemetry for distributed tracing.
OpenTelemetry Configuration
| Parameter | Value | Description |
|---|---|---|
| Export Protocol | OTLP | OpenTelemetry Protocol |
| Sampling | Parent-based | Follows parent span decision |
| Batch Timeout | 1 second[2] | Max wait before export |
| Max Batch Size | 512 spans | Maximum spans per batch (SDK default) |
Tracing Architecture
┌─────────────────────────────────────────────────────────┐│ OpenTelemetry Tracing │├─────────────────────────────────────────────────────────┤│ ││ Request Flow ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────┐ ││ │ Span Creation │ ││ │ ├── Service name │ ││ │ ├── Node name │ ││ │ ├── Port │ ││ │ └── Operation details │ ││ └─────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────┐ ││ │ Batch Processor │ ││ │ ├── Collect spans (up to 512) │ ││ │ ├── Timeout after 1 second │ ││ │ └── Export via OTLP │ ││ └─────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────┐ ││ │ OTLP Exporter │ ││ │ └── Send to Jaeger/Tempo/other backend │ ││ └─────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────┘Resource Attributes
Each span includes resource attributes for identification:
Resource Attributes:├── service.name: "minio"├── service.instance.id: <node-name>├── service.version: <minio-version>├── host.name: <hostname>└── net.host.port: <port>Enabling Tracing
# Set OTLP endpointexport MINIO_TRACING_OTLP_ENDPOINT=http://jaeger:4317
# Enable tracingmc admin config set ALIAS tracing endpoint=http://jaeger:4317mc admin service restart ALIASAudit Logging
MinIO provides comprehensive audit logging for security and compliance.
Audit Log Contents
| Field | Description | Example |
|---|---|---|
| Timestamp | Event time | 2025-01-05T10:30:00Z |
| Headers | Request headers | Content-Type, Authorization |
| Claims | Identity claims | User, groups, policies |
| Status | Response status | 200, 403, 500 |
| Duration | Request duration | 45ms |
| Errors | Error details | Access denied reason |
Audit Log Structure
{ "version": "1", "deploymentid": "xxxx-xxxx", "time": "2025-01-05T10:30:00.000Z", "event": { "api": { "name": "PutObject", "bucket": "mybucket", "object": "mykey", "status": "OK", "statusCode": 200, "timeToResponse": "45ms" }, "remotehost": "192.168.1.100", "requestID": "xxxx", "userAgent": "aws-sdk-go/1.0", "requestClaims": { "accessKey": "myaccesskey", "sub": "user123" }, "requestHeader": { "Content-Type": "application/octet-stream" }, "responseHeader": { "x-amz-request-id": "xxxx" } }}Audit Log Targets
┌─────────────────────────────────────────────────────────┐│ Audit Log Targets │├─────────────────────────────────────────────────────────┤│ ││ Supported Targets: ││ ├── HTTP Webhook → External logging service ││ ├── Kafka → Stream processing ││ ├── PostgreSQL → Database storage ││ └── Elasticsearch → Search and analytics ││ ││ Configuration: ││ mc admin config set ALIAS audit_webhook \ ││ endpoint=http://audit-service:8080 ││ │└─────────────────────────────────────────────────────────┘Subsystem Logging
MinIO provides subsystem-specific loggers for detailed debugging.
Logging Subsystems
| Subsystem | Purpose | Log Contents |
|---|---|---|
| Internal | Core operations | Object I/O, metadata ops |
| Replication | Replication events | Queue status, failures |
| Healing | Healing operations | Objects healed, errors |
| Scanner | Background scanner | Scan progress, findings |
| IAM | IAM operations | Auth events, policy changes |
Log Levels
Log Levels (most to least verbose):├── DEBUG → Detailed debugging information├── INFO → Normal operational messages├── WARN → Warning conditions├── ERROR → Error conditions└── FATAL → Critical errors (process exit)Subsystem Log Configuration
# Enable debug logging for replicationmc admin config set ALIAS log replication=debug
# Enable debug logging for healingmc admin config set ALIAS log healing=debug
# View current log configurationmc admin config get ALIAS logLog Output Format
┌─────────────────────────────────────────────────────────┐│ Log Entry Format │├─────────────────────────────────────────────────────────┤│ ││ Structured Log Entry: ││ { ││ "level": "info", ││ "time": "2025-01-05T10:30:00Z", ││ "subsystem": "replication", ││ "message": "Object replicated successfully", ││ "bucket": "mybucket", ││ "object": "mykey", ││ "target": "site-b", ││ "duration": "120ms" ││ } ││ │└─────────────────────────────────────────────────────────┘Prometheus Integration
Scrape Configuration
scrape_configs: - job_name: 'minio' metrics_path: /minio/v2/metrics/cluster scheme: http static_configs: - targets: ['minio1:9000', 'minio2:9000'] bearer_token: <your-bearer-token>Key Dashboards
| Dashboard | Purpose | Key Panels |
|---|---|---|
| Overview | Cluster health | Nodes online, capacity, requests |
| API | Request analysis | Latency, throughput, errors |
| Replication | Replication health | Lag, queue, failures |
| Resources | System resources | CPU, memory, disk I/O |
Best Practices
- Scrape interval: 15-30 seconds for Prometheus
- Retention: Keep metrics for trend analysis (30+ days)
- Alerting: Set alerts for queue growth, error rates, latency
- Tracing sampling: Use parent-based sampling in production
- Log rotation: Configure log rotation to prevent disk fill
- Audit compliance: Route audit logs to immutable storage
Source Code References
cmd/metrics-v3.go:32-74- 38 metric collector paths definedcmd/opentelemetry.go:78-80-sdktrace.WithBatcher(..., sdktrace.WithBatchTimeout(time.Second))cmd/healthcheck-router.go:22-27- Health endpoint path constants