How does MinIO AIStor handle telemetry and diagnostics internally?

Asked by muratkars Answered by muratkars January 4, 2026
0 views

Understanding MinIO AIStor’s telemetry and diagnostics capabilities is essential for monitoring deployments, troubleshooting issues, and integrating with observability platforms.

Answer

MinIO provides 38+ metric collectors[1] with Prometheus compatibility and distributed tracing via OpenTelemetry. The comprehensive observability stack includes metrics, health endpoints, distributed tracing, audit logging, and subsystem-specific logging for complete operational visibility.


Metrics V3 Architecture

MinIO’s metrics system exposes detailed operational data organized by category.

Metrics Endpoints

PathPurposeKey Metrics
/api/requestsS3 API request metricsLatency, throughput, error rates
/bucket/replicationPer-bucket replication statsLag, queue size, failures
/cluster/healthDrive/node/capacity healthOnline status, capacity
/system/driveDisk I/O, health, latencyIOPS, latency percentiles
/system/cpuCPU usage metricsUtilization, load average
/system/memoryMemory statisticsHeap, RSS, GC stats
/debug/healHealing progressObjects healed, pending
/scannerBackground scanner statsObjects scanned, rate

Metrics Architecture

┌─────────────────────────────────────────────────────────┐
│ Metrics V3 Architecture │
├─────────────────────────────────────────────────────────┤
│ │
│ 38+ Metric Collectors │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Metric Categories │ │
│ │ ├── API Metrics (requests, latency, errors) │ │
│ │ ├── Bucket Metrics (replication, ILM) │ │
│ │ ├── Cluster Metrics (health, capacity) │ │
│ │ ├── System Metrics (CPU, memory, disk) │ │
│ │ └── Debug Metrics (healing, scanner) │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Prometheus-Compatible Export │ │
│ │ └── /minio/v2/metrics/cluster │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘

Key Metric Categories

API Request Metrics

minio_api_requests_total
├── Labels: api, bucket, method
└── Purpose: Request count by API operation
minio_api_requests_latency_seconds
├── Labels: api, bucket
└── Purpose: Request latency histogram
minio_api_requests_errors_total
├── Labels: api, bucket, error_code
└── Purpose: Error count by type

Replication Metrics

minio_bucket_replication_sent_bytes
├── Labels: bucket, target_arn
└── Purpose: Bytes replicated to target
minio_bucket_replication_failed_operations
├── Labels: bucket, target_arn
└── Purpose: Failed replication count
minio_bucket_replication_pending_count
├── Labels: bucket
└── Purpose: Objects pending replication

System Metrics

minio_system_drive_used_bytes
├── Labels: drive, pool, set
└── Purpose: Drive space usage
minio_system_drive_latency_seconds
├── Labels: drive, api (read/write)
└── Purpose: Disk I/O latency
minio_system_cpu_usage_percent
└── Purpose: CPU utilization

Health Endpoints

MinIO provides dedicated health endpoints for orchestration integration.

Health Endpoint Overview[3]

EndpointPurposeUse Case
/minio/health/liveLiveness checkKubernetes liveness probe
/minio/health/readyReadiness checkKubernetes readiness probe
/minio/health/clusterWrite quorum checkMaintenance checks
/minio/health/cluster/readRead quorum checkRead availability

Health Check Details

┌─────────────────────────────────────────────────────────┐
│ Health Endpoints │
├─────────────────────────────────────────────────────────┤
│ │
│ /minio/health/live │
│ ├── Returns: 200 OK if process is running │
│ ├── Use: Kubernetes liveness probe │
│ └── Failure: Triggers pod restart │
│ │
│ /minio/health/ready │
│ ├── Returns: 200 OK if ready to serve requests │
│ ├── Use: Kubernetes readiness probe │
│ └── Failure: Removes pod from service │
│ │
│ /minio/health/cluster │
│ ├── Returns: 200 OK if write quorum available │
│ ├── Use: Pre-maintenance checks │
│ └── Checks: All erasure sets have write quorum │
│ │
│ /minio/health/cluster/read │
│ ├── Returns: 200 OK if read quorum available │
│ ├── Use: Read availability verification │
│ └── Checks: All erasure sets have read quorum │
│ │
└─────────────────────────────────────────────────────────┘

Kubernetes Integration Example

livenessProbe:
httpGet:
path: /minio/health/live
port: 9000
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
httpGet:
path: /minio/health/ready
port: 9000
initialDelaySeconds: 5
periodSeconds: 15

Distributed Tracing

MinIO supports OpenTelemetry for distributed tracing.

OpenTelemetry Configuration

ParameterValueDescription
Export ProtocolOTLPOpenTelemetry Protocol
SamplingParent-basedFollows parent span decision
Batch Timeout1 second[2]Max wait before export
Max Batch Size512 spansMaximum spans per batch (SDK default)

Tracing Architecture

┌─────────────────────────────────────────────────────────┐
│ OpenTelemetry Tracing │
├─────────────────────────────────────────────────────────┤
│ │
│ Request Flow │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Span Creation │ │
│ │ ├── Service name │ │
│ │ ├── Node name │ │
│ │ ├── Port │ │
│ │ └── Operation details │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Batch Processor │ │
│ │ ├── Collect spans (up to 512) │ │
│ │ ├── Timeout after 1 second │ │
│ │ └── Export via OTLP │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ OTLP Exporter │ │
│ │ └── Send to Jaeger/Tempo/other backend │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘

Resource Attributes

Each span includes resource attributes for identification:

Resource Attributes:
├── service.name: "minio"
├── service.instance.id: <node-name>
├── service.version: <minio-version>
├── host.name: <hostname>
└── net.host.port: <port>

Enabling Tracing

Terminal window
# Set OTLP endpoint
export MINIO_TRACING_OTLP_ENDPOINT=http://jaeger:4317
# Enable tracing
mc admin config set ALIAS tracing endpoint=http://jaeger:4317
mc admin service restart ALIAS

Audit Logging

MinIO provides comprehensive audit logging for security and compliance.

Audit Log Contents

FieldDescriptionExample
TimestampEvent time2025-01-05T10:30:00Z
HeadersRequest headersContent-Type, Authorization
ClaimsIdentity claimsUser, groups, policies
StatusResponse status200, 403, 500
DurationRequest duration45ms
ErrorsError detailsAccess denied reason

Audit Log Structure

{
"version": "1",
"deploymentid": "xxxx-xxxx",
"time": "2025-01-05T10:30:00.000Z",
"event": {
"api": {
"name": "PutObject",
"bucket": "mybucket",
"object": "mykey",
"status": "OK",
"statusCode": 200,
"timeToResponse": "45ms"
},
"remotehost": "192.168.1.100",
"requestID": "xxxx",
"userAgent": "aws-sdk-go/1.0",
"requestClaims": {
"accessKey": "myaccesskey",
"sub": "user123"
},
"requestHeader": {
"Content-Type": "application/octet-stream"
},
"responseHeader": {
"x-amz-request-id": "xxxx"
}
}
}

Audit Log Targets

┌─────────────────────────────────────────────────────────┐
│ Audit Log Targets │
├─────────────────────────────────────────────────────────┤
│ │
│ Supported Targets: │
│ ├── HTTP Webhook → External logging service │
│ ├── Kafka → Stream processing │
│ ├── PostgreSQL → Database storage │
│ └── Elasticsearch → Search and analytics │
│ │
│ Configuration: │
│ mc admin config set ALIAS audit_webhook \ │
│ endpoint=http://audit-service:8080 │
│ │
└─────────────────────────────────────────────────────────┘

Subsystem Logging

MinIO provides subsystem-specific loggers for detailed debugging.

Logging Subsystems

SubsystemPurposeLog Contents
InternalCore operationsObject I/O, metadata ops
ReplicationReplication eventsQueue status, failures
HealingHealing operationsObjects healed, errors
ScannerBackground scannerScan progress, findings
IAMIAM operationsAuth events, policy changes

Log Levels

Log Levels (most to least verbose):
├── DEBUG → Detailed debugging information
├── INFO → Normal operational messages
├── WARN → Warning conditions
├── ERROR → Error conditions
└── FATAL → Critical errors (process exit)

Subsystem Log Configuration

Terminal window
# Enable debug logging for replication
mc admin config set ALIAS log replication=debug
# Enable debug logging for healing
mc admin config set ALIAS log healing=debug
# View current log configuration
mc admin config get ALIAS log

Log Output Format

┌─────────────────────────────────────────────────────────┐
│ Log Entry Format │
├─────────────────────────────────────────────────────────┤
│ │
│ Structured Log Entry: │
│ { │
│ "level": "info", │
│ "time": "2025-01-05T10:30:00Z", │
│ "subsystem": "replication", │
│ "message": "Object replicated successfully", │
│ "bucket": "mybucket", │
│ "object": "mykey", │
│ "target": "site-b", │
│ "duration": "120ms" │
│ } │
│ │
└─────────────────────────────────────────────────────────┘

Prometheus Integration

Scrape Configuration

prometheus.yml
scrape_configs:
- job_name: 'minio'
metrics_path: /minio/v2/metrics/cluster
scheme: http
static_configs:
- targets: ['minio1:9000', 'minio2:9000']
bearer_token: <your-bearer-token>

Key Dashboards

DashboardPurposeKey Panels
OverviewCluster healthNodes online, capacity, requests
APIRequest analysisLatency, throughput, errors
ReplicationReplication healthLag, queue, failures
ResourcesSystem resourcesCPU, memory, disk I/O

Best Practices

  1. Scrape interval: 15-30 seconds for Prometheus
  2. Retention: Keep metrics for trend analysis (30+ days)
  3. Alerting: Set alerts for queue growth, error rates, latency
  4. Tracing sampling: Use parent-based sampling in production
  5. Log rotation: Configure log rotation to prevent disk fill
  6. Audit compliance: Route audit logs to immutable storage

Source Code References
  1. cmd/metrics-v3.go:32-74 - 38 metric collector paths defined
  2. cmd/opentelemetry.go:78-80 - sdktrace.WithBatcher(..., sdktrace.WithBatchTimeout(time.Second))
  3. cmd/healthcheck-router.go:22-27 - Health endpoint path constants
0