How does MinIO AIStor handle telemetry and diagnostics internally?

Understanding MinIO AIStor’s telemetry and diagnostics capabilities is essential for monitoring deployments, troubleshooting issues, and integrating with observability platforms.

Answer

MinIO provides 38+ metric collectors^[1] with Prometheus compatibility and distributed tracing via OpenTelemetry. The comprehensive observability stack includes metrics, health endpoints, distributed tracing, audit logging, and subsystem-specific logging for complete operational visibility.

Metrics V3 Architecture

MinIO’s metrics system exposes detailed operational data organized by category.

Metrics Endpoints

Path	Purpose	Key Metrics
`/api/requests`	S3 API request metrics	Latency, throughput, error rates
`/bucket/replication`	Per-bucket replication stats	Lag, queue size, failures
`/cluster/health`	Drive/node/capacity health	Online status, capacity
`/system/drive`	Disk I/O, health, latency	IOPS, latency percentiles
`/system/cpu`	CPU usage metrics	Utilization, load average
`/system/memory`	Memory statistics	Heap, RSS, GC stats
`/debug/heal`	Healing progress	Objects healed, pending
`/scanner`	Background scanner stats	Objects scanned, rate

Metrics Architecture

┌─────────────────────────────────────────────────────────┐
│                  Metrics V3 Architecture                 │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  38+ Metric Collectors                                   │
│        │                                                 │
│        ▼                                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Metric Categories                               │    │
│  │  ├── API Metrics (requests, latency, errors)    │    │
│  │  ├── Bucket Metrics (replication, ILM)          │    │
│  │  ├── Cluster Metrics (health, capacity)         │    │
│  │  ├── System Metrics (CPU, memory, disk)         │    │
│  │  └── Debug Metrics (healing, scanner)           │    │
│  └─────────────────────────────────────────────────┘    │
│        │                                                 │
│        ▼                                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Prometheus-Compatible Export                    │    │
│  │  └── /minio/v2/metrics/cluster                  │    │
│  └─────────────────────────────────────────────────┘    │
│                                                          │
└─────────────────────────────────────────────────────────┘

Key Metric Categories

API Request Metrics

minio_api_requests_total
├── Labels: api, bucket, method
└── Purpose: Request count by API operation

minio_api_requests_latency_seconds
├── Labels: api, bucket
└── Purpose: Request latency histogram

minio_api_requests_errors_total
├── Labels: api, bucket, error_code
└── Purpose: Error count by type

Replication Metrics

minio_bucket_replication_sent_bytes
├── Labels: bucket, target_arn
└── Purpose: Bytes replicated to target

minio_bucket_replication_failed_operations
├── Labels: bucket, target_arn
└── Purpose: Failed replication count

minio_bucket_replication_pending_count
├── Labels: bucket
└── Purpose: Objects pending replication

System Metrics

minio_system_drive_used_bytes
├── Labels: drive, pool, set
└── Purpose: Drive space usage

minio_system_drive_latency_seconds
├── Labels: drive, api (read/write)
└── Purpose: Disk I/O latency

minio_system_cpu_usage_percent
└── Purpose: CPU utilization

Health Endpoints

MinIO provides dedicated health endpoints for orchestration integration.

Health Endpoint Overview^[3]

Endpoint	Purpose	Use Case
`/minio/health/live`	Liveness check	Kubernetes liveness probe
`/minio/health/ready`	Readiness check	Kubernetes readiness probe
`/minio/health/cluster`	Write quorum check	Maintenance checks
`/minio/health/cluster/read`	Read quorum check	Read availability

Health Check Details

┌─────────────────────────────────────────────────────────┐
│                   Health Endpoints                       │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  /minio/health/live                                      │
│  ├── Returns: 200 OK if process is running              │
│  ├── Use: Kubernetes liveness probe                     │
│  └── Failure: Triggers pod restart                      │
│                                                          │
│  /minio/health/ready                                     │
│  ├── Returns: 200 OK if ready to serve requests         │
│  ├── Use: Kubernetes readiness probe                    │
│  └── Failure: Removes pod from service                  │
│                                                          │
│  /minio/health/cluster                                   │
│  ├── Returns: 200 OK if write quorum available          │
│  ├── Use: Pre-maintenance checks                        │
│  └── Checks: All erasure sets have write quorum         │
│                                                          │
│  /minio/health/cluster/read                              │
│  ├── Returns: 200 OK if read quorum available           │
│  ├── Use: Read availability verification               │
│  └── Checks: All erasure sets have read quorum          │
│                                                          │
└─────────────────────────────────────────────────────────┘

Kubernetes Integration Example

livenessProbe:
  httpGet:
    path: /minio/health/live
    port: 9000
  initialDelaySeconds: 30
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /minio/health/ready
    port: 9000
  initialDelaySeconds: 5
  periodSeconds: 15

Distributed Tracing

MinIO supports OpenTelemetry for distributed tracing.

OpenTelemetry Configuration

Parameter	Value	Description
Export Protocol	OTLP	OpenTelemetry Protocol
Sampling	Parent-based	Follows parent span decision
Batch Timeout	1 second^[2]	Max wait before export
Max Batch Size	512 spans	Maximum spans per batch (SDK default)

Tracing Architecture

┌─────────────────────────────────────────────────────────┐
│              OpenTelemetry Tracing                       │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Request Flow                                            │
│        │                                                 │
│        ▼                                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Span Creation                                   │    │
│  │  ├── Service name                               │    │
│  │  ├── Node name                                  │    │
│  │  ├── Port                                       │    │
│  │  └── Operation details                          │    │
│  └─────────────────────────────────────────────────┘    │
│        │                                                 │
│        ▼                                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Batch Processor                                 │    │
│  │  ├── Collect spans (up to 512)                  │    │
│  │  ├── Timeout after 1 second                     │    │
│  │  └── Export via OTLP                            │    │
│  └─────────────────────────────────────────────────┘    │
│        │                                                 │
│        ▼                                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │  OTLP Exporter                                   │    │
│  │  └── Send to Jaeger/Tempo/other backend         │    │
│  └─────────────────────────────────────────────────┘    │
│                                                          │
└─────────────────────────────────────────────────────────┘

Resource Attributes

Each span includes resource attributes for identification:

Resource Attributes:
├── service.name: "minio"
├── service.instance.id: <node-name>
├── service.version: <minio-version>
├── host.name: <hostname>
└── net.host.port: <port>

Enabling Tracing

# Set OTLP endpoint
export MINIO_TRACING_OTLP_ENDPOINT=http://jaeger:4317

# Enable tracing
mc admin config set ALIAS tracing endpoint=http://jaeger:4317
mc admin service restart ALIAS

Audit Logging

MinIO provides comprehensive audit logging for security and compliance.

Audit Log Contents

Field	Description	Example
Timestamp	Event time	2025-01-05T10:30:00Z
Headers	Request headers	Content-Type, Authorization
Claims	Identity claims	User, groups, policies
Status	Response status	200, 403, 500
Duration	Request duration	45ms
Errors	Error details	Access denied reason

Audit Log Structure

{
  "version": "1",
  "deploymentid": "xxxx-xxxx",
  "time": "2025-01-05T10:30:00.000Z",
  "event": {
    "api": {
      "name": "PutObject",
      "bucket": "mybucket",
      "object": "mykey",
      "status": "OK",
      "statusCode": 200,
      "timeToResponse": "45ms"
    },
    "remotehost": "192.168.1.100",
    "requestID": "xxxx",
    "userAgent": "aws-sdk-go/1.0",
    "requestClaims": {
      "accessKey": "myaccesskey",
      "sub": "user123"
    },
    "requestHeader": {
      "Content-Type": "application/octet-stream"
    },
    "responseHeader": {
      "x-amz-request-id": "xxxx"
    }
  }
}

Audit Log Targets

┌─────────────────────────────────────────────────────────┐
│                 Audit Log Targets                        │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Supported Targets:                                      │
│  ├── HTTP Webhook → External logging service            │
│  ├── Kafka → Stream processing                          │
│  ├── PostgreSQL → Database storage                      │
│  └── Elasticsearch → Search and analytics               │
│                                                          │
│  Configuration:                                          │
│  mc admin config set ALIAS audit_webhook \              │
│    endpoint=http://audit-service:8080                   │
│                                                          │
└─────────────────────────────────────────────────────────┘

Subsystem Logging

MinIO provides subsystem-specific loggers for detailed debugging.

Logging Subsystems

Subsystem	Purpose	Log Contents
Internal	Core operations	Object I/O, metadata ops
Replication	Replication events	Queue status, failures
Healing	Healing operations	Objects healed, errors
Scanner	Background scanner	Scan progress, findings
IAM	IAM operations	Auth events, policy changes

Log Levels

Log Levels (most to least verbose):
├── DEBUG  → Detailed debugging information
├── INFO   → Normal operational messages
├── WARN   → Warning conditions
├── ERROR  → Error conditions
└── FATAL  → Critical errors (process exit)

Subsystem Log Configuration

# Enable debug logging for replication
mc admin config set ALIAS log replication=debug

# Enable debug logging for healing
mc admin config set ALIAS log healing=debug

# View current log configuration
mc admin config get ALIAS log

Log Output Format

┌─────────────────────────────────────────────────────────┐
│                  Log Entry Format                        │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Structured Log Entry:                                   │
│  {                                                       │
│    "level": "info",                                     │
│    "time": "2025-01-05T10:30:00Z",                      │
│    "subsystem": "replication",                          │
│    "message": "Object replicated successfully",         │
│    "bucket": "mybucket",                                │
│    "object": "mykey",                                   │
│    "target": "site-b",                                  │
│    "duration": "120ms"                                  │
│  }                                                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Prometheus Integration

Scrape Configuration

scrape_configs:
  - job_name: 'minio'
    metrics_path: /minio/v2/metrics/cluster
    scheme: http
    static_configs:
      - targets: ['minio1:9000', 'minio2:9000']
    bearer_token: <your-bearer-token>

Key Dashboards

Dashboard	Purpose	Key Panels
Overview	Cluster health	Nodes online, capacity, requests
API	Request analysis	Latency, throughput, errors
Replication	Replication health	Lag, queue, failures
Resources	System resources	CPU, memory, disk I/O

Best Practices

Scrape interval: 15-30 seconds for Prometheus
Retention: Keep metrics for trend analysis (30+ days)
Alerting: Set alerts for queue growth, error rates, latency
Tracing sampling: Use parent-based sampling in production
Log rotation: Configure log rotation to prevent disk fill
Audit compliance: Route audit logs to immutable storage

Source Code References

cmd/metrics-v3.go:32-74 - 38 metric collector paths defined
cmd/opentelemetry.go:78-80 - sdktrace.WithBatcher(..., sdktrace.WithBatchTimeout(time.Second))
cmd/healthcheck-router.go:22-27 - Health endpoint path constants