Understanding MinIO AIStor’s replication engine internals helps operators design resilient multi-site architectures and troubleshoot replication issues effectively.
Answer
MinIO uses an event-driven replication engine with worker pools, persistent MRF (Most Recent Failures) queue, and timestamp-based conflict resolution. The architecture supports multiple replication types with intelligent routing, automatic retry mechanisms, and deterministic conflict handling.
Event Triggers
Replication is triggered by specific events that are evaluated against configured rules.
Event Evaluation Flow
Object Operation (PUT/DELETE/Metadata) │ ▼┌─────────────────────────────────────┐│ Must Have to Replicate │ ← Evaluates bucket rules│ (Rule Qualification Check) │└─────────────────────────────────────┘ │ ├── No Match → No replication │ └── Match → Check Operation Type │ ┌─────────────┴─────────────┐ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────────┐ │ Object/Meta │ │ Check Replication │ │ Replication │ │ for Deletes │ └─────────────────┘ └─────────────────────┘Event Types
| Event | Purpose | Trigger |
|---|---|---|
| Must Have to Replicate | Evaluates rules for object qualification | Every object operation |
| Check Replication for Deletes | Determines if delete markers replicate | DELETE operations |
Replication Types
MinIO supports multiple replication scenarios, each with specific handling.
| Type | Description | Use Case |
|---|---|---|
| Basic Object Replication | Regular object PUT operations | New object uploads |
| Replication for Deletes | Delete marker propagation | Object deletions |
| Replication for Metadata | Metadata-only updates | Tag/retention changes |
| Replication for Healing | Failed replication retry | Recovery operations |
| Replication for Existing | Existing object resync | Initial sync, disaster recovery |
Replication Type Flow
┌─────────────────────────────────────────────────────────┐│ Replication Types │├─────────────────────────────────────────────────────────┤│ ││ PUT Object ──────────────► Basic Object Replication ││ ││ DELETE Object ───────────► Replication for Deletes ││ ││ Update Metadata ─────────► Replication for Metadata ││ ││ MRF Queue Item ──────────► Replication for Healing ││ ││ Resync Command ──────────► Replication for Existing ││ │└─────────────────────────────────────────────────────────┘Queue Architecture
The replication engine uses intelligent routing to distribute work efficiently.
Worker Pool Architecture
┌─────────────────────────────────────────────────────────┐│ Replication Queue │├─────────────────────────────────────────────────────────┤│ ││ Incoming Object ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────┐ ││ │ Hash-Based Routing (bucket + object key) │ ││ └─────────────────────────────────────────────────┘ ││ │ ││ ├── Size < 128 MiB ──► Standard Workers ││ │ [Worker 1] [Worker 2]... ││ │ ││ └── Size ≥ 128 MiB ──► Large Object Workers ││ [LO Worker 1] [LO 2]... ││ ││ Queue Full? ──────────────► Overflow to MRF Queue ││ │└─────────────────────────────────────────────────────────┘Routing Details
| Aspect | Behavior | Purpose |
|---|---|---|
| Hash-Based Distribution | hash(bucket + object) determines worker | Consistent routing, ordering |
| Large Object Threshold | ≥ 128 MiB[1] | Dedicated workers prevent blocking |
| Overflow Handling | Full channels trigger MRF queue | Prevents dropped operations |
Why Separate Large Object Workers?
- Prevents head-of-line blocking: Large transfers don’t delay small objects
- Optimized throughput: Large objects benefit from dedicated bandwidth
- Resource isolation: Memory/CPU usage separated from standard operations
Retry Mechanism
MinIO implements a multi-stage retry system for reliable replication.
Retry Stages
Stage 1: Inline Retry │ └── Failure → Queue to MRF │ ▼Stage 2: MRF Persistence (every 5 minutes) │ └── Persist queue to disk │ ▼Stage 3: MRF Processing (every 6 minutes) │ └── Retry queued items │ ▼Stage 4: Scanner Takeover (after 3 retries) │ └── Background scanner handles persistent failuresStage Details
| Stage | Timing | Description |
|---|---|---|
| Stage 1 | Immediate | Inline failure triggers MRF queue entry |
| Stage 2 | Every 5 minutes[2] | MRF queue persisted to disk for durability |
| Stage 3 | Every 6 minutes[3] | MRF processor retries failed items |
| Stage 4 | After 3 retries[4] | Scanner takes over for persistent failures |
MRF (Most Recent Failures) Queue
┌─────────────────────────────────────────────────────────┐│ MRF Queue │├─────────────────────────────────────────────────────────┤│ ││ In-Memory Queue ││ │ ││ ├── Entry: {bucket, object, version, retry_count} ││ │ ││ └── Every 5 min → Persist to disk ││ │ ││ ▼ ││ .minio.sys/replication/ ││ ││ Every 6 min → Process persisted entries ││ │ ││ ├── Success → Remove from queue ││ │ ││ └── Failure → Increment retry_count ││ │ ││ └── retry_count > 3 → Scanner mode ││ │└─────────────────────────────────────────────────────────┘Conflict Resolution
MinIO uses deterministic rules to resolve conflicts in multi-site replication.
Timestamp-Based Resolution
Conflict Detected (same object, different content) │ ▼┌─────────────────────────────────────────────────────────┐│ Compare Timestamps ││ ││ Replica Timestamp vs Replication Source Timestamp ││ ││ Winner = Most Recent Timestamp │└─────────────────────────────────────────────────────────┘ │ ▼ Latest timestamp wins → Object updated Older timestamp loses → Replication skippedResolution Rules
| Scenario | Resolution | Outcome |
|---|---|---|
| Source newer | Source wins | Target updated |
| Target newer | Target wins | Replication skipped |
| Equal timestamps | Source wins | Ensures consistency |
Version Purge States
For delete marker replication, MinIO tracks purge status:
| State | Description | Next Action |
|---|---|---|
| Pending | Delete initiated, replication in progress | Wait for completion |
| Complete | Delete replicated successfully | No action needed |
| Failed | Delete replication failed | Retry via MRF |
Null Version Handling
Pre-Replication Objects (null version ID) │ ▼┌─────────────────────────────────────────────────────────┐│ Null Version Check ││ ││ Object has null version? ││ │ ││ ├── Yes → Skip replication ││ │ (Prevents resync of legacy objects) ││ │ ││ └── No → Proceed with replication │└─────────────────────────────────────────────────────────┘Why Null Version Check?
- Prevents infinite replication loops
- Excludes objects created before replication was enabled
- Ensures only versioned objects participate in replication
Replication Status Tracking
MinIO tracks replication status per object and per target.
Status Values
| Status | Meaning |
|---|---|
| PENDING | Queued for replication |
| COMPLETED | Successfully replicated |
| FAILED | Replication failed, queued for retry |
| REPLICA | Object is a replica (received from another site) |
Multi-Target Status
For multi-site replication, status is tracked per destination:
Object: bucket/key├── Target 1 (site-a): COMPLETED├── Target 2 (site-b): PENDING└── Target 3 (site-c): FAILEDOperational Metrics
Key metrics for monitoring replication health:
| Metric | Description | Alert Threshold |
|---|---|---|
| Replication Lag | Time since oldest pending item | > 5 minutes |
| MRF Queue Size | Items awaiting retry | > 1000 items |
| Failed Count | Persistent failures | > 0 after retries |
| Bandwidth Usage | Replication throughput | Near link capacity |
Best Practices
- Network sizing: Ensure sufficient bandwidth for replication traffic
- Monitor MRF queue: Growing queue indicates replication issues
- Timestamp sync: Use NTP across all sites for accurate conflict resolution
- Separate large objects: Consider dedicated replication rules for large files
Source Code References
cmd/bucket-replication.go:2467-minLargeObjSize = 128 * humanize.MiBytecmd/bucket-replication.go:3815-mrfSaveInterval = 5 * time.Minutecmd/bucket-replication.go:3816-mrfQueueInterval = mrfSaveInterval + time.Minutecmd/bucket-replication.go:3818-mrfRetryLimit = 3