How does the MinIO AIStor replication engine work internally?

Understanding MinIO AIStor’s replication engine internals helps operators design resilient multi-site architectures and troubleshoot replication issues effectively.

Answer

MinIO uses an event-driven replication engine with worker pools, persistent MRF (Most Recent Failures) queue, and timestamp-based conflict resolution. The architecture supports multiple replication types with intelligent routing, automatic retry mechanisms, and deterministic conflict handling.

Event Triggers

Replication is triggered by specific events that are evaluated against configured rules.

Event Evaluation Flow

Object Operation (PUT/DELETE/Metadata)
              │
              ▼
┌─────────────────────────────────────┐
│     Must Have to Replicate          │  ← Evaluates bucket rules
│     (Rule Qualification Check)      │
└─────────────────────────────────────┘
              │
              ├── No Match → No replication
              │
              └── Match → Check Operation Type
                            │
              ┌─────────────┴─────────────┐
              │                           │
              ▼                           ▼
    ┌─────────────────┐        ┌─────────────────────┐
    │  Object/Meta    │        │  Check Replication  │
    │  Replication    │        │  for Deletes        │
    └─────────────────┘        └─────────────────────┘

Event Types

Event	Purpose	Trigger
Must Have to Replicate	Evaluates rules for object qualification	Every object operation
Check Replication for Deletes	Determines if delete markers replicate	DELETE operations

Replication Types

MinIO supports multiple replication scenarios, each with specific handling.

Type	Description	Use Case
Basic Object Replication	Regular object PUT operations	New object uploads
Replication for Deletes	Delete marker propagation	Object deletions
Replication for Metadata	Metadata-only updates	Tag/retention changes
Replication for Healing	Failed replication retry	Recovery operations
Replication for Existing	Existing object resync	Initial sync, disaster recovery

Replication Type Flow

┌─────────────────────────────────────────────────────────┐
│                   Replication Types                      │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  PUT Object ──────────────► Basic Object Replication    │
│                                                          │
│  DELETE Object ───────────► Replication for Deletes     │
│                                                          │
│  Update Metadata ─────────► Replication for Metadata    │
│                                                          │
│  MRF Queue Item ──────────► Replication for Healing     │
│                                                          │
│  Resync Command ──────────► Replication for Existing    │
│                                                          │
└─────────────────────────────────────────────────────────┘

Queue Architecture

The replication engine uses intelligent routing to distribute work efficiently.

Worker Pool Architecture

┌─────────────────────────────────────────────────────────┐
│                  Replication Queue                       │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Incoming Object                                         │
│        │                                                 │
│        ▼                                                 │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Hash-Based Routing (bucket + object key)       │    │
│  └─────────────────────────────────────────────────┘    │
│        │                                                 │
│        ├── Size < 128 MiB ──► Standard Workers          │
│        │                       [Worker 1] [Worker 2]... │
│        │                                                 │
│        └── Size ≥ 128 MiB ──► Large Object Workers      │
│                                [LO Worker 1] [LO 2]...  │
│                                                          │
│  Queue Full? ──────────────► Overflow to MRF Queue      │
│                                                          │
└─────────────────────────────────────────────────────────┘

Routing Details

Aspect	Behavior	Purpose
Hash-Based Distribution	`hash(bucket + object)` determines worker	Consistent routing, ordering
Large Object Threshold	≥ 128 MiB^[1]	Dedicated workers prevent blocking
Overflow Handling	Full channels trigger MRF queue	Prevents dropped operations

Why Separate Large Object Workers?

Prevents head-of-line blocking: Large transfers don’t delay small objects
Optimized throughput: Large objects benefit from dedicated bandwidth
Resource isolation: Memory/CPU usage separated from standard operations

Retry Mechanism

MinIO implements a multi-stage retry system for reliable replication.

Retry Stages

Stage 1: Inline Retry
         │
         └── Failure → Queue to MRF
                           │
                           ▼
Stage 2: MRF Persistence (every 5 minutes)
         │
         └── Persist queue to disk
                           │
                           ▼
Stage 3: MRF Processing (every 6 minutes)
         │
         └── Retry queued items
                           │
                           ▼
Stage 4: Scanner Takeover (after 3 retries)
         │
         └── Background scanner handles persistent failures

Stage Details

Stage	Timing	Description
Stage 1	Immediate	Inline failure triggers MRF queue entry
Stage 2	Every 5 minutes^[2]	MRF queue persisted to disk for durability
Stage 3	Every 6 minutes^[3]	MRF processor retries failed items
Stage 4	After 3 retries^[4]	Scanner takes over for persistent failures

MRF (Most Recent Failures) Queue

┌─────────────────────────────────────────────────────────┐
│                    MRF Queue                             │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  In-Memory Queue                                         │
│       │                                                  │
│       ├── Entry: {bucket, object, version, retry_count} │
│       │                                                  │
│       └── Every 5 min → Persist to disk                 │
│                              │                           │
│                              ▼                           │
│                     .minio.sys/replication/             │
│                                                          │
│  Every 6 min → Process persisted entries                │
│       │                                                  │
│       ├── Success → Remove from queue                   │
│       │                                                  │
│       └── Failure → Increment retry_count               │
│                     │                                    │
│                     └── retry_count > 3 → Scanner mode  │
│                                                          │
└─────────────────────────────────────────────────────────┘

Conflict Resolution

MinIO uses deterministic rules to resolve conflicts in multi-site replication.

Timestamp-Based Resolution

Conflict Detected (same object, different content)
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│  Compare Timestamps                                      │
│                                                          │
│  Replica Timestamp vs Replication Source Timestamp      │
│                                                          │
│  Winner = Most Recent Timestamp                         │
└─────────────────────────────────────────────────────────┘
              │
              ▼
    Latest timestamp wins → Object updated
    Older timestamp loses → Replication skipped

Resolution Rules

Scenario	Resolution	Outcome
Source newer	Source wins	Target updated
Target newer	Target wins	Replication skipped
Equal timestamps	Source wins	Ensures consistency

Version Purge States

For delete marker replication, MinIO tracks purge status:

State	Description	Next Action
Pending	Delete initiated, replication in progress	Wait for completion
Complete	Delete replicated successfully	No action needed
Failed	Delete replication failed	Retry via MRF

Null Version Handling

Pre-Replication Objects (null version ID)
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│  Null Version Check                                      │
│                                                          │
│  Object has null version?                               │
│       │                                                  │
│       ├── Yes → Skip replication                        │
│       │         (Prevents resync of legacy objects)     │
│       │                                                  │
│       └── No  → Proceed with replication                │
└─────────────────────────────────────────────────────────┘

Why Null Version Check?

Prevents infinite replication loops
Excludes objects created before replication was enabled
Ensures only versioned objects participate in replication

Replication Status Tracking

MinIO tracks replication status per object and per target.

Status Values

Status	Meaning
PENDING	Queued for replication
COMPLETED	Successfully replicated
FAILED	Replication failed, queued for retry
REPLICA	Object is a replica (received from another site)

Multi-Target Status

For multi-site replication, status is tracked per destination:

Object: bucket/key
├── Target 1 (site-a): COMPLETED
├── Target 2 (site-b): PENDING
└── Target 3 (site-c): FAILED

Operational Metrics

Key metrics for monitoring replication health:

Metric	Description	Alert Threshold
Replication Lag	Time since oldest pending item	> 5 minutes
MRF Queue Size	Items awaiting retry	> 1000 items
Failed Count	Persistent failures	> 0 after retries
Bandwidth Usage	Replication throughput	Near link capacity

Best Practices

Network sizing: Ensure sufficient bandwidth for replication traffic
Monitor MRF queue: Growing queue indicates replication issues
Timestamp sync: Use NTP across all sites for accurate conflict resolution
Separate large objects: Consider dedicated replication rules for large files

Source Code References

cmd/bucket-replication.go:2467 - minLargeObjSize = 128 * humanize.MiByte
cmd/bucket-replication.go:3815 - mrfSaveInterval = 5 * time.Minute
cmd/bucket-replication.go:3816 - mrfQueueInterval = mrfSaveInterval + time.Minute
cmd/bucket-replication.go:3818 - mrfRetryLimit = 3