Understanding MinIO AIStor’s multi-site data flow is essential for architects designing geo-distributed deployments and operators managing site replication.
Answer
MinIO uses timestamp-based synchronization with persistent failure queues and concurrent peer updates. The multi-site architecture ensures eventual consistency across sites through sequence tracking, read-repair mechanisms, and resilient failure recovery with exponential backoff retry logic.
Sequence Tracking
MinIO tracks multiple timestamps and status indicators to manage multi-site synchronization.
Tracking Components
| Component | Purpose | Scope |
|---|---|---|
| Replica Update Timestamp | When object was last replicated | Per-object |
| Replication Activity Timestamp | When replication was last attempted | Per-object |
| Target Status Map | Replication status per destination | Per-object, per-target |
| Version Purge Status | Delete marker replication state | Per-version |
Sequence Flow
┌─────────────────────────────────────────────────────────┐│ Sequence Tracking │├─────────────────────────────────────────────────────────┤│ ││ Object Metadata ││ ├── ReplicaTimestamp: 2025-01-05T10:30:00Z ││ │ └── Last successful replication ││ │ ││ ├── ReplicationActivityTimestamp: 2025-01-05T10:30:05Z││ │ └── Last replication attempt ││ │ ││ ├── TargetStatusMap: ││ │ ├── site-b: COMPLETED ││ │ ├── site-c: PENDING ││ │ └── site-d: FAILED ││ │ ││ └── VersionPurgeStatus: ││ └── Pending | Complete | Failed ││ │└─────────────────────────────────────────────────────────┘Target Status Values
| Status | Meaning | Next Action |
|---|---|---|
| PENDING | Queued for replication | Will be processed |
| COMPLETED | Successfully replicated | None |
| FAILED | Replication failed | Queued for retry |
| REPLICA | Object is a replica | Do not re-replicate |
Multi-Site Data Flow
Write Flow (Source Site)
┌─────────────────────────────────────────────────────────┐│ Source Site Write Flow │├─────────────────────────────────────────────────────────┤│ ││ Client Write ││ │ ││ ▼ ││ 1. Write to local erasure set ││ │ ││ ▼ ││ 2. Set ReplicationActivityTimestamp ││ │ ││ ▼ ││ 3. Initialize TargetStatusMap (all PENDING) ││ │ ││ ▼ ││ 4. Queue for replication to each target ││ │ ││ ├──► Site B ──► On success: COMPLETED ││ │ ││ ├──► Site C ──► On success: COMPLETED ││ │ ││ └──► Site D ──► On failure: FAILED → MRF ││ ││ 5. Update ReplicaTimestamp on success ││ │└─────────────────────────────────────────────────────────┘Replication Flow (To Target Site)
┌─────────────────────────────────────────────────────────┐│ Target Site Receive Flow │├─────────────────────────────────────────────────────────┤│ ││ Incoming Replication Request ││ │ ││ ▼ ││ 1. Timestamp Resolution ││ └── Compare incoming vs local timestamp ││ │ ││ ├── Incoming newer → Accept ││ │ ││ └── Local newer → Reject (already updated) ││ │ ││ ▼ ││ 2. Write to local erasure set ││ │ ││ ▼ ││ 3. Mark object as REPLICA ││ └── Prevents re-replication loops ││ │ ││ ▼ ││ 4. Send acknowledgment to source ││ │└─────────────────────────────────────────────────────────┘Failure Recovery
MinIO implements robust failure recovery for multi-site replication.
MRF Queue (Most Recent Failures)
┌─────────────────────────────────────────────────────────┐│ MRF Queue │├─────────────────────────────────────────────────────────┤│ ││ Location: .minio.sys/replication/mrf/<nodename>.bin ││ ││ Format: ││ ┌─────────────────────────────────────────────────┐ ││ │ Header (binary) │ ││ │ └── Version, flags, metadata │ ││ ├─────────────────────────────────────────────────┤ ││ │ Payload (msgpack encoded) │ ││ │ ├── Bucket name │ ││ │ ├── Object key │ ││ │ ├── Version ID │ ││ │ ├── Target site │ ││ │ ├── Retry count │ ││ │ └── Failure timestamp │ ││ └─────────────────────────────────────────────────┘ ││ ││ Serialization: MessagePack (msgp) ││ │└─────────────────────────────────────────────────────────┘MRF location[1] and serialization format[2] verified from source.
Retry Logic
Exponential Backoff Retry:
Attempt 1 → Immediate retry │ └── Failure → Wait 1 × base_interval │Attempt 2 ◄────────────────┘ │ └── Failure → Wait 2 × base_interval │Attempt 3 ◄────────────────┘ │ └── Failure → Wait 4 × base_interval │ ... │Attempt N ◄────────────────┘ │ └── Failure → Exceeds retry limit? │ ├── No → Continue backoff │ └── Yes → Drop from queue Log permanent failureRetry Parameters
| Parameter | Description |
|---|---|
| Initial Delay | Base interval for first retry |
| Backoff Multiplier | 2× for exponential growth |
| Max Retries | Configurable retry limit |
| Max Delay | Cap on backoff interval |
| Drop Policy | Dropped after retry limit exceeded |
Consistency Mechanisms
Timestamp Resolution
When receiving replicated data, MinIO verifies timestamps to maintain consistency:
┌─────────────────────────────────────────────────────────┐│ Timestamp Resolution │├─────────────────────────────────────────────────────────┤│ ││ Incoming Object ││ └── Timestamp: T_incoming ││ ││ Local Object (if exists) ││ └── Timestamp: T_local ││ ││ Resolution: ││ ├── T_incoming > T_local → Accept incoming ││ ├── T_incoming < T_local → Reject (stale) ││ └── T_incoming = T_local → Accept (idempotent) ││ ││ Verification: ││ └── verifyState() checks object can be accepted ││ based on current replication state ││ │└─────────────────────────────────────────────────────────┘Read-Repair
MinIO merges incoming metadata with local state during replication:
┌─────────────────────────────────────────────────────────┐│ Read-Repair │├─────────────────────────────────────────────────────────┤│ ││ On Replica Receive: ││ │ ││ ▼ ││ 1. Read local metadata (if exists) ││ │ ││ ▼ ││ 2. Compare with incoming metadata ││ │ ││ ▼ ││ 3. Merge: ││ ├── User metadata: Incoming wins if newer ││ ├── Replication status: Combine target maps ││ └── Version info: Latest timestamp wins ││ │ ││ ▼ ││ 4. Write merged metadata ││ │└─────────────────────────────────────────────────────────┘Version Mismatch Detection
┌─────────────────────────────────────────────────────────┐│ Version Mismatch Detection │├─────────────────────────────────────────────────────────┤│ ││ Version Info Process: ││ └── Tracks per-site version information ││ ││ Per-Site Tracking: ││ ┌─────────────────────────────────────────────────┐ ││ │ Object: bucket/key │ ││ │ ├── Site A: version=v3, timestamp=T1 │ ││ │ ├── Site B: version=v3, timestamp=T1 │ ││ │ └── Site C: version=v2, timestamp=T0 ← Stale │ ││ └─────────────────────────────────────────────────┘ ││ ││ Detection Triggers: ││ ├── Background scanner comparison ││ ├── Read request with stale data detected ││ └── Replication status check ││ ││ Resolution: ││ └── Queue stale site for resync ││ │└─────────────────────────────────────────────────────────┘Concurrent Peer Updates
Multi-site replication handles concurrent updates across sites.
Concurrent Write Handling
Site A writes object X at T1Site B writes object X at T2
Scenario 1: T1 < T2├── Site A's write reaches Site B first├── Site B overwrites with its own write├── Site A eventually receives T2 version└── All sites converge to T2 version
Scenario 2: T2 < T1├── Site B's write reaches Site A first├── Site A overwrites with its own write (newer)├── Site B eventually receives T1 version└── All sites converge to T1 version
Scenario 3: T1 = T2 (rare)├── Deterministic tiebreaker (e.g., site ID)└── Consistent resolution across all sitesUpdate Propagation
┌─────────────────────────────────────────────────────────┐│ Concurrent Peer Updates │├─────────────────────────────────────────────────────────┤│ ││ Site A ◄────────────────────────────────────► Site B ││ │ │ ││ │ Bidirectional Replication │ ││ │ │ ││ └──────────────────┬───────────────────────────┘ ││ │ ││ ▼ ││ Site C ││ ││ Each site: ││ ├── Sends updates to all peers concurrently ││ ├── Receives updates from all peers ││ ├── Applies timestamp-based resolution ││ └── Maintains consistent final state ││ │└─────────────────────────────────────────────────────────┘Monitoring Multi-Site Replication
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Replication lag | Time since oldest pending | > 5 minutes |
| MRF queue size | Failed items pending retry | Growing trend |
| Cross-site latency | RTT between sites | > expected network RTT |
| Version mismatches | Detected inconsistencies | > 0 |
Health Check Commands
# Check replication statusmc replicate status ALIAS/bucket
# View site replication infomc admin replicate info ALIAS
# Check replication metricsmc admin prometheus metrics ALIAS | grep replicationBest Practices
- Network reliability: Ensure stable, low-latency links between sites
- Monitor MRF queues: Growing queues indicate connectivity issues
- Timestamp sync: Use NTP across all sites for accurate ordering
- Capacity planning: Size each site for full workload (failover capability)
- Test failover: Regularly validate disaster recovery procedures
Source Code References
cmd/bucket-replication.go:3927-pathJoin(replicationMRFDir, globalLocalNodeNameHex+".bin")cmd/bucket-replication.go:3907-mw := msgp.NewWriter(w)(MessagePack serialization)internal/bucket/replication/datatypes.go:25-37- StatusType constants: Pending, Completed, Failed, Replica