How does MinIO AIStor handle multi-site data flow internally?

Asked by muratkars Answered by muratkars January 4, 2026
0 views

Understanding MinIO AIStor’s multi-site data flow is essential for architects designing geo-distributed deployments and operators managing site replication.

Answer

MinIO uses timestamp-based synchronization with persistent failure queues and concurrent peer updates. The multi-site architecture ensures eventual consistency across sites through sequence tracking, read-repair mechanisms, and resilient failure recovery with exponential backoff retry logic.


Sequence Tracking

MinIO tracks multiple timestamps and status indicators to manage multi-site synchronization.

Tracking Components

ComponentPurposeScope
Replica Update TimestampWhen object was last replicatedPer-object
Replication Activity TimestampWhen replication was last attemptedPer-object
Target Status MapReplication status per destinationPer-object, per-target
Version Purge StatusDelete marker replication statePer-version

Sequence Flow

┌─────────────────────────────────────────────────────────┐
│ Sequence Tracking │
├─────────────────────────────────────────────────────────┤
│ │
│ Object Metadata │
│ ├── ReplicaTimestamp: 2025-01-05T10:30:00Z │
│ │ └── Last successful replication │
│ │ │
│ ├── ReplicationActivityTimestamp: 2025-01-05T10:30:05Z│
│ │ └── Last replication attempt │
│ │ │
│ ├── TargetStatusMap: │
│ │ ├── site-b: COMPLETED │
│ │ ├── site-c: PENDING │
│ │ └── site-d: FAILED │
│ │ │
│ └── VersionPurgeStatus: │
│ └── Pending | Complete | Failed │
│ │
└─────────────────────────────────────────────────────────┘

Target Status Values

StatusMeaningNext Action
PENDINGQueued for replicationWill be processed
COMPLETEDSuccessfully replicatedNone
FAILEDReplication failedQueued for retry
REPLICAObject is a replicaDo not re-replicate

Multi-Site Data Flow

Write Flow (Source Site)

┌─────────────────────────────────────────────────────────┐
│ Source Site Write Flow │
├─────────────────────────────────────────────────────────┤
│ │
│ Client Write │
│ │ │
│ ▼ │
│ 1. Write to local erasure set │
│ │ │
│ ▼ │
│ 2. Set ReplicationActivityTimestamp │
│ │ │
│ ▼ │
│ 3. Initialize TargetStatusMap (all PENDING) │
│ │ │
│ ▼ │
│ 4. Queue for replication to each target │
│ │ │
│ ├──► Site B ──► On success: COMPLETED │
│ │ │
│ ├──► Site C ──► On success: COMPLETED │
│ │ │
│ └──► Site D ──► On failure: FAILED → MRF │
│ │
│ 5. Update ReplicaTimestamp on success │
│ │
└─────────────────────────────────────────────────────────┘

Replication Flow (To Target Site)

┌─────────────────────────────────────────────────────────┐
│ Target Site Receive Flow │
├─────────────────────────────────────────────────────────┤
│ │
│ Incoming Replication Request │
│ │ │
│ ▼ │
│ 1. Timestamp Resolution │
│ └── Compare incoming vs local timestamp │
│ │ │
│ ├── Incoming newer → Accept │
│ │ │
│ └── Local newer → Reject (already updated) │
│ │ │
│ ▼ │
│ 2. Write to local erasure set │
│ │ │
│ ▼ │
│ 3. Mark object as REPLICA │
│ └── Prevents re-replication loops │
│ │ │
│ ▼ │
│ 4. Send acknowledgment to source │
│ │
└─────────────────────────────────────────────────────────┘

Failure Recovery

MinIO implements robust failure recovery for multi-site replication.

MRF Queue (Most Recent Failures)

┌─────────────────────────────────────────────────────────┐
│ MRF Queue │
├─────────────────────────────────────────────────────────┤
│ │
│ Location: .minio.sys/replication/mrf/<nodename>.bin │
│ │
│ Format: │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Header (binary) │ │
│ │ └── Version, flags, metadata │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ Payload (msgpack encoded) │ │
│ │ ├── Bucket name │ │
│ │ ├── Object key │ │
│ │ ├── Version ID │ │
│ │ ├── Target site │ │
│ │ ├── Retry count │ │
│ │ └── Failure timestamp │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Serialization: MessagePack (msgp) │
│ │
└─────────────────────────────────────────────────────────┘

MRF location[1] and serialization format[2] verified from source.

Retry Logic

Exponential Backoff Retry:
Attempt 1 → Immediate retry
└── Failure → Wait 1 × base_interval
Attempt 2 ◄────────────────┘
└── Failure → Wait 2 × base_interval
Attempt 3 ◄────────────────┘
└── Failure → Wait 4 × base_interval
...
Attempt N ◄────────────────┘
└── Failure → Exceeds retry limit?
├── No → Continue backoff
└── Yes → Drop from queue
Log permanent failure

Retry Parameters

ParameterDescription
Initial DelayBase interval for first retry
Backoff Multiplier2× for exponential growth
Max RetriesConfigurable retry limit
Max DelayCap on backoff interval
Drop PolicyDropped after retry limit exceeded

Consistency Mechanisms

Timestamp Resolution

When receiving replicated data, MinIO verifies timestamps to maintain consistency:

┌─────────────────────────────────────────────────────────┐
│ Timestamp Resolution │
├─────────────────────────────────────────────────────────┤
│ │
│ Incoming Object │
│ └── Timestamp: T_incoming │
│ │
│ Local Object (if exists) │
│ └── Timestamp: T_local │
│ │
│ Resolution: │
│ ├── T_incoming > T_local → Accept incoming │
│ ├── T_incoming < T_local → Reject (stale) │
│ └── T_incoming = T_local → Accept (idempotent) │
│ │
│ Verification: │
│ └── verifyState() checks object can be accepted │
│ based on current replication state │
│ │
└─────────────────────────────────────────────────────────┘

Read-Repair

MinIO merges incoming metadata with local state during replication:

┌─────────────────────────────────────────────────────────┐
│ Read-Repair │
├─────────────────────────────────────────────────────────┤
│ │
│ On Replica Receive: │
│ │ │
│ ▼ │
│ 1. Read local metadata (if exists) │
│ │ │
│ ▼ │
│ 2. Compare with incoming metadata │
│ │ │
│ ▼ │
│ 3. Merge: │
│ ├── User metadata: Incoming wins if newer │
│ ├── Replication status: Combine target maps │
│ └── Version info: Latest timestamp wins │
│ │ │
│ ▼ │
│ 4. Write merged metadata │
│ │
└─────────────────────────────────────────────────────────┘

Version Mismatch Detection

┌─────────────────────────────────────────────────────────┐
│ Version Mismatch Detection │
├─────────────────────────────────────────────────────────┤
│ │
│ Version Info Process: │
│ └── Tracks per-site version information │
│ │
│ Per-Site Tracking: │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Object: bucket/key │ │
│ │ ├── Site A: version=v3, timestamp=T1 │ │
│ │ ├── Site B: version=v3, timestamp=T1 │ │
│ │ └── Site C: version=v2, timestamp=T0 ← Stale │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Detection Triggers: │
│ ├── Background scanner comparison │
│ ├── Read request with stale data detected │
│ └── Replication status check │
│ │
│ Resolution: │
│ └── Queue stale site for resync │
│ │
└─────────────────────────────────────────────────────────┘

Concurrent Peer Updates

Multi-site replication handles concurrent updates across sites.

Concurrent Write Handling

Site A writes object X at T1
Site B writes object X at T2
Scenario 1: T1 < T2
├── Site A's write reaches Site B first
├── Site B overwrites with its own write
├── Site A eventually receives T2 version
└── All sites converge to T2 version
Scenario 2: T2 < T1
├── Site B's write reaches Site A first
├── Site A overwrites with its own write (newer)
├── Site B eventually receives T1 version
└── All sites converge to T1 version
Scenario 3: T1 = T2 (rare)
├── Deterministic tiebreaker (e.g., site ID)
└── Consistent resolution across all sites

Update Propagation

┌─────────────────────────────────────────────────────────┐
│ Concurrent Peer Updates │
├─────────────────────────────────────────────────────────┤
│ │
│ Site A ◄────────────────────────────────────► Site B │
│ │ │ │
│ │ Bidirectional Replication │ │
│ │ │ │
│ └──────────────────┬───────────────────────────┘ │
│ │ │
│ ▼ │
│ Site C │
│ │
│ Each site: │
│ ├── Sends updates to all peers concurrently │
│ ├── Receives updates from all peers │
│ ├── Applies timestamp-based resolution │
│ └── Maintains consistent final state │
│ │
└─────────────────────────────────────────────────────────┘

Monitoring Multi-Site Replication

Key Metrics

MetricDescriptionAlert Threshold
Replication lagTime since oldest pending> 5 minutes
MRF queue sizeFailed items pending retryGrowing trend
Cross-site latencyRTT between sites> expected network RTT
Version mismatchesDetected inconsistencies> 0

Health Check Commands

Terminal window
# Check replication status
mc replicate status ALIAS/bucket
# View site replication info
mc admin replicate info ALIAS
# Check replication metrics
mc admin prometheus metrics ALIAS | grep replication

Best Practices

  1. Network reliability: Ensure stable, low-latency links between sites
  2. Monitor MRF queues: Growing queues indicate connectivity issues
  3. Timestamp sync: Use NTP across all sites for accurate ordering
  4. Capacity planning: Size each site for full workload (failover capability)
  5. Test failover: Regularly validate disaster recovery procedures

Source Code References
  1. cmd/bucket-replication.go:3927 - pathJoin(replicationMRFDir, globalLocalNodeNameHex+".bin")
  2. cmd/bucket-replication.go:3907 - mw := msgp.NewWriter(w) (MessagePack serialization)
  3. internal/bucket/replication/datatypes.go:25-37 - StatusType constants: Pending, Completed, Failed, Replica
0