How does the MinIO AIStor replication engine work internally?

Asked by muratkars Answered by muratkars January 4, 2026
0 views

Understanding MinIO AIStor’s replication engine internals helps operators design resilient multi-site architectures and troubleshoot replication issues effectively.

Answer

MinIO uses an event-driven replication engine with worker pools, persistent MRF (Most Recent Failures) queue, and timestamp-based conflict resolution. The architecture supports multiple replication types with intelligent routing, automatic retry mechanisms, and deterministic conflict handling.


Event Triggers

Replication is triggered by specific events that are evaluated against configured rules.

Event Evaluation Flow

Object Operation (PUT/DELETE/Metadata)
┌─────────────────────────────────────┐
│ Must Have to Replicate │ ← Evaluates bucket rules
│ (Rule Qualification Check) │
└─────────────────────────────────────┘
├── No Match → No replication
└── Match → Check Operation Type
┌─────────────┴─────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ Object/Meta │ │ Check Replication │
│ Replication │ │ for Deletes │
└─────────────────┘ └─────────────────────┘

Event Types

EventPurposeTrigger
Must Have to ReplicateEvaluates rules for object qualificationEvery object operation
Check Replication for DeletesDetermines if delete markers replicateDELETE operations

Replication Types

MinIO supports multiple replication scenarios, each with specific handling.

TypeDescriptionUse Case
Basic Object ReplicationRegular object PUT operationsNew object uploads
Replication for DeletesDelete marker propagationObject deletions
Replication for MetadataMetadata-only updatesTag/retention changes
Replication for HealingFailed replication retryRecovery operations
Replication for ExistingExisting object resyncInitial sync, disaster recovery

Replication Type Flow

┌─────────────────────────────────────────────────────────┐
│ Replication Types │
├─────────────────────────────────────────────────────────┤
│ │
│ PUT Object ──────────────► Basic Object Replication │
│ │
│ DELETE Object ───────────► Replication for Deletes │
│ │
│ Update Metadata ─────────► Replication for Metadata │
│ │
│ MRF Queue Item ──────────► Replication for Healing │
│ │
│ Resync Command ──────────► Replication for Existing │
│ │
└─────────────────────────────────────────────────────────┘

Queue Architecture

The replication engine uses intelligent routing to distribute work efficiently.

Worker Pool Architecture

┌─────────────────────────────────────────────────────────┐
│ Replication Queue │
├─────────────────────────────────────────────────────────┤
│ │
│ Incoming Object │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Hash-Based Routing (bucket + object key) │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ├── Size < 128 MiB ──► Standard Workers │
│ │ [Worker 1] [Worker 2]... │
│ │ │
│ └── Size ≥ 128 MiB ──► Large Object Workers │
│ [LO Worker 1] [LO 2]... │
│ │
│ Queue Full? ──────────────► Overflow to MRF Queue │
│ │
└─────────────────────────────────────────────────────────┘

Routing Details

AspectBehaviorPurpose
Hash-Based Distributionhash(bucket + object) determines workerConsistent routing, ordering
Large Object Threshold≥ 128 MiB[1]Dedicated workers prevent blocking
Overflow HandlingFull channels trigger MRF queuePrevents dropped operations

Why Separate Large Object Workers?

  • Prevents head-of-line blocking: Large transfers don’t delay small objects
  • Optimized throughput: Large objects benefit from dedicated bandwidth
  • Resource isolation: Memory/CPU usage separated from standard operations

Retry Mechanism

MinIO implements a multi-stage retry system for reliable replication.

Retry Stages

Stage 1: Inline Retry
└── Failure → Queue to MRF
Stage 2: MRF Persistence (every 5 minutes)
└── Persist queue to disk
Stage 3: MRF Processing (every 6 minutes)
└── Retry queued items
Stage 4: Scanner Takeover (after 3 retries)
└── Background scanner handles persistent failures

Stage Details

StageTimingDescription
Stage 1ImmediateInline failure triggers MRF queue entry
Stage 2Every 5 minutes[2]MRF queue persisted to disk for durability
Stage 3Every 6 minutes[3]MRF processor retries failed items
Stage 4After 3 retries[4]Scanner takes over for persistent failures

MRF (Most Recent Failures) Queue

┌─────────────────────────────────────────────────────────┐
│ MRF Queue │
├─────────────────────────────────────────────────────────┤
│ │
│ In-Memory Queue │
│ │ │
│ ├── Entry: {bucket, object, version, retry_count} │
│ │ │
│ └── Every 5 min → Persist to disk │
│ │ │
│ ▼ │
│ .minio.sys/replication/ │
│ │
│ Every 6 min → Process persisted entries │
│ │ │
│ ├── Success → Remove from queue │
│ │ │
│ └── Failure → Increment retry_count │
│ │ │
│ └── retry_count > 3 → Scanner mode │
│ │
└─────────────────────────────────────────────────────────┘

Conflict Resolution

MinIO uses deterministic rules to resolve conflicts in multi-site replication.

Timestamp-Based Resolution

Conflict Detected (same object, different content)
┌─────────────────────────────────────────────────────────┐
│ Compare Timestamps │
│ │
│ Replica Timestamp vs Replication Source Timestamp │
│ │
│ Winner = Most Recent Timestamp │
└─────────────────────────────────────────────────────────┘
Latest timestamp wins → Object updated
Older timestamp loses → Replication skipped

Resolution Rules

ScenarioResolutionOutcome
Source newerSource winsTarget updated
Target newerTarget winsReplication skipped
Equal timestampsSource winsEnsures consistency

Version Purge States

For delete marker replication, MinIO tracks purge status:

StateDescriptionNext Action
PendingDelete initiated, replication in progressWait for completion
CompleteDelete replicated successfullyNo action needed
FailedDelete replication failedRetry via MRF

Null Version Handling

Pre-Replication Objects (null version ID)
┌─────────────────────────────────────────────────────────┐
│ Null Version Check │
│ │
│ Object has null version? │
│ │ │
│ ├── Yes → Skip replication │
│ │ (Prevents resync of legacy objects) │
│ │ │
│ └── No → Proceed with replication │
└─────────────────────────────────────────────────────────┘

Why Null Version Check?

  • Prevents infinite replication loops
  • Excludes objects created before replication was enabled
  • Ensures only versioned objects participate in replication

Replication Status Tracking

MinIO tracks replication status per object and per target.

Status Values

StatusMeaning
PENDINGQueued for replication
COMPLETEDSuccessfully replicated
FAILEDReplication failed, queued for retry
REPLICAObject is a replica (received from another site)

Multi-Target Status

For multi-site replication, status is tracked per destination:

Object: bucket/key
├── Target 1 (site-a): COMPLETED
├── Target 2 (site-b): PENDING
└── Target 3 (site-c): FAILED

Operational Metrics

Key metrics for monitoring replication health:

MetricDescriptionAlert Threshold
Replication LagTime since oldest pending item> 5 minutes
MRF Queue SizeItems awaiting retry> 1000 items
Failed CountPersistent failures> 0 after retries
Bandwidth UsageReplication throughputNear link capacity

Best Practices

  1. Network sizing: Ensure sufficient bandwidth for replication traffic
  2. Monitor MRF queue: Growing queue indicates replication issues
  3. Timestamp sync: Use NTP across all sites for accurate conflict resolution
  4. Separate large objects: Consider dedicated replication rules for large files

Source Code References
  1. cmd/bucket-replication.go:2467 - minLargeObjSize = 128 * humanize.MiByte
  2. cmd/bucket-replication.go:3815 - mrfSaveInterval = 5 * time.Minute
  3. cmd/bucket-replication.go:3816 - mrfQueueInterval = mrfSaveInterval + time.Minute
  4. cmd/bucket-replication.go:3818 - mrfRetryLimit = 3
0