How does MinIO AIStor coordinate operations across the cluster?

Asked by muratkars Answered by muratkars January 4, 2026
0 views

Understanding MinIO AIStor’s cluster coordination mechanisms is essential for operators managing distributed deployments and troubleshooting coordination-related issues.

Answer

MinIO uses grid-based RPC with a dedicated lock manager and quorum-based distributed locking. The architecture separates general peer communication from lock traffic, ensuring that lock operations remain responsive even under heavy data I/O. Quorum-based locking provides consistency guarantees while lease-based management prevents deadlocks.


RPC Communication

MinIO’s inter-node communication uses a grid-based RPC system.

Communication Architecture

┌─────────────────────────────────────────────────────────┐
│ MinIO Node │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────────────┐ ┌─────────────────────────┐ │
│ │ Main RPC Manager │ │ Lock RPC Manager │ │
│ │ (Grid Manager) │ │ (Isolated Grid) │ │
│ └──────────┬──────────┘ └──────────┬──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ WebSocket Transport (TLS 1.3) │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
┌───────────────────────┐
│ Peer Nodes │
└───────────────────────┘

RPC Components

ComponentPurposeTraffic Type
Main RPC ManagerGeneral peer communicationData operations, healing, replication
Lock RPC ManagerDedicated lock trafficLock acquire, refresh, release

Why Separate Lock Traffic?

  • Isolation: Lock operations don’t compete with data I/O
  • Responsiveness: Lock timeouts remain predictable under load
  • Reliability: Lock manager failures don’t cascade to data operations

Transport Details

ParameterValueDescription
ProtocolWebSocketPersistent, bidirectional connections
SecurityTLS 1.3Encrypted inter-node communication
Connection StateGrid stateUsed for online/offline detection
Ping Interval10-15 secondsHealth check frequency

Distributed Locking

MinIO implements quorum-based distributed locks for consistency.

Lock Types and Quorum

Total Nodes = N
Write Lock Quorum = N/2 + 1 (strict majority)
Read Lock Quorum = N - N/2 (complement)
Lock TypeQuorum FormulaPurpose
Write Lock[1]len(nodes)/2 + 1Exclusive access for mutations
Read Lock[2]len(nodes) - len(nodes)/2Shared access for reads

Quorum Examples

4-node cluster:

Write Lock Quorum = 4/2 + 1 = 3 nodes
Read Lock Quorum = 4 - 4/2 = 2 nodes

8-node cluster:

Write Lock Quorum = 8/2 + 1 = 5 nodes
Read Lock Quorum = 8 - 8/2 = 4 nodes

Lock Flow

Lock Request
┌─────────────────────────────┐
│ Determine Lock Server │ ← Consistent hash of resource
└─────────────────────────────┘
┌─────────────────────────────┐
│ Send to Quorum Nodes │ ← Parallel requests
└─────────────────────────────┘
┌─────────────────────────────┐
│ Wait for Quorum Response │ ← Timeout: 1 second
└─────────────────────────────┘
├── Quorum Met → Lock Acquired
└── Quorum Failed → Retry or Error

Lease Management

Locks use time-based leases to prevent deadlocks from failed nodes.

Lease Parameters

ParameterValueDescription
Lease Duration[3]60 secondsAbsolute expiry time
Refresh Interval[4]10 secondsHow often lock is renewed
Expiry MultiplierLease = 6 × refresh interval

Lease Lifecycle

Lock Acquired (T=0)
├── T=10s → Refresh ✓
├── T=20s → Refresh ✓
├── T=30s → Refresh ✓
├── T=40s → Refresh ✓
├── T=50s → Refresh ✓
└── T=60s → Lease expires if no refresh

Failure Scenarios

ScenarioBehavior
Refresh succeedsLease extended for another 60 seconds
Refresh failsLock holder retries until expiry
Node crashesLease expires after 60 seconds, lock released
Network partitionQuorum loss triggers error, operations fail-safe

Lock Loss Handling

When a lock is lost (quorum failure or lease expiry):

  1. Operation receives lock error
  2. Quorum loss is logged
  3. Operation must retry from the beginning
  4. Partial writes are rolled back

Lock Server Manager

MinIO uses consistent hashing to distribute lock management across nodes.

Architecture

┌─────────────────────────────────────────────────────────┐
│ Lock Server Manager │
├─────────────────────────────────────────────────────────┤
│ │
│ Resource Key ──► Consistent Hash ──► Lock Server │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Server 1│ │ Server 2│ │ Server 3│ ... │
│ │ (Set 1) │ │ (Set 2) │ │ (Set 3) │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────┘

Key Features

FeatureDescription
Consistent Hash DistributionLocks mapped to specific endpoints
Per-Set Lock ServersEach erasure set has dedicated lock management
Handler DeterminationLock manager identifies handler without acquiring lock

Benefits

  • Load distribution: Locks spread across all nodes
  • Locality: Related locks handled by same server when possible
  • Scalability: Adding nodes redistributes lock responsibility

Timeout Configuration

All lock operations have carefully tuned timeouts.

OperationTimeoutDescription
Lock Acquire[5]1 secondInitial lock request
Refresh Call[6]5 secondsLease renewal
Unlock Call[7]30 secondsNormal lock release
Force Unlock[8]30 secondsAdministrative override
Lease Duration[3]60 secondsMaximum lock hold time without refresh

Timeout Rationale

Lock Acquire (1s)
└── Fast fail for contended resources
└── Allows quick retry with backoff
Refresh Call (5s)
└── Generous for network hiccups
└── Still well under lease duration
Unlock/Force Unlock (30s)
└── Allows completion even under load
└── Prevents orphaned locks

Operational Considerations

Monitoring Lock Health

Key metrics to watch:

  • Lock acquire latency
  • Lock timeout frequency
  • Quorum failures
  • Lease refresh failures

Common Issues

SymptomLikely CauseResolution
High lock latencyNetwork congestionCheck inter-node bandwidth
Frequent timeoutsNode overloadScale out or reduce load
Quorum failuresNode failuresCheck cluster health
Lock stormsHot keysReview access patterns

Best Practices

  1. Network: Ensure low-latency, reliable inter-node connectivity
  2. Time sync: Use NTP to keep node clocks synchronized
  3. Monitoring: Alert on lock timeout rates
  4. Capacity: Avoid overloading individual nodes

Source Code References
  1. internal/dsync/drwmutex.go:210-217 - Write lock quorum: quorum++ when quorum == tolerance (ensures N/2 + 1)
  2. internal/dsync/drwmutex.go:206,209 - Read lock quorum: len(restClnts) - tolerance where tolerance = len(restClnts) / 2
  3. internal/dsync/drwmutex.go:416 - Lease expiry: drwMutexRefreshInterval * 6 (60 seconds = 6 × 10s)
  4. internal/dsync/drwmutex.go:81 - drwMutexRefreshInterval = 10 * time.Second
  5. internal/dsync/drwmutex.go:69 - drwMutexAcquireTimeout = 1 * time.Second
  6. internal/dsync/drwmutex.go:72 - drwMutexRefreshCallTimeout = 5 * time.Second
  7. internal/dsync/drwmutex.go:75 - drwMutexUnlockCallTimeout = 30 * time.Second
  8. internal/dsync/drwmutex.go:78 - drwMutexForceUnlockCallTimeout = 30 * time.Second
0