How does MinIO AIStor coordinate operations across the cluster?

Understanding MinIO AIStor’s cluster coordination mechanisms is essential for operators managing distributed deployments and troubleshooting coordination-related issues.

Answer

MinIO uses grid-based RPC with a dedicated lock manager and quorum-based distributed locking. The architecture separates general peer communication from lock traffic, ensuring that lock operations remain responsive even under heavy data I/O. Quorum-based locking provides consistency guarantees while lease-based management prevents deadlocks.

RPC Communication

MinIO’s inter-node communication uses a grid-based RPC system.

Communication Architecture

┌─────────────────────────────────────────────────────────┐
│                     MinIO Node                          │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐  ┌─────────────────────────┐  │
│  │   Main RPC Manager  │  │  Lock RPC Manager       │  │
│  │   (Grid Manager)    │  │  (Isolated Grid)        │  │
│  └──────────┬──────────┘  └──────────┬──────────────┘  │
│             │                        │                  │
│             ▼                        ▼                  │
│  ┌─────────────────────────────────────────────────┐   │
│  │         WebSocket Transport (TLS 1.3)           │   │
│  └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │     Peer Nodes        │
              └───────────────────────┘

RPC Components

Component	Purpose	Traffic Type
Main RPC Manager	General peer communication	Data operations, healing, replication
Lock RPC Manager	Dedicated lock traffic	Lock acquire, refresh, release

Why Separate Lock Traffic?

Isolation: Lock operations don’t compete with data I/O
Responsiveness: Lock timeouts remain predictable under load
Reliability: Lock manager failures don’t cascade to data operations

Transport Details

Parameter	Value	Description
Protocol	WebSocket	Persistent, bidirectional connections
Security	TLS 1.3	Encrypted inter-node communication
Connection State	Grid state	Used for online/offline detection
Ping Interval	10-15 seconds	Health check frequency

Distributed Locking

MinIO implements quorum-based distributed locks for consistency.

Lock Types and Quorum

Total Nodes = N

Write Lock Quorum = N/2 + 1  (strict majority)
Read Lock Quorum  = N - N/2  (complement)

Lock Type	Quorum Formula	Purpose
Write Lock^[1]	`len(nodes)/2 + 1`	Exclusive access for mutations
Read Lock^[2]	`len(nodes) - len(nodes)/2`	Shared access for reads

Quorum Examples

4-node cluster:

Write Lock Quorum = 4/2 + 1 = 3 nodes
Read Lock Quorum  = 4 - 4/2 = 2 nodes

8-node cluster:

Write Lock Quorum = 8/2 + 1 = 5 nodes
Read Lock Quorum  = 8 - 8/2 = 4 nodes

Lock Flow

Lock Request
     │
     ▼
┌─────────────────────────────┐
│  Determine Lock Server      │  ← Consistent hash of resource
└─────────────────────────────┘
     │
     ▼
┌─────────────────────────────┐
│  Send to Quorum Nodes       │  ← Parallel requests
└─────────────────────────────┘
     │
     ▼
┌─────────────────────────────┐
│  Wait for Quorum Response   │  ← Timeout: 1 second
└─────────────────────────────┘
     │
     ├── Quorum Met → Lock Acquired
     │
     └── Quorum Failed → Retry or Error

Lease Management

Locks use time-based leases to prevent deadlocks from failed nodes.

Lease Parameters

Parameter	Value	Description
Lease Duration^[3]	60 seconds	Absolute expiry time
Refresh Interval^[4]	10 seconds	How often lock is renewed
Expiry Multiplier	6×	Lease = 6 × refresh interval

Lease Lifecycle

Lock Acquired (T=0)
     │
     ├── T=10s  → Refresh ✓
     ├── T=20s  → Refresh ✓
     ├── T=30s  → Refresh ✓
     ├── T=40s  → Refresh ✓
     ├── T=50s  → Refresh ✓
     │
     └── T=60s  → Lease expires if no refresh

Failure Scenarios

Scenario	Behavior
Refresh succeeds	Lease extended for another 60 seconds
Refresh fails	Lock holder retries until expiry
Node crashes	Lease expires after 60 seconds, lock released
Network partition	Quorum loss triggers error, operations fail-safe

Lock Loss Handling

When a lock is lost (quorum failure or lease expiry):

Operation receives lock error
Quorum loss is logged
Operation must retry from the beginning
Partial writes are rolled back

Lock Server Manager

MinIO uses consistent hashing to distribute lock management across nodes.

Architecture

┌─────────────────────────────────────────────────────────┐
│                  Lock Server Manager                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│   Resource Key ──► Consistent Hash ──► Lock Server      │
│                                                          │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐            │
│   │ Server 1│    │ Server 2│    │ Server 3│  ...       │
│   │ (Set 1) │    │ (Set 2) │    │ (Set 3) │            │
│   └─────────┘    └─────────┘    └─────────┘            │
│                                                          │
└─────────────────────────────────────────────────────────┘

Key Features

Feature	Description
Consistent Hash Distribution	Locks mapped to specific endpoints
Per-Set Lock Servers	Each erasure set has dedicated lock management
Handler Determination	Lock manager identifies handler without acquiring lock

Benefits

Load distribution: Locks spread across all nodes
Locality: Related locks handled by same server when possible
Scalability: Adding nodes redistributes lock responsibility

Timeout Configuration

All lock operations have carefully tuned timeouts.

Operation	Timeout	Description
Lock Acquire^[5]	1 second	Initial lock request
Refresh Call^[6]	5 seconds	Lease renewal
Unlock Call^[7]	30 seconds	Normal lock release
Force Unlock^[8]	30 seconds	Administrative override
Lease Duration^[3]	60 seconds	Maximum lock hold time without refresh

Timeout Rationale

Lock Acquire (1s)
└── Fast fail for contended resources
    └── Allows quick retry with backoff

Refresh Call (5s)
└── Generous for network hiccups
    └── Still well under lease duration

Unlock/Force Unlock (30s)
└── Allows completion even under load
    └── Prevents orphaned locks

Operational Considerations

Monitoring Lock Health

Key metrics to watch:

Lock acquire latency
Lock timeout frequency
Quorum failures
Lease refresh failures

Common Issues

Symptom	Likely Cause	Resolution
High lock latency	Network congestion	Check inter-node bandwidth
Frequent timeouts	Node overload	Scale out or reduce load
Quorum failures	Node failures	Check cluster health
Lock storms	Hot keys	Review access patterns

Best Practices

Network: Ensure low-latency, reliable inter-node connectivity
Time sync: Use NTP to keep node clocks synchronized
Monitoring: Alert on lock timeout rates
Capacity: Avoid overloading individual nodes

Source Code References

internal/dsync/drwmutex.go:210-217 - Write lock quorum: quorum++ when quorum == tolerance (ensures N/2 + 1)
internal/dsync/drwmutex.go:206,209 - Read lock quorum: len(restClnts) - tolerance where tolerance = len(restClnts) / 2
internal/dsync/drwmutex.go:416 - Lease expiry: drwMutexRefreshInterval * 6 (60 seconds = 6 × 10s)
internal/dsync/drwmutex.go:81 - drwMutexRefreshInterval = 10 * time.Second
internal/dsync/drwmutex.go:69 - drwMutexAcquireTimeout = 1 * time.Second
internal/dsync/drwmutex.go:72 - drwMutexRefreshCallTimeout = 5 * time.Second
internal/dsync/drwmutex.go:75 - drwMutexUnlockCallTimeout = 30 * time.Second
internal/dsync/drwmutex.go:78 - drwMutexForceUnlockCallTimeout = 30 * time.Second