Understanding MinIO AIStor’s cluster coordination mechanisms is essential for operators managing distributed deployments and troubleshooting coordination-related issues.
Answer
MinIO uses grid-based RPC with a dedicated lock manager and quorum-based distributed locking. The architecture separates general peer communication from lock traffic, ensuring that lock operations remain responsive even under heavy data I/O. Quorum-based locking provides consistency guarantees while lease-based management prevents deadlocks.
RPC Communication
MinIO’s inter-node communication uses a grid-based RPC system.
Communication Architecture
┌─────────────────────────────────────────────────────────┐│ MinIO Node │├─────────────────────────────────────────────────────────┤│ ┌─────────────────────┐ ┌─────────────────────────┐ ││ │ Main RPC Manager │ │ Lock RPC Manager │ ││ │ (Grid Manager) │ │ (Isolated Grid) │ ││ └──────────┬──────────┘ └──────────┬──────────────┘ ││ │ │ ││ ▼ ▼ ││ ┌─────────────────────────────────────────────────┐ ││ │ WebSocket Transport (TLS 1.3) │ ││ └─────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────┘ │ ▼ ┌───────────────────────┐ │ Peer Nodes │ └───────────────────────┘RPC Components
| Component | Purpose | Traffic Type |
|---|---|---|
| Main RPC Manager | General peer communication | Data operations, healing, replication |
| Lock RPC Manager | Dedicated lock traffic | Lock acquire, refresh, release |
Why Separate Lock Traffic?
- Isolation: Lock operations don’t compete with data I/O
- Responsiveness: Lock timeouts remain predictable under load
- Reliability: Lock manager failures don’t cascade to data operations
Transport Details
| Parameter | Value | Description |
|---|---|---|
| Protocol | WebSocket | Persistent, bidirectional connections |
| Security | TLS 1.3 | Encrypted inter-node communication |
| Connection State | Grid state | Used for online/offline detection |
| Ping Interval | 10-15 seconds | Health check frequency |
Distributed Locking
MinIO implements quorum-based distributed locks for consistency.
Lock Types and Quorum
Total Nodes = N
Write Lock Quorum = N/2 + 1 (strict majority)Read Lock Quorum = N - N/2 (complement)| Lock Type | Quorum Formula | Purpose |
|---|---|---|
| Write Lock[1] | len(nodes)/2 + 1 | Exclusive access for mutations |
| Read Lock[2] | len(nodes) - len(nodes)/2 | Shared access for reads |
Quorum Examples
4-node cluster:
Write Lock Quorum = 4/2 + 1 = 3 nodesRead Lock Quorum = 4 - 4/2 = 2 nodes8-node cluster:
Write Lock Quorum = 8/2 + 1 = 5 nodesRead Lock Quorum = 8 - 8/2 = 4 nodesLock Flow
Lock Request │ ▼┌─────────────────────────────┐│ Determine Lock Server │ ← Consistent hash of resource└─────────────────────────────┘ │ ▼┌─────────────────────────────┐│ Send to Quorum Nodes │ ← Parallel requests└─────────────────────────────┘ │ ▼┌─────────────────────────────┐│ Wait for Quorum Response │ ← Timeout: 1 second└─────────────────────────────┘ │ ├── Quorum Met → Lock Acquired │ └── Quorum Failed → Retry or ErrorLease Management
Locks use time-based leases to prevent deadlocks from failed nodes.
Lease Parameters
| Parameter | Value | Description |
|---|---|---|
| Lease Duration[3] | 60 seconds | Absolute expiry time |
| Refresh Interval[4] | 10 seconds | How often lock is renewed |
| Expiry Multiplier | 6× | Lease = 6 × refresh interval |
Lease Lifecycle
Lock Acquired (T=0) │ ├── T=10s → Refresh ✓ ├── T=20s → Refresh ✓ ├── T=30s → Refresh ✓ ├── T=40s → Refresh ✓ ├── T=50s → Refresh ✓ │ └── T=60s → Lease expires if no refreshFailure Scenarios
| Scenario | Behavior |
|---|---|
| Refresh succeeds | Lease extended for another 60 seconds |
| Refresh fails | Lock holder retries until expiry |
| Node crashes | Lease expires after 60 seconds, lock released |
| Network partition | Quorum loss triggers error, operations fail-safe |
Lock Loss Handling
When a lock is lost (quorum failure or lease expiry):
- Operation receives lock error
- Quorum loss is logged
- Operation must retry from the beginning
- Partial writes are rolled back
Lock Server Manager
MinIO uses consistent hashing to distribute lock management across nodes.
Architecture
┌─────────────────────────────────────────────────────────┐│ Lock Server Manager │├─────────────────────────────────────────────────────────┤│ ││ Resource Key ──► Consistent Hash ──► Lock Server ││ ││ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ │ Server 1│ │ Server 2│ │ Server 3│ ... ││ │ (Set 1) │ │ (Set 2) │ │ (Set 3) │ ││ └─────────┘ └─────────┘ └─────────┘ ││ │└─────────────────────────────────────────────────────────┘Key Features
| Feature | Description |
|---|---|
| Consistent Hash Distribution | Locks mapped to specific endpoints |
| Per-Set Lock Servers | Each erasure set has dedicated lock management |
| Handler Determination | Lock manager identifies handler without acquiring lock |
Benefits
- Load distribution: Locks spread across all nodes
- Locality: Related locks handled by same server when possible
- Scalability: Adding nodes redistributes lock responsibility
Timeout Configuration
All lock operations have carefully tuned timeouts.
| Operation | Timeout | Description |
|---|---|---|
| Lock Acquire[5] | 1 second | Initial lock request |
| Refresh Call[6] | 5 seconds | Lease renewal |
| Unlock Call[7] | 30 seconds | Normal lock release |
| Force Unlock[8] | 30 seconds | Administrative override |
| Lease Duration[3] | 60 seconds | Maximum lock hold time without refresh |
Timeout Rationale
Lock Acquire (1s)└── Fast fail for contended resources └── Allows quick retry with backoff
Refresh Call (5s)└── Generous for network hiccups └── Still well under lease duration
Unlock/Force Unlock (30s)└── Allows completion even under load └── Prevents orphaned locksOperational Considerations
Monitoring Lock Health
Key metrics to watch:
- Lock acquire latency
- Lock timeout frequency
- Quorum failures
- Lease refresh failures
Common Issues
| Symptom | Likely Cause | Resolution |
|---|---|---|
| High lock latency | Network congestion | Check inter-node bandwidth |
| Frequent timeouts | Node overload | Scale out or reduce load |
| Quorum failures | Node failures | Check cluster health |
| Lock storms | Hot keys | Review access patterns |
Best Practices
- Network: Ensure low-latency, reliable inter-node connectivity
- Time sync: Use NTP to keep node clocks synchronized
- Monitoring: Alert on lock timeout rates
- Capacity: Avoid overloading individual nodes
Source Code References
internal/dsync/drwmutex.go:210-217- Write lock quorum:quorum++whenquorum == tolerance(ensures N/2 + 1)internal/dsync/drwmutex.go:206,209- Read lock quorum:len(restClnts) - tolerancewheretolerance = len(restClnts) / 2internal/dsync/drwmutex.go:416- Lease expiry:drwMutexRefreshInterval * 6(60 seconds = 6 × 10s)internal/dsync/drwmutex.go:81-drwMutexRefreshInterval = 10 * time.Secondinternal/dsync/drwmutex.go:69-drwMutexAcquireTimeout = 1 * time.Secondinternal/dsync/drwmutex.go:72-drwMutexRefreshCallTimeout = 5 * time.Secondinternal/dsync/drwmutex.go:75-drwMutexUnlockCallTimeout = 30 * time.Secondinternal/dsync/drwmutex.go:78-drwMutexForceUnlockCallTimeout = 30 * time.Second