Understanding MinIO AIStor’s internal data I/O path is essential for architects and operators who need to understand how the system achieves durability, consistency, and performance in distributed object storage.
Answer
MinIO AIStor implements a distributed, erasure-coded object storage system with strong consistency through quorum-based operations. The architecture ensures data durability through Reed-Solomon erasure coding while maintaining strict consistency guarantees via quorum validation on both read and write paths.
Object Write Flow
The write path ensures atomic, durable commits with erasure coding protection.
Write Sequence
Client Request │ ▼┌─────────────────────┐│ Put Object Handler │└─────────────────────┘ │ ▼┌─────────────────────┐│ Namespace Lock │ ← Lock on bucket/object└─────────────────────┘ │ ▼┌─────────────────────┐│ Erasure Encoding │ ← Reed-Solomon: data + parity shards└─────────────────────┘ │ ▼┌─────────────────────┐│ Parallel Write │ ← Workers write to all online disks└─────────────────────┘ │ ▼┌─────────────────────┐│ Quorum Validation │ ← Verify write quorum met└─────────────────────┘ │ ▼┌─────────────────────┐│ Atomic Commit │ ← Rename temp → final, write xl.meta└─────────────────────┘Write Steps
| Step | Operation | Description |
|---|---|---|
| 1 | Entry | Request arrives at Put Object handler |
| 2 | Lock Acquisition | Namespace lock obtained on bucket/object path |
| 3 | Erasure Encoding | Data split into dataBlocks shards with parityBlocks parity via Reed-Solomon |
| 4 | Parallel Write | Writer workers write encoded shards to all online disks concurrently |
| 5 | Quorum Validation | Verify writes succeeded on ≥ Write Quorum disks |
| 6 | Atomic Commit | Rename temp data to final location, write xl.meta metadata |
| 7 | Post-Write | Failed partial writes queued to MRF (Most Recent Failures) for healing |
Write Quorum Calculation[1]
Write Quorum = dataBlocks (or dataBlocks + 1 if dataBlocks == parityBlocks)Example with EC:4 (4 data + 4 parity on 8 disks):
- Write Quorum = 4 + 1 = 5 disks must succeed
Object Read Flow
The read path prioritizes consistency and enables on-read healing for corrupted data.
Read Sequence
Client Request │ ▼┌─────────────────────┐│ Get Object Handler │└─────────────────────┘ │ ▼┌─────────────────────┐│ Parallel Meta Read │ ← Read xl.meta from all disks└─────────────────────┘ │ ▼┌─────────────────────┐│ Quorum Selection │ ← Determine latest valid metadata└─────────────────────┘ │ ▼┌─────────────────────┐│ Erasure Decoding │ ← Read from dataBlocks disks└─────────────────────┘ │ ▼┌─────────────────────┐│ On-Read Healing │ ← Queue corrupted shards for repair└─────────────────────┘ │ ▼┌─────────────────────┐│ Return Data │└─────────────────────┘Read Steps
| Step | Operation | Description |
|---|---|---|
| 1 | Entry | Request arrives at Get Object handler |
| 2 | Metadata Read | Read workers fetch xl.meta from all disks in parallel |
| 3 | Quorum Selection | MinIO algorithm determines latest valid metadata version |
| 4 | Erasure Decoding | Parallelized read from dataBlocks disks to reconstruct object |
| 5 | On-Read Healing | Corrupted shards detected via bitrot checksums queued for repair |
Read Quorum Calculation[2]
Read Quorum = totalDisks - parityBlocksExample with EC:4 (8 total disks, 4 parity):
- Read Quorum = 8 - 4 = 4 disks must be available
Consistency Guarantees
MinIO provides strong consistency through quorum-based operations.
| Guarantee | Behavior |
|---|---|
| Write Consistency | Succeeds only when data + metadata committed to ≥ Write Quorum disks |
| Read Consistency | Requires ≥ Read Quorum disks available with valid data |
| Atomicity | Partial writes never visible to readers; failures trigger full rollback |
| Durability | Data survives up to parityBlocks disk failures |
Consistency Model
Strong Consistency:- Read-after-write: Guaranteed (same or different client)- List-after-write: Guaranteed- No stale reads: Quorum ensures latest committed version
Failure Handling:- Write failure → Full rollback, no partial data visible- Read with degraded disks → Reconstruct from available shards- Bitrot detection → On-read healing queues repairsKey Components
xl.meta
The metadata file stored alongside each object containing:
- Object version information
- Erasure coding parameters
- Checksum data for bitrot detection
- Part information for multipart uploads
MRF (Most Recent Failures)
A queue system that tracks:
- Partial operations that achieved quorum but didn’t write to all disks
- Detected corruptions for background healing
- Ensures eventual consistency for degraded operations
Namespace Locking
Distributed locking mechanism that:
- Prevents concurrent writes to same object
- Ensures serializable operations
- Coordinates across all nodes in the erasure set
Performance Characteristics
| Operation | Parallelism | Limiting Factor |
|---|---|---|
| Write | All disks written concurrently | Slowest disk in quorum |
| Read | dataBlocks disks read concurrently | Reconstruction overhead if degraded |
| Metadata | All disks queried in parallel | Quorum response time |
Optimization Tips
- Balanced erasure sets: Ensure similar disk performance within each set
- Network bandwidth: Size network for parallel shard transfers
- Disk health: Monitor for slow disks that impact quorum operations
Source Code References
cmd/erasure.go:71-77-defaultWQuorum(): Write quorum = dataCount (or dataCount + 1 if dataCount == parityCount)cmd/erasure.go:80-82-defaultRQuorum(): Read quorum = setDriveCount - defaultParityCount