Understanding MinIO AIStor’s erasure coding implementation helps architects design resilient storage systems and operators troubleshoot data protection issues. This answer is based on the implementation by Klaus Post, the engineer who wrote MinIO’s Reed-Solomon erasure coding.
Answer
MinIO uses Klaus Post’s Reed-Solomon implementation with configurable data/parity ratios and deterministic shard placement. The system encodes data into multiple shards distributed across disks, enabling reconstruction even when multiple disks fail. Quorum-based operations ensure consistency while maximizing availability.
Encoding Logic
How Data Is Encoded
Original Object (e.g., 2 MB) │ ▼┌─────────────────────────────────┐│ Split into dataBlocks shards │└─────────────────────────────────┘ │ ▼┌─────────────────────────────────┐│ Reed-Solomon over GF(2^8) │ ← Galois Field 256│ Compute parityBlocks shards │└─────────────────────────────────┘ │ ▼┌─────────────────────────────────┐│ Write all shards to disks │└─────────────────────────────────┘Encoding Parameters
| Parameter | Description | Typical Value |
|---|---|---|
| dataBlocks | Number of data shards | Varies by EC config |
| parityBlocks | Number of parity shards | Varies by EC config |
| blockSize[1] | Size of data block before splitting | 1 MiB default |
| shardSize | Size of each shard | ceil(blockSize / dataBlocks) |
Shard Size Calculation
shardSize = ceil(blockSize / dataBlocks)
Example: 1 MB block with 8 data blocksshardSize = ceil(1 MB / 8) = 128 KB per shardTypical shard sizes range around 256 KB for smaller objects.
Library
MinIO uses the high-performance Reed-Solomon library:
github.com/klauspost/reedsolomonKey features:
- SIMD-optimized (AVX2, AVX512, NEON)
- GF(2^8) - Galois Field with 256 elements
- Concurrent encoding/decoding
Parity Block Placement
MinIO uses deterministic algorithms to distribute shards across disks.
Placement Algorithms
| Algorithm | Description | Use Case |
|---|---|---|
| CRCMOD[2] | Legacy CRC32-based rotation | Older deployments |
| SIPMOD[2] | SipHash-2-4 with deployment ID | Current default |
| SIPMOD+PARITY[2] | Parity-aware distribution | Optimized placement |
Deterministic Ordering
Object Key + Deployment ID │ ▼┌─────────────────────────────────┐│ Hash Function (SipHash-2-4) │└─────────────────────────────────┘ │ ▼┌─────────────────────────────────┐│ CRC32 Rotation for Ordering │└─────────────────────────────────┘ │ ▼┌─────────────────────────────────┐│ Map to Erasure Set │└─────────────────────────────────┘ │ ▼ Disk 1 Disk 2 Disk 3 ... Disk N [D1] [D2] [P1] ... [PN]Why Deterministic Placement?
- Consistent reads: Any node can locate shards without coordination
- Balanced distribution: Objects spread evenly across disks
- Predictable healing: System knows where shards should be
Quorum Mathematics
Definitions
N = Total Disks in Erasure SetP = Parity BlocksD = Data Blocks = N - PQuorum Formulas[3]
| Quorum Type | Formula | Purpose |
|---|---|---|
| Write Quorum | D (or D + 1 if D == P) | Minimum disks for successful write |
| Read Quorum | D = N - P | Minimum disks for reconstruction |
Why +1 When D == P?
When data and parity blocks are equal (e.g., EC:4 on 8 disks), requiring exactly D disks creates ambiguity about which version is authoritative. Adding +1 ensures a clear majority.
Example Calculations
12 disks with 4 parity (EC:4):
N = 12, P = 4D = 12 - 4 = 8
Write Quorum = 8 (D ≠ P, so just D)Read Quorum = 8
Fault Tolerance: Can lose 4 disks and still operateStorage Efficiency: 8/12 = 66.7% usable8 disks with 4 parity (EC:4):
N = 8, P = 4D = 8 - 4 = 4
Write Quorum = 5 (D == P, so D + 1)Read Quorum = 4
Fault Tolerance: Can lose 4 disks for reads, 3 for writesStorage Efficiency: 4/8 = 50% usable16 disks with 4 parity (EC:4):
N = 16, P = 4D = 16 - 4 = 12
Write Quorum = 12Read Quorum = 12
Fault Tolerance: Can lose 4 disksStorage Efficiency: 12/16 = 75% usableFailure Handling
Error Categories
| Category | Description | Resolution |
|---|---|---|
| Bitrot | Silent data corruption detected by checksum | On-read healing queues repair |
| Disk Not Found | Disk offline or failed | Reconstruct from available shards |
| Missing Parts | Partial object data | Heal from parity shards |
| Checksum Mismatch | Data integrity failure | Reconstruct and replace shard |
Reconstruction Process
Available Shards < Total Shards? │ ├── No → Direct read (no reconstruction needed) │ └── Yes → Available ≥ Read Quorum? │ ├── Yes → Reed-Solomon reconstruction │ └── Rebuild missing shards │ └── No → Read fails (insufficient data)Reconstruction Requirements
- Minimum shards needed:
dataBlocks(Read Quorum) - Can reconstruct from: Any combination of data + parity shards
- Performance impact: Reconstruction requires more CPU and I/O
Self-Test Verification[4]
MinIO runs comprehensive erasure coding verification at startup.
What Gets Tested
Startup Self-Test:├── Test all (data, parity) configurations│ ├── Data blocks: 2 to 14│ └── Parity blocks: 1 to 8├── Encode test data├── Corrupt random shards├── Verify reconstruction└── Fatal error on any mismatchWhy Self-Test?
| Purpose | Benefit |
|---|---|
| Hardware validation | Detects CPU/memory issues affecting encoding |
| Library verification | Ensures Reed-Solomon implementation is correct |
| Silent corruption prevention | Fatal error prevents serving corrupt data |
Test Coverage
Configurations tested at startup:- EC:2 through EC:8 parity levels- Data blocks from 2 to 14- All combinations within valid ranges
If ANY test fails → Server refuses to startThis aggressive testing ensures MinIO never operates with a faulty erasure coding implementation that could silently corrupt data.
Storage Efficiency Reference
| Erasure Set Size | Parity | Data Blocks | Efficiency | Fault Tolerance |
|---|---|---|---|---|
| 8 disks | 2 | 6 | 75% | 2 disk failures |
| 8 disks | 4 | 4 | 50% | 4 disk failures |
| 12 disks | 4 | 8 | 66.7% | 4 disk failures |
| 16 disks | 4 | 12 | 75% | 4 disk failures |
| 16 disks | 8 | 8 | 50% | 8 disk failures |
Source Code References
cmd/object-api-common.go:31-blockSizeV2 = 1 * humanize.MiByte(default block size)cmd/format-erasure.go:47-53- Distribution algorithms:CRCMOD,SIPMOD,SIPMOD+PARITYcmd/erasure.go:71-82-defaultWQuorum()anddefaultRQuorum()implementationscmd/erasure-coding.go:137-194-erasureSelfTest()verifies all data/parity configurations at startup