Understanding MinIO AIStor’s metadata handling is essential for optimizing performance and troubleshooting consistency issues in distributed deployments.
Answer
MinIO uses a binary XL format v2 with multi-tier caching and quorum-based consistency. The metadata system is designed for durability through atomic writes, performance through tiered caching, and efficiency by inlining small objects directly within metadata files.
Persistence Model
MinIO stores object metadata in xl.meta files using a compact binary format.
XL Format Structure
┌─────────────────────────────────────────────────────────┐│ xl.meta File │├─────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────┐ ││ │ Header │ ││ │ ├── Magic: "XL2 " (4 bytes) │ ││ │ └── Version flags │ ││ └─────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────┐ ││ │ Metadata (msgpack encoded) │ ││ │ ├── Version info │ ││ │ ├── Erasure coding params │ ││ │ ├── Part information │ ││ │ ├── Checksums │ ││ │ ├── User metadata │ ││ │ └── [Inline data for small objects] │ ││ └─────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────┘Format Details
| Component | Description | Value |
|---|---|---|
| Format | Binary msgpack with header | Compact, fast parsing |
| Magic Header[1] | “XL2 ” (4 bytes) | Format identification |
| Major Version[2] | 1 | Breaking changes |
| Minor Version[2] | 4 | Compatible additions |
| Encoding | MessagePack | Efficient binary serialization |
Version History
XL Format Evolution:├── v1.0 - Initial format├── v1.1 - Added inline data support├── v1.2 - Extended metadata fields├── v1.3 - Checksum improvements└── v1.4 - Current version (optimizations)Inline Data
Small objects are stored directly within the xl.meta file for efficiency.
How Inline Data Works
Object Size Check │ ├── Size ≤ Threshold ──► Store in xl.meta (inline) │ └── No separate part files │ └── Size > Threshold ──► Store as separate parts └── xl.meta contains referencesBenefits of Inline Data
| Benefit | Description |
|---|---|
| Reduced I/O | Single file read for small objects |
| Better performance | No additional disk seeks |
| Atomic operations | Data and metadata in one file |
| Space efficiency | Avoids small file overhead |
Inline Threshold
Small objects below the inline threshold are embedded directly in metadata, eliminating separate part files and reducing I/O operations.
Atomic Write Process
MinIO ensures metadata durability through atomic write operations.
Write Sequence
Step 1: Write new metadata to temp file (UUID in tmp bucket) │ ▼Step 2: Sync to disk (fsync) │ ▼Step 3: Atomic rename temp file to xl.metaMinIO relies on POSIX atomic rename semantics for crash safety. If the rename completes, the new metadata is durable. If it fails, the original xl.meta remains intact.
Crash Recovery
On Startup: │ ├── xl.meta exists, no temp files │ └── Normal state, no recovery needed │ └── Orphaned temp files in tmp bucket └── Cleaned up automaticallyRecovery Files
| File | Purpose | Lifetime |
|---|---|---|
| xl.meta | Active metadata | Permanent |
| temp file | New metadata being written | Transient (auto-cleaned) |
Note: The .bkp file is only used in specific rollback scenarios (e.g., undo operations), not in normal writes.
Caching Architecture
MinIO implements a two-tier caching system for metadata performance.
Cache Tiers
┌─────────────────────────────────────────────────────────┐│ Metadata Caching │├─────────────────────────────────────────────────────────┤│ ││ Request ──► Tier 1: BigCache (per-drive) ││ │ ││ ├── Hit ──► Return cached metadata ││ │ ││ └── Miss ──► Tier 2: Metacache ││ │ ││ ├── Hit ──► Return ││ │ ││ └── Miss ──► Disk ││ │└─────────────────────────────────────────────────────────┘Tier Details
| Tier | Implementation | Scope | Purpose |
|---|---|---|---|
| Tier 1[3] | BigCache | Per-drive | Hot metadata for individual objects |
| Tier 2 | Metacache | Distributed | Listing cache across cluster |
BigCache (Tier 1)
┌─────────────────────────────────────────────────────────┐│ BigCache │├─────────────────────────────────────────────────────────┤│ ││ Scope: Per-drive metadata cache ││ ││ Key: bucket/object/version ││ Value: Parsed xl.meta contents ││ ││ Eviction: LRU-based ││ Invalidation: On write/delete ││ ││ Benefits: ││ ├── Reduces disk I/O for hot objects ││ ├── Speeds up repeated reads ││ └── Memory-efficient storage ││ │└─────────────────────────────────────────────────────────┘Metacache (Tier 2)
┌─────────────────────────────────────────────────────────┐│ Metacache │├─────────────────────────────────────────────────────────┤│ ││ Scope: Distributed listing cache ││ ││ Purpose: Accelerate ListObjects operations ││ ││ Features: ││ ├── Caches directory listings ││ ├── Distributed across nodes ││ ├── Supports pagination ││ └── Invalidated on bucket changes ││ ││ Benefits: ││ ├── Faster bucket listings ││ ├── Reduced disk scanning ││ └── Better LIST operation performance ││ │└─────────────────────────────────────────────────────────┘Consistency Model
Metadata consistency is maintained through quorum-based operations.
Read Consistency
ListOnlineDisks (all disks in erasure set) │ ▼Read xl.meta from each disk │ ▼Compare versions (modTime) │ ▼Select latest with quorum │ ▼Return consistent metadataWrite Consistency
Acquire namespace lock │ ▼Write xl.meta to all disks │ ▼Wait for write quorum │ ▼Commit (atomic rename) │ ▼Release lockRecovery: NSScanner[4]
The Namespace Scanner (NSScanner) provides background metadata consistency checks.
NSScanner Functions
┌─────────────────────────────────────────────────────────┐│ NSScanner │├─────────────────────────────────────────────────────────┤│ ││ Background Process: ││ ├── Scans all objects in namespace ││ ├── Detects missing shards ││ ├── Identifies metadata inconsistencies ││ └── Queues objects for healing ││ ││ Detection: ││ ├── Missing xl.meta files ││ ├── Version mismatches across disks ││ ├── Orphaned data files ││ └── Corrupted metadata ││ ││ Actions: ││ ├── Queue for healing ││ ├── Log inconsistencies ││ └── Update metrics ││ │└─────────────────────────────────────────────────────────┘Scanner Cycle
| Phase | Action | Frequency |
|---|---|---|
| Enumerate | List all objects | Continuous |
| Validate | Check metadata consistency | Per-object |
| Queue | Add issues to healing queue | As detected |
| Report | Update scanner metrics | Periodic |
Metadata Contents
xl.meta Fields
| Field | Description | Required |
|---|---|---|
| Version | Object version identifier | Yes |
| ModTime | Last modification timestamp | Yes |
| ErasureInfo | Data/parity block configuration | Yes |
| Parts | Part numbers and checksums | Yes |
| Metadata | User-defined metadata (x-amz-meta-*) | No |
| ContentType | MIME type | No |
| ETag | Entity tag for integrity | Yes |
| InlineData | Embedded object data (small objects) | No |
Example Metadata Structure
xl.meta contents (conceptual):{ "version": "1.0", "modTime": "2025-01-05T10:30:00Z", "erasure": { "algorithm": "reedsolomon", "data": 8, "parity": 4, "blockSize": 1048576 }, "parts": [ {"number": 1, "size": 5242880, "etag": "abc123..."} ], "metadata": { "content-type": "application/octet-stream", "x-amz-meta-custom": "value" }}Performance Considerations
Optimization Tips
| Area | Recommendation | Impact |
|---|---|---|
| Cache sizing | Allocate adequate memory for BigCache | Reduces disk I/O |
| Small objects | Leverage inline data | Faster small object access |
| Listings | Enable metacache | Faster LIST operations |
| Disk type | Use SSDs for metadata | Lower latency |
Monitoring
Key metrics for metadata performance:
- Cache hit ratio (BigCache, Metacache)
- xl.meta read/write latency
- NSScanner cycle time
- Metadata size distribution
Source Code References
cmd/xl-storage-format-v2.go:42-xlHeader = [4]byte{'X', 'L', '2', ' '}(magic header)cmd/xl-storage-format-v2.go:55,61-xlVersionMajor = 1,xlVersionMinor = 4cmd/xl-storage.go:130-xlMetaCache *bigcache.BigCache(per-drive metadata cache)cmd/data-scanner.go:216-objAPI.NSScanner(ctx, results)(namespace scanner invocation)