How does MinIO AIStor handle metadata internally?

Understanding MinIO AIStor’s metadata handling is essential for optimizing performance and troubleshooting consistency issues in distributed deployments.

Answer

MinIO uses a binary XL format v2 with multi-tier caching and quorum-based consistency. The metadata system is designed for durability through atomic writes, performance through tiered caching, and efficiency by inlining small objects directly within metadata files.

Persistence Model

MinIO stores object metadata in xl.meta files using a compact binary format.

XL Format Structure

┌─────────────────────────────────────────────────────────┐
│                    xl.meta File                          │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Header                                          │    │
│  │  ├── Magic: "XL2 " (4 bytes)                    │    │
│  │  └── Version flags                               │    │
│  └─────────────────────────────────────────────────┘    │
│                                                          │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Metadata (msgpack encoded)                      │    │
│  │  ├── Version info                                │    │
│  │  ├── Erasure coding params                       │    │
│  │  ├── Part information                            │    │
│  │  ├── Checksums                                   │    │
│  │  ├── User metadata                               │    │
│  │  └── [Inline data for small objects]            │    │
│  └─────────────────────────────────────────────────┘    │
│                                                          │
└─────────────────────────────────────────────────────────┘

Format Details

Component	Description	Value
Format	Binary msgpack with header	Compact, fast parsing
Magic Header^[1]	“XL2 ” (4 bytes)	Format identification
Major Version^[2]	1	Breaking changes
Minor Version^[2]	4	Compatible additions
Encoding	MessagePack	Efficient binary serialization

Version History

XL Format Evolution:
├── v1.0 - Initial format
├── v1.1 - Added inline data support
├── v1.2 - Extended metadata fields
├── v1.3 - Checksum improvements
└── v1.4 - Current version (optimizations)

Inline Data

Small objects are stored directly within the xl.meta file for efficiency.

How Inline Data Works

Object Size Check
       │
       ├── Size ≤ Threshold ──► Store in xl.meta (inline)
       │                         └── No separate part files
       │
       └── Size > Threshold ──► Store as separate parts
                                 └── xl.meta contains references

Benefits of Inline Data

Benefit	Description
Reduced I/O	Single file read for small objects
Better performance	No additional disk seeks
Atomic operations	Data and metadata in one file
Space efficiency	Avoids small file overhead

Inline Threshold

Small objects below the inline threshold are embedded directly in metadata, eliminating separate part files and reducing I/O operations.

Atomic Write Process

MinIO ensures metadata durability through atomic write operations.

Write Sequence

Step 1: Write new metadata to temp file (UUID in tmp bucket)
              │
              ▼
Step 2: Sync to disk (fsync)
              │
              ▼
Step 3: Atomic rename temp file to xl.meta

MinIO relies on POSIX atomic rename semantics for crash safety. If the rename completes, the new metadata is durable. If it fails, the original xl.meta remains intact.

Crash Recovery

On Startup:
       │
       ├── xl.meta exists, no temp files
       │   └── Normal state, no recovery needed
       │
       └── Orphaned temp files in tmp bucket
           └── Cleaned up automatically

Recovery Files

File	Purpose	Lifetime
xl.meta	Active metadata	Permanent
temp file	New metadata being written	Transient (auto-cleaned)

Note: The .bkp file is only used in specific rollback scenarios (e.g., undo operations), not in normal writes.

Caching Architecture

MinIO implements a two-tier caching system for metadata performance.

Cache Tiers

┌─────────────────────────────────────────────────────────┐
│                  Metadata Caching                        │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Request ──► Tier 1: BigCache (per-drive)               │
│                  │                                       │
│                  ├── Hit ──► Return cached metadata     │
│                  │                                       │
│                  └── Miss ──► Tier 2: Metacache         │
│                                   │                      │
│                                   ├── Hit ──► Return    │
│                                   │                      │
│                                   └── Miss ──► Disk     │
│                                                          │
└─────────────────────────────────────────────────────────┘

Tier Details

Tier	Implementation	Scope	Purpose
Tier 1^[3]	BigCache	Per-drive	Hot metadata for individual objects
Tier 2	Metacache	Distributed	Listing cache across cluster

BigCache (Tier 1)

┌─────────────────────────────────────────────────────────┐
│                   BigCache                               │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Scope: Per-drive metadata cache                         │
│                                                          │
│  Key: bucket/object/version                             │
│  Value: Parsed xl.meta contents                         │
│                                                          │
│  Eviction: LRU-based                                    │
│  Invalidation: On write/delete                          │
│                                                          │
│  Benefits:                                               │
│  ├── Reduces disk I/O for hot objects                   │
│  ├── Speeds up repeated reads                           │
│  └── Memory-efficient storage                           │
│                                                          │
└─────────────────────────────────────────────────────────┘

Metacache (Tier 2)

┌─────────────────────────────────────────────────────────┐
│                   Metacache                              │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Scope: Distributed listing cache                        │
│                                                          │
│  Purpose: Accelerate ListObjects operations             │
│                                                          │
│  Features:                                               │
│  ├── Caches directory listings                          │
│  ├── Distributed across nodes                           │
│  ├── Supports pagination                                │
│  └── Invalidated on bucket changes                      │
│                                                          │
│  Benefits:                                               │
│  ├── Faster bucket listings                             │
│  ├── Reduced disk scanning                              │
│  └── Better LIST operation performance                  │
│                                                          │
└─────────────────────────────────────────────────────────┘

Consistency Model

Metadata consistency is maintained through quorum-based operations.

Read Consistency

ListOnlineDisks (all disks in erasure set)
              │
              ▼
Read xl.meta from each disk
              │
              ▼
Compare versions (modTime)
              │
              ▼
Select latest with quorum
              │
              ▼
Return consistent metadata

Write Consistency

Acquire namespace lock
              │
              ▼
Write xl.meta to all disks
              │
              ▼
Wait for write quorum
              │
              ▼
Commit (atomic rename)
              │
              ▼
Release lock

Recovery: NSScanner^[4]

The Namespace Scanner (NSScanner) provides background metadata consistency checks.

NSScanner Functions

┌─────────────────────────────────────────────────────────┐
│                    NSScanner                             │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Background Process:                                     │
│  ├── Scans all objects in namespace                     │
│  ├── Detects missing shards                             │
│  ├── Identifies metadata inconsistencies                │
│  └── Queues objects for healing                         │
│                                                          │
│  Detection:                                              │
│  ├── Missing xl.meta files                              │
│  ├── Version mismatches across disks                    │
│  ├── Orphaned data files                                │
│  └── Corrupted metadata                                 │
│                                                          │
│  Actions:                                                │
│  ├── Queue for healing                                  │
│  ├── Log inconsistencies                                │
│  └── Update metrics                                     │
│                                                          │
└─────────────────────────────────────────────────────────┘

Scanner Cycle

Phase	Action	Frequency
Enumerate	List all objects	Continuous
Validate	Check metadata consistency	Per-object
Queue	Add issues to healing queue	As detected
Report	Update scanner metrics	Periodic

Metadata Contents

xl.meta Fields

Field	Description	Required
Version	Object version identifier	Yes
ModTime	Last modification timestamp	Yes
ErasureInfo	Data/parity block configuration	Yes
Parts	Part numbers and checksums	Yes
Metadata	User-defined metadata (x-amz-meta-*)	No
ContentType	MIME type	No
ETag	Entity tag for integrity	Yes
InlineData	Embedded object data (small objects)	No

Example Metadata Structure

xl.meta contents (conceptual):
{
  "version": "1.0",
  "modTime": "2025-01-05T10:30:00Z",
  "erasure": {
    "algorithm": "reedsolomon",
    "data": 8,
    "parity": 4,
    "blockSize": 1048576
  },
  "parts": [
    {"number": 1, "size": 5242880, "etag": "abc123..."}
  ],
  "metadata": {
    "content-type": "application/octet-stream",
    "x-amz-meta-custom": "value"
  }
}

Performance Considerations

Optimization Tips

Area	Recommendation	Impact
Cache sizing	Allocate adequate memory for BigCache	Reduces disk I/O
Small objects	Leverage inline data	Faster small object access
Listings	Enable metacache	Faster LIST operations
Disk type	Use SSDs for metadata	Lower latency

Monitoring

Key metrics for metadata performance:

Cache hit ratio (BigCache, Metacache)
xl.meta read/write latency
NSScanner cycle time
Metadata size distribution

Source Code References

cmd/xl-storage-format-v2.go:42 - xlHeader = [4]byte{'X', 'L', '2', ' '} (magic header)
cmd/xl-storage-format-v2.go:55,61 - xlVersionMajor = 1, xlVersionMinor = 4
cmd/xl-storage.go:130 - xlMetaCache *bigcache.BigCache (per-drive metadata cache)
cmd/data-scanner.go:216 - objAPI.NSScanner(ctx, results) (namespace scanner invocation)