How does MinIO AIStor handle metadata internally?

Asked by muratkars Answered by muratkars January 4, 2026
0 views

Understanding MinIO AIStor’s metadata handling is essential for optimizing performance and troubleshooting consistency issues in distributed deployments.

Answer

MinIO uses a binary XL format v2 with multi-tier caching and quorum-based consistency. The metadata system is designed for durability through atomic writes, performance through tiered caching, and efficiency by inlining small objects directly within metadata files.


Persistence Model

MinIO stores object metadata in xl.meta files using a compact binary format.

XL Format Structure

┌─────────────────────────────────────────────────────────┐
│ xl.meta File │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Header │ │
│ │ ├── Magic: "XL2 " (4 bytes) │ │
│ │ └── Version flags │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Metadata (msgpack encoded) │ │
│ │ ├── Version info │ │
│ │ ├── Erasure coding params │ │
│ │ ├── Part information │ │
│ │ ├── Checksums │ │
│ │ ├── User metadata │ │
│ │ └── [Inline data for small objects] │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘

Format Details

ComponentDescriptionValue
FormatBinary msgpack with headerCompact, fast parsing
Magic Header[1]“XL2 ” (4 bytes)Format identification
Major Version[2]1Breaking changes
Minor Version[2]4Compatible additions
EncodingMessagePackEfficient binary serialization

Version History

XL Format Evolution:
├── v1.0 - Initial format
├── v1.1 - Added inline data support
├── v1.2 - Extended metadata fields
├── v1.3 - Checksum improvements
└── v1.4 - Current version (optimizations)

Inline Data

Small objects are stored directly within the xl.meta file for efficiency.

How Inline Data Works

Object Size Check
├── Size ≤ Threshold ──► Store in xl.meta (inline)
│ └── No separate part files
└── Size > Threshold ──► Store as separate parts
└── xl.meta contains references

Benefits of Inline Data

BenefitDescription
Reduced I/OSingle file read for small objects
Better performanceNo additional disk seeks
Atomic operationsData and metadata in one file
Space efficiencyAvoids small file overhead

Inline Threshold

Small objects below the inline threshold are embedded directly in metadata, eliminating separate part files and reducing I/O operations.


Atomic Write Process

MinIO ensures metadata durability through atomic write operations.

Write Sequence

Step 1: Write new metadata to temp file (UUID in tmp bucket)
Step 2: Sync to disk (fsync)
Step 3: Atomic rename temp file to xl.meta

MinIO relies on POSIX atomic rename semantics for crash safety. If the rename completes, the new metadata is durable. If it fails, the original xl.meta remains intact.

Crash Recovery

On Startup:
├── xl.meta exists, no temp files
│ └── Normal state, no recovery needed
└── Orphaned temp files in tmp bucket
└── Cleaned up automatically

Recovery Files

FilePurposeLifetime
xl.metaActive metadataPermanent
temp fileNew metadata being writtenTransient (auto-cleaned)

Note: The .bkp file is only used in specific rollback scenarios (e.g., undo operations), not in normal writes.


Caching Architecture

MinIO implements a two-tier caching system for metadata performance.

Cache Tiers

┌─────────────────────────────────────────────────────────┐
│ Metadata Caching │
├─────────────────────────────────────────────────────────┤
│ │
│ Request ──► Tier 1: BigCache (per-drive) │
│ │ │
│ ├── Hit ──► Return cached metadata │
│ │ │
│ └── Miss ──► Tier 2: Metacache │
│ │ │
│ ├── Hit ──► Return │
│ │ │
│ └── Miss ──► Disk │
│ │
└─────────────────────────────────────────────────────────┘

Tier Details

TierImplementationScopePurpose
Tier 1[3]BigCachePer-driveHot metadata for individual objects
Tier 2MetacacheDistributedListing cache across cluster

BigCache (Tier 1)

┌─────────────────────────────────────────────────────────┐
│ BigCache │
├─────────────────────────────────────────────────────────┤
│ │
│ Scope: Per-drive metadata cache │
│ │
│ Key: bucket/object/version │
│ Value: Parsed xl.meta contents │
│ │
│ Eviction: LRU-based │
│ Invalidation: On write/delete │
│ │
│ Benefits: │
│ ├── Reduces disk I/O for hot objects │
│ ├── Speeds up repeated reads │
│ └── Memory-efficient storage │
│ │
└─────────────────────────────────────────────────────────┘

Metacache (Tier 2)

┌─────────────────────────────────────────────────────────┐
│ Metacache │
├─────────────────────────────────────────────────────────┤
│ │
│ Scope: Distributed listing cache │
│ │
│ Purpose: Accelerate ListObjects operations │
│ │
│ Features: │
│ ├── Caches directory listings │
│ ├── Distributed across nodes │
│ ├── Supports pagination │
│ └── Invalidated on bucket changes │
│ │
│ Benefits: │
│ ├── Faster bucket listings │
│ ├── Reduced disk scanning │
│ └── Better LIST operation performance │
│ │
└─────────────────────────────────────────────────────────┘

Consistency Model

Metadata consistency is maintained through quorum-based operations.

Read Consistency

ListOnlineDisks (all disks in erasure set)
Read xl.meta from each disk
Compare versions (modTime)
Select latest with quorum
Return consistent metadata

Write Consistency

Acquire namespace lock
Write xl.meta to all disks
Wait for write quorum
Commit (atomic rename)
Release lock

Recovery: NSScanner[4]

The Namespace Scanner (NSScanner) provides background metadata consistency checks.

NSScanner Functions

┌─────────────────────────────────────────────────────────┐
│ NSScanner │
├─────────────────────────────────────────────────────────┤
│ │
│ Background Process: │
│ ├── Scans all objects in namespace │
│ ├── Detects missing shards │
│ ├── Identifies metadata inconsistencies │
│ └── Queues objects for healing │
│ │
│ Detection: │
│ ├── Missing xl.meta files │
│ ├── Version mismatches across disks │
│ ├── Orphaned data files │
│ └── Corrupted metadata │
│ │
│ Actions: │
│ ├── Queue for healing │
│ ├── Log inconsistencies │
│ └── Update metrics │
│ │
└─────────────────────────────────────────────────────────┘

Scanner Cycle

PhaseActionFrequency
EnumerateList all objectsContinuous
ValidateCheck metadata consistencyPer-object
QueueAdd issues to healing queueAs detected
ReportUpdate scanner metricsPeriodic

Metadata Contents

xl.meta Fields

FieldDescriptionRequired
VersionObject version identifierYes
ModTimeLast modification timestampYes
ErasureInfoData/parity block configurationYes
PartsPart numbers and checksumsYes
MetadataUser-defined metadata (x-amz-meta-*)No
ContentTypeMIME typeNo
ETagEntity tag for integrityYes
InlineDataEmbedded object data (small objects)No

Example Metadata Structure

xl.meta contents (conceptual):
{
"version": "1.0",
"modTime": "2025-01-05T10:30:00Z",
"erasure": {
"algorithm": "reedsolomon",
"data": 8,
"parity": 4,
"blockSize": 1048576
},
"parts": [
{"number": 1, "size": 5242880, "etag": "abc123..."}
],
"metadata": {
"content-type": "application/octet-stream",
"x-amz-meta-custom": "value"
}
}

Performance Considerations

Optimization Tips

AreaRecommendationImpact
Cache sizingAllocate adequate memory for BigCacheReduces disk I/O
Small objectsLeverage inline dataFaster small object access
ListingsEnable metacacheFaster LIST operations
Disk typeUse SSDs for metadataLower latency

Monitoring

Key metrics for metadata performance:

  • Cache hit ratio (BigCache, Metacache)
  • xl.meta read/write latency
  • NSScanner cycle time
  • Metadata size distribution

Source Code References
  1. cmd/xl-storage-format-v2.go:42 - xlHeader = [4]byte{'X', 'L', '2', ' '} (magic header)
  2. cmd/xl-storage-format-v2.go:55,61 - xlVersionMajor = 1, xlVersionMinor = 4
  3. cmd/xl-storage.go:130 - xlMetaCache *bigcache.BigCache (per-drive metadata cache)
  4. cmd/data-scanner.go:216 - objAPI.NSScanner(ctx, results) (namespace scanner invocation)
0