How does locking and concurrency work in AIStor Tables?

Asked by ravindk89 Answered by ravindk89 February 11, 2026
0 views

AIStor Tables uses a hierarchical read-write locking system to ensure consistency across warehouses, namespaces, and tables while maximizing concurrent throughput. Understanding this locking architecture is essential for diagnosing lock contention and optimizing workload performance in production.

This question covers:

  • The four-level lock hierarchy (warehouse → namespace → table registry shard → metadata pointer)
  • Optimistic vs pessimistic concurrency strategies
  • Deadlock prevention mechanisms
  • Where lock contention occurs and how to reason about it

Answer

AIStor Tables implements a layered locking architecture that balances consistency with concurrency. The design uses coarse-grained locks for rare structural operations (creating warehouses/namespaces) and fine-grained locks for frequent data operations (table commits), with two distinct concurrency strategies depending on the operation type.


Lock Infrastructure

Lock Types

The system selects between three lock implementations depending on deployment topology:

Lock TypeDeploymentDescription
Local RWMutex (lsync.LRWMutex)Single-nodeIn-process read-write mutex with reference counting
Distributed RWMutex (dsync.DRWMutex)Multi-nodeCoordinates across cluster nodes via RPCs
Coalesced Distributed LockMulti-node (default)Optimization that shares a single distributed read lock across multiple goroutines

All three implement a common RWLocker interface[1]:

type RWLocker interface {
GetLock(ctx context.Context, timeout *dynamicTimeout) (LockContext, error)
Unlock(lkCtx LockContext)
GetRLock(ctx context.Context, timeout *dynamicTimeout) (LockContext, error)
RUnlock(lkCtx LockContext)
}

Coalesced Read Locks

Enabled by default[2], coalesced locks allow multiple concurrent readers to share a single distributed read lock, reducing RPC overhead from N distributed lock RPCs to 1[3]. The first reader acquires the distributed lock; subsequent readers join at no cost. Write locks are never coalesced — each write lock is independent.

Lock Timeouts

TimeoutDurationUsed For
tablesLockTimeout[4]30s (5s min)Most Tables operations
tablesLockAllShardsTimeout45s (30s min)Locking all 16 registry shards
tablesLockTimeoutNonBlocking[5]Immediate failMulti-table transactions (returns OperationConflict on contention)
tablesRecoveryTimeout5 minBackground transaction recovery

The Four-Level Lock Hierarchy

Locks are organized from coarsest to finest granularity. Higher-level locks protect structural metadata; lower-level locks protect individual table state.

Level 1: Warehouse Registry Lock catalog/warehouse.bin
Level 2: Namespace Registry Lock catalog/{warehouse}/namespace.bin
Level 3: Table Registry Shard Locks catalog/{warehouse}/namespaces/{ns}/table-registry-shard-{N}.bin
Level 4: Metadata Pointer Lock catalog/{warehouse}/tables/{uuid}.bin

Level 1 — Warehouse Registry

Resource: catalog/warehouse.bin (single file for all warehouses)

OperationLock Type
ListWarehouses, GetWarehouseRead
CreateWarehouse, DeleteWarehouseWrite

Warehouse operations are rare, so this single-file lock has minimal contention in practice[6].

Level 2 — Namespace Registry

Resource: catalog/{warehouse}/namespace.bin (one per warehouse)

OperationLock Type
List/read namespacesRead
CreateNamespace, DeleteNamespace, UpdateNamespacePropertiesWrite

Like warehouse locks, namespace operations are infrequent and short-lived[7].

Level 3 — Table Registry Shards

Resource: 16 shard files[8] per namespace, selected by xxhash(table_name) & 0xF

This is the first layer where meaningful contention can occur. The registry is sharded into 16 independent partitions, providing ~16x parallelism over a single-file approach.

OperationShards LockedStrategy
CreateTable, DeleteTable1 shardLock the shard containing the table name
RenameTable, RenameView1–2 shardsLock source and dest shards in sorted order[9]
DeleteNamespaceAll 16 shardsSerialize via LockNamespaceFunc, then acquire all in parallel[10]

Contention scenario: Multiple concurrent CreateTable operations targeting tables that hash to the same shard will serialize at this level.

Level 4 — Metadata Pointer (Per-Table)

Resource: catalog/{warehouse}/tables/{uuid}.bin (one per table, UUID-based)

This is the finest-grained lock and the hot path for table mutations. The system uses two different concurrency strategies here:

OperationStrategy
Regular single-table commitsOptimistic (CAS, no lock)
Staged-to-live commitsPessimistic (pointer lock → shard lock)
Rename operationsPessimistic (shard lock → pointer lock)
Multi-table transactionsPessimistic (all pointer locks in sorted order, non-blocking)

Concurrency Strategies

Optimistic Concurrency (CAS) — The Hot Path

Regular single-table commits bypass distributed locking entirely and use ETag-based Compare-And-Swap[11]:

1. Read metadata pointer (get current ETag)
2. Read and validate table metadata
3. Apply updates, write new metadata
4. CAS write pointer — succeeds only if ETag unchanged

If a concurrent writer changed the pointer between steps 1 and 4, the CAS fails with a PreConditionFailed error (surfaced as TableUpdateConflict). The client retries.

Why this matters: The common case (single-table writes) incurs zero distributed lock RPCs, making it the lowest-latency path.

Pessimistic Locking — Multi-Table Transactions

Multi-table transactions acquire exclusive locks on all participating table pointers before making any changes[12]:

1. Discover participating tables
2. Recover any pending failed transactions
3. Acquire pointer locks (sorted by path, non-blocking)
4. Validate all table states
5. Write atomic transaction log (WAL)
6. Commit all tables
7. Delete transaction log
8. Release locks

Key design decisions:

  • Sorted lock order: Pointer paths are sorted lexicographically before acquisition to prevent deadlock[13]
  • Non-blocking: Uses immediate-fail timeout — returns OperationConflict instead of waiting[14]
  • Reverse release: Locks are released in reverse order on failure

Deadlock Prevention

The system employs four mechanisms to prevent deadlocks:

1. Lexicographic Lock Ordering

When acquiring multiple locks of the same type, paths are always sorted[9]:

// Rename: always lock in lexicographical order to prevent deadlock
if sourcePath > destPath {
firstPath, secondPath = destPath, sourcePath
} else {
firstPath, secondPath = sourcePath, destPath
}

2. Non-Blocking Multi-Resource Acquisition

Multi-table transactions use non-blocking lock attempts. If any lock cannot be acquired immediately, all held locks are released and the operation fails with OperationConflict rather than blocking.

3. Serialization for All-Shard Operations

Operations requiring all 16 shard locks (e.g., DeleteNamespace) first acquire a higher-level serialization lock (LockNamespaceFunc)[10] to prevent interleaved shard acquisition between concurrent callers.

4. Disallowing Conflicting Lock Orders

Staged table renames are disallowed because commitStagedToLive locks pointer-then-registry while rename locks registry-then-pointer — allowing both would create a lock ordering inversion.


Lock Contention Analysis

Where Contention Is Most Likely

ResourceContention RiskCauseImpact
Table registry shardMediumConcurrent creates/deletes hashing to same shardOperations serialize within the shard
Metadata pointer (CAS)Low-MediumConcurrent writers to the same tableRetry on conflict, no blocking
Metadata pointer (locked)MediumMulti-table transactions overlapping with renames or staged commitsOperationConflict returned
All-shard lockLowConcurrent DeleteNamespace operationsSerialized by LockNamespaceFunc
Warehouse/namespace registryVery LowRare structural operationsMinimal in practice

Contention Patterns by Workload

High-throughput single-table writes (most common):

  • Uses CAS — no distributed lock contention
  • Retries on ETag mismatch are fast and local
  • Scales linearly with cluster size

Concurrent multi-table transactions targeting overlapping tables:

  • Non-blocking acquisition means fast failure, not blocking
  • Clients should implement retry with backoff
  • Contention increases with transaction scope (more tables = more lock overlap probability)

Bulk table creation in a single namespace:

  • 16 shards provide good parallelism
  • Worst case: all new tables hash to the same shard (statistically unlikely)
  • Expected throughput: ~16x better than a single-lock design

Namespace deletion during active table operations:

  • All-shard lock blocks all concurrent creates/deletes in that namespace
  • Short-lived — only held for the registry cleanup

Recovery and Lock Interaction

Failed multi-step transactions (multi-table, staged-to-live, rename) are detected lazily and recovered in the background:

  • Recovery uses the same non-blocking lock acquisition
  • Retries with exponential backoff (50ms → 1s) and jitter on lock contention
  • Maximum 50 attempts over a 5-minute window
  • A Write-Ahead Log (WAL) at catalog/{warehouse}/.tx/pending/{txID}.bin ensures atomicity across crashes

Per-Operation Lock Summary

OperationLocks Acquired (in order)
CreateWarehouseWarehouse registry (W)
DeleteWarehouseNamespace registry (W) → Warehouse registry (W)
ListWarehousesWarehouse registry (R)
CreateNamespaceNamespace registry (W)
DeleteNamespaceNamespace func lock → All 16 shard locks (parallel) → Namespace registry (W)
CreateTableTable registry shard (W)
DeleteTableTable registry shard (W)
CommitTable (regular)None — uses CAS on pointer
CommitTable (staged→live)Pointer lock (W) → Registry shard lock (W)
RenameTableRegistry shard lock(s) (W, sorted) → Pointer lock (W)
Multi-table transactionAll pointer locks (W, sorted, non-blocking)
LoadTablePointer (R)

(W) = Write lock, (R) = Read lock

Source Code References
  1. cmd/namespace-lock.go:41-46RWLocker interface definition
  2. cmd/namespace-lock.go:37-38_MINIO_COALESCED_LOCKS env var, enabled by default
  3. cmd/coalesced-lock.go:42-56coalescedDistLock struct, “Multiple goroutines can share a single distributed RLock, reducing RPC overhead”
  4. cmd/tables-api-utils.go:134-135tablesLockTimeout = newDynamicTimeout(30*time.Second, 5*time.Second)
  5. cmd/tables-api-utils.go:141-143tablesLockTimeoutNonBlocking with nonBlocking: true
  6. cmd/tables-api-utils.go:2390-2407lockWarehouseRegistryForRead and lockWarehouseRegistryForWrite
  7. cmd/tables-api-utils.go:2412-2415lockNamespaceRegistryForWrite
  8. cmd/tables-api-utils.go:115tableRegistryShardCount = 16
  9. cmd/tables-api-utils.go:2148-2154 — Lexicographic lock ordering in LockTableRegistryShardsForRename
  10. cmd/tables-api-utils.go:2062-2067LockAllTableRegistryShards with serialization via LockNamespaceFunc
  11. cmd/tables-metadata-pointer.go:479-482BuildApplyCASWriteWithInfo CAS-based commit
  12. cmd/tables-transactions.go:530-546AcquireLocks with sorted paths and non-blocking acquisition
  13. cmd/tables-transactions.go:539slices.Sort(paths) for deterministic lock ordering
  14. cmd/tables-transactions.go:546tablesLockTimeoutNonBlocking usage in multi-table lock acquisition
0