AIStor Tables uses a hierarchical read-write locking system to ensure consistency across warehouses, namespaces, and tables while maximizing concurrent throughput. Understanding this locking architecture is essential for diagnosing lock contention and optimizing workload performance in production.
This question covers:
- The four-level lock hierarchy (warehouse → namespace → table registry shard → metadata pointer)
- Optimistic vs pessimistic concurrency strategies
- Deadlock prevention mechanisms
- Where lock contention occurs and how to reason about it
Answer
AIStor Tables implements a layered locking architecture that balances consistency with concurrency. The design uses coarse-grained locks for rare structural operations (creating warehouses/namespaces) and fine-grained locks for frequent data operations (table commits), with two distinct concurrency strategies depending on the operation type.
Lock Infrastructure
Lock Types
The system selects between three lock implementations depending on deployment topology:
| Lock Type | Deployment | Description |
|---|---|---|
Local RWMutex (lsync.LRWMutex) | Single-node | In-process read-write mutex with reference counting |
Distributed RWMutex (dsync.DRWMutex) | Multi-node | Coordinates across cluster nodes via RPCs |
| Coalesced Distributed Lock | Multi-node (default) | Optimization that shares a single distributed read lock across multiple goroutines |
All three implement a common RWLocker interface[1]:
type RWLocker interface { GetLock(ctx context.Context, timeout *dynamicTimeout) (LockContext, error) Unlock(lkCtx LockContext) GetRLock(ctx context.Context, timeout *dynamicTimeout) (LockContext, error) RUnlock(lkCtx LockContext)}Coalesced Read Locks
Enabled by default[2], coalesced locks allow multiple concurrent readers to share a single distributed read lock, reducing RPC overhead from N distributed lock RPCs to 1[3]. The first reader acquires the distributed lock; subsequent readers join at no cost. Write locks are never coalesced — each write lock is independent.
Lock Timeouts
| Timeout | Duration | Used For |
|---|---|---|
tablesLockTimeout[4] | 30s (5s min) | Most Tables operations |
tablesLockAllShardsTimeout | 45s (30s min) | Locking all 16 registry shards |
tablesLockTimeoutNonBlocking[5] | Immediate fail | Multi-table transactions (returns OperationConflict on contention) |
tablesRecoveryTimeout | 5 min | Background transaction recovery |
The Four-Level Lock Hierarchy
Locks are organized from coarsest to finest granularity. Higher-level locks protect structural metadata; lower-level locks protect individual table state.
Level 1: Warehouse Registry Lock catalog/warehouse.binLevel 2: Namespace Registry Lock catalog/{warehouse}/namespace.binLevel 3: Table Registry Shard Locks catalog/{warehouse}/namespaces/{ns}/table-registry-shard-{N}.binLevel 4: Metadata Pointer Lock catalog/{warehouse}/tables/{uuid}.binLevel 1 — Warehouse Registry
Resource: catalog/warehouse.bin (single file for all warehouses)
| Operation | Lock Type |
|---|---|
ListWarehouses, GetWarehouse | Read |
CreateWarehouse, DeleteWarehouse | Write |
Warehouse operations are rare, so this single-file lock has minimal contention in practice[6].
Level 2 — Namespace Registry
Resource: catalog/{warehouse}/namespace.bin (one per warehouse)
| Operation | Lock Type |
|---|---|
| List/read namespaces | Read |
CreateNamespace, DeleteNamespace, UpdateNamespaceProperties | Write |
Like warehouse locks, namespace operations are infrequent and short-lived[7].
Level 3 — Table Registry Shards
Resource: 16 shard files[8] per namespace, selected by xxhash(table_name) & 0xF
This is the first layer where meaningful contention can occur. The registry is sharded into 16 independent partitions, providing ~16x parallelism over a single-file approach.
| Operation | Shards Locked | Strategy |
|---|---|---|
CreateTable, DeleteTable | 1 shard | Lock the shard containing the table name |
RenameTable, RenameView | 1–2 shards | Lock source and dest shards in sorted order[9] |
DeleteNamespace | All 16 shards | Serialize via LockNamespaceFunc, then acquire all in parallel[10] |
Contention scenario: Multiple concurrent CreateTable operations targeting tables that hash to the same shard will serialize at this level.
Level 4 — Metadata Pointer (Per-Table)
Resource: catalog/{warehouse}/tables/{uuid}.bin (one per table, UUID-based)
This is the finest-grained lock and the hot path for table mutations. The system uses two different concurrency strategies here:
| Operation | Strategy |
|---|---|
| Regular single-table commits | Optimistic (CAS, no lock) |
| Staged-to-live commits | Pessimistic (pointer lock → shard lock) |
| Rename operations | Pessimistic (shard lock → pointer lock) |
| Multi-table transactions | Pessimistic (all pointer locks in sorted order, non-blocking) |
Concurrency Strategies
Optimistic Concurrency (CAS) — The Hot Path
Regular single-table commits bypass distributed locking entirely and use ETag-based Compare-And-Swap[11]:
1. Read metadata pointer (get current ETag)2. Read and validate table metadata3. Apply updates, write new metadata4. CAS write pointer — succeeds only if ETag unchangedIf a concurrent writer changed the pointer between steps 1 and 4, the CAS fails with a PreConditionFailed error (surfaced as TableUpdateConflict). The client retries.
Why this matters: The common case (single-table writes) incurs zero distributed lock RPCs, making it the lowest-latency path.
Pessimistic Locking — Multi-Table Transactions
Multi-table transactions acquire exclusive locks on all participating table pointers before making any changes[12]:
1. Discover participating tables2. Recover any pending failed transactions3. Acquire pointer locks (sorted by path, non-blocking)4. Validate all table states5. Write atomic transaction log (WAL)6. Commit all tables7. Delete transaction log8. Release locksKey design decisions:
- Sorted lock order: Pointer paths are sorted lexicographically before acquisition to prevent deadlock[13]
- Non-blocking: Uses immediate-fail timeout — returns
OperationConflictinstead of waiting[14] - Reverse release: Locks are released in reverse order on failure
Deadlock Prevention
The system employs four mechanisms to prevent deadlocks:
1. Lexicographic Lock Ordering
When acquiring multiple locks of the same type, paths are always sorted[9]:
// Rename: always lock in lexicographical order to prevent deadlockif sourcePath > destPath { firstPath, secondPath = destPath, sourcePath} else { firstPath, secondPath = sourcePath, destPath}2. Non-Blocking Multi-Resource Acquisition
Multi-table transactions use non-blocking lock attempts. If any lock cannot be acquired immediately, all held locks are released and the operation fails with OperationConflict rather than blocking.
3. Serialization for All-Shard Operations
Operations requiring all 16 shard locks (e.g., DeleteNamespace) first acquire a higher-level serialization lock (LockNamespaceFunc)[10] to prevent interleaved shard acquisition between concurrent callers.
4. Disallowing Conflicting Lock Orders
Staged table renames are disallowed because commitStagedToLive locks pointer-then-registry while rename locks registry-then-pointer — allowing both would create a lock ordering inversion.
Lock Contention Analysis
Where Contention Is Most Likely
| Resource | Contention Risk | Cause | Impact |
|---|---|---|---|
| Table registry shard | Medium | Concurrent creates/deletes hashing to same shard | Operations serialize within the shard |
| Metadata pointer (CAS) | Low-Medium | Concurrent writers to the same table | Retry on conflict, no blocking |
| Metadata pointer (locked) | Medium | Multi-table transactions overlapping with renames or staged commits | OperationConflict returned |
| All-shard lock | Low | Concurrent DeleteNamespace operations | Serialized by LockNamespaceFunc |
| Warehouse/namespace registry | Very Low | Rare structural operations | Minimal in practice |
Contention Patterns by Workload
High-throughput single-table writes (most common):
- Uses CAS — no distributed lock contention
- Retries on ETag mismatch are fast and local
- Scales linearly with cluster size
Concurrent multi-table transactions targeting overlapping tables:
- Non-blocking acquisition means fast failure, not blocking
- Clients should implement retry with backoff
- Contention increases with transaction scope (more tables = more lock overlap probability)
Bulk table creation in a single namespace:
- 16 shards provide good parallelism
- Worst case: all new tables hash to the same shard (statistically unlikely)
- Expected throughput: ~16x better than a single-lock design
Namespace deletion during active table operations:
- All-shard lock blocks all concurrent creates/deletes in that namespace
- Short-lived — only held for the registry cleanup
Recovery and Lock Interaction
Failed multi-step transactions (multi-table, staged-to-live, rename) are detected lazily and recovered in the background:
- Recovery uses the same non-blocking lock acquisition
- Retries with exponential backoff (50ms → 1s) and jitter on lock contention
- Maximum 50 attempts over a 5-minute window
- A Write-Ahead Log (WAL) at
catalog/{warehouse}/.tx/pending/{txID}.binensures atomicity across crashes
Per-Operation Lock Summary
| Operation | Locks Acquired (in order) |
|---|---|
| CreateWarehouse | Warehouse registry (W) |
| DeleteWarehouse | Namespace registry (W) → Warehouse registry (W) |
| ListWarehouses | Warehouse registry (R) |
| CreateNamespace | Namespace registry (W) |
| DeleteNamespace | Namespace func lock → All 16 shard locks (parallel) → Namespace registry (W) |
| CreateTable | Table registry shard (W) |
| DeleteTable | Table registry shard (W) |
| CommitTable (regular) | None — uses CAS on pointer |
| CommitTable (staged→live) | Pointer lock (W) → Registry shard lock (W) |
| RenameTable | Registry shard lock(s) (W, sorted) → Pointer lock (W) |
| Multi-table transaction | All pointer locks (W, sorted, non-blocking) |
| LoadTable | Pointer (R) |
(W) = Write lock, (R) = Read lock
Source Code References
cmd/namespace-lock.go:41-46—RWLockerinterface definitioncmd/namespace-lock.go:37-38—_MINIO_COALESCED_LOCKSenv var, enabled by defaultcmd/coalesced-lock.go:42-56—coalescedDistLockstruct, “Multiple goroutines can share a single distributed RLock, reducing RPC overhead”cmd/tables-api-utils.go:134-135—tablesLockTimeout = newDynamicTimeout(30*time.Second, 5*time.Second)cmd/tables-api-utils.go:141-143—tablesLockTimeoutNonBlockingwithnonBlocking: truecmd/tables-api-utils.go:2390-2407—lockWarehouseRegistryForReadandlockWarehouseRegistryForWritecmd/tables-api-utils.go:2412-2415—lockNamespaceRegistryForWritecmd/tables-api-utils.go:115—tableRegistryShardCount = 16cmd/tables-api-utils.go:2148-2154— Lexicographic lock ordering inLockTableRegistryShardsForRenamecmd/tables-api-utils.go:2062-2067—LockAllTableRegistryShardswith serialization viaLockNamespaceFunccmd/tables-metadata-pointer.go:479-482—BuildApplyCASWriteWithInfoCAS-based commitcmd/tables-transactions.go:530-546—AcquireLockswith sorted paths and non-blocking acquisitioncmd/tables-transactions.go:539—slices.Sort(paths)for deterministic lock orderingcmd/tables-transactions.go:546—tablesLockTimeoutNonBlockingusage in multi-table lock acquisition