How does locking and concurrency work in AIStor Tables?

AIStor Tables uses a hierarchical read-write locking system to ensure consistency across warehouses, namespaces, and tables while maximizing concurrent throughput. Understanding this locking architecture is essential for diagnosing lock contention and optimizing workload performance in production.

This question covers:

The four-level lock hierarchy (warehouse → namespace → table registry shard → metadata pointer)
Optimistic vs pessimistic concurrency strategies
Deadlock prevention mechanisms
Where lock contention occurs and how to reason about it

Answer

AIStor Tables implements a layered locking architecture that balances consistency with concurrency. The design uses coarse-grained locks for rare structural operations (creating warehouses/namespaces) and fine-grained locks for frequent data operations (table commits), with two distinct concurrency strategies depending on the operation type.

Lock Infrastructure

Lock Types

The system selects between three lock implementations depending on deployment topology:

Lock Type	Deployment	Description
Local RWMutex (`lsync.LRWMutex`)	Single-node	In-process read-write mutex with reference counting
Distributed RWMutex (`dsync.DRWMutex`)	Multi-node	Coordinates across cluster nodes via RPCs
Coalesced Distributed Lock	Multi-node (default)	Optimization that shares a single distributed read lock across multiple goroutines

All three implement a common RWLocker interface^[1]:

type RWLocker interface {
    GetLock(ctx context.Context, timeout *dynamicTimeout) (LockContext, error)
    Unlock(lkCtx LockContext)
    GetRLock(ctx context.Context, timeout *dynamicTimeout) (LockContext, error)
    RUnlock(lkCtx LockContext)
}

Coalesced Read Locks

Enabled by default^[2], coalesced locks allow multiple concurrent readers to share a single distributed read lock, reducing RPC overhead from N distributed lock RPCs to 1^[3]. The first reader acquires the distributed lock; subsequent readers join at no cost. Write locks are never coalesced — each write lock is independent.

Lock Timeouts

Timeout	Duration	Used For
`tablesLockTimeout`^[4]	30s (5s min)	Most Tables operations
`tablesLockAllShardsTimeout`	45s (30s min)	Locking all 16 registry shards
`tablesLockTimeoutNonBlocking`^[5]	Immediate fail	Multi-table transactions (returns `OperationConflict` on contention)
`tablesRecoveryTimeout`	5 min	Background transaction recovery

The Four-Level Lock Hierarchy

Locks are organized from coarsest to finest granularity. Higher-level locks protect structural metadata; lower-level locks protect individual table state.

Level 1: Warehouse Registry Lock      catalog/warehouse.bin
Level 2: Namespace Registry Lock       catalog/{warehouse}/namespace.bin
Level 3: Table Registry Shard Locks    catalog/{warehouse}/namespaces/{ns}/table-registry-shard-{N}.bin
Level 4: Metadata Pointer Lock         catalog/{warehouse}/tables/{uuid}.bin

Level 1 — Warehouse Registry

Resource: catalog/warehouse.bin (single file for all warehouses)

Operation	Lock Type
`ListWarehouses`, `GetWarehouse`	Read
`CreateWarehouse`, `DeleteWarehouse`	Write

Warehouse operations are rare, so this single-file lock has minimal contention in practice^[6].

Level 2 — Namespace Registry

Resource: catalog/{warehouse}/namespace.bin (one per warehouse)

Operation	Lock Type
List/read namespaces	Read
`CreateNamespace`, `DeleteNamespace`, `UpdateNamespaceProperties`	Write

Like warehouse locks, namespace operations are infrequent and short-lived^[7].

Level 3 — Table Registry Shards

Resource: 16 shard files^[8] per namespace, selected by xxhash(table_name) & 0xF

This is the first layer where meaningful contention can occur. The registry is sharded into 16 independent partitions, providing ~16x parallelism over a single-file approach.

Operation	Shards Locked	Strategy
`CreateTable`, `DeleteTable`	1 shard	Lock the shard containing the table name
`RenameTable`, `RenameView`	1–2 shards	Lock source and dest shards in sorted order^[9]
`DeleteNamespace`	All 16 shards	Serialize via `LockNamespaceFunc`, then acquire all in parallel^[10]

Contention scenario: Multiple concurrent CreateTable operations targeting tables that hash to the same shard will serialize at this level.

Level 4 — Metadata Pointer (Per-Table)

Resource: catalog/{warehouse}/tables/{uuid}.bin (one per table, UUID-based)

This is the finest-grained lock and the hot path for table mutations. The system uses two different concurrency strategies here:

Operation	Strategy
Regular single-table commits	Optimistic (CAS, no lock)
Staged-to-live commits	Pessimistic (pointer lock → shard lock)
Rename operations	Pessimistic (shard lock → pointer lock)
Multi-table transactions	Pessimistic (all pointer locks in sorted order, non-blocking)

Concurrency Strategies

Optimistic Concurrency (CAS) — The Hot Path

Regular single-table commits bypass distributed locking entirely and use ETag-based Compare-And-Swap^[11]:

1. Read metadata pointer (get current ETag)
2. Read and validate table metadata
3. Apply updates, write new metadata
4. CAS write pointer — succeeds only if ETag unchanged

If a concurrent writer changed the pointer between steps 1 and 4, the CAS fails with a PreConditionFailed error (surfaced as TableUpdateConflict). The client retries.

Why this matters: The common case (single-table writes) incurs zero distributed lock RPCs, making it the lowest-latency path.

Pessimistic Locking — Multi-Table Transactions

Multi-table transactions acquire exclusive locks on all participating table pointers before making any changes^[12]:

1. Discover participating tables
2. Recover any pending failed transactions
3. Acquire pointer locks (sorted by path, non-blocking)
4. Validate all table states
5. Write atomic transaction log (WAL)
6. Commit all tables
7. Delete transaction log
8. Release locks

Key design decisions:

Sorted lock order: Pointer paths are sorted lexicographically before acquisition to prevent deadlock^[13]
Non-blocking: Uses immediate-fail timeout — returns OperationConflict instead of waiting^[14]
Reverse release: Locks are released in reverse order on failure

Deadlock Prevention

The system employs four mechanisms to prevent deadlocks:

1. Lexicographic Lock Ordering

When acquiring multiple locks of the same type, paths are always sorted^[9]:

// Rename: always lock in lexicographical order to prevent deadlock
if sourcePath > destPath {
    firstPath, secondPath = destPath, sourcePath
} else {
    firstPath, secondPath = sourcePath, destPath
}

2. Non-Blocking Multi-Resource Acquisition

Multi-table transactions use non-blocking lock attempts. If any lock cannot be acquired immediately, all held locks are released and the operation fails with OperationConflict rather than blocking.

3. Serialization for All-Shard Operations

Operations requiring all 16 shard locks (e.g., DeleteNamespace) first acquire a higher-level serialization lock (LockNamespaceFunc)^[10] to prevent interleaved shard acquisition between concurrent callers.

4. Disallowing Conflicting Lock Orders

Staged table renames are disallowed because commitStagedToLive locks pointer-then-registry while rename locks registry-then-pointer — allowing both would create a lock ordering inversion.

Lock Contention Analysis

Where Contention Is Most Likely

Resource	Contention Risk	Cause	Impact
Table registry shard	Medium	Concurrent creates/deletes hashing to same shard	Operations serialize within the shard
Metadata pointer (CAS)	Low-Medium	Concurrent writers to the same table	Retry on conflict, no blocking
Metadata pointer (locked)	Medium	Multi-table transactions overlapping with renames or staged commits	`OperationConflict` returned
All-shard lock	Low	Concurrent `DeleteNamespace` operations	Serialized by `LockNamespaceFunc`
Warehouse/namespace registry	Very Low	Rare structural operations	Minimal in practice

Contention Patterns by Workload

High-throughput single-table writes (most common):

Uses CAS — no distributed lock contention
Retries on ETag mismatch are fast and local
Scales linearly with cluster size

Concurrent multi-table transactions targeting overlapping tables:

Non-blocking acquisition means fast failure, not blocking
Clients should implement retry with backoff
Contention increases with transaction scope (more tables = more lock overlap probability)

Bulk table creation in a single namespace:

16 shards provide good parallelism
Worst case: all new tables hash to the same shard (statistically unlikely)
Expected throughput: ~16x better than a single-lock design

Namespace deletion during active table operations:

All-shard lock blocks all concurrent creates/deletes in that namespace
Short-lived — only held for the registry cleanup

Recovery and Lock Interaction

Failed multi-step transactions (multi-table, staged-to-live, rename) are detected lazily and recovered in the background:

Recovery uses the same non-blocking lock acquisition
Retries with exponential backoff (50ms → 1s) and jitter on lock contention
Maximum 50 attempts over a 5-minute window
A Write-Ahead Log (WAL) at catalog/{warehouse}/.tx/pending/{txID}.bin ensures atomicity across crashes

Per-Operation Lock Summary

Operation	Locks Acquired (in order)
CreateWarehouse	Warehouse registry (W)
DeleteWarehouse	Namespace registry (W) → Warehouse registry (W)
ListWarehouses	Warehouse registry (R)
CreateNamespace	Namespace registry (W)
DeleteNamespace	Namespace func lock → All 16 shard locks (parallel) → Namespace registry (W)
CreateTable	Table registry shard (W)
DeleteTable	Table registry shard (W)
CommitTable (regular)	None — uses CAS on pointer
CommitTable (staged→live)	Pointer lock (W) → Registry shard lock (W)
RenameTable	Registry shard lock(s) (W, sorted) → Pointer lock (W)
Multi-table transaction	All pointer locks (W, sorted, non-blocking)
LoadTable	Pointer (R)

(W) = Write lock, (R) = Read lock

Source Code References

cmd/namespace-lock.go:41-46 — RWLocker interface definition
cmd/namespace-lock.go:37-38 — _MINIO_COALESCED_LOCKS env var, enabled by default
cmd/coalesced-lock.go:42-56 — coalescedDistLock struct, “Multiple goroutines can share a single distributed RLock, reducing RPC overhead”
cmd/tables-api-utils.go:134-135 — tablesLockTimeout = newDynamicTimeout(30*time.Second, 5*time.Second)
cmd/tables-api-utils.go:141-143 — tablesLockTimeoutNonBlocking with nonBlocking: true
cmd/tables-api-utils.go:2390-2407 — lockWarehouseRegistryForRead and lockWarehouseRegistryForWrite
cmd/tables-api-utils.go:2412-2415 — lockNamespaceRegistryForWrite
cmd/tables-api-utils.go:115 — tableRegistryShardCount = 16
cmd/tables-api-utils.go:2148-2154 — Lexicographic lock ordering in LockTableRegistryShardsForRename
cmd/tables-api-utils.go:2062-2067 — LockAllTableRegistryShards with serialization via LockNamespaceFunc
cmd/tables-metadata-pointer.go:479-482 — BuildApplyCASWriteWithInfo CAS-based commit
cmd/tables-transactions.go:530-546 — AcquireLocks with sorted paths and non-blocking acquisition
cmd/tables-transactions.go:539 — slices.Sort(paths) for deterministic lock ordering
cmd/tables-transactions.go:546 — tablesLockTimeoutNonBlocking usage in multi-table lock acquisition