During a deployment expansion, what happens to the performance of the system?

Asked by muratkars Answered by muratkars July 17, 2025
0 views

Understanding scaling behavior and performance impact during expansion is critical for planning large-scale deployments and maintaining service levels during growth phases.

This question addresses:

  • Storage scaling mechanics and minimum units
  • Performance impact during expansion
  • Rebalancing behavior and triggers
  • Best practices for zero-impact scaling

Answer

Scaling involves adding new server pools as distinct hardware sets. Expansion does not trigger rebalancing, and cluster performance remains dictated by network bandwidth regardless of scaling strategies.

Scaling Mechanics

Server Pool Architecture:

  • Each expansion creates a new server pool
  • Pools are distinct sets of hardware and drives
  • No data movement between existing and new pools
  • No rebalancing triggered by expansion
  • Stripe size maintained across all pools

Minimum Unit of Scale:

Minimum expansion = 1 complete server pool
- Must match erasure coding requirements
- Example: EC 8:3 requires minimum 8 drives
- Typically: 4-16 servers per pool
- Pool size determined by erasure set configuration

Performance During Expansion

Key Principle: Performance is network-limited, not storage-limited.

During Expansion:

  1. Existing performance maintained - No impact on current operations
  2. New capacity immediately available - Pool ready for writes
  3. Linear performance scaling - New pool adds full bandwidth
  4. No degradation period - Unlike traditional storage arrays

Network-First Performance Planning

Critical Consideration: Network infrastructure must scale with storage.

Example Scenario:

Original: 20 PiB deployment over 4 racks
- Monitor network saturation first
- Drive saturation occurs after network bottlenecks
- Bottlenecks: NIC → spine → switch → router
Expansion Planning:
- Add 20 PiB or 40 PiB as new server pool
- Scale network proportionally
- Match or exceed original baseline performance

Performance Monitoring Priorities

1. Network Bottlenecks (Primary Focus):

  • NIC utilization per server
  • Spine switch capacity
  • Core network bandwidth
  • Router throughput

2. Storage Performance (Secondary):

  • Drive IOPS and throughput
  • Queue depths
  • Response times

3. System Resources (Tertiary):

  • CPU utilization
  • Memory usage
  • Storage controller performance

Expansion Best Practices

Pre-Expansion Assessment:

Terminal window
# Check current network utilization
mc admin prometheus metrics myminio | grep network_
# Monitor bandwidth per server
mc admin trace myminio --json | jq '.time, .callStats.rx, .callStats.tx'
# Assess current bottlenecks
mc admin speedtest myminio

Network Capacity Planning:

If existing setup shows:
- >70% network utilization
- Network bottlenecks evident
- Performance plateauing
Recommendation: Address network infrastructure BEFORE expansion
- Upgrade NICs (e.g., 100 GbE → 400 GbE)
- Increase spine capacity
- Add core network bandwidth

Zero-Impact Scaling Strategy

1. Infrastructure Preparation:

  • Provision network capacity for new pool
  • Ensure power and cooling adequate
  • Configure network segments

2. Pool Addition Process:

Terminal window
# Add new server pool seamlessly
mc admin service add myminio \
https://newpool{1...8}.example.com/data{1...4}
# Verify pool health
mc admin info myminio
# Monitor performance impact (should be none)
mc admin prometheus metrics myminio

3. Client Distribution:

  • Update client configurations for new endpoints
  • Use load balancer for automatic distribution
  • Implement least-connections balancing

Real-World Scaling Example

Phase 1 - Baseline (4 racks, 20 PiB):

Hardware: 64 servers, 100 GbE each
Network: 6.4 Tbps aggregate capacity
Performance: 600 GB/s reads, 400 GB/s writes
Utilization: 60% network, 40% storage

Phase 2 - Network Bottleneck Identified:

Symptoms: Performance plateauing at network limits
Utilization: 85% network, 45% storage
Solution: Upgrade to 200 GbE before expansion

Phase 3 - Expansion (8 racks, 40 PiB):

Hardware: 128 servers, 200 GbE each
Network: 25.6 Tbps aggregate capacity
Performance: 1.2 TB/s reads, 800 GB/s writes
Utilization: 50% network, 40% storage
Result: Linear performance scaling achieved

Advanced Considerations

Mixed Pool Sizes:

  • Different pools can have different hardware
  • Performance characteristics may vary
  • Plan client placement accordingly

Geographic Distribution:

Terminal window
# Multi-site scaling
mc admin service add myminio \
https://site2-pool{1...8}.example.com/data{1...4} \
--site site2
# Site-aware client routing
mc alias set site1 https://site1.minio.example.com
mc alias set site2 https://site2.minio.example.com

When NOT to Scale

Defer Expansion If:

  1. Network utilization >80%
  2. Existing performance issues unresolved
  3. Infrastructure capacity constraints
  4. Ongoing maintenance windows

Address First:

  • Network bandwidth upgrades
  • Switch fabric optimization
  • Load balancer tuning
  • Client distribution improvement

Performance Validation

Post-Expansion Verification:

Terminal window
# Validate linear scaling
mc admin speedtest myminio --duration 60s
# Check pool distribution
mc admin info myminio | grep "Server Pool"
# Monitor sustained performance
mc admin prometheus metrics myminio | grep -E "(bandwidth|throughput)"

Expected Results:

  • Performance scales linearly with pools
  • No degradation during operation
  • Network remains primary constraint
  • Storage utilization balanced

Documentation References

For detailed expansion procedures:

Key Takeaways

MinIO’s server pool architecture enables:

  • Zero-impact scaling - No performance degradation
  • No rebalancing overhead - Instant capacity availability
  • Linear performance growth - Predictable scaling behavior
  • Network-first optimization - Focus on the real bottleneck
  • Flexible expansion - Scale in increments that match business needs

Success depends on proportional network infrastructure growth alongside storage expansion, ensuring the network can handle both existing workloads and new capacity at full utilization.

0