rustfs/docs/CONCURRENT_PERFORMANCE_OPTIMIZATION.md

# Concurrent GetObject Performance Optimization

## Problem Statement

When multiple concurrent GetObject requests are made to RustFS, performance degrades exponentially:

| Concurrency Level | Single Request Latency | Performance Impact |
|------------------|----------------------|-------------------|
| 1 request        | 59ms                 | Baseline          |
| 2 requests       | 110ms                | 1.9x slower       |
| 4 requests       | 200ms                | 3.4x slower       |

## Root Cause Analysis

The performance degradation was caused by several factors:

1. **Fixed Buffer Sizing**: Using `DEFAULT_READ_BUFFER_SIZE` (1MB) for all requests, regardless of concurrent load
   - High memory contention under concurrent load
   - Inefficient cache utilization
   - CPU context switching overhead

2. **No Concurrency Control**: Unlimited concurrent disk reads causing I/O saturation
   - Disk I/O queue depth exceeded optimal levels
   - Increased seek times on traditional disks
   - Resource contention between requests

3. **Lack of Caching**: Repeated reads of the same objects
   - No reuse of frequently accessed data
   - Unnecessary disk I/O for hot objects

## Solution Architecture

### 1. Concurrency-Aware Adaptive Buffer Sizing

The system now dynamically adjusts buffer sizes based on the current number of concurrent GetObject requests:

```rust
let optimal_buffer_size = get_concurrency_aware_buffer_size(file_size, base_buffer_size);
```

#### Buffer Sizing Strategy

| Concurrent Requests | Buffer Size Multiplier | Typical Buffer | Rationale |
|--------------------|----------------------|----------------|-----------|
| 1-2 (Low)          | 1.0x (100%)          | 512KB-1MB      | Maximize throughput with large buffers |
| 3-4 (Medium)       | 0.75x (75%)          | 256KB-512KB    | Balance throughput and fairness |
| 5-8 (High)         | 0.5x (50%)           | 128KB-256KB    | Improve fairness, reduce memory pressure |
| 9+ (Very High)     | 0.4x (40%)           | 64KB-128KB     | Ensure fair scheduling, minimize memory |

#### Benefits
- **Reduced memory pressure**: Smaller buffers under high concurrency prevent memory exhaustion
- **Better cache utilization**: More requests fit in CPU cache with smaller buffers
- **Improved fairness**: Prevents large requests from starving smaller ones
- **Adaptive performance**: Automatically tunes for different workload patterns

### 2. Hot Object Caching (LRU)

Implemented an intelligent LRU cache for frequently accessed small objects:

```rust
pub struct HotObjectCache {
    max_object_size: usize,      // Default: 10MB
    max_cache_size: usize,       // Default: 100MB
    cache: RwLock<lru::LruCache<String, Arc<CachedObject>>>,
}
```

#### Caching Policy
- **Eligible objects**: Size ≤ 10MB, complete object reads (no ranges)
- **Eviction**: LRU (Least Recently Used)
- **Capacity**: Up to 1000 objects, 100MB total
- **Exclusions**: Encrypted objects, partial reads, multipart

#### Benefits
- **Reduced disk I/O**: Cache hits eliminate disk reads entirely
- **Lower latency**: Memory access is 100-1000x faster than disk
- **Higher throughput**: Free up disk bandwidth for cache misses
- **Better scalability**: Cache hit ratio improves with concurrent load

### 3. Disk I/O Concurrency Control

Added a semaphore to limit maximum concurrent disk reads:

```rust
disk_read_semaphore: Arc<Semaphore>  // Default: 64 permits
```

#### Benefits
- **Prevents I/O saturation**: Limits queue depth to optimal levels
- **Predictable latency**: Avoids exponential latency increase
- **Protects disk health**: Reduces excessive seek operations
- **Graceful degradation**: Queues requests rather than thrashing

### 4. Request Tracking and Monitoring

Implemented RAII-based request tracking with automatic cleanup:

```rust
pub struct GetObjectGuard {
    start_time: Instant,
}

impl Drop for GetObjectGuard {
    fn drop(&mut self) {
        ACTIVE_GET_REQUESTS.fetch_sub(1, Ordering::Relaxed);
        // Record metrics
    }
}
```

#### Metrics Collected
- `rustfs_concurrent_get_requests`: Current concurrent request count
- `rustfs_get_object_requests_completed`: Total completed requests
- `rustfs_get_object_duration_seconds`: Request duration histogram
- `rustfs_object_cache_hits`: Cache hit count
- `rustfs_object_cache_misses`: Cache miss count
- `rustfs_buffer_size_bytes`: Buffer size distribution

## Performance Expectations

### Expected Improvements

Based on the optimizations, we expect:

| Concurrency Level | Before | After (Expected) | Improvement |
|------------------|--------|------------------|-------------|
| 1 request        | 59ms   | 55-60ms          | Similar (baseline) |
| 2 requests       | 110ms  | 65-75ms          | ~40% faster |
| 4 requests       | 200ms  | 80-100ms         | ~50% faster |
| 8 requests       | 400ms  | 100-130ms        | ~65% faster |
| 16 requests      | 800ms  | 120-160ms        | ~75% faster |

### Key Performance Characteristics

1. **Sub-linear scaling**: Latency increases sub-linearly with concurrency
2. **Cache benefits**: Hot objects see near-zero latency from cache hits
3. **Predictable behavior**: Bounded latency even under extreme load
4. **Memory efficiency**: Lower memory usage under high concurrency

## Implementation Details

### Integration Points

The optimization is integrated at the GetObject handler level:

```rust
async fn get_object(&self, req: S3Request<GetObjectInput>) -> S3Result<S3Response<GetObjectOutput>> {
    // 1. Track request
    let _request_guard = ConcurrencyManager::track_request();

    // 2. Try cache
    if let Some(cached_data) = manager.get_cached(&cache_key).await {
        return Ok(S3Response::new(output));  // Fast path
    }

    // 3. Acquire I/O permit
    let _disk_permit = manager.acquire_disk_read_permit().await;

    // 4. Calculate optimal buffer size
    let optimal_buffer_size = get_concurrency_aware_buffer_size(
        response_content_length,
        base_buffer_size
    );

    // 5. Stream with optimal buffer
    let body = StreamingBlob::wrap(
        ReaderStream::with_capacity(final_stream, optimal_buffer_size)
    );
}
```

### Configuration

All defaults can be tuned via code changes:

```rust
// In concurrency.rs
const HIGH_CONCURRENCY_THRESHOLD: usize = 8;
const MEDIUM_CONCURRENCY_THRESHOLD: usize = 4;

// Cache settings
max_object_size: 10 * MI_B,      // 10MB
max_cache_size: 100 * MI_B,      // 100MB
disk_read_semaphore: Semaphore::new(64),  // 64 concurrent reads
```

## Testing Recommendations

### 1. Concurrent Load Testing

Use the provided Go client to test different concurrency levels:

```go
concurrency := []int{1, 2, 4, 8, 16, 32}
for _, c := range concurrency {
    // Run test with c concurrent goroutines
    // Measure average latency and P50/P95/P99
}
```

### 2. Hot Object Testing

Test cache effectiveness with repeated reads:

```bash
# Read same object 100 times with 10 concurrent clients
for i in {1..10}; do
    for j in {1..100}; do
        mc cat rustfs/test/bxx > /dev/null
    done &
done
wait
```

### 3. Mixed Workload Testing

Simulate real-world scenarios:
- 70% small objects (<1MB) - should see high cache hit rate
- 20% medium objects (1-10MB) - partial cache benefit
- 10% large objects (>10MB) - adaptive buffer sizing benefit

### 4. Stress Testing

Test system behavior under extreme load:
```bash
# 100 concurrent clients, continuous reads
ab -n 10000 -c 100 http://rustfs:9000/test/bxx
```

## Monitoring and Observability

### Key Metrics to Watch

1. **Latency Percentiles**
   - P50, P95, P99 request duration
   - Should show sub-linear growth with concurrency

2. **Cache Performance**
   - Cache hit ratio (target: >70% for hot objects)
   - Cache memory usage
   - Eviction rate

3. **Resource Utilization**
   - Memory usage per concurrent request
   - Disk I/O queue depth
   - CPU utilization

4. **Throughput**
   - Requests per second
   - Bytes per second
   - Concurrent request count

### Prometheus Queries

```promql
# Average request duration by concurrency level
histogram_quantile(0.95,
  rate(rustfs_get_object_duration_seconds_bucket[5m])
)

# Cache hit ratio
sum(rate(rustfs_object_cache_hits[5m]))
/
(sum(rate(rustfs_object_cache_hits[5m])) + sum(rate(rustfs_object_cache_misses[5m])))

# Concurrent requests over time
rustfs_concurrent_get_requests

# Memory efficiency (bytes per request)
rustfs_object_cache_size_bytes / rustfs_concurrent_get_requests
```

## Future Enhancements

### Potential Improvements

1. **Request Prioritization**
   - Prioritize small requests over large ones
   - Age-based priority to prevent starvation
   - QoS classes for different clients

2. **Advanced Caching**
   - Partial object caching (hot blocks)
   - Predictive prefetching based on access patterns
   - Distributed cache across multiple nodes

3. **I/O Scheduling**
   - Batch similar requests for sequential I/O
   - Deadline-based I/O scheduling
   - NUMA-aware buffer allocation

4. **Adaptive Tuning**
   - Machine learning based buffer sizing
   - Dynamic cache size adjustment
   - Workload-aware optimization

5. **Compression**
   - Transparent compression for cached objects
   - Adaptive compression based on CPU availability
   - Deduplication for similar objects

## References

- [Issue #XXX](https://github.com/rustfs/rustfs/issues/XXX): Original performance issue
- [PR #XXX](https://github.com/rustfs/rustfs/pull/XXX): Implementation PR
- [MinIO Best Practices](https://min.io/docs/minio/linux/operations/install-deploy-manage/performance-and-optimization.html)
- [LRU Cache Design](https://leetcode.com/problems/lru-cache/)
- [Tokio Concurrency Patterns](https://tokio.rs/tokio/tutorial/shared-state)

## Conclusion

The concurrency-aware optimization addresses the root causes of performance degradation:

1. ✅ **Adaptive buffer sizing** reduces memory contention and improves cache utilization
2. ✅ **Hot object caching** eliminates redundant disk I/O for frequently accessed files
3. ✅ **I/O concurrency control** prevents disk saturation and ensures predictable latency
4. ✅ **Comprehensive monitoring** enables performance tracking and tuning

These changes should significantly improve performance under concurrent load while maintaining compatibility with existing clients and workloads.