fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) (#1035)

This commit is contained in:
Jitter
2025-12-07 15:35:51 +05:30
committed by GitHub
parent 5f256249f4
commit cd6a26bc3a
4 changed files with 295 additions and 238 deletions

View File

@@ -25,6 +25,21 @@ To resolve this, we needed to transform the passive failure detection (waiting f
## 3. Implemented Solution
We modified the internal gRPC client configuration in `crates/protos/src/lib.rs` to implement a multi-layered health check strategy.
### Solution Overview
The fix implements a multi-layered detection strategy covering both Control Plane (RPC) and Data Plane (Streaming):
1. **Control Plane (gRPC)**:
* Enabled `http2_keep_alive_interval` (5s) and `keep_alive_timeout` (3s) in `tonic` clients.
* Enforced `tcp_keepalive` (10s) on underlying transport.
* Context: Ensures cluster metadata operations (raft, status checks) fail fast if a node dies.
2. **Data Plane (File Uploads/Downloads)**:
* **Client (Rio)**: Updated `reqwest` client builder in `crates/rio` to enable TCP Keepalive (10s) and HTTP/2 Keepalive (5s). This prevents hangs during large file streaming (e.g., 1GB uploads).
* **Server**: Enabled `SO_KEEPALIVE` on all incoming TCP connections in `rustfs/src/server/http.rs` to forcefully close sockets from dead clients.
3. **Cross-Platform Build Stability**:
* Guarded Linux-specific profiling code (`jemalloc_pprof`) with `#[cfg(target_os = "linux")]` to fix build failures on macOS/AArch64.
### Configuration Changes
```rust