* fix getobject content length resp
* Fix regression in exception handling for non-existent key with enhanced compression predicate and metadata improvements (#915)
* Initial plan
* Fix GetObject regression by excluding error responses from compression
The issue was that CompressionLayer was attempting to compress error responses,
which could cause Content-Length header mismatches. By excluding 4xx and 5xx
responses from compression, we ensure error responses (like NoSuchKey) are sent
correctly without body truncation.
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* Enhance NoSuchKey fix with improved compression predicate and comprehensive tests
- Enhanced ShouldCompress predicate with size-based exclusion (< 256 bytes)
- Added detailed documentation explaining the compression logic
- Added debug logging for better observability
- Created comprehensive test suite with 4 test cases:
- test_get_deleted_object_returns_nosuchkey
- test_head_deleted_object_returns_nosuchkey
- test_get_nonexistent_object_returns_nosuchkey
- test_multiple_gets_deleted_object
- Added extensive inline documentation and comments
- Created docs/fix-nosuchkey-regression.md with full analysis
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* Add compression best practices documentation
Added comprehensive guide covering:
- Best practices for HTTP response compression
- Common pitfalls and solutions
- Performance considerations and trade-offs
- Testing guidelines and examples
- Monitoring and alerting recommendations
- Migration guide for existing services
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* fix
* fmt
* fmt
* Fix/objectdelete (#917)
* fix getobject content length resp
* fix delete object
---------
Co-authored-by: houseme <housemecn@gmail.com>
* Add comprehensive analysis of NoSuchKey fix and related improvements
Created detailed documentation analyzing:
- HTTP compression layer fix (primary issue)
- Content-length calculation fix from PR #917
- Delete object metadata fixes from PR #917
- How all components work together
- Complete scenario walkthrough
- Performance impact analysis
- Testing strategy and deployment checklist
This ties together all the changes in the PR branch including the merged
improvements from PR #917.
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* replace `once_cell` to `std`
* fmt
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
Co-authored-by: houseme <housemecn@gmail.com>
Co-authored-by: weisd <im@weisd.in>
* fmt
---------
Co-authored-by: weisd <weishidavip@163.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
Co-authored-by: weisd <im@weisd.in>
* Initial plan
* Implement priority-based heal queue with deduplication
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* Apply cargo fmt formatting fixes
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* Add queue monitoring, better error handling, and adaptive processing
- Add priority-based queue statistics tracking
- Implement queue capacity warnings (>80% full)
- Process multiple tasks per cycle when capacity allows
- Add proper error logging for failed heal request submissions
- Add Hash trait to HealPriority for HashMap support
- Improve observability with detailed queue status logs
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* Add comprehensive diagnostic logging to heal flow
- Add detailed logging for shard availability checks during reads
- Log successful heal request submissions
- Add detailed logging in heal_object for disk checking
- Log each disk's heal decision with reason and endpoint
- Add logging for heal writing and renaming phases
- Improve variable naming (nil_count -> available_shards)
- Add success confirmation messages for renamed healed data
These logs will help diagnose where the heal flow is failing.
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* fix
* feat(heal): enhance robustness, safety, and observability of heal process
- **Logging & Observability**:
- Upgrade critical heal logs from [info](/crates/e2e_test/src/reliant/node_interact_test.rs:196:0-213:1) to `warn` for better visibility.
- Implement structured logging with `tracing` fields for machine readability.
- Add `#[tracing::instrument]` to [HealTask](c/crates/ahm/src/heal/task.rs:182:0-205:1) and [SetDisks](/crates/ecstore/src/set_disk.rs:120:0-131:1) methods for automatic context propagation.
- **Robustness**:
- Add exponential backoff retry (3 attempts) for acquiring write locks in [heal_object](/crates/ahm/src/heal/storage.rs:438:4-460:5) to handle contention.
- Handle [rename_data](/crates/ecstore/src/set_disk.rs:392:4-516:5) failures gracefully by preserving temporary files instead of forcing deletion, preventing potential data loss.
- **Data Safety**:
- Fix [object_exists](/crates/ahm/src/heal/storage.rs:395:4-412:5) to propagate IO errors instead of treating them as "object not found".
- Update [ErasureSetHealer](/crates/ahm/src/heal/erasure_healer.rs:28:0-33:1) to mark objects as failed rather than skipped when existence checks error, ensuring they are tracked for retry.
* fix
* fmt
* improve code for heal_object
* fix
* fix
* fix
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
Co-authored-by: houseme <housemecn@gmail.com>
Avoid setting AuditSystemState::Starting when target list is empty.
Now checks target availability before state transition, keeping the
system in Stopped state if no enabled targets are found.
- Check targets.is_empty() before setting Starting state
- Return early with Ok(()) when no targets exist
- Maintain consistent state machine behavior
- Prevent transient "Starting" state with no actual targets
Resolves issue where audit system would incorrectly enter Starting
state even when configuration contained no enabled targets.
* feat(obs): unify metrics initialization and fix exporter move error
- Fix Rust E0382 (use after move) by removing duplicate MetricExporter consumption.
- Consolidate MeterProvider construction into single Recorder builder path.
- Remove redundant Recorder::builder(...).install_global() call.
- Ensure PeriodicReader setup is performed only once (HTTP + optional stdout).
- Set global meter provider and metrics recorder exactly once.
- Preserve existing behavior for stdout/file vs HTTP modes.
- Minor cleanup: consistent resource reuse and interval handling.
* update telemetry.rs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix
* fix
* fix
* fix: modify logger level from error to event
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix
* chore: upgrade cryptography libraries to RC versions
- Upgrade aes-gcm to 0.11.0-rc.2 with rand_core support
- Upgrade chacha20poly1305 to 0.11.0-rc.2
- Upgrade argon2 to 0.6.0-rc.2 with std features
- Upgrade hmac to 0.13.0-rc.3
- Upgrade pbkdf2 to 0.13.0-rc.2
- Upgrade rsa to 0.10.0-rc.10
- Upgrade sha1 and sha2 to 0.11.0-rc.3
- Upgrade md-5 to 0.11.0-rc.3
These upgrades provide enhanced security features and performance
improvements while maintaining backward compatibility with existing
encryption workflows.
* add
* improve code
* fix
* improve code for audit
* improve code ecfs.rs
* improve code
* improve code for ecfs.rs
* feat(storage): refactor audit and notification with OperationHelper
This commit introduces a significant refactoring of the audit logging and event notification mechanisms within `ecfs.rs`.
The core of this change is the new `OperationHelper` struct, which encapsulates and simplifies the logic for both concerns. It replaces the previous `AuditHelper` and manual event dispatching.
Key improvements include:
- **Unified Handling**: `OperationHelper` manages both audit and notification builders, providing a single, consistent entry point for S3 operations.
- **RAII for Automation**: By leveraging the `Drop` trait, the helper automatically dispatches logs and notifications when it goes out of scope. This simplifies S3 method implementations and ensures cleanup even on early returns.
- **Fluent API**: A builder-like pattern with methods such as `.object()`, `.version_id()`, and `.suppress_event()` makes the code more readable and expressive.
- **Context-Aware Logic**: The helper's `.complete()` method intelligently populates log details based on the operation's `S3Result` and only triggers notifications on success.
- **Modular Design**: All helper logic is now isolated in `rustfs/src/storage/helper.rs`, improving separation of concerns and making `ecfs.rs` cleaner.
This refactoring significantly enhances code clarity, reduces boilerplate, and improves the robustness of logging and notification handling across the storage layer.
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* improve code for audit and notify
* fix
* fix
* fix
* improve code for metrics
* improve code for metrics
* fix
* fix
* Refactor telemetry initialization and environment functions ordering
- Reorder functions in envs.rs by type size (8-bit to 64-bit, signed before unsigned) and add missing variants like get_env_opt_u16.
- Optimize init_telemetry to support three modes: stdout logging (default error level with span tracing), file rolling logs (size-based with retention), and HTTP-based observability with sub-endpoints (trace, metric, log) falling back to unified endpoint.
- Fix stdout logging issue by retaining WorkerGuard in OtelGuard to prevent premature release of async writer threads.
- Enhance observability mode with HTTP protocol, compression, and proper resource management.
- Update OtelGuard to include tracing_guard for stdout and flexi_logger_handles for file logging.
- Improve error handling and configuration extraction in OtelConfig.
* fix
* up
* fix
* fix
* improve code for obs
* fix
* fix
* fix: resolve logic errors in ahm heal module
- Fix response publishing logic in HealChannelProcessor to properly handle errors
- Fix negative index handling in DiskStatusChange event to fail fast instead of silently converting to 0
- Enhance timeout control in heal_erasure_set Step 3 loop to immediately respond to cancellation/timeout
- Add proper error propagation for task cancellation and timeout in bucket healing loop
* fix: stabilize performance impact measurement test
- Increase measurement count from 3 to 5 runs for better stability
- Increase workload from 5000 to 10000 operations for more accurate timing
- Use median of 5 measurements instead of single measurement
- Ensure with_scanner duration is at least baseline to avoid negative overhead
- Increase wait time for scanner state stabilization
* wip
* Update crates/ahm/src/heal/channel.rs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor: remove redundant ok_or_else + expect in event.rs
Replace redundant ok_or_else() + expect() pattern with
unwrap_or_else() + panic!() to avoid creating unnecessary Error
type when the value will panic anyway. This also defers error
message formatting until the error actually occurs.
* Update crates/ahm/src/heal/task.rs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix(ahm): fix logic errors and add unit tests
- Fix panic in HealEvent::to_heal_request for invalid indices
- Replace unwrap() calls with proper error handling in resume.rs
- Fix race conditions and timeout calculation in task.rs
- Fix semaphore acquisition error handling in erasure_healer.rs
- Improve error message for large objects in storage.rs
- Add comprehensive unit tests for progress, event, and channel modules
- Fix clippy warning: move test module to end of file in heal_channel.rs
* style: apply cargo fmt formatting
* refactor(ahm): address copilot review suggestions
- Add comment to check_control_flags explaining why return value is discarded
- Fix hardcoded median index in performance test using constant and dynamic calculation
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>