* Initial plan
* Implement priority-based heal queue with deduplication
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* Apply cargo fmt formatting fixes
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* Add queue monitoring, better error handling, and adaptive processing
- Add priority-based queue statistics tracking
- Implement queue capacity warnings (>80% full)
- Process multiple tasks per cycle when capacity allows
- Add proper error logging for failed heal request submissions
- Add Hash trait to HealPriority for HashMap support
- Improve observability with detailed queue status logs
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* Add comprehensive diagnostic logging to heal flow
- Add detailed logging for shard availability checks during reads
- Log successful heal request submissions
- Add detailed logging in heal_object for disk checking
- Log each disk's heal decision with reason and endpoint
- Add logging for heal writing and renaming phases
- Improve variable naming (nil_count -> available_shards)
- Add success confirmation messages for renamed healed data
These logs will help diagnose where the heal flow is failing.
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
* fix
* feat(heal): enhance robustness, safety, and observability of heal process
- **Logging & Observability**:
- Upgrade critical heal logs from [info](/crates/e2e_test/src/reliant/node_interact_test.rs:196:0-213:1) to `warn` for better visibility.
- Implement structured logging with `tracing` fields for machine readability.
- Add `#[tracing::instrument]` to [HealTask](c/crates/ahm/src/heal/task.rs:182:0-205:1) and [SetDisks](/crates/ecstore/src/set_disk.rs:120:0-131:1) methods for automatic context propagation.
- **Robustness**:
- Add exponential backoff retry (3 attempts) for acquiring write locks in [heal_object](/crates/ahm/src/heal/storage.rs:438:4-460:5) to handle contention.
- Handle [rename_data](/crates/ecstore/src/set_disk.rs:392:4-516:5) failures gracefully by preserving temporary files instead of forcing deletion, preventing potential data loss.
- **Data Safety**:
- Fix [object_exists](/crates/ahm/src/heal/storage.rs:395:4-412:5) to propagate IO errors instead of treating them as "object not found".
- Update [ErasureSetHealer](/crates/ahm/src/heal/erasure_healer.rs:28:0-33:1) to mark objects as failed rather than skipped when existence checks error, ensuring they are tracked for retry.
* fix
* fmt
* improve code for heal_object
* fix
* fix
* fix
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
Co-authored-by: houseme <housemecn@gmail.com>
* fix: resolve logic errors in ahm heal module
- Fix response publishing logic in HealChannelProcessor to properly handle errors
- Fix negative index handling in DiskStatusChange event to fail fast instead of silently converting to 0
- Enhance timeout control in heal_erasure_set Step 3 loop to immediately respond to cancellation/timeout
- Add proper error propagation for task cancellation and timeout in bucket healing loop
* fix: stabilize performance impact measurement test
- Increase measurement count from 3 to 5 runs for better stability
- Increase workload from 5000 to 10000 operations for more accurate timing
- Use median of 5 measurements instead of single measurement
- Ensure with_scanner duration is at least baseline to avoid negative overhead
- Increase wait time for scanner state stabilization
* wip
* Update crates/ahm/src/heal/channel.rs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor: remove redundant ok_or_else + expect in event.rs
Replace redundant ok_or_else() + expect() pattern with
unwrap_or_else() + panic!() to avoid creating unnecessary Error
type when the value will panic anyway. This also defers error
message formatting until the error actually occurs.
* Update crates/ahm/src/heal/task.rs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix(ahm): fix logic errors and add unit tests
- Fix panic in HealEvent::to_heal_request for invalid indices
- Replace unwrap() calls with proper error handling in resume.rs
- Fix race conditions and timeout calculation in task.rs
- Fix semaphore acquisition error handling in erasure_healer.rs
- Improve error message for large objects in storage.rs
- Add comprehensive unit tests for progress, event, and channel modules
- Fix clippy warning: move test module to end of file in heal_channel.rs
* style: apply cargo fmt formatting
* refactor(ahm): address copilot review suggestions
- Add comment to check_control_flags explaining why return value is discarded
- Fix hardcoded median index in performance test using constant and dynamic calculation
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix: remove code
* improve code for tokio runtime config
* improve code for main
* fix: add tokio enable_all
* upgrade version
* improve for Cargo.toml
* fix: fix datausageinfo
Signed-off-by: junxiang Mu <1948535941@qq.com>
* feat(data-usage): implement local disk snapshot aggregation for data usage statistics
Signed-off-by: junxiang Mu <1948535941@qq.com>
* feat(scanner): improve data usage collection with local scan aggregation
Signed-off-by: junxiang Mu <1948535941@qq.com>
* refactor: improve object existence check and code style
Signed-off-by: junxiang Mu <1948535941@qq.com>
---------
Signed-off-by: junxiang Mu <1948535941@qq.com>
* Refactor: reimplement lock
Signed-off-by: junxiang Mu <1948535941@qq.com>
* Fix: fix test case failed
Signed-off-by: junxiang Mu <1948535941@qq.com>
* Improve: lock pref
Signed-off-by: junxiang Mu <1948535941@qq.com>
* fix(lock): Fix resource cleanup issue when batch lock acquisition fails
Ensure that the locks already acquired are properly released when batch lock acquisition fails to avoid memory leaks
Improve the lock protection mechanism to prevent double release issues
Add complete Apache license declarations to all files
Signed-off-by: junxiang Mu <1948535941@qq.com>
---------
Signed-off-by: junxiang Mu <1948535941@qq.com>
- Merge all AI rules from .rules.md, .cursorrules, and CLAUDE.md into AGENTS.md
- Add competitor keyword prohibition rules (minio, ceph, swift, etc.)
- Simplify rules by removing overly detailed code examples
- Integrate new development principles as highest priority
- Remove old tool-specific rule files
- Fix clippy warnings for format string improvements
Change the object integrity verification from reading all data to streaming processing to avoid memory overflow caused by large objects.
Modify the TLS key log check to use environment variables directly instead of configuration constants.
Add memory limits for object data reading in the AHM module.
Signed-off-by: junxiang Mu <1948535941@qq.com>
* feat(lifecycle): Implement object lifecycle management functionality
Add a lifecycle module to automatically handle object expiration and transition during scanning
Modify the file metadata cache module to be publicly visible to support lifecycle operations
Adjust the scanning interval to a shorter time for testing lifecycle rules
Implement the parsing and execution logic for S3 lifecycle configurations
Add integration tests to verify the lifecycle expiration functionality
Update dependencies to support the new lifecycle features
Signed-off-by: junxiang Mu <1948535941@qq.com>
* fix cargo dependencies
Signed-off-by: junxiang Mu <1948535941@qq.com>
* fix fmt
Signed-off-by: junxiang Mu <1948535941@qq.com>
---------
Signed-off-by: junxiang Mu <1948535941@qq.com>
Co-authored-by: houseme <housemecn@gmail.com>