mirror of
https://github.com/rustfs/rustfs.git
synced 2026-01-17 01:30:33 +00:00
* Initial plan * Implement priority-based heal queue with deduplication Co-authored-by: houseme <4829346+houseme@users.noreply.github.com> * Apply cargo fmt formatting fixes Co-authored-by: houseme <4829346+houseme@users.noreply.github.com> * Add queue monitoring, better error handling, and adaptive processing - Add priority-based queue statistics tracking - Implement queue capacity warnings (>80% full) - Process multiple tasks per cycle when capacity allows - Add proper error logging for failed heal request submissions - Add Hash trait to HealPriority for HashMap support - Improve observability with detailed queue status logs Co-authored-by: houseme <4829346+houseme@users.noreply.github.com> * Add comprehensive diagnostic logging to heal flow - Add detailed logging for shard availability checks during reads - Log successful heal request submissions - Add detailed logging in heal_object for disk checking - Log each disk's heal decision with reason and endpoint - Add logging for heal writing and renaming phases - Improve variable naming (nil_count -> available_shards) - Add success confirmation messages for renamed healed data These logs will help diagnose where the heal flow is failing. Co-authored-by: houseme <4829346+houseme@users.noreply.github.com> * fix * feat(heal): enhance robustness, safety, and observability of heal process - **Logging & Observability**: - Upgrade critical heal logs from [info](/crates/e2e_test/src/reliant/node_interact_test.rs:196:0-213:1) to `warn` for better visibility. - Implement structured logging with `tracing` fields for machine readability. - Add `#[tracing::instrument]` to [HealTask](c/crates/ahm/src/heal/task.rs:182:0-205:1) and [SetDisks](/crates/ecstore/src/set_disk.rs:120:0-131:1) methods for automatic context propagation. - **Robustness**: - Add exponential backoff retry (3 attempts) for acquiring write locks in [heal_object](/crates/ahm/src/heal/storage.rs:438:4-460:5) to handle contention. - Handle [rename_data](/crates/ecstore/src/set_disk.rs:392:4-516:5) failures gracefully by preserving temporary files instead of forcing deletion, preventing potential data loss. - **Data Safety**: - Fix [object_exists](/crates/ahm/src/heal/storage.rs:395:4-412:5) to propagate IO errors instead of treating them as "object not found". - Update [ErasureSetHealer](/crates/ahm/src/heal/erasure_healer.rs:28:0-33:1) to mark objects as failed rather than skipped when existence checks error, ensuring they are tracked for retry. * fix * fmt * improve code for heal_object * fix * fix * fix --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: houseme <4829346+houseme@users.noreply.github.com> Co-authored-by: houseme <housemecn@gmail.com>