Files
rustfs/crates/utils/src/http
Copilot 277d80de13 Fix: Implement priority-based heal queue with comprehensive diagnostic logging (#884)
* Initial plan

* Implement priority-based heal queue with deduplication

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* Apply cargo fmt formatting fixes

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* Add queue monitoring, better error handling, and adaptive processing

- Add priority-based queue statistics tracking
- Implement queue capacity warnings (>80% full)
- Process multiple tasks per cycle when capacity allows
- Add proper error logging for failed heal request submissions
- Add Hash trait to HealPriority for HashMap support
- Improve observability with detailed queue status logs

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* Add comprehensive diagnostic logging to heal flow

- Add detailed logging for shard availability checks during reads
- Log successful heal request submissions
- Add detailed logging in heal_object for disk checking
- Log each disk's heal decision with reason and endpoint
- Add logging for heal writing and renaming phases
- Improve variable naming (nil_count -> available_shards)
- Add success confirmation messages for renamed healed data

These logs will help diagnose where the heal flow is failing.

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* feat(heal): enhance robustness, safety, and observability of heal process

- **Logging & Observability**:
  - Upgrade critical heal logs from [info](/crates/e2e_test/src/reliant/node_interact_test.rs:196:0-213:1) to `warn` for better visibility.
  - Implement structured logging with `tracing` fields for machine readability.
  - Add `#[tracing::instrument]` to [HealTask](c/crates/ahm/src/heal/task.rs:182:0-205:1) and [SetDisks](/crates/ecstore/src/set_disk.rs:120:0-131:1) methods for automatic context propagation.

- **Robustness**:
  - Add exponential backoff retry (3 attempts) for acquiring write locks in [heal_object](/crates/ahm/src/heal/storage.rs:438:4-460:5) to handle contention.
  - Handle [rename_data](/crates/ecstore/src/set_disk.rs:392:4-516:5) failures gracefully by preserving temporary files instead of forcing deletion, preventing potential data loss.

- **Data Safety**:
  - Fix [object_exists](/crates/ahm/src/heal/storage.rs:395:4-412:5) to propagate IO errors instead of treating them as "object not found".
  - Update [ErasureSetHealer](/crates/ahm/src/heal/erasure_healer.rs:28:0-33:1) to mark objects as failed rather than skipped when existence checks error, ensuring they are tracked for retry.

* fix

* fmt

* improve code for heal_object

* fix

* fix

* fix

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
Co-authored-by: houseme <housemecn@gmail.com>
2025-11-20 00:36:25 +08:00
..