feat(append): implement object append operations with state tracking (#599)

* feat(append): implement object append operations with state tracking

Signed-off-by: junxiang Mu <1948535941@qq.com>

* chore: rebase

Signed-off-by: junxiang Mu <1948535941@qq.com>

---------

Signed-off-by: junxiang Mu <1948535941@qq.com>
This commit is contained in:
guojidan
2025-09-28 11:06:26 +08:00
committed by GitHub
parent be66cf8bd3
commit 4f73760a45
19 changed files with 4160 additions and 237 deletions

147
docs/append_write/README.md Normal file
View File

@@ -0,0 +1,147 @@
# Append Write Design
This document captures the current design of the append-write feature in RustFS so that new contributors can quickly understand the moving parts, data flows, and testing expectations.
## Goals & Non-Goals
### Goals
- Allow clients to append payloads to existing objects without re-uploading the full body.
- Support inline objects and spill seamlessly into segmented layout once thresholds are exceeded.
- Preserve strong read-after-write semantics via optimistic concurrency controls (ETag / epoch).
- Expose minimal S3-compatible surface area (`x-amz-object-append`, `x-amz-append-position`, `x-amz-append-action`).
### Non-Goals
- Full multipart-upload parity; append is intentionally simpler and serialized per object.
- Cross-object transactions; each object is isolated.
- Rebalancing or background compaction (future work).
## State Machine
Append state is persisted inside `FileInfo.metadata` under `x-rustfs-internal-append-state` and serialized as `AppendState` (`crates/filemeta/src/append.rs`).
```
Disabled --(initial PUT w/o append)--> SegmentedSealed
Inline --(inline append)--> Inline / InlinePendingSpill
InlinePendingSpill --(spill success)--> SegmentedActive
SegmentedActive --(Complete)--> SegmentedSealed
SegmentedActive --(Abort)--> SegmentedSealed
SegmentedSealed --(new append)--> SegmentedActive
```
Definitions:
- **Inline**: Object data fully stored in metadata (`FileInfo.data`).
- **InlinePendingSpill**: Inline data after append exceeded inline threshold; awaiting spill to disk.
- **SegmentedActive**: Object data lives in erasure-coded part(s) plus one or more pending append segments on disk (`append/<epoch>/<uuid>`).
- **SegmentedSealed**: No pending segments; logical content equals committed parts.
`AppendState` fields:
- `state`: current state enum (see above).
- `epoch`: monotonically increasing counter for concurrency control.
- `committed_length`: logical size already durable in the base parts/inline region.
- `pending_segments`: ordered list of `AppendSegment { offset, length, data_dir, etag, epoch }`.
## Metadata & Storage Layout
### Inline Objects
- Inline payload stored in `FileInfo.data`.
- Hash metadata maintained through `append_inline_data` (re-encoding with bitrot writer when checksums exist).
- When spilling is required, inline data is decoded, appended, and re-encoded into erasure shards written to per-disk `append/<epoch>/<segment_id>/part.1` temporary path before rename to primary data directory.
### Segmented Objects
- Base object content is represented by standard erasure-coded parts (`FileInfo.parts`, `FileInfo.data_dir`).
- Pending append segments live under `<object>/append/<epoch>/<segment_uuid>/part.1` (per disk).
- Each append stores segment metadata (`etag`, `offset`, `length`) inside `AppendState.pending_segments` and updates `FileInfo.size` to include pending bytes.
- Aggregate ETag is recomputed using multipart MD5 helper (`get_complete_multipart_md5`).
### Metadata Writes
- `SetDisks::write_unique_file_info` persists `FileInfo` updates to the quorum of disks.
- During spill/append/complete/abort, all mirrored `FileInfo` copies within `parts_metadata` are updated to keep nodes consistent.
- Abort ensures inline markers are cleared (`x-rustfs-internal-inline-data`) and `FileInfo.data = None` to avoid stale inline reads.
## Request Flows
### Append (Inline Path)
1. Handler (`rustfs/src/storage/ecfs.rs`) validates headers and fills `ObjectOptions.append_*`.
2. `SetDisks::append_inline_object` verifies append position using `AppendState` snapshot.
3. Existing inline payload decoded (if checksums present) and appended in-memory (`append_inline_data`).
4. Storage class decision determines whether to remain inline or spill.
5. Inline success updates `FileInfo.data`, metadata, `AppendState` (state `Inline`, lengths updated).
6. Spill path delegates to `spill_inline_into_segmented` (see segmented path below).
### Append (Segmented Path)
1. `SetDisks::append_segmented_object` validates state (must be `SegmentedActive` or `SegmentedSealed`).
2. Snapshot expected offset = committed length + sum of pending segments.
3. Payload encoded using erasure coding; shards written to temp volume; renamed into `append/<epoch>/<segment_uuid>` under object data directory.
4. New `AppendSegment` pushed, `AppendState.epoch` incremented, aggregated ETag recalculated.
5. `FileInfo.size` reflects committed + pending bytes; metadata persisted across quorum.
### GET / Range Reads
1. `SetDisks::get_object_with_fileinfo` inspects `AppendState`.
2. Reads committed data from inline or erasure parts (ignoring inline buffers once segmented).
3. If requested range includes pending segments, loader fetches each segment via `load_pending_segment`, decodes shards, and streams appended bytes.
### Complete Append (`x-amz-append-action: complete`)
1. `complete_append_object` fetches current `FileInfo`, ensures pending segments exist.
2. Entire logical object (committed + pending) streamed through `VecAsyncWriter` (TODO: potential optimization) to produce contiguous payload.
3. Inline spill routine (`spill_inline_into_segmented`) consolidates data into primary part, sets state `SegmentedSealed`, clears pending list, updates `committed_length`.
4. Pending segment directories removed and quorum metadata persisted.
### Abort Append (`x-amz-append-action: abort`)
1. `abort_append_object` removes pending segment directories.
2. Ensures `committed_length` matches actual durable data (inline length or sum of parts); logs and corrects if mismatch is found.
3. Clears pending list, sets state `SegmentedSealed`, bumps epoch, removes inline markers/data.
4. Persists metadata and returns base ETag (multipart MD5 of committed parts).
## Error Handling & Recovery
- All disk writes go through quorum helpers (`reduce_write_quorum_errs`, `reduce_read_quorum_errs`) and propagate `StorageError` variants for HTTP mapping.
- Append operations are single-threaded per object via locking in higher layers (`fast_lock_manager` in `SetDisks::put_object`).
- On spill/append rename failure, temp directories are cleaned up; operation aborts without mutating metadata.
- Abort path now realigns `committed_length` if metadata drifted (observed during development) and strips inline remnants to prevent stale reads.
- Pending segments are only removed once metadata update succeeds; no partial deletion is performed ahead of state persistence.
## Concurrency
- Append requests rely on exact `x-amz-append-position` to ensure the client has an up-to-date view.
- Optional header `If-Match` is honored in S3 handler before actual append (shared with regular PUT path).
- `AppendState.epoch` increments after each append/complete/abort; future work may expose it for stronger optimistic control.
- e2e test `append_segments_concurrency_then_complete` verifies that simultaneous appends result in exactly one success; the loser receives 400.
## Key Modules
- `crates/ecstore/src/set_disk.rs`: core implementation (inline append, spill, segmented append, complete, abort, GET integration).
- `crates/ecstore/src/erasure_coding/{encode,decode}.rs`: encode/decode helpers used by append pipeline.
- `crates/filemeta/src/append.rs`: metadata schema + helper functions.
- `rustfs/src/storage/ecfs.rs`: HTTP/S3 layer that parses headers and routes to append operations.
## Testing Strategy
### Unit Tests
- `crates/filemeta/src/append.rs` covers serialization and state transitions.
- `crates/ecstore/src/set_disk.rs` contains lower-level utilities and regression tests for metadata helpers.
- Additional unit coverage is recommended for spill/append failure paths (e.g., injected rename failures).
### End-to-End Tests (`cargo test --package e2e_test append`)
- Inline append success, wrong position, precondition failures.
- Segmented append success, wrong position, wrong ETag.
- Spill threshold transition (`append_threshold_crossing_inline_to_segmented`).
- Pending segment streaming (`append_range_requests_across_segments`).
- Complete append consolidates pending segments.
- Abort append discards pending data and allows new append.
- Concurrency: two clients racing to append, followed by additional append + complete.
### Tooling Considerations
- `make clippy` must pass; the append code relies on async operations and custom logging.
- `make test` / `cargo nextest run` recommended before submitting PRs.
- Use `RUST_LOG=rustfs_ecstore=debug` when debugging append flows; targeted `info!`/`warn!` logs are emitted during spill/abort.
## Future Work
- Streamed consolidation in `complete_append_object` to avoid buffering entire logical object.
- Throttling or automatic `Complete` when pending segments exceed size/quantity thresholds.
- Stronger epoch exposure to clients (header-based conflict detection).
- Automated cleanup or garbage collection for orphaned `append/*` directories.
---
For questions or design discussions, drop a note in the append-write channel or ping the storage team.