mirror of
https://github.com/rustfs/rustfs.git
synced 2026-01-17 09:40:32 +00:00
* feat(append): implement object append operations with state tracking Signed-off-by: junxiang Mu <1948535941@qq.com> * chore: rebase Signed-off-by: junxiang Mu <1948535941@qq.com> --------- Signed-off-by: junxiang Mu <1948535941@qq.com>
148 lines
8.8 KiB
Markdown
148 lines
8.8 KiB
Markdown
# Append Write Design
|
|
|
|
This document captures the current design of the append-write feature in RustFS so that new contributors can quickly understand the moving parts, data flows, and testing expectations.
|
|
|
|
## Goals & Non-Goals
|
|
|
|
### Goals
|
|
- Allow clients to append payloads to existing objects without re-uploading the full body.
|
|
- Support inline objects and spill seamlessly into segmented layout once thresholds are exceeded.
|
|
- Preserve strong read-after-write semantics via optimistic concurrency controls (ETag / epoch).
|
|
- Expose minimal S3-compatible surface area (`x-amz-object-append`, `x-amz-append-position`, `x-amz-append-action`).
|
|
|
|
### Non-Goals
|
|
- Full multipart-upload parity; append is intentionally simpler and serialized per object.
|
|
- Cross-object transactions; each object is isolated.
|
|
- Rebalancing or background compaction (future work).
|
|
|
|
## State Machine
|
|
|
|
Append state is persisted inside `FileInfo.metadata` under `x-rustfs-internal-append-state` and serialized as `AppendState` (`crates/filemeta/src/append.rs`).
|
|
|
|
```
|
|
Disabled --(initial PUT w/o append)--> SegmentedSealed
|
|
Inline --(inline append)--> Inline / InlinePendingSpill
|
|
InlinePendingSpill --(spill success)--> SegmentedActive
|
|
SegmentedActive --(Complete)--> SegmentedSealed
|
|
SegmentedActive --(Abort)--> SegmentedSealed
|
|
SegmentedSealed --(new append)--> SegmentedActive
|
|
```
|
|
|
|
Definitions:
|
|
- **Inline**: Object data fully stored in metadata (`FileInfo.data`).
|
|
- **InlinePendingSpill**: Inline data after append exceeded inline threshold; awaiting spill to disk.
|
|
- **SegmentedActive**: Object data lives in erasure-coded part(s) plus one or more pending append segments on disk (`append/<epoch>/<uuid>`).
|
|
- **SegmentedSealed**: No pending segments; logical content equals committed parts.
|
|
|
|
`AppendState` fields:
|
|
- `state`: current state enum (see above).
|
|
- `epoch`: monotonically increasing counter for concurrency control.
|
|
- `committed_length`: logical size already durable in the base parts/inline region.
|
|
- `pending_segments`: ordered list of `AppendSegment { offset, length, data_dir, etag, epoch }`.
|
|
|
|
## Metadata & Storage Layout
|
|
|
|
### Inline Objects
|
|
- Inline payload stored in `FileInfo.data`.
|
|
- Hash metadata maintained through `append_inline_data` (re-encoding with bitrot writer when checksums exist).
|
|
- When spilling is required, inline data is decoded, appended, and re-encoded into erasure shards written to per-disk `append/<epoch>/<segment_id>/part.1` temporary path before rename to primary data directory.
|
|
|
|
### Segmented Objects
|
|
- Base object content is represented by standard erasure-coded parts (`FileInfo.parts`, `FileInfo.data_dir`).
|
|
- Pending append segments live under `<object>/append/<epoch>/<segment_uuid>/part.1` (per disk).
|
|
- Each append stores segment metadata (`etag`, `offset`, `length`) inside `AppendState.pending_segments` and updates `FileInfo.size` to include pending bytes.
|
|
- Aggregate ETag is recomputed using multipart MD5 helper (`get_complete_multipart_md5`).
|
|
|
|
### Metadata Writes
|
|
- `SetDisks::write_unique_file_info` persists `FileInfo` updates to the quorum of disks.
|
|
- During spill/append/complete/abort, all mirrored `FileInfo` copies within `parts_metadata` are updated to keep nodes consistent.
|
|
- Abort ensures inline markers are cleared (`x-rustfs-internal-inline-data`) and `FileInfo.data = None` to avoid stale inline reads.
|
|
|
|
## Request Flows
|
|
|
|
### Append (Inline Path)
|
|
1. Handler (`rustfs/src/storage/ecfs.rs`) validates headers and fills `ObjectOptions.append_*`.
|
|
2. `SetDisks::append_inline_object` verifies append position using `AppendState` snapshot.
|
|
3. Existing inline payload decoded (if checksums present) and appended in-memory (`append_inline_data`).
|
|
4. Storage class decision determines whether to remain inline or spill.
|
|
5. Inline success updates `FileInfo.data`, metadata, `AppendState` (state `Inline`, lengths updated).
|
|
6. Spill path delegates to `spill_inline_into_segmented` (see segmented path below).
|
|
|
|
### Append (Segmented Path)
|
|
1. `SetDisks::append_segmented_object` validates state (must be `SegmentedActive` or `SegmentedSealed`).
|
|
2. Snapshot expected offset = committed length + sum of pending segments.
|
|
3. Payload encoded using erasure coding; shards written to temp volume; renamed into `append/<epoch>/<segment_uuid>` under object data directory.
|
|
4. New `AppendSegment` pushed, `AppendState.epoch` incremented, aggregated ETag recalculated.
|
|
5. `FileInfo.size` reflects committed + pending bytes; metadata persisted across quorum.
|
|
|
|
### GET / Range Reads
|
|
1. `SetDisks::get_object_with_fileinfo` inspects `AppendState`.
|
|
2. Reads committed data from inline or erasure parts (ignoring inline buffers once segmented).
|
|
3. If requested range includes pending segments, loader fetches each segment via `load_pending_segment`, decodes shards, and streams appended bytes.
|
|
|
|
### Complete Append (`x-amz-append-action: complete`)
|
|
1. `complete_append_object` fetches current `FileInfo`, ensures pending segments exist.
|
|
2. Entire logical object (committed + pending) streamed through `VecAsyncWriter` (TODO: potential optimization) to produce contiguous payload.
|
|
3. Inline spill routine (`spill_inline_into_segmented`) consolidates data into primary part, sets state `SegmentedSealed`, clears pending list, updates `committed_length`.
|
|
4. Pending segment directories removed and quorum metadata persisted.
|
|
|
|
### Abort Append (`x-amz-append-action: abort`)
|
|
1. `abort_append_object` removes pending segment directories.
|
|
2. Ensures `committed_length` matches actual durable data (inline length or sum of parts); logs and corrects if mismatch is found.
|
|
3. Clears pending list, sets state `SegmentedSealed`, bumps epoch, removes inline markers/data.
|
|
4. Persists metadata and returns base ETag (multipart MD5 of committed parts).
|
|
|
|
## Error Handling & Recovery
|
|
|
|
- All disk writes go through quorum helpers (`reduce_write_quorum_errs`, `reduce_read_quorum_errs`) and propagate `StorageError` variants for HTTP mapping.
|
|
- Append operations are single-threaded per object via locking in higher layers (`fast_lock_manager` in `SetDisks::put_object`).
|
|
- On spill/append rename failure, temp directories are cleaned up; operation aborts without mutating metadata.
|
|
- Abort path now realigns `committed_length` if metadata drifted (observed during development) and strips inline remnants to prevent stale reads.
|
|
- Pending segments are only removed once metadata update succeeds; no partial deletion is performed ahead of state persistence.
|
|
|
|
## Concurrency
|
|
|
|
- Append requests rely on exact `x-amz-append-position` to ensure the client has an up-to-date view.
|
|
- Optional header `If-Match` is honored in S3 handler before actual append (shared with regular PUT path).
|
|
- `AppendState.epoch` increments after each append/complete/abort; future work may expose it for stronger optimistic control.
|
|
- e2e test `append_segments_concurrency_then_complete` verifies that simultaneous appends result in exactly one success; the loser receives 400.
|
|
|
|
## Key Modules
|
|
|
|
- `crates/ecstore/src/set_disk.rs`: core implementation (inline append, spill, segmented append, complete, abort, GET integration).
|
|
- `crates/ecstore/src/erasure_coding/{encode,decode}.rs`: encode/decode helpers used by append pipeline.
|
|
- `crates/filemeta/src/append.rs`: metadata schema + helper functions.
|
|
- `rustfs/src/storage/ecfs.rs`: HTTP/S3 layer that parses headers and routes to append operations.
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- `crates/filemeta/src/append.rs` covers serialization and state transitions.
|
|
- `crates/ecstore/src/set_disk.rs` contains lower-level utilities and regression tests for metadata helpers.
|
|
- Additional unit coverage is recommended for spill/append failure paths (e.g., injected rename failures).
|
|
|
|
### End-to-End Tests (`cargo test --package e2e_test append`)
|
|
- Inline append success, wrong position, precondition failures.
|
|
- Segmented append success, wrong position, wrong ETag.
|
|
- Spill threshold transition (`append_threshold_crossing_inline_to_segmented`).
|
|
- Pending segment streaming (`append_range_requests_across_segments`).
|
|
- Complete append consolidates pending segments.
|
|
- Abort append discards pending data and allows new append.
|
|
- Concurrency: two clients racing to append, followed by additional append + complete.
|
|
|
|
### Tooling Considerations
|
|
- `make clippy` must pass; the append code relies on async operations and custom logging.
|
|
- `make test` / `cargo nextest run` recommended before submitting PRs.
|
|
- Use `RUST_LOG=rustfs_ecstore=debug` when debugging append flows; targeted `info!`/`warn!` logs are emitted during spill/abort.
|
|
|
|
## Future Work
|
|
|
|
- Streamed consolidation in `complete_append_object` to avoid buffering entire logical object.
|
|
- Throttling or automatic `Complete` when pending segments exceed size/quantity thresholds.
|
|
- Stronger epoch exposure to clients (header-based conflict detection).
|
|
- Automated cleanup or garbage collection for orphaned `append/*` directories.
|
|
|
|
---
|
|
|
|
For questions or design discussions, drop a note in the append-write channel or ping the storage team.
|