Signed-off-by: houseme <housemecn@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
22 KiB
rustfs-obs
Observability library for RustFS providing structured JSON logging, distributed tracing, metrics via OpenTelemetry, and continuous profiling via Pyroscope.
Features
| Feature | Description |
|---|---|
| Structured logging | JSON-formatted logs via tracing-subscriber |
| Rolling-file logging | Daily / hourly rotation with automatic cleanup and high-precision timestamps |
| Distributed tracing | OTLP/HTTP export to Jaeger, Tempo, or any OTel collector |
| Metrics | OTLP/HTTP export, bridged from the metrics crate facade |
| Continuous Profiling | CPU/Memory profiling export to Pyroscope |
| Log cleanup | Background task: size limits, zstd/gzip compression, retention policies |
| GPU metrics (optional) | Enable with the gpu feature flag |
Quick Start
# Cargo.toml
[dependencies]
rustfs-obs = { version = "0.0.5" }
# GPU metrics support
rustfs-obs = { version = "0.0.5", features = ["gpu"] }
use rustfs_obs::init_obs;
#[tokio::main]
async fn main() {
// Build config from environment variables, then initialise all backends.
let _guard = init_obs(None).await.expect("failed to initialise observability");
tracing::info!("RustFS started");
// _guard is dropped here — all providers are flushed and shut down.
}
Keep
_guardalive for the lifetime of your application. Dropping it triggers an ordered shutdown of every OpenTelemetry provider.
Initialisation
With an explicit OTLP endpoint
use rustfs_obs::init_obs;
let _guard = init_obs(Some("http://otel-collector:4318".to_string()))
.await
.expect("observability init failed");
With a custom config struct
use rustfs_obs::{AppConfig, OtelConfig, init_obs_with_config};
let config = AppConfig::new_with_endpoint(Some("http://localhost:4318".to_string()));
let _guard = init_obs_with_config( & config.observability)
.await
.expect("observability init failed");
Routing Logic
The library selects a backend automatically based on configuration:
1. Any OTLP endpoint set?
└─ YES → Full OTLP/HTTP pipeline (traces + metrics + logs + profiling)
2. RUSTFS_OBS_LOG_DIRECTORY set to a non-empty path?
└─ YES → Rolling-file JSON logging
+ Stdout mirror enabled if:
- RUSTFS_OBS_LOG_STDOUT_ENABLED=true (explicit), OR
- RUSTFS_OBS_ENVIRONMENT != "production" (automatic)
3. Default → Stdout-only JSON logging (all signals)
Key Points:
- When no log directory is configured, logs automatically go to stdout only (perfect for development)
- When a log directory is set, logs go to rolling files in that directory
- In non-production environments, stdout is automatically mirrored alongside file logging for visibility
- In production mode, you must explicitly set
RUSTFS_OBS_LOG_STDOUT_ENABLED=trueto see stdout in addition to files
Environment Variables
All configuration is read from environment variables at startup.
OTLP / Export
| Variable | Default | Description |
|---|---|---|
RUSTFS_OBS_ENDPOINT |
(empty) | Root OTLP/HTTP endpoint, e.g. http://otel-collector:4318 |
RUSTFS_OBS_TRACE_ENDPOINT |
(empty) | Dedicated trace endpoint (overrides root + /v1/traces) |
RUSTFS_OBS_METRIC_ENDPOINT |
(empty) | Dedicated metrics endpoint |
RUSTFS_OBS_LOG_ENDPOINT |
(empty) | Dedicated log endpoint |
RUSTFS_OBS_PROFILING_ENDPOINT |
(empty) | Dedicated profiling endpoint (e.g. Pyroscope) |
RUSTFS_OBS_TRACES_EXPORT_ENABLED |
true |
Toggle trace export |
RUSTFS_OBS_METRICS_EXPORT_ENABLED |
true |
Toggle metrics export |
RUSTFS_OBS_LOGS_EXPORT_ENABLED |
true |
Toggle OTLP log export |
RUSTFS_OBS_PROFILING_EXPORT_ENABLED |
true |
Toggle profiling export |
RUSTFS_OBS_USE_STDOUT |
false |
Mirror all signals to stdout alongside OTLP |
RUSTFS_OBS_SAMPLE_RATIO |
0.1 |
Trace sampling ratio 0.0–1.0 |
RUSTFS_OBS_METER_INTERVAL |
15 |
Metrics export interval (seconds) |
Service identity
| Variable | Default | Description |
|---|---|---|
RUSTFS_OBS_SERVICE_NAME |
rustfs |
OTel service.name |
RUSTFS_OBS_SERVICE_VERSION |
(crate version) | OTel service.version |
RUSTFS_OBS_ENVIRONMENT |
development |
Deployment environment (production, development, …) |
Local logging
| Variable | Default | Description |
|---|---|---|
RUSTFS_OBS_LOGGER_LEVEL |
info |
Log level; RUST_LOG syntax supported |
RUSTFS_OBS_LOG_STDOUT_ENABLED |
false |
When file logging is active, also mirror to stdout |
RUSTFS_OBS_LOG_DIRECTORY |
(empty) | Directory for rolling log files. When empty, logs go to stdout only |
RUSTFS_OBS_LOG_FILENAME |
rustfs.log |
Base filename for rolling logs. Rotated archives include a high-precision timestamp and counter. With the default RUSTFS_OBS_LOG_MATCH_MODE=suffix, names look like <timestamp>-<counter>.rustfs.log (e.g., 20231027103001.123456-0.rustfs.log); with prefix, they look like rustfs.log.<timestamp>-<counter> (e.g., rustfs.log.20231027103001.123456-0). |
RUSTFS_OBS_LOG_ROTATION_TIME |
hourly |
Rotation granularity: minutely, hourly, or daily |
RUSTFS_OBS_LOG_KEEP_FILES |
30 |
Number of rolling files to keep (also used by cleaner) |
RUSTFS_OBS_LOG_MATCH_MODE |
suffix |
File matching mode: prefix or suffix |
Log cleanup
| Variable | Default | Description |
|---|---|---|
RUSTFS_OBS_LOG_MAX_TOTAL_SIZE_BYTES |
2147483648 |
Hard cap on total log directory size (2 GiB) |
RUSTFS_OBS_LOG_MAX_SINGLE_FILE_SIZE_BYTES |
0 |
Per-file size cap; 0 = unlimited |
RUSTFS_OBS_LOG_COMPRESS_OLD_FILES |
true |
Compress files before deleting |
RUSTFS_OBS_LOG_GZIP_COMPRESSION_LEVEL |
6 |
Gzip level 1 (fastest) – 9 (best) |
RUSTFS_OBS_LOG_COMPRESSION_ALGORITHM |
zstd |
Compression codec: zstd or gzip |
RUSTFS_OBS_LOG_PARALLEL_COMPRESS |
true |
Enable work-stealing parallel compression |
RUSTFS_OBS_LOG_PARALLEL_WORKERS |
6 |
Number of cleaner worker threads |
RUSTFS_OBS_LOG_ZSTD_COMPRESSION_LEVEL |
8 |
Zstd level 1 (fastest) – 21 (best ratio) |
RUSTFS_OBS_LOG_ZSTD_FALLBACK_TO_GZIP |
true |
Fallback to gzip when zstd compression fails |
RUSTFS_OBS_LOG_ZSTD_WORKERS |
1 |
zstdmt worker threads per compression task |
RUSTFS_OBS_LOG_COMPRESSED_FILE_RETENTION_DAYS |
30 |
Delete .gz / .zst archives older than N days; 0 = keep forever |
RUSTFS_OBS_LOG_EXCLUDE_PATTERNS |
(empty) | Comma-separated glob patterns to never clean up |
RUSTFS_OBS_LOG_DELETE_EMPTY_FILES |
true |
Remove zero-byte files |
RUSTFS_OBS_LOG_MIN_FILE_AGE_SECONDS |
3600 |
Minimum file age (seconds) before cleanup |
RUSTFS_OBS_LOG_CLEANUP_INTERVAL_SECONDS |
1800 |
How often the cleanup task runs (0.5 hours) |
RUSTFS_OBS_LOG_DRY_RUN |
false |
Report deletions without actually removing files |
Cleaner & Rotation Metrics
The log rotation and cleanup pipeline emits these metrics (via the metrics facade):
| Metric | Type | Description |
|---|---|---|
rustfs.log_cleaner.deleted_files_total |
counter | Number of files deleted per cleanup pass |
rustfs.log_cleaner.freed_bytes_total |
counter | Bytes reclaimed by deletion |
rustfs.log_cleaner.compress_duration_seconds |
histogram | Compression stage duration |
rustfs.log_cleaner.steal_success_rate |
gauge | Work-stealing success ratio in parallel mode |
rustfs.log_cleaner.runs_total |
counter | Successful cleanup loop runs |
rustfs.log_cleaner.run_failures_total |
counter | Failed or panicked cleanup loop runs |
rustfs.log_cleaner.rotation_total |
counter | Successful file rotations |
rustfs.log_cleaner.rotation_failures_total |
counter | Failed file rotations |
rustfs.log_cleaner.rotation_duration_seconds |
histogram | Rotation latency |
rustfs.log_cleaner.active_file_size_bytes |
gauge | Current active log file size |
These metrics cover compression, cleanup, and file rotation end-to-end.
Metric Semantics
deleted_files_totalandfreed_bytes_totalare emitted after each cleanup pass and include both normal log cleanup and expired compressed archive cleanup.compress_duration_secondsmeasures compression stage wall-clock time for both serial and parallel modes.steal_success_rateis updated by the parallel work-stealing path and remains at the last computed value.rotation_*metrics are emitted byRollingAppenderand include retries; a failed final rotation incrementsrotation_failures_total.active_file_size_bytesis sampled on writes and after successful roll, so dashboards can track current active file growth.
Grafana Dashboard JSON Draft (Ready to Import)
Save this as
rustfs-log-cleaner-dashboard.json, then import from Grafana UI. For Prometheus datasources, metric names are usually normalized to underscores, sorustfs.log_cleaner.deleted_files_totalbecomesrustfs_log_cleaner_deleted_files_total.The same panels are now checked in at:
.docker/observability/grafana/dashboards/rustfs.json(row title:Log Cleaner).
{
"uid": "rustfs-log-cleaner",
"title": "RustFS Log Cleaner",
"timezone": "browser",
"schemaVersion": 39,
"version": 1,
"refresh": "10s",
"tags": ["rustfs", "observability", "log-cleaner"],
"time": {
"from": "now-6h",
"to": "now"
},
"panels": [
{
"id": 1,
"title": "Cleanup Runs / Failures",
"type": "timeseries",
"targets": [
{ "refId": "A", "expr": "sum(rate(rustfs_log_cleaner_runs_total[5m]))", "legendFormat": "runs/s" },
{ "refId": "B", "expr": "sum(rate(rustfs_log_cleaner_run_failures_total[5m]))", "legendFormat": "failures/s" }
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
},
{
"id": 2,
"title": "Freed Bytes / Deleted Files",
"type": "timeseries",
"targets": [
{ "refId": "A", "expr": "sum(rate(rustfs_log_cleaner_freed_bytes_total[15m]))", "legendFormat": "bytes/s" },
{ "refId": "B", "expr": "sum(rate(rustfs_log_cleaner_deleted_files_total[15m]))", "legendFormat": "files/s" }
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
},
{
"id": 3,
"title": "Compression P95 Latency",
"type": "timeseries",
"targets": [
{
"refId": "A",
"expr": "histogram_quantile(0.95, sum(rate(rustfs_log_cleaner_compress_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p95"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
},
{
"id": 4,
"title": "Rotation Success / Failure",
"type": "timeseries",
"targets": [
{ "refId": "A", "expr": "sum(rate(rustfs_log_cleaner_rotation_total[5m]))", "legendFormat": "rotation/s" },
{ "refId": "B", "expr": "sum(rate(rustfs_log_cleaner_rotation_failures_total[5m]))", "legendFormat": "rotation_failures/s" }
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
},
{
"id": 5,
"title": "Steal Success Rate",
"type": "timeseries",
"targets": [
{ "refId": "A", "expr": "max(rustfs_log_cleaner_steal_success_rate)", "legendFormat": "ratio" }
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }
},
{
"id": 6,
"title": "Active File Size",
"type": "timeseries",
"targets": [
{ "refId": "A", "expr": "max(rustfs_log_cleaner_active_file_size_bytes)", "legendFormat": "bytes" }
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }
}
]
}
PromQL Templates
Use these templates directly in Grafana panels/alerts.
- Cleanup run rate
sum(rate(rustfs_log_cleaner_runs_total[$__rate_interval]))
- Cleanup failure rate
sum(rate(rustfs_log_cleaner_run_failures_total[$__rate_interval]))
- Cleanup failure ratio
sum(rate(rustfs_log_cleaner_run_failures_total[$__rate_interval])) / clamp_min(sum(rate(rustfs_log_cleaner_runs_total[$__rate_interval])), 1e-9)
- Freed bytes throughput
sum(rate(rustfs_log_cleaner_freed_bytes_total[$__rate_interval]))
- Deleted files throughput
sum(rate(rustfs_log_cleaner_deleted_files_total[$__rate_interval]))
- Compression p95 latency
histogram_quantile(0.95, sum(rate(rustfs_log_cleaner_compress_duration_seconds_bucket[$__rate_interval])) by (le))
- Rotation failure ratio
sum(rate(rustfs_log_cleaner_rotation_failures_total[$__rate_interval])) / clamp_min(sum(rate(rustfs_log_cleaner_rotation_total[$__rate_interval])), 1e-9)
- Work-stealing efficiency (latest)
max(rustfs_log_cleaner_steal_success_rate)
- Active file size (latest)
max(rustfs_log_cleaner_active_file_size_bytes)
Suggested Alerts
- CleanupFailureRatioHigh: failure ratio > 0.05 for 10m.
- CompressionLatencyP95High: p95 above your baseline SLO for 15m.
- RotationFailuresDetected: rotation failure rate > 0 for 3 consecutive windows.
- NoCleanupActivity: runs rate == 0 for expected active environments.
Metrics Compatibility
The project is currently in active development. Metric names and labels are updated directly when architecture evolves, and no backward-compatibility shim is maintained for old names. Use the metric names documented in this README as the current source of truth.
Examples
Stdout-only (development default)
# No RUSTFS_OBS_LOG_DIRECTORY set → stdout JSON
RUSTFS_OBS_LOGGER_LEVEL=debug ./rustfs
Rolling-file logging
export RUSTFS_OBS_LOG_DIRECTORY=/var/log/rustfs
export RUSTFS_OBS_LOGGER_LEVEL=info
export RUSTFS_OBS_LOG_KEEP_FILES=30
export RUSTFS_OBS_LOG_MAX_TOTAL_SIZE_BYTES=5368709120 # 5 GiB
./rustfs
Full OTLP pipeline (production)
export RUSTFS_OBS_ENDPOINT=http://otel-collector:4318
export RUSTFS_OBS_ENVIRONMENT=production
export RUSTFS_OBS_SAMPLE_RATIO=0.05 # 5% trace sampling
export RUSTFS_OBS_LOG_DIRECTORY=/var/log/rustfs
export RUSTFS_OBS_LOG_STDOUT_ENABLED=false
./rustfs
Separate per-signal endpoints
export RUSTFS_OBS_TRACE_ENDPOINT=http://tempo:4318/v1/traces
export RUSTFS_OBS_METRIC_ENDPOINT=http://prometheus-otel:4318/v1/metrics
export RUSTFS_OBS_LOG_ENDPOINT=http://loki-otel:4318/v1/logs
./rustfs
Dry-run cleanup audit
export RUSTFS_OBS_LOG_DIRECTORY=/var/log/rustfs
export RUSTFS_OBS_LOG_DRY_RUN=true
./rustfs
# Observe log output — no files will actually be deleted.
Parallel zstd cleanup (recommended production profile)
export RUSTFS_OBS_LOG_DIRECTORY=/var/log/rustfs
export RUSTFS_OBS_LOG_COMPRESSION_ALGORITHM=zstd
export RUSTFS_OBS_LOG_PARALLEL_COMPRESS=true
export RUSTFS_OBS_LOG_PARALLEL_WORKERS=6
export RUSTFS_OBS_LOG_ZSTD_COMPRESSION_LEVEL=8
export RUSTFS_OBS_LOG_ZSTD_FALLBACK_TO_GZIP=true
export RUSTFS_OBS_LOG_ZSTD_WORKERS=1
./rustfs
Module Structure
rustfs-obs/src/
├── lib.rs # Crate root; public re-exports
├── config.rs # OtelConfig + AppConfig; env-var loading
├── error.rs # TelemetryError type
├── global.rs # init_obs / init_obs_with_config entry points
│
├── telemetry/ # Backend initialisation
│ ├── mod.rs # init_telemetry routing logic
│ ├── guard.rs # OtelGuard RAII lifecycle manager
│ ├── filter.rs # EnvFilter construction helpers
│ ├── resource.rs # OTel Resource builder
│ ├── local.rs # Stdout-only and rolling-file backends
│ ├── otel.rs # Full OTLP/HTTP pipeline
│ └── recorder.rs # metrics-crate → OTel bridge (Recorder)
│
├── cleaner/ # Background log-file cleanup subsystem
│ ├── mod.rs # LogCleaner public API + tests
│ ├── types.rs # Shared cleaner types (match mode, compression codec, FileInfo)
│ ├── scanner.rs # Filesystem discovery
│ ├── compress.rs # Gzip/Zstd compression helper
│ └── core.rs # Selection, compression, deletion logic
│
└── system/ # Host metrics (CPU, memory, disk, GPU)
├── mod.rs
├── attributes.rs
├── collector.rs
├── metrics.rs
└── gpu.rs # GPU metrics (feature = "gpu")
Using LogCleaner Directly
use std::path::PathBuf;
use rustfs_obs::LogCleaner;
use rustfs_obs::types::FileMatchMode;
let cleaner = LogCleaner::builder(
PathBuf::from("/var/log/rustfs"),
"rustfs.log.".to_string(), // file_pattern
"rustfs.log".to_string(), // active_filename
)
.match_mode(FileMatchMode::Prefix)
.keep_files(10)
.max_total_size_bytes(2 * 1024 * 1024 * 1024) // 2 GiB
.max_single_file_size_bytes(0) // unlimited
.compress_old_files(true)
.gzip_compression_level(6)
.compressed_file_retention_days(7)
.exclude_patterns(vec!["current.log".to_string()])
.delete_empty_files(true)
.min_file_age_seconds(3600) // 1 hour
.dry_run(false)
.build();
let (deleted, freed_bytes) = cleaner.cleanup().expect("cleanup failed");
println!("Deleted {deleted} files, freed {freed_bytes} bytes");
Feature Flags
| Flag | Description |
|---|---|
| (default) | Core logging, tracing, and metrics |
gpu |
GPU utilisation metrics via nvml |
full |
All features enabled |
# Enable GPU monitoring
rustfs-obs = { version = "0.0.5", features = ["gpu"] }
# Enable everything
rustfs-obs = { version = "0.0.5", features = ["full"] }
License
Apache 2.0 — see LICENSE.