Optimize concurrent GetObject performance with Moka cache, comprehensive metrics, complete test suite, cache writeback, and comprehensive documentation (#916)

* Initial plan

* feat: add concurrency-aware buffer sizing and hot object caching for GetObject

- Implement adaptive buffer sizing based on concurrent request load
- Add per-request tracking with automatic cleanup using RAII guards
- Implement hot object cache (LRU) for frequently accessed small files (<= 10MB)
- Add disk I/O semaphore to prevent saturation under extreme load
- Integrate concurrency module into GetObject implementation
- Buffer sizes now adapt: low concurrency uses large buffers for throughput,
  high concurrency uses smaller buffers for fairness and memory efficiency
- Add comprehensive metrics collection for monitoring performance

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* docs: add comprehensive documentation and tests for concurrent GetObject optimization

- Add detailed technical documentation explaining the solution
- Document root cause analysis and solution architecture
- Include performance expectations and testing recommendations
- Add integration tests for concurrency tracking and buffer sizing
- Add cache behavior tests
- Include benchmark tests for concurrent request handling

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix: address code review issues in concurrency module

- Fix race condition in cache size tracking by using consistent atomic operations within lock
- Correct buffer sizing logic: 1-2 requests use 100%, 3-4 use 75%, 5-8 use 50%, >8 use 40%
- Improve error message for semaphore acquire failure
- Document limitation of streaming cache implementation (not yet implemented)
- Add TODO for proper streaming cache with suggested approaches
- Update tests to match corrected buffer sizing thresholds

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* docs: add comprehensive implementation summary for concurrent GetObject optimization

- Executive summary of problem and solution
- Detailed architecture documentation for each optimization
- Integration points and code examples
- Complete testing strategy and performance validation steps
- Monitoring and observability guidelines with Prometheus queries
- Deployment considerations and rollout strategy
- Future enhancement roadmap
- Success criteria and quantitative metrics

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* fix

* refactor: optimize cache with lru 0.16.2 read-first pattern and add advanced features

- Implement optimized read-first cache access using peek() to reduce write lock contention
- Add batch cache operations: get_cached_batch() for efficient multi-object retrieval
- Add cache utility methods: is_cached(), remove_cached(), get_hot_keys()
- Implement warm_cache() for pre-populating cache on startup
- Add get_advanced_buffer_size() with file size and sequential read optimization
- Enhance test suite with 8 new comprehensive tests covering:
  - Batch operations and cache warming
  - Hot keys tracking and analysis
  - Cache removal and LRU behavior verification
  - Concurrent cache access performance
  - Advanced buffer sizing strategies
- Improve documentation and code comments in English throughout

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* docs: add final comprehensive optimization summary

- Complete technical documentation of all optimizations
- Detailed performance analysis and metrics
- Production deployment guide with examples
- Comprehensive API reference and usage patterns
- Migration guide and future enhancement roadmap
- All documentation in professional English

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* fix

* add moka crate for cache

* feat: replace LRU with Moka cache and add comprehensive metrics

- Replace lru crate with moka 0.12.11 for superior concurrent performance
- Implement lock-free cache with automatic TTL/TTI expiration
- Add size-based eviction using Moka's weigher function
- Integrate comprehensive metrics collection throughout GetObject flow:
  * Cache hit/miss tracking with per-key access counts
  * Request concurrency gauges
  * Disk permit wait time histograms
  * Total request duration tracking
  * Response size and buffer size histograms
- Deep integration with ecfs.rs GetObject operation
- Add hit rate calculation method
- Enhanced CacheStats with hit/miss counters
- Lock-free concurrent reads for better scalability

Moka advantages over LRU:
- True lock-free concurrent access
- Built-in TTL and TTI support
- Automatic size-based eviction
- Better performance under high concurrency
- Native async support

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* docs: add comprehensive Moka cache migration and metrics documentation

- Complete technical documentation of LRU to Moka migration
- Detailed performance comparison and benchmarks
- Comprehensive metrics catalog with 15+ Prometheus metrics
- Prometheus query examples for monitoring
- Dashboard and alerting recommendations
- Migration guide with code examples
- Troubleshooting guide for common issues
- Future enhancement roadmap

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* fix

* refactor: update tests for Moka cache implementation

- Completely refactor test suite to align with Moka-based concurrency.rs
- Add Clone derive to ConcurrencyManager for test convenience
- Update all tests to handle Moka's async behavior with proper delays
- Add new tests:
  * test_cache_hit_rate - validate hit rate calculation
  * test_ttl_expiration - verify TTL configuration
  * test_is_cached_no_side_effects - ensure contains doesn't affect LRU
  * bench_concurrent_cache_performance - benchmark concurrent access
- Updated existing tests:
  * test_moka_cache_operations - renamed and updated for Moka API
  * test_moka_cache_eviction - validate automatic eviction
  * test_hot_keys_tracking - improved assertions for sorted results
  * test_concurrent_cache_access - validate lock-free performance
- All tests now include appropriate sleep delays for Moka's async processing
- Enhanced documentation and assertions for better test clarity
- Total: 18 comprehensive integration tests

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* docs: add comprehensive Moka test suite documentation

- Complete test suite documentation for all 18 tests
- Detailed test patterns and best practices for Moka
- Debugging guide for common test failures
- Performance expectations and CI integration
- 100% API coverage validation
- Future enhancement roadmap

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* fmt

* docs: enhance code comments and test documentation throughout

- Significantly improve module-level documentation in concurrency.rs
- Add detailed explanations of performance characteristics and expected improvements
- Document the solution to issue #911 with before/after metrics
- Enhance constant documentation with purpose and impact explanations
- Add comprehensive RAII guard documentation with thread safety notes
- Improve test documentation with detailed test strategy and rationale
- Add Moka-specific test patterns and timing guidance
- Include performance expectations for each test category
- Document why each test matters for solving the original issue
- All documentation written in professional English
- Follow Rust documentation best practices with examples

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* remove lru crate

* upgrade version

* fix: resolve test errors by correcting module structure and test assertions

- Fix test import paths to use crate:: instead of rustfs:: (binary-only crate)
- Keep test file in src/storage/ instead of tests/ (no lib.rs exists)
- Add #[cfg(test)] guard to mod declaration in storage/mod.rs
- Fix Arc type annotations for Moka's ConcurrencyManager in concurrent tests
- Correct test_buffer_size_bounds assertions to match actual implementation:
  * Minimum buffer is 32KB for files <100KB, 64KB otherwise
  * Maximum buffer respects base_buffer_size when concurrency is low
  * Buffer sizing doesn't cap at file size, only at min/max constraints
- All 17 integration tests now pass successfully

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix: modify `TimeoutLayer::new` to `TimeoutLayer::with_status_code` and improve docker health check

* fix

* feat: implement cache writeback for small objects in GetObject

- Add cache writeback logic for objects meeting caching criteria:
  * No range/part request (full object retrieval)
  * Object size known and <= 10MB (max_object_size threshold)
  * Not encrypted (SSE-C or managed encryption)
- Read eligible objects into memory and cache via background task
- Serve response from in-memory data for immediate client response
- Add metrics counter for cache writeback operations
- Add 3 new tests for cache writeback functionality:
  * test_cache_writeback_flow - validates round-trip caching
  * test_cache_writeback_size_limit - ensures large objects aren't cached
  * test_cache_writeback_concurrent - validates thread-safe concurrent writes
- Update test suite documentation (now 20 comprehensive tests)

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* improve code for const

* cargo clippy

* feat: add cache enable/disable configuration via environment variable

- Add is_cache_enabled() method to ConcurrencyManager
- Read RUSTFS_OBJECT_CACHE_ENABLE env var (default: false) at startup
- Update ecfs.rs to check is_cache_enabled() before cache lookup and writeback
- Cache lookup and writeback now respect the enable flag
- Add test_cache_enable_configuration test
- Constants already exist in rustfs_config:
  * ENV_OBJECT_CACHE_ENABLE = "RUSTFS_OBJECT_CACHE_ENABLE"
  * DEFAULT_OBJECT_CACHE_ENABLE = false
- Total: 21 comprehensive tests passing

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* fmt

* fix

* fix

* feat: implement comprehensive CachedGetObject response cache with metadata

- Add CachedGetObject struct with full response metadata fields:
  * body, content_length, content_type, e_tag, last_modified
  * expires, cache_control, content_disposition, content_encoding
  * storage_class, version_id, delete_marker, tag_count, etc.
- Add dual cache architecture in HotObjectCache:
  * Legacy simple byte cache for backward compatibility
  * New response cache for complete GetObject responses
- Add ConcurrencyManager methods for response caching:
  * get_cached_object() - retrieve cached response with metadata
  * put_cached_object() - store complete response
  * invalidate_cache() - invalidate on write operations
  * invalidate_cache_versioned() - invalidate both version and latest
  * make_cache_key() - generate cache keys with version support
  * max_object_size() - get cache threshold
- Add builder pattern for CachedGetObject construction
- Add 6 new tests for response cache functionality (27 total):
  * test_cached_get_object_basic - basic operations
  * test_cached_get_object_versioned - version key handling
  * test_cache_invalidation - write operation invalidation
  * test_cache_invalidation_versioned - versioned invalidation
  * test_cached_get_object_size_limit - size enforcement
  * test_max_object_size - threshold accessor

All 27 tests pass successfully.

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* feat: integrate CachedGetObject cache in ecfs.rs with full metadata and cache invalidation

Integration of CachedGetObject response cache in ecfs.rs:
1. get_object: Cache lookup uses get_cached_object() with full metadata
   - Returns complete response with e_tag, last_modified, content_type, etc.
   - Parses last_modified from RFC3339 string
   - Supports versioned cache keys via make_cache_key()

2. get_object: Cache writeback uses put_cached_object() with metadata
   - Stores content_type, e_tag, last_modified in CachedGetObject
   - Background writeback via tokio::spawn()

3. Cache invalidation added to write operations:
   - put_object: invalidate_cache_versioned() after store.put_object()
   - put_object_extract: invalidate_cache_versioned() after each file extraction
   - copy_object: invalidate_cache_versioned() after store.copy_object()
   - delete_object: invalidate_cache_versioned() after store.delete_object()
   - delete_objects: invalidate_cache_versioned() for each deleted object
   - complete_multipart_upload: invalidate_cache_versioned() after completion

4. Fixed test_adaptive_buffer_sizing to be more robust with parallel tests

All 27 tests pass.

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix: add error logging for time parsing in cache lookup and writeback

- Add warning log when RFC3339 parsing fails in cache lookup
- Add warning log when time formatting fails in cache writeback
- Improves debugging for cache-related issues

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* fix

* upgrade version

* fmt

* add http: 4138 and add max_object_size key

* fix

* fix

* fix

* modify metrics key

* add

* upgrade crates version and improve docker observability

* feat: implement adaptive I/O strategy based on disk permit wait time

Implements adaptive I/O strategy for GetObject operations based on disk permit
wait times, as requested in issue #911.

Key changes:

1. New types in concurrency.rs:
   - IoLoadLevel enum: Low (<10ms), Medium (10-50ms), High (50-200ms), Critical (>200ms)
   - IoStrategy struct: contains optimized I/O parameters based on load level
   - IoLoadMetrics: rolling window metrics for load tracking

2. New ConcurrencyManager methods:
   - calculate_io_strategy(): calculates adaptive strategy from permit wait duration
   - record_permit_wait(): records observation for load tracking
   - smoothed_load_level(): returns averaged load level for stability
   - io_load_stats(): returns (avg_wait, p95_wait, max_wait, count) for monitoring
   - adaptive_buffer_size(): convenience method for buffer sizing

3. Integration in ecfs.rs get_object:
   - Calculate IoStrategy after acquiring disk permit
   - Use strategy buffer_size instead of static sizing
   - Consider strategy.cache_writeback_enabled in cache decision
   - Record new metrics: io.load.level gauge, io.buffer.multiplier gauge,
     io.strategy.selected counter by load level

4. New tests (4 added, 31 total):
   - test_io_load_level_classification: validates load level thresholds
   - test_io_strategy_buffer_sizing: validates buffer multipliers
   - test_calculate_io_strategy: validates manager strategy calculation
   - test_io_load_stats: validates stats tracking

All 31 tests pass. Clippy clean. Formatted.

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* fix

* docs: add comprehensive architecture design and fix cache hit event notification

1. Added CONCURRENCY_ARCHITECTURE.md with complete design documentation:
   - Executive summary and problem statement
   - Architecture overview with request flow diagram
   - Detailed module analysis for concurrency.rs and ecfs.rs
   - Critical analysis of helper.complete() for cache hits
   - Adaptive I/O strategy design with algorithm
   - Cache architecture with CachedGetObject structure
   - Metrics and monitoring with Prometheus queries
   - Performance characteristics and future enhancements

2. Fixed critical issue: Cache hit path now calls helper.complete()
   - S3 bucket notifications (s3:GetObject events) now trigger for cache hits
   - Event-driven workflows (Lambda, SNS) work correctly for all object access
   - Maintains audit trail for both cache hits and misses

All 31 tests pass.

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix: set object info and version_id on helper before complete() for cache hits

When serving from cache, properly configure the OperationHelper before
calling complete() to ensure S3 bucket notifications include complete
object metadata:

1. Build ObjectInfo from cached metadata:
   - bucket, name, size, actual_size
   - etag, mod_time, version_id, delete_marker
   - storage_class, content_type, content_encoding
   - user_metadata (user_defined)

2. Set helper.object(event_info).version_id(version_id_str) before complete()

3. Updated CONCURRENCY_ARCHITECTURE.md with:
   - Complete code example for cache hit event notification
   - Explanation of why ObjectInfo is required
   - Documentation of version_id handling

This ensures:
- Lambda triggers receive proper object metadata for cache hits
- SNS/SQS notifications include complete information
- Audit logs contain accurate object details
- Version-specific event routing works correctly

All 31 tests pass.

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* fix

* improve code

* fmt

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
Co-authored-by: houseme <housemecn@gmail.com>
This commit is contained in:
Copilot
2025-11-30 01:16:55 +08:00
committed by GitHub
parent a6cf0740cb
commit fdcdb30d28
35 changed files with 6833 additions and 230 deletions

View File

@@ -34,61 +34,111 @@ services:
ports:
- "3200:3200" # tempo
- "24317:4317" # otlp grpc
- "24318:4318" # otlp http
restart: unless-stopped
networks:
- otel-network
healthcheck:
test: [ "CMD", "wget", "--spider", "-q", "http://localhost:3200/metrics" ]
interval: 10s
timeout: 5s
retries: 3
start_period: 15s
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
environment:
- TZ=Asia/Shanghai
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro
ports:
- "1888:1888"
- "8888:8888"
- "8889:8889"
- "13133:13133"
- "4317:4317"
- "4318:4318"
- "55679:55679"
- "1888:1888" # pprof
- "8888:8888" # Prometheus metrics for Collector
- "8889:8889" # Prometheus metrics for application indicators
- "13133:13133" # health check
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "55679:55679" # zpages
networks:
- otel-network
depends_on:
jaeger:
condition: service_started
tempo:
condition: service_started
prometheus:
condition: service_started
loki:
condition: service_started
healthcheck:
test: [ "CMD", "wget", "--spider", "-q", "http://localhost:13133" ]
interval: 10s
timeout: 5s
retries: 3
jaeger:
image: jaegertracing/jaeger:latest
environment:
- TZ=Asia/Shanghai
- SPAN_STORAGE_TYPE=memory
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686"
- "14317:4317"
- "14318:4318"
- "16686:16686" # Web UI
- "14317:4317" # OTLP gRPC
- "14318:4318" # OTLP HTTP
- "18888:8888" # collector
networks:
- otel-network
healthcheck:
test: [ "CMD", "wget", "--spider", "-q", "http://localhost:16686" ]
interval: 10s
timeout: 5s
retries: 3
prometheus:
image: prom/prometheus:latest
environment:
- TZ=Asia/Shanghai
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus-data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-otlp-receiver' # Enable OTLP
- '--web.enable-remote-write-receiver' # Enable remote write
- '--enable-feature=promql-experimental-functions' # Enable info()
- '--storage.tsdb.min-block-duration=15m' # Minimum block duration
- '--storage.tsdb.max-block-duration=1h' # Maximum block duration
- '--log.level=info'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
restart: unless-stopped
networks:
- otel-network
healthcheck:
test: [ "CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy" ]
interval: 10s
timeout: 5s
retries: 3
loki:
image: grafana/loki:latest
environment:
- TZ=Asia/Shanghai
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- ./loki-config.yaml:/etc/loki/local-config.yaml:ro
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
networks:
- otel-network
healthcheck:
test: [ "CMD", "wget", "--spider", "-q", "http://localhost:3100/ready" ]
interval: 10s
timeout: 5s
retries: 3
grafana:
image: grafana/grafana:latest
ports:
@@ -97,14 +147,32 @@ services:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_SECURITY_ADMIN_USER=admin
- TZ=Asia/Shanghai
- GF_INSTALL_PLUGINS=grafana-pyroscope-datasource
restart: unless-stopped
networks:
- otel-network
depends_on:
- prometheus
- tempo
- loki
healthcheck:
test: [ "CMD", "wget", "--spider", "-q", "http://localhost:3000/api/health" ]
interval: 10s
timeout: 5s
retries: 3
volumes:
prometheus-data:
tempo-data:
networks:
otel-network:
driver: bridge
name: "network_otel_config"
ipam:
config:
- subnet: 172.28.0.0/16
driver_opts:
com.docker.network.enable_ipv6: "true"

View File

@@ -42,7 +42,7 @@ datasources:
customQuery: true
query: 'method="$${__span.tags.method}"'
tracesToMetrics:
datasourceUid: 'prom'
datasourceUid: 'prometheus'
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
tags: [ { key: 'service.name', value: 'service' }, { key: 'job' } ]
@@ -91,7 +91,7 @@ datasources:
customQuery: true
query: 'method="$${__span.tags.method}"'
tracesToMetrics:
datasourceUid: 'prom'
datasourceUid: 'Prometheus'
spanStartTimeShift: '1h'
spanEndTimeShift: '-1h'
tags: [ { key: 'service.name', value: 'service' }, { key: 'job' } ]

View File

@@ -65,6 +65,7 @@ extensions:
some_store:
memory:
max_traces: 1000000
max_events: 100000
another_store:
memory:
max_traces: 1000000
@@ -102,6 +103,7 @@ receivers:
processors:
batch:
metadata_keys: [ "span.kind", "http.method", "http.status_code", "db.system", "db.statement", "messaging.system", "messaging.destination", "messaging.operation","span.events","span.links" ]
# Adaptive Sampling Processor is required to support adaptive sampling.
# It expects remote_sampling extension with `adaptive:` config to be enabled.
adaptive_sampling:

View File

@@ -41,6 +41,9 @@ query_range:
limits_config:
metric_aggregation_enabled: true
max_line_size: 256KB
max_line_size_truncate: false
allow_structured_metadata: true
schema_config:
configs:
@@ -51,6 +54,7 @@ schema_config:
index:
prefix: index_
period: 24h
row_shards: 16
pattern_ingester:
enabled: true

View File

@@ -15,66 +15,108 @@
receivers:
otlp:
protocols:
grpc: # OTLP gRPC 接收器
grpc: # OTLP gRPC receiver
endpoint: 0.0.0.0:4317
http: # OTLP HTTP 接收器
http: # OTLP HTTP receiver
endpoint: 0.0.0.0:4318
processors:
batch: # 批处理处理器,提升吞吐量
batch: # Batch processor to improve throughput
timeout: 5s
send_batch_size: 1000
metadata_keys: [ ]
metadata_cardinality_limit: 1000
memory_limiter:
check_interval: 1s
limit_mib: 512
transform/logs:
log_statements:
- context: log
statements:
# Extract Body as attribute "message"
- set(attributes["message"], body.string)
# Retain the original Body
- set(attributes["log.body"], body.string)
exporters:
otlp/traces: # OTLP 导出器,用于跟踪数据
endpoint: "jaeger:4317" # Jaeger 的 OTLP gRPC 端点
otlp/traces: # OTLP exporter for trace data
endpoint: "http://jaeger:4317" # OTLP gRPC endpoint for Jaeger
tls:
insecure: true # 开发环境禁用 TLS生产环境需配置证书
otlp/tempo: # OTLP 导出器,用于跟踪数据
endpoint: "tempo:4317" # tempo 的 OTLP gRPC 端点
insecure: true # TLS is disabled in the development environment and a certificate needs to be configured in the production environment.
compression: gzip # Enable compression to reduce network bandwidth
retry_on_failure:
enabled: true # Enable retry on failure
initial_interval: 1s # Initial interval for retry
max_interval: 30s # Maximum interval for retry
max_elapsed_time: 300s # Maximum elapsed time for retry
sending_queue:
enabled: true # Enable sending queue
num_consumers: 10 # Number of consumers
queue_size: 5000 # Queue size
otlp/tempo: # OTLP exporter for trace data
endpoint: "http://tempo:4317" # OTLP gRPC endpoint for tempo
tls:
insecure: true # 开发环境禁用 TLS生产环境需配置证书
prometheus: # Prometheus 导出器,用于指标数据
endpoint: "0.0.0.0:8889" # Prometheus 刮取端点
namespace: "rustfs" # 指标前缀
send_timestamps: true # 发送时间戳
# enable_open_metrics: true
otlphttp/loki: # Loki 导出器,用于日志数据
endpoint: "http://loki:3100/otlp/v1/logs"
insecure: true # TLS is disabled in the development environment and a certificate needs to be configured in the production environment.
compression: gzip # Enable compression to reduce network bandwidth
retry_on_failure:
enabled: true # Enable retry on failure
initial_interval: 1s # Initial interval for retry
max_interval: 30s # Maximum interval for retry
max_elapsed_time: 300s # Maximum elapsed time for retry
sending_queue:
enabled: true # Enable sending queue
num_consumers: 10 # Number of consumers
queue_size: 5000 # Queue size
prometheus: # Prometheus exporter for metrics data
endpoint: "0.0.0.0:8889" # Prometheus scraping endpoint
namespace: "metrics" # indicator prefix
send_timestamps: true # Send timestamp
metric_expiration: 5m # Metric expiration time
resource_to_telemetry_conversion:
enabled: true # Enable resource to telemetry conversion
otlphttp/loki: # Loki exporter for log data
endpoint: "http://loki:3100/otlp"
tls:
insecure: true
compression: gzip # Enable compression to reduce network bandwidth
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1888
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [ health_check, pprof, zpages ] # 启用扩展
extensions: [ health_check, pprof, zpages ] # Enable extension
pipelines:
traces:
receivers: [ otlp ]
processors: [ memory_limiter,batch ]
exporters: [ otlp/traces,otlp/tempo ]
processors: [ memory_limiter, batch ]
exporters: [ otlp/traces, otlp/tempo ]
metrics:
receivers: [ otlp ]
processors: [ batch ]
exporters: [ prometheus ]
logs:
receivers: [ otlp ]
processors: [ batch ]
processors: [ batch, transform/logs ]
exporters: [ otlphttp/loki ]
telemetry:
logs:
level: "info" # Collector 日志级别
level: "debug" # Collector log level
encoding: "json" # Log encoding: console or json
metrics:
level: "detailed" # 可以是 basic, normal, detailed
level: "detailed" # Can be basic, normal, detailed
readers:
- periodic:
exporter:
otlp:
protocol: http/protobuf
endpoint: http://otel-collector:4318
- pull:
exporter:
prometheus:
host: '0.0.0.0'
port: 8888

View File

@@ -0,0 +1 @@
*

View File

@@ -14,17 +14,27 @@
global:
scrape_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
evaluation_interval: 15s
external_labels:
cluster: 'rustfs-dev' # Label to identify the cluster
relica: '1' # Replica identifier
scrape_configs:
- job_name: 'otel-collector'
- job_name: 'otel-collector-internal'
static_configs:
- targets: [ 'otel-collector:8888' ] # Scrape metrics from Collector
- job_name: 'otel-metrics'
scrape_interval: 10s
- job_name: 'rustfs-app-metrics'
static_configs:
- targets: [ 'otel-collector:8889' ] # Application indicators
scrape_interval: 15s
metric_relabel_configs:
- job_name: 'tempo'
static_configs:
- targets: [ 'tempo:3200' ] # Scrape metrics from Tempo
- job_name: 'jaeger'
static_configs:
- targets: [ 'jaeger:8888' ] # Jaeger admin port
otlp:
# Recommended attributes to be promoted to labels.

View File

@@ -18,7 +18,9 @@ distributor:
otlp:
protocols:
grpc:
endpoint: "tempo:4317"
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
ingester:
max_block_duration: 5m # cut the headblock when this much time passes. this is being set for demo purposes and should probably be left alone normally