sdgoij/rustfs

Fork 0

mirror of https://github.com/rustfs/rustfs.git synced 2026-01-17 01:30:33 +00:00

Files

History

安正超 fa17f7b1e3 feat: add comprehensive README documentation for all RustFS submodules (#48 )

2025-07-04 23:02:13 +08:00

src

refactor: Restructure project layout and clean up dependencies (#30 )

2025-07-02 19:33:12 +08:00

Cargo.toml

refactor: Restructure project layout and clean up dependencies (#30 )

2025-07-02 19:33:12 +08:00

README.md

feat: add comprehensive README documentation for all RustFS submodules (#48 )

2025-07-04 23:02:13 +08:00

README.md

RustFS S3Select API - SQL Query Interface

AWS S3 Select compatible SQL query API for RustFS distributed object storage

📖 Documentation · 🐛 Bug Reports · 💬 Discussions

📖 Overview

RustFS S3Select API provides AWS S3 Select compatible SQL query capabilities for the RustFS distributed object storage system. It enables clients to retrieve subsets of data from objects using SQL expressions, reducing data transfer and improving query performance through server-side filtering.

Note: This is a high-performance submodule of RustFS that provides essential SQL query capabilities for the distributed object storage system. For the complete RustFS experience, please visit the main RustFS repository.

✨ Features

📊 SQL Query Support

Standard SQL: Support for SELECT, WHERE, GROUP BY, ORDER BY clauses
Data Types: Support for strings, numbers, booleans, timestamps
Functions: Built-in SQL functions (aggregation, string, date functions)
Complex Expressions: Nested queries and complex conditional logic

📁 Format Support

CSV Files: Comma-separated values with customizable delimiters
JSON Documents: JSON objects and arrays with path expressions
Parquet Files: Columnar format with schema evolution
Apache Arrow: High-performance columnar data format

🚀 Performance Features

Streaming Processing: Process large files without loading into memory
Parallel Execution: Multi-threaded query execution
Predicate Pushdown: Push filters down to storage layer
Columnar Processing: Efficient columnar data processing with Apache DataFusion

🔧 S3 Compatibility

S3 Select API: Full compatibility with AWS S3 Select API
Request Formats: Support for JSON and XML request formats
Response Streaming: Streaming query results back to clients
Error Handling: AWS-compatible error responses

📦 Installation

Add this to your Cargo.toml:

[dependencies]
rustfs-s3select-api = "0.1.0"

🔧 Usage

Basic S3 Select Query

use rustfs_s3select_api::{S3SelectService, SelectRequest, InputSerialization, OutputSerialization};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create S3 Select service
    let s3select = S3SelectService::new().await?;

    // Configure input format (CSV)
    let input_serialization = InputSerialization::CSV {
        file_header_info: "USE".to_string(),
        record_delimiter: "\n".to_string(),
        field_delimiter: ",".to_string(),
        quote_character: "\"".to_string(),
        quote_escape_character: "\"".to_string(),
        comments: "#".to_string(),
    };

    // Configure output format
    let output_serialization = OutputSerialization::JSON {
        record_delimiter: "\n".to_string(),
    };

    // Create select request
    let select_request = SelectRequest {
        bucket: "sales-data".to_string(),
        key: "2024/sales.csv".to_string(),
        expression: "SELECT name, revenue FROM S3Object WHERE revenue > 10000".to_string(),
        expression_type: "SQL".to_string(),
        input_serialization,
        output_serialization,
        request_progress: false,
    };

    // Execute query
    let mut result_stream = s3select.select_object_content(select_request).await?;

    // Process streaming results
    while let Some(event) = result_stream.next().await {
        match event? {
            SelectEvent::Records(data) => {
                println!("Query result: {}", String::from_utf8(data)?);
            }
            SelectEvent::Stats(stats) => {
                println!("Bytes scanned: {}", stats.bytes_scanned);
                println!("Bytes processed: {}", stats.bytes_processed);
                println!("Bytes returned: {}", stats.bytes_returned);
            }
            SelectEvent::Progress(progress) => {
                println!("Progress: {}%", progress.details.bytes_processed_percent);
            }
            SelectEvent::End => {
                println!("Query completed");
                break;
            }
        }
    }

    Ok(())
}

CSV Data Processing

use rustfs_s3select_api::{S3SelectService, CSVInputSerialization};

async fn csv_processing_example() -> Result<(), Box<dyn std::error::Error>> {
    let s3select = S3SelectService::new().await?;

    // Configure CSV input with custom settings
    let csv_input = CSVInputSerialization {
        file_header_info: "USE".to_string(),
        record_delimiter: "\r\n".to_string(),
        field_delimiter: "|".to_string(),
        quote_character: "'".to_string(),
        quote_escape_character: "\\".to_string(),
        comments: "//".to_string(),
        allow_quoted_record_delimiter: false,
    };

    // Query with aggregation
    let select_request = SelectRequest {
        bucket: "analytics".to_string(),
        key: "user-events.csv".to_string(),
        expression: r#"
            SELECT
                event_type,
                COUNT(*) as event_count,
                AVG(CAST(duration as DECIMAL)) as avg_duration
            FROM S3Object
            WHERE timestamp >= '2024-01-01'
            GROUP BY event_type
            ORDER BY event_count DESC
        "#.to_string(),
        expression_type: "SQL".to_string(),
        input_serialization: InputSerialization::CSV(csv_input),
        output_serialization: OutputSerialization::JSON {
            record_delimiter: "\n".to_string(),
        },
        request_progress: true,
    };

    let mut result_stream = s3select.select_object_content(select_request).await?;

    let mut total_events = 0;
    while let Some(event) = result_stream.next().await {
        match event? {
            SelectEvent::Records(data) => {
                let result: serde_json::Value = serde_json::from_slice(&data)?;
                println!("Event type: {}, Count: {}, Avg duration: {}",
                    result["event_type"], result["event_count"], result["avg_duration"]);
                total_events += result["event_count"].as_u64().unwrap_or(0);
            }
            SelectEvent::Progress(progress) => {
                println!("Processing: {}%", progress.details.bytes_processed_percent);
            }
            _ => {}
        }
    }

    println!("Total events processed: {}", total_events);
    Ok(())
}

JSON Data Querying

use rustfs_s3select_api::{JSONInputSerialization, JSONType};

async fn json_querying_example() -> Result<(), Box<dyn std::error::Error>> {
    let s3select = S3SelectService::new().await?;

    // Configure JSON input
    let json_input = JSONInputSerialization {
        json_type: JSONType::Lines, // JSON Lines format
    };

    // Query nested JSON data
    let select_request = SelectRequest {
        bucket: "logs".to_string(),
        key: "application.jsonl".to_string(),
        expression: r#"
            SELECT
                s.timestamp,
                s.level,
                s.message,
                s.metadata.user_id,
                s.metadata.request_id
            FROM S3Object[*] s
            WHERE s.level = 'ERROR'
            AND s.metadata.user_id IS NOT NULL
            ORDER BY s.timestamp DESC
        "#.to_string(),
        expression_type: "SQL".to_string(),
        input_serialization: InputSerialization::JSON(json_input),
        output_serialization: OutputSerialization::JSON {
            record_delimiter: "\n".to_string(),
        },
        request_progress: false,
    };

    let mut result_stream = s3select.select_object_content(select_request).await?;

    while let Some(event) = result_stream.next().await {
        if let SelectEvent::Records(data) = event? {
            let log_entry: serde_json::Value = serde_json::from_slice(&data)?;
            println!("Error at {}: {} (User: {}, Request: {})",
                log_entry["timestamp"],
                log_entry["message"],
                log_entry["user_id"],
                log_entry["request_id"]
            );
        }
    }

    Ok(())
}

Parquet File Analysis

use rustfs_s3select_api::{ParquetInputSerialization};

async fn parquet_analysis_example() -> Result<(), Box<dyn std::error::Error>> {
    let s3select = S3SelectService::new().await?;

    // Parquet files don't need serialization configuration
    let parquet_input = ParquetInputSerialization {};

    // Complex analytical query
    let select_request = SelectRequest {
        bucket: "data-warehouse".to_string(),
        key: "sales/2024/q1/sales_data.parquet".to_string(),
        expression: r#"
            SELECT
                region,
                product_category,
                SUM(amount) as total_sales,
                COUNT(*) as transaction_count,
                AVG(amount) as avg_transaction,
                MIN(amount) as min_sale,
                MAX(amount) as max_sale
            FROM S3Object
            WHERE sale_date >= '2024-01-01'
            AND sale_date < '2024-04-01'
            AND amount > 0
            GROUP BY region, product_category
            HAVING SUM(amount) > 50000
            ORDER BY total_sales DESC
            LIMIT 20
        "#.to_string(),
        expression_type: "SQL".to_string(),
        input_serialization: InputSerialization::Parquet(parquet_input),
        output_serialization: OutputSerialization::JSON {
            record_delimiter: "\n".to_string(),
        },
        request_progress: true,
    };

    let mut result_stream = s3select.select_object_content(select_request).await?;

    while let Some(event) = result_stream.next().await {
        match event? {
            SelectEvent::Records(data) => {
                let sales_data: serde_json::Value = serde_json::from_slice(&data)?;
                println!("Region: {}, Category: {}, Total Sales: ${:.2}",
                    sales_data["region"],
                    sales_data["product_category"],
                    sales_data["total_sales"]
                );
            }
            SelectEvent::Stats(stats) => {
                println!("Query statistics:");
                println!("  Bytes scanned: {}", stats.bytes_scanned);
                println!("  Bytes processed: {}", stats.bytes_processed);
                println!("  Bytes returned: {}", stats.bytes_returned);
            }
            _ => {}
        }
    }

    Ok(())
}

Advanced SQL Functions

async fn advanced_sql_functions_example() -> Result<(), Box<dyn std::error::Error>> {
    let s3select = S3SelectService::new().await?;

    // Query with various SQL functions
    let select_request = SelectRequest {
        bucket: "analytics".to_string(),
        key: "user_data.csv".to_string(),
        expression: r#"
            SELECT
                -- String functions
                UPPER(name) as name_upper,
                SUBSTRING(email, 1, POSITION('@' IN email) - 1) as username,
                LENGTH(description) as desc_length,

                -- Date functions
                EXTRACT(YEAR FROM registration_date) as reg_year,
                DATE_DIFF('day', registration_date, last_login) as days_since_reg,

                -- Numeric functions
                ROUND(score, 2) as rounded_score,
                CASE
                    WHEN score >= 90 THEN 'Excellent'
                    WHEN score >= 70 THEN 'Good'
                    WHEN score >= 50 THEN 'Average'
                    ELSE 'Poor'
                END as score_category,

                -- Conditional logic
                COALESCE(nickname, SUBSTRING(name, 1, POSITION(' ' IN name) - 1)) as display_name

            FROM S3Object
            WHERE registration_date IS NOT NULL
            AND score IS NOT NULL
            ORDER BY score DESC
        "#.to_string(),
        expression_type: "SQL".to_string(),
        input_serialization: InputSerialization::CSV {
            file_header_info: "USE".to_string(),
            record_delimiter: "\n".to_string(),
            field_delimiter: ",".to_string(),
            quote_character: "\"".to_string(),
            quote_escape_character: "\"".to_string(),
            comments: "#".to_string(),
        },
        output_serialization: OutputSerialization::JSON {
            record_delimiter: "\n".to_string(),
        },
        request_progress: false,
    };

    let mut result_stream = s3select.select_object_content(select_request).await?;

    while let Some(event) = result_stream.next().await {
        if let SelectEvent::Records(data) = event? {
            let user: serde_json::Value = serde_json::from_slice(&data)?;
            println!("User: {} ({}) - Score: {} ({})",
                user["display_name"],
                user["username"],
                user["rounded_score"],
                user["score_category"]
            );
        }
    }

    Ok(())
}

Streaming Large Datasets

use rustfs_s3select_api::{SelectObjectContentStream, ProgressDetails};

async fn streaming_large_datasets() -> Result<(), Box<dyn std::error::Error>> {
    let s3select = S3SelectService::new().await?;

    let select_request = SelectRequest {
        bucket: "big-data".to_string(),
        key: "large_dataset.csv".to_string(),
        expression: "SELECT * FROM S3Object WHERE status = 'active'".to_string(),
        expression_type: "SQL".to_string(),
        input_serialization: InputSerialization::CSV {
            file_header_info: "USE".to_string(),
            record_delimiter: "\n".to_string(),
            field_delimiter: ",".to_string(),
            quote_character: "\"".to_string(),
            quote_escape_character: "\"".to_string(),
            comments: "".to_string(),
        },
        output_serialization: OutputSerialization::JSON {
            record_delimiter: "\n".to_string(),
        },
        request_progress: true,
    };

    let mut result_stream = s3select.select_object_content(select_request).await?;

    let mut processed_count = 0;
    let mut output_file = tokio::fs::File::create("filtered_results.jsonl").await?;

    while let Some(event) = result_stream.next().await {
        match event? {
            SelectEvent::Records(data) => {
                // Write results to file
                output_file.write_all(&data).await?;
                processed_count += 1;

                if processed_count % 1000 == 0 {
                    println!("Processed {} records", processed_count);
                }
            }
            SelectEvent::Progress(progress) => {
                println!("Progress: {:.1}% ({} bytes processed)",
                    progress.details.bytes_processed_percent,
                    progress.details.bytes_processed
                );
            }
            SelectEvent::Stats(stats) => {
                println!("Final statistics:");
                println!("  Total bytes scanned: {}", stats.bytes_scanned);
                println!("  Total bytes processed: {}", stats.bytes_processed);
                println!("  Total bytes returned: {}", stats.bytes_returned);
                println!("  Processing efficiency: {:.2}%",
                    (stats.bytes_returned as f64 / stats.bytes_scanned as f64) * 100.0
                );
            }
            SelectEvent::End => {
                println!("Streaming completed. Total records: {}", processed_count);
                break;
            }
        }
    }

    output_file.flush().await?;
    Ok(())
}

HTTP API Integration

use rustfs_s3select_api::{S3SelectHandler, SelectRequestXML};
use axum::{Router, Json, extract::{Path, Query}};

async fn setup_s3select_http_api() -> Router {
    let s3select_handler = S3SelectHandler::new().await.unwrap();

    Router::new()
        .route("/buckets/:bucket/objects/:key/select",
               axum::routing::post(handle_select_object_content))
        .layer(Extension(s3select_handler))
}

async fn handle_select_object_content(
    Path((bucket, key)): Path<(String, String)>,
    Extension(handler): Extension<S3SelectHandler>,
    body: String,
) -> Result<impl axum::response::IntoResponse, Box<dyn std::error::Error>> {
    // Parse S3 Select request (XML or JSON)
    let select_request = handler.parse_request(&body, &bucket, &key).await?;

    // Execute query
    let result_stream = handler.execute_select(select_request).await?;

    // Return streaming response
    let response = axum::response::Response::builder()
        .header("content-type", "application/xml")
        .header("x-amz-request-id", "12345")
        .body(axum::body::Body::from_stream(result_stream))?;

    Ok(response)
}

🏗️ Architecture

S3Select API Architecture

S3Select API Architecture:
┌─────────────────────────────────────────────────────────────┐
│                    S3 Select HTTP API                       │
├─────────────────────────────────────────────────────────────┤
│   Request      │   Response     │   Streaming  │   Error    │
│   Parsing      │   Formatting   │   Results    │   Handling │
├─────────────────────────────────────────────────────────────┤
│              Query Engine (DataFusion)                      │
├─────────────────────────────────────────────────────────────┤
│   SQL Parser   │   Optimizer    │   Execution  │   Streaming│
├─────────────────────────────────────────────────────────────┤
│              Storage Integration                            │
└─────────────────────────────────────────────────────────────┘

Supported Data Formats

Format	Features	Use Cases
CSV	Custom delimiters, headers, quotes	Log files, exports
JSON	Objects, arrays, nested data	APIs, documents
JSON Lines	Streaming JSON records	Event logs, analytics
Parquet	Columnar, schema evolution	Data warehousing

🧪 Testing

Run the test suite:

# Run all tests
cargo test

# Test SQL parsing
cargo test sql_parsing

# Test format support
cargo test format_support

# Test streaming
cargo test streaming

# Integration tests
cargo test --test integration

# Performance tests
cargo test --test performance --release

📋 Requirements

Rust: 1.70.0 or later
Platforms: Linux, macOS, Windows
Dependencies: Apache DataFusion, Arrow
Memory: Sufficient RAM for query processing

This module is part of the RustFS ecosystem:

RustFS Main - Core distributed storage system
RustFS S3Select Query - Query engine implementation
RustFS ECStore - Storage backend

📚 Documentation

For comprehensive documentation, visit:

🔗 Links

Documentation - Complete RustFS manual
Changelog - Release notes and updates
GitHub Discussions - Community support

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

RustFS is a trademark of RustFS, Inc.
All other trademarks are the property of their respective owners.

Made with 📊 by the RustFS Team

README.md

RustFS S3Select API - SQL Query Interface

📖 Overview

✨ Features

📊 SQL Query Support

📁 Format Support

🚀 Performance Features

🔧 S3 Compatibility

📦 Installation

🔧 Usage

Basic S3 Select Query

CSV Data Processing

JSON Data Querying

Parquet File Analysis

Advanced SQL Functions

Streaming Large Datasets

HTTP API Integration

🏗️ Architecture

S3Select API Architecture

Supported Data Formats

🧪 Testing

📋 Requirements

🌍 Related Projects

📚 Documentation

🔗 Links

🤝 Contributing

📄 License