Files
rustfs/docs/examples/mnmd/CHECKLIST.md
Copilot ad99019749 Add complete MNMD Docker deployment example with startup coordination and VolumeNotFound fix (#642)
* Initial plan

* Add MNMD Docker deployment example with 4 nodes x 4 drives

- Create docs/examples/mnmd/ directory structure
- Add docker-compose.yml with proper disk indexing (1..4)
- Add wait-and-start.sh for startup coordination
- Add README.md with usage instructions and alternatives
- Add CHECKLIST.md with step-by-step verification
- Fixes VolumeNotFound issue by using correct volume paths
- Implements health checks and startup ordering
- Uses service names for stable inter-node addressing

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* Add docs/examples README as index for deployment examples

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* Add automated test script for MNMD deployment

- Add test-deployment.sh with comprehensive validation
- Test container status, health, endpoints, connectivity
- Update README to reference test script
- Make script executable

Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>

* improve code

* improve code

* improve dep crates `cargo shear --fix`

* upgrade aws-sdk-s3

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: houseme <4829346+houseme@users.noreply.github.com>
Co-authored-by: houseme <housemecn@gmail.com>
2025-10-12 13:15:14 +08:00

7.8 KiB
Raw Permalink Blame History

MNMD Deployment Checklist

This checklist provides step-by-step verification for deploying RustFS in MNMD (Multi-Node Multi-Drive) mode using Docker.

Pre-Deployment Checks

1. System Requirements

  • Docker Engine 20.10+ installed
  • Docker Compose 2.0+ installed
  • At least 8GB RAM available
  • At least 40GB disk space available (for 4 nodes × 4 volumes)

Verify with:

docker --version
docker-compose --version
free -h
df -h

2. File System Checks

  • Using XFS, ext4, or another suitable filesystem (not NFS for production)
  • File system supports extended attributes

Verify with:

df -T | grep -E '(xfs|ext4)'

3. Permissions and SELinux

  • Current user is in docker group or can run sudo docker
  • SELinux is properly configured (if enabled)

Verify with:

groups | grep docker
getenforce  # If enabled, should show "Permissive" or "Enforcing" with proper policies

4. Network Configuration

  • Ports 9000-9031 are available
  • No firewall blocking Docker bridge network

Verify with:

# Check if ports are free
netstat -tuln | grep -E ':(9000|9001|9010|9011|9020|9021|9030|9031)'
# Should return nothing if ports are free

5. Files Present

  • docker-compose.yml exists in current directory

Verify with:

cd docs/examples/mnmd
ls -la
chmod +x wait-and-start.sh  # If needed

Deployment Steps

1. Start the Cluster

  • Navigate to the example directory
  • Pull the latest RustFS image
  • Start the cluster
cd docs/examples/mnmd
docker-compose pull
docker-compose up -d

2. Monitor Startup

  • Watch container logs during startup
  • Verify no VolumeNotFound errors
  • Check that peer discovery completes
# Watch all logs
docker-compose logs -f

# Watch specific node
docker-compose logs -f rustfs-node1

# Look for successful startup messages
docker-compose logs | grep -i "ready\|listening\|started"

3. Verify Container Status

  • All 4 containers are running
  • All 4 containers show as healthy
docker-compose ps

# Expected output: 4 containers in "Up" state with "healthy" status

4. Check Health Endpoints

  • API health endpoints respond on all nodes
  • Console health endpoints respond on all nodes
# Test API endpoints
curl http://localhost:9000/health
curl http://localhost:9010/health
curl http://localhost:9020/health
curl http://localhost:9030/health

# Test Console endpoints
curl http://localhost:9001/health
curl http://localhost:9011/health
curl http://localhost:9021/health
curl http://localhost:9031/health

# All should return successful health status

Post-Deployment Verification

1. In-Container Checks

  • Data directories exist
  • Directories have correct permissions
  • RustFS process is running
# Check node1
docker exec rustfs-node1 ls -la /data/
docker exec rustfs-node1 ps aux | grep rustfs

# Verify all 4 data directories exist
docker exec rustfs-node1 ls -d /data/rustfs{1..4}

2. DNS and Network Validation

  • Service names resolve correctly
  • Inter-node connectivity works
# DNS resolution test
docker exec rustfs-node1 nslookup rustfs-node2
docker exec rustfs-node1 nslookup rustfs-node3
docker exec rustfs-node1 nslookup rustfs-node4

# Connectivity test (using nc if available)
docker exec rustfs-node1 nc -zv rustfs-node2 9000
docker exec rustfs-node1 nc -zv rustfs-node3 9000
docker exec rustfs-node1 nc -zv rustfs-node4 9000

# Or using telnet/curl
docker exec rustfs-node1 curl -v http://rustfs-node2:9000/health

3. Volume Configuration Validation

  • RUSTFS_VOLUMES environment variable is correct
  • All 16 endpoints are configured (4 nodes × 4 drives)
# Check environment variable
docker exec rustfs-node1 env | grep RUSTFS_VOLUMES

# Expected output:
# RUSTFS_VOLUMES=http://rustfs-node{1...4}:9000/data/rustfs{1...4}

4. Cluster Functionality

  • Can list buckets via API
  • Can create a bucket
  • Can upload an object
  • Can download an object
# Configure AWS CLI or s3cmd
export AWS_ACCESS_KEY_ID=rustfsadmin
export AWS_SECRET_ACCESS_KEY=rustfsadmin

# Using AWS CLI (if installed)
aws --endpoint-url http://localhost:9000 s3 mb s3://test-bucket
aws --endpoint-url http://localhost:9000 s3 ls
echo "test content" > test.txt
aws --endpoint-url http://localhost:9000 s3 cp test.txt s3://test-bucket/
aws --endpoint-url http://localhost:9000 s3 ls s3://test-bucket/
aws --endpoint-url http://localhost:9000 s3 cp s3://test-bucket/test.txt downloaded.txt
cat downloaded.txt

# Or using curl
curl -X PUT http://localhost:9000/test-bucket \
  -H "Host: localhost:9000" \
  --user rustfsadmin:rustfsadmin

5. Healthcheck Verification

  • Docker reports all services as healthy
  • Healthcheck scripts work in containers
# Check Docker health status
docker inspect rustfs-node1 --format='{{.State.Health.Status}}'
docker inspect rustfs-node2 --format='{{.State.Health.Status}}'
docker inspect rustfs-node3 --format='{{.State.Health.Status}}'
docker inspect rustfs-node4 --format='{{.State.Health.Status}}'

# All should return "healthy"

# Test healthcheck command manually
docker exec rustfs-node1 nc -z localhost 9000
echo $?  # Should be 0

Troubleshooting Checks

If VolumeNotFound Error Occurs

  • Verify volume indexing starts at 1, not 0
  • Check that RUSTFS_VOLUMES matches mounted paths
  • Ensure all /data/rustfs{1..4} directories exist
# Check mounted volumes
docker inspect rustfs-node1 | jq '.[].Mounts'

# Verify directories in container
docker exec rustfs-node1 ls -la /data/

If Healthcheck Fails

  • Check if nc is available in the image
  • Try alternative healthcheck (curl/wget)
  • Increase start_period in docker-compose.yml
# Check if nc is available
docker exec rustfs-node1 which nc

# Test healthcheck manually
docker exec rustfs-node1 nc -z localhost 9000

# Check logs for errors
docker-compose logs rustfs-node1 | grep -i error

If Startup Takes Too Long

  • Check peer discovery timeout in logs
  • Verify network connectivity between nodes
  • Consider increasing timeout in wait-and-start.sh
# Check startup logs
docker-compose logs rustfs-node1 | grep -i "waiting\|peer\|timeout"

# Check network
docker network inspect mnmd_rustfs-mnmd

If Containers Crash or Restart

  • Review container logs
  • Check resource usage (CPU/Memory)
  • Verify no port conflicts
# View last crash logs
docker-compose logs --tail=100 rustfs-node1

# Check resource usage
docker stats --no-stream

# Check restart count
docker-compose ps

Cleanup Checklist

When done testing:

  • Stop the cluster: docker-compose down
  • Remove volumes (optional, destroys data): docker-compose down -v
  • Clean up dangling images: docker image prune
  • Verify ports are released: netstat -tuln | grep -E ':(9000|9001|9010|9011|9020|9021|9030|9031)'

Production Deployment Additional Checks

Before deploying to production:

  • Change default credentials (RUSTFS_ACCESS_KEY, RUSTFS_SECRET_KEY)
  • Configure TLS certificates
  • Set up proper logging and monitoring
  • Configure backups for volumes
  • Review and adjust resource limits
  • Set up external load balancer (if needed)
  • Document disaster recovery procedures
  • Test failover scenarios
  • Verify data persistence after container restart

Summary

This checklist ensures:

  • ✓ Correct disk indexing (1..4 instead of 0..3)
  • ✓ Proper startup coordination via wait-and-start.sh
  • ✓ Service discovery via Docker service names
  • ✓ Health checks function correctly
  • ✓ All 16 endpoints (4 nodes × 4 drives) are operational
  • ✓ No VolumeNotFound errors occur

For more details, see README.md in this directory.