rustfs/docs/examples/mnmd/CHECKLIST.md

# MNMD Deployment Checklist

This checklist provides step-by-step verification for deploying RustFS in MNMD (Multi-Node Multi-Drive) mode using
Docker.

## Pre-Deployment Checks

### 1. System Requirements

- [ ] Docker Engine 20.10+ installed
- [ ] Docker Compose 2.0+ installed
- [ ] At least 8GB RAM available
- [ ] At least 40GB disk space available (for 4 nodes × 4 volumes)

Verify with:

```bash
docker --version
docker-compose --version
free -h
df -h
```

### 2. File System Checks

- [ ] Using XFS, ext4, or another suitable filesystem (not NFS for production)
- [ ] File system supports extended attributes

Verify with:

```bash
df -T | grep -E '(xfs|ext4)'
```

### 3. Permissions and SELinux

- [ ] Current user is in `docker` group or can run `sudo docker`
- [ ] SELinux is properly configured (if enabled)

Verify with:

```bash
groups | grep docker
getenforce  # If enabled, should show "Permissive" or "Enforcing" with proper policies
```

### 4. Network Configuration

- [ ] Ports 9000-9031 are available
- [ ] No firewall blocking Docker bridge network

Verify with:

```bash
# Check if ports are free
netstat -tuln | grep -E ':(9000|9001|9010|9011|9020|9021|9030|9031)'
# Should return nothing if ports are free
```

### 5. Files Present

- [ ] `docker-compose.yml` exists in current directory

Verify with:

```bash
cd docs/examples/mnmd
ls -la
chmod +x wait-and-start.sh  # If needed
```

## Deployment Steps

### 1. Start the Cluster

- [ ] Navigate to the example directory
- [ ] Pull the latest RustFS image
- [ ] Start the cluster

```bash
cd docs/examples/mnmd
docker-compose pull
docker-compose up -d
```

### 2. Monitor Startup

- [ ] Watch container logs during startup
- [ ] Verify no VolumeNotFound errors
- [ ] Check that peer discovery completes

```bash
# Watch all logs
docker-compose logs -f

# Watch specific node
docker-compose logs -f rustfs-node1

# Look for successful startup messages
docker-compose logs | grep -i "ready\|listening\|started"
```

### 3. Verify Container Status

- [ ] All 4 containers are running
- [ ] All 4 containers show as healthy

```bash
docker-compose ps

# Expected output: 4 containers in "Up" state with "healthy" status
```

### 4. Check Health Endpoints

- [ ] API health endpoints respond on all nodes
- [ ] Console health endpoints respond on all nodes

```bash
# Test API endpoints
curl http://localhost:9000/health
curl http://localhost:9010/health
curl http://localhost:9020/health
curl http://localhost:9030/health

# Test Console endpoints
curl http://localhost:9001/health
curl http://localhost:9011/health
curl http://localhost:9021/health
curl http://localhost:9031/health

# All should return successful health status
```

## Post-Deployment Verification

### 1. In-Container Checks

- [ ] Data directories exist
- [ ] Directories have correct permissions
- [ ] RustFS process is running

```bash
# Check node1
docker exec rustfs-node1 ls -la /data/
docker exec rustfs-node1 ps aux | grep rustfs

# Verify all 4 data directories exist
docker exec rustfs-node1 ls -d /data/rustfs{1..4}
```

### 2. DNS and Network Validation

- [ ] Service names resolve correctly
- [ ] Inter-node connectivity works

```bash
# DNS resolution test
docker exec rustfs-node1 nslookup rustfs-node2
docker exec rustfs-node1 nslookup rustfs-node3
docker exec rustfs-node1 nslookup rustfs-node4

# Connectivity test (using nc if available)
docker exec rustfs-node1 nc -zv rustfs-node2 9000
docker exec rustfs-node1 nc -zv rustfs-node3 9000
docker exec rustfs-node1 nc -zv rustfs-node4 9000

# Or using telnet/curl
docker exec rustfs-node1 curl -v http://rustfs-node2:9000/health
```

### 3. Volume Configuration Validation

- [ ] RUSTFS_VOLUMES environment variable is correct
- [ ] All 16 endpoints are configured (4 nodes × 4 drives)

```bash
# Check environment variable
docker exec rustfs-node1 env | grep RUSTFS_VOLUMES

# Expected output:
# RUSTFS_VOLUMES=http://rustfs-node{1...4}:9000/data/rustfs{1...4}
```

### 4. Cluster Functionality

- [ ] Can list buckets via API
- [ ] Can create a bucket
- [ ] Can upload an object
- [ ] Can download an object

```bash
# Configure AWS CLI or s3cmd
export AWS_ACCESS_KEY_ID=rustfsadmin
export AWS_SECRET_ACCESS_KEY=rustfsadmin

# Using AWS CLI (if installed)
aws --endpoint-url http://localhost:9000 s3 mb s3://test-bucket
aws --endpoint-url http://localhost:9000 s3 ls
echo "test content" > test.txt
aws --endpoint-url http://localhost:9000 s3 cp test.txt s3://test-bucket/
aws --endpoint-url http://localhost:9000 s3 ls s3://test-bucket/
aws --endpoint-url http://localhost:9000 s3 cp s3://test-bucket/test.txt downloaded.txt
cat downloaded.txt

# Or using curl
curl -X PUT http://localhost:9000/test-bucket \
  -H "Host: localhost:9000" \
  --user rustfsadmin:rustfsadmin
```

### 5. Healthcheck Verification

- [ ] Docker reports all services as healthy
- [ ] Healthcheck scripts work in containers

```bash
# Check Docker health status
docker inspect rustfs-node1 --format='{{.State.Health.Status}}'
docker inspect rustfs-node2 --format='{{.State.Health.Status}}'
docker inspect rustfs-node3 --format='{{.State.Health.Status}}'
docker inspect rustfs-node4 --format='{{.State.Health.Status}}'

# All should return "healthy"

# Test healthcheck command manually
docker exec rustfs-node1 nc -z localhost 9000
echo $?  # Should be 0
```

## Troubleshooting Checks

### If VolumeNotFound Error Occurs

- [ ] Verify volume indexing starts at 1, not 0
- [ ] Check that RUSTFS_VOLUMES matches mounted paths
- [ ] Ensure all /data/rustfs{1..4} directories exist

```bash
# Check mounted volumes
docker inspect rustfs-node1 | jq '.[].Mounts'

# Verify directories in container
docker exec rustfs-node1 ls -la /data/
```

### If Healthcheck Fails

- [ ] Check if `nc` is available in the image
- [ ] Try alternative healthcheck (curl/wget)
- [ ] Increase `start_period` in docker-compose.yml

```bash
# Check if nc is available
docker exec rustfs-node1 which nc

# Test healthcheck manually
docker exec rustfs-node1 nc -z localhost 9000

# Check logs for errors
docker-compose logs rustfs-node1 | grep -i error
```

### If Startup Takes Too Long

- [ ] Check peer discovery timeout in logs
- [ ] Verify network connectivity between nodes
- [ ] Consider increasing timeout in wait-and-start.sh

```bash
# Check startup logs
docker-compose logs rustfs-node1 | grep -i "waiting\|peer\|timeout"

# Check network
docker network inspect mnmd_rustfs-mnmd
```

### If Containers Crash or Restart

- [ ] Review container logs
- [ ] Check resource usage (CPU/Memory)
- [ ] Verify no port conflicts

```bash
# View last crash logs
docker-compose logs --tail=100 rustfs-node1

# Check resource usage
docker stats --no-stream

# Check restart count
docker-compose ps
```

## Cleanup Checklist

When done testing:

- [ ] Stop the cluster: `docker-compose down`
- [ ] Remove volumes (optional, destroys data): `docker-compose down -v`
- [ ] Clean up dangling images: `docker image prune`
- [ ] Verify ports are released: `netstat -tuln | grep -E ':(9000|9001|9010|9011|9020|9021|9030|9031)'`

## Production Deployment Additional Checks

Before deploying to production:

- [ ] Change default credentials (RUSTFS_ACCESS_KEY, RUSTFS_SECRET_KEY)
- [ ] Configure TLS certificates
- [ ] Set up proper logging and monitoring
- [ ] Configure backups for volumes
- [ ] Review and adjust resource limits
- [ ] Set up external load balancer (if needed)
- [ ] Document disaster recovery procedures
- [ ] Test failover scenarios
- [ ] Verify data persistence after container restart

## Summary

This checklist ensures:

- ✓ Correct disk indexing (1..4 instead of 0..3)
- ✓ Proper startup coordination via wait-and-start.sh
- ✓ Service discovery via Docker service names
- ✓ Health checks function correctly
- ✓ All 16 endpoints (4 nodes × 4 drives) are operational
- ✓ No VolumeNotFound errors occur

For more details, see [README.md](./README.md) in this directory.