High Availability Patterns
Introduction
High availability (HA) is the ability of a system to operate continuously without failing for a designated period. This guide covers proven patterns and strategies to achieve high availability in modern infrastructure deployments.
Availability Levels
Availability % | Downtime per Year | Downtime per Month | Typical Use Case |
---|---|---|---|
99% | 3.65 days | 7.31 hours | Development environments |
99.9% | 8.77 hours | 43.83 minutes | Standard web applications |
99.99% | 52.60 minutes | 4.38 minutes | E-commerce, SaaS platforms |
99.999% | 5.26 minutes | 26.30 seconds | Financial services, healthcare |
Active-Passive Pattern
99.9% AvailabilityIn an active-passive configuration, one instance handles all traffic while standby instances remain ready to take over in case of failure.
┌─────────────┐ │ Users │ └──────┬──────┘ │ ┌───▼────┐ │ Load │ │Balancer│ └───┬────┘ │ ┌────▼─────────┬──────────────┐ │ │ │ ┌─▼──────────┐ ┌─▼──────────┐ ┌─▼──────────┐ │ Active │ │ Passive │ │ Passive │ │ Instance │ │ Instance 1 │ │ Instance 2 │ │ (Primary) │ │ (Standby) │ │ (Standby) │ └────────────┘ └────────────┘ └────────────┘ │ │ │ └──────────────┴──────────────┘ │ ┌──────▼──────┐ │ Database │ │ (Primary) │ └─────────────┘
Implementation
# Health check configuration health_check { interval: 30s timeout: 5s healthy_threshold: 2 unhealthy_threshold: 3 http { path: "/health" expected_codes: [200, 204] } } # Failover configuration failover { detection_time: 60s promotion_time: 30s automatic_failback: false priorities { primary: 100 standby_1: 50 standby_2: 25 } }
Advantages
- Simple to implement and manage
- Lower cost than active-active
- No data synchronization issues
- Clear failover path
Disadvantages
- Wasted resources on standby
- Failover time impacts availability
- Manual intervention may be required
- Testing failover can be disruptive
Active-Active Pattern
99.99% AvailabilityActive-active configurations distribute traffic across multiple active instances, providing both high availability and increased capacity.
┌─────────────┐ │ Users │ └──────┬──────┘ │ ┌───▼────┐ │ Load │ │Balancer│ └───┬────┘ │ ┌────┴─────┬──────────┬──────────┐ │ │ │ │ ┌─▼──────┐ ┌▼──────┐ ┌▼──────┐ ┌▼──────┐ │Active 1│ │Active 2│ │Active 3│ │Active 4│ │ 25% │ │ 25% │ │ 25% │ │ 25% │ └────┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ │ │ │ │ └─────────┴────┬────┴─────────┘ │ ┌───────▼────────┐ │ Shared Storage │ │ (Clustered) │ └────────────────┘
Load Distribution Strategies
- Round Robin: Equal distribution across all instances
- Least Connections: Route to instance with fewest active connections
- Weighted: Distribute based on instance capacity
- Geographic: Route based on user location
- Session Affinity: Maintain user sessions on same instance
# Active-Active Load Balancer Configuration upstream backend { least_conn; server backend1.example.com:80 weight=5 max_fails=3 fail_timeout=30s; server backend2.example.com:80 weight=5 max_fails=3 fail_timeout=30s; server backend3.example.com:80 weight=5 max_fails=3 fail_timeout=30s; server backend4.example.com:80 weight=5 max_fails=3 fail_timeout=30s; keepalive 32; } server { listen 80; location / { proxy_pass http://backend; proxy_next_upstream error timeout invalid_header http_500 http_502 http_503; proxy_connect_timeout 2s; proxy_read_timeout 30s; } }
Advantages
- No wasted resources
- Instant failover capability
- Better performance and scalability
- Load distribution reduces strain
Disadvantages
- More complex to implement
- Data consistency challenges
- Higher infrastructure costs
- Requires session management
Multi-Region Pattern
99.999% AvailabilityMulti-region deployments provide the highest level of availability by distributing infrastructure across geographically separated data centers.
┌─────────────────────────────────────────────────┐ │ Global Load Balancer │ └────────┬──────────────┬──────────────┬──────────┘ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │Region 1 │ │Region 2 │ │Region 3 │ │ US-East │ │ EU │ │ APAC │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ App │ │ App │ │ App │ │ Cluster │◄───┤ Cluster │◄───┤ Cluster │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ DB │◄───┤ DB │◄───┤ DB │ │Primary │ │ Replica │ │ Replica │ └─────────┘ └─────────┘ └─────────┘ ◄─────────── Replication ───────────►
Key Components
- Global Traffic Management (GTM) for intelligent routing
- Cross-region data replication with conflict resolution
- Regional failover automation
- Latency-based routing for optimal performance
- Disaster recovery orchestration
Database High Availability Patterns
Database availability is often the most critical component of system reliability. Here are proven patterns for database HA:
1. Primary-Replica Replication
# MySQL Primary-Replica Configuration [mysqld] server-id = 1 log_bin = /var/log/mysql/mysql-bin.log binlog_do_db = production binlog_format = ROW max_binlog_size = 100M expire_logs_days = 7 # Replica configuration [mysqld] server-id = 2 relay-log = /var/log/mysql/mysql-relay-bin.log read_only = 1 super_read_only = 1
2. Multi-Master Replication
- All nodes can accept writes
- Conflict resolution mechanisms required
- Suitable for geographically distributed applications
- Examples: Galera Cluster, CockroachDB
3. Database Clustering
Solution | Type | Consistency | Best For |
---|---|---|---|
PostgreSQL Streaming | Primary-Replica | Eventual | Read-heavy workloads |
MySQL Group Replication | Multi-Master | Strong | Write scaling |
MongoDB Replica Set | Primary-Secondary | Eventual | Document stores |
Cassandra | Masterless | Tunable | Large scale, multi-DC |
Load Balancing Strategies
Effective load balancing is crucial for distributing traffic and detecting failures:
Health Check Configuration
# Advanced health check with custom logic location /health { access_log off; content_by_lua_block { local redis = require "resty.redis" local mysql = require "resty.mysql" -- Check Redis connectivity local red = redis:new() local ok, err = red:connect("127.0.0.1", 6379) if not ok then ngx.status = 503 ngx.say("Redis unhealthy: ", err) return end -- Check MySQL connectivity local db = mysql:new() local ok, err = db:connect{ host = "127.0.0.1", port = 3306, database = "production" } if not ok then ngx.status = 503 ngx.say("MySQL unhealthy: ", err) return end -- All checks passed ngx.status = 200 ngx.say("Healthy") } }
Circuit Breaker Pattern
- Prevents cascading failures
- Automatic service isolation
- Gradual recovery mechanisms
- Configurable thresholds and timeouts
Failure Detection and Recovery
Quick and accurate failure detection is essential for maintaining high availability:
Detection Methods
- Heartbeat: Regular ping checks between components
- Consensus: Quorum-based failure agreement (Raft, Paxos)
- Synthetic Monitoring: Simulated user transactions
- Anomaly Detection: ML-based pattern recognition
Recovery Strategies
# Automated recovery script example #!/bin/bash MAX_RETRIES=3 RETRY_DELAY=30 recover_service() { local service=$1 local retries=0 while [ $retries -lt $MAX_RETRIES ]; do echo "Attempting to recover $service (attempt $((retries+1)))" # Stop the service systemctl stop $service sleep 5 # Clear any locks or temp files rm -f /var/lock/$service/* # Start the service if systemctl start $service; then echo "$service recovered successfully" # Verify service is healthy sleep 10 if systemctl is-active $service; then return 0 fi fi retries=$((retries+1)) sleep $RETRY_DELAY done echo "Failed to recover $service after $MAX_RETRIES attempts" # Trigger escalation notify_oncall "Service $service failed to recover" return 1 }
Best Practices
Follow these proven practices to achieve and maintain high availability:
Design Principles
- Design for failure - assume components will fail
- Eliminate single points of failure (SPOF)
- Implement defense in depth
- Use immutable infrastructure
- Automate everything possible
- Monitor proactively, not reactively
Testing High Availability
- Chaos Engineering: Intentionally introduce failures
- Load Testing: Verify performance under stress
- Failover Drills: Regular practice of recovery procedures
- Disaster Recovery: Full-scale regional failover tests
Monitoring and Alerting
# Example Prometheus alerting rules groups: - name: high_availability rules: - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "High error rate detected" - alert: DatabaseReplicationLag expr: mysql_slave_lag_seconds > 10 for: 5m labels: severity: warning annotations: summary: "Database replication lag exceeds 10 seconds"
Documentation and Runbooks
Maintain comprehensive documentation for all HA components:
- Architecture diagrams and data flows
- Failover procedures and decision trees
- Recovery time objectives (RTO) and recovery point objectives (RPO)
- Contact lists and escalation policies
- Post-incident review templates