High Availability Patterns
Introduction
High availability (HA) is the ability of a system to operate continuously without failing for a designated period. This guide covers proven patterns and strategies to achieve high availability in modern infrastructure deployments.
Availability Levels
| Availability % | Downtime per Year | Downtime per Month | Typical Use Case |
|---|---|---|---|
| 99% | 3.65 days | 7.31 hours | Development environments |
| 99.9% | 8.77 hours | 43.83 minutes | Standard web applications |
| 99.99% | 52.60 minutes | 4.38 minutes | E-commerce, SaaS platforms |
| 99.999% | 5.26 minutes | 26.30 seconds | Financial services, healthcare |
Active-Passive Pattern
99.9% AvailabilityIn an active-passive configuration, one instance handles all traffic while standby instances remain ready to take over in case of failure.
┌─────────────┐
│ Users │
└──────┬──────┘
│
┌───▼────┐
│ Load │
│Balancer│
└───┬────┘
│
┌────▼─────────┬──────────────┐
│ │ │
┌─▼──────────┐ ┌─▼──────────┐ ┌─▼──────────┐
│ Active │ │ Passive │ │ Passive │
│ Instance │ │ Instance 1 │ │ Instance 2 │
│ (Primary) │ │ (Standby) │ │ (Standby) │
└────────────┘ └────────────┘ └────────────┘
│ │ │
└──────────────┴──────────────┘
│
┌──────▼──────┐
│ Database │
│ (Primary) │
└─────────────┘
Implementation
# Health check configuration
health_check {
interval: 30s
timeout: 5s
healthy_threshold: 2
unhealthy_threshold: 3
http {
path: "/health"
expected_codes: [200, 204]
}
}
# Failover configuration
failover {
detection_time: 60s
promotion_time: 30s
automatic_failback: false
priorities {
primary: 100
standby_1: 50
standby_2: 25
}
}
Advantages
- Simple to implement and manage
- Lower cost than active-active
- No data synchronization issues
- Clear failover path
Disadvantages
- Wasted resources on standby
- Failover time impacts availability
- Manual intervention may be required
- Testing failover can be disruptive
Active-Active Pattern
99.99% AvailabilityActive-active configurations distribute traffic across multiple active instances, providing both high availability and increased capacity.
┌─────────────┐
│ Users │
└──────┬──────┘
│
┌───▼────┐
│ Load │
│Balancer│
└───┬────┘
│
┌────┴─────┬──────────┬──────────┐
│ │ │ │
┌─▼──────┐ ┌▼──────┐ ┌▼──────┐ ┌▼──────┐
│Active 1│ │Active 2│ │Active 3│ │Active 4│
│ 25% │ │ 25% │ │ 25% │ │ 25% │
└────┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │
└─────────┴────┬────┴─────────┘
│
┌───────▼────────┐
│ Shared Storage │
│ (Clustered) │
└────────────────┘
Load Distribution Strategies
- Round Robin: Equal distribution across all instances
- Least Connections: Route to instance with fewest active connections
- Weighted: Distribute based on instance capacity
- Geographic: Route based on user location
- Session Affinity: Maintain user sessions on same instance
# Active-Active Load Balancer Configuration
upstream backend {
least_conn;
server backend1.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
server backend2.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
server backend3.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
server backend4.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
proxy_connect_timeout 2s;
proxy_read_timeout 30s;
}
}
Advantages
- No wasted resources
- Instant failover capability
- Better performance and scalability
- Load distribution reduces strain
Disadvantages
- More complex to implement
- Data consistency challenges
- Higher infrastructure costs
- Requires session management
Multi-Region Pattern
99.999% AvailabilityMulti-region deployments provide the highest level of availability by distributing infrastructure across geographically separated data centers.
┌─────────────────────────────────────────────────┐
│ Global Load Balancer │
└────────┬──────────────┬──────────────┬──────────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Region 1 │ │Region 2 │ │Region 3 │
│ US-East │ │ EU │ │ APAC │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ App │ │ App │ │ App │
│ Cluster │◄───┤ Cluster │◄───┤ Cluster │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ DB │◄───┤ DB │◄───┤ DB │
│Primary │ │ Replica │ │ Replica │
└─────────┘ └─────────┘ └─────────┘
◄─────────── Replication ───────────►
Key Components
- Global Traffic Management (GTM) for intelligent routing
- Cross-region data replication with conflict resolution
- Regional failover automation
- Latency-based routing for optimal performance
- Disaster recovery orchestration
Database High Availability Patterns
Database availability is often the most critical component of system reliability. Here are proven patterns for database HA:
1. Primary-Replica Replication
# MySQL Primary-Replica Configuration [mysqld] server-id = 1 log_bin = /var/log/mysql/mysql-bin.log binlog_do_db = production binlog_format = ROW max_binlog_size = 100M expire_logs_days = 7 # Replica configuration [mysqld] server-id = 2 relay-log = /var/log/mysql/mysql-relay-bin.log read_only = 1 super_read_only = 1
2. Multi-Master Replication
- All nodes can accept writes
- Conflict resolution mechanisms required
- Suitable for geographically distributed applications
- Examples: Galera Cluster, CockroachDB
3. Database Clustering
| Solution | Type | Consistency | Best For |
|---|---|---|---|
| PostgreSQL Streaming | Primary-Replica | Eventual | Read-heavy workloads |
| MySQL Group Replication | Multi-Master | Strong | Write scaling |
| MongoDB Replica Set | Primary-Secondary | Eventual | Document stores |
| Cassandra | Masterless | Tunable | Large scale, multi-DC |
Load Balancing Strategies
Effective load balancing is crucial for distributing traffic and detecting failures:
Health Check Configuration
# Advanced health check with custom logic
location /health {
access_log off;
content_by_lua_block {
local redis = require "resty.redis"
local mysql = require "resty.mysql"
-- Check Redis connectivity
local red = redis:new()
local ok, err = red:connect("127.0.0.1", 6379)
if not ok then
ngx.status = 503
ngx.say("Redis unhealthy: ", err)
return
end
-- Check MySQL connectivity
local db = mysql:new()
local ok, err = db:connect{
host = "127.0.0.1",
port = 3306,
database = "production"
}
if not ok then
ngx.status = 503
ngx.say("MySQL unhealthy: ", err)
return
end
-- All checks passed
ngx.status = 200
ngx.say("Healthy")
}
}
Circuit Breaker Pattern
- Prevents cascading failures
- Automatic service isolation
- Gradual recovery mechanisms
- Configurable thresholds and timeouts
Failure Detection and Recovery
Quick and accurate failure detection is essential for maintaining high availability:
Detection Methods
- Heartbeat: Regular ping checks between components
- Consensus: Quorum-based failure agreement (Raft, Paxos)
- Synthetic Monitoring: Simulated user transactions
- Anomaly Detection: ML-based pattern recognition
Recovery Strategies
# Automated recovery script example
#!/bin/bash
MAX_RETRIES=3
RETRY_DELAY=30
recover_service() {
local service=$1
local retries=0
while [ $retries -lt $MAX_RETRIES ]; do
echo "Attempting to recover $service (attempt $((retries+1)))"
# Stop the service
systemctl stop $service
sleep 5
# Clear any locks or temp files
rm -f /var/lock/$service/*
# Start the service
if systemctl start $service; then
echo "$service recovered successfully"
# Verify service is healthy
sleep 10
if systemctl is-active $service; then
return 0
fi
fi
retries=$((retries+1))
sleep $RETRY_DELAY
done
echo "Failed to recover $service after $MAX_RETRIES attempts"
# Trigger escalation
notify_oncall "Service $service failed to recover"
return 1
}
Best Practices
Follow these proven practices to achieve and maintain high availability:
Design Principles
- Design for failure - assume components will fail
- Eliminate single points of failure (SPOF)
- Implement defense in depth
- Use immutable infrastructure
- Automate everything possible
- Monitor proactively, not reactively
Testing High Availability
- Chaos Engineering: Intentionally introduce failures
- Load Testing: Verify performance under stress
- Failover Drills: Regular practice of recovery procedures
- Disaster Recovery: Full-scale regional failover tests
Monitoring and Alerting
# Example Prometheus alerting rules
groups:
- name: high_availability
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
- alert: DatabaseReplicationLag
expr: mysql_slave_lag_seconds > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Database replication lag exceeds 10 seconds"
Documentation and Runbooks
Maintain comprehensive documentation for all HA components:
- Architecture diagrams and data flows
- Failover procedures and decision trees
- Recovery time objectives (RTO) and recovery point objectives (RPO)
- Contact lists and escalation policies
- Post-incident review templates