High Availability Patterns

14 min read

Updated Jun 28, 2025

Introduction

High availability (HA) is the ability of a system to operate continuously without failing for a designated period. This guide covers proven patterns and strategies to achieve high availability in modern infrastructure deployments.

Key Principle: High availability is achieved through redundancy, fault tolerance, and automated failover mechanisms.

Availability Levels

Availability %	Downtime per Year	Downtime per Month	Typical Use Case
99%	3.65 days	7.31 hours	Development environments
99.9%	8.77 hours	43.83 minutes	Standard web applications
99.99%	52.60 minutes	4.38 minutes	E-commerce, SaaS platforms
99.999%	5.26 minutes	26.30 seconds	Financial services, healthcare

Active-Passive Pattern

99.9% Availability

In an active-passive configuration, one instance handles all traffic while standby instances remain ready to take over in case of failure.

┌─────────────┐
│   Users     │
└──────┬──────┘
       │
   ┌───▼────┐
   │  Load  │
   │Balancer│
   └───┬────┘
       │
  ┌────▼─────────┬──────────────┐
  │              │              │
┌─▼──────────┐ ┌─▼──────────┐ ┌─▼──────────┐
│   Active   │ │  Passive   │ │  Passive   │
│  Instance  │ │ Instance 1 │ │ Instance 2 │
│  (Primary) │ │ (Standby)  │ │ (Standby)  │
└────────────┘ └────────────┘ └────────────┘
      │              │              │
      └──────────────┴──────────────┘
                     │
              ┌──────▼──────┐
              │  Database   │
              │  (Primary)  │
              └─────────────┘

Implementation

# Health check configuration
health_check {
  interval: 30s
  timeout: 5s
  healthy_threshold: 2
  unhealthy_threshold: 3
  
  http {
    path: "/health"
    expected_codes: [200, 204]
  }
}

# Failover configuration
failover {
  detection_time: 60s
  promotion_time: 30s
  automatic_failback: false
  
  priorities {
    primary: 100
    standby_1: 50
    standby_2: 25
  }
}

Advantages

Simple to implement and manage
Lower cost than active-active
No data synchronization issues
Clear failover path

Disadvantages

Wasted resources on standby
Failover time impacts availability
Manual intervention may be required
Testing failover can be disruptive

Active-Active Pattern

99.99% Availability

Active-active configurations distribute traffic across multiple active instances, providing both high availability and increased capacity.

┌─────────────┐
│   Users     │
└──────┬──────┘
       │
   ┌───▼────┐
   │  Load  │
   │Balancer│
   └───┬────┘
       │
  ┌────┴─────┬──────────┬──────────┐
  │          │          │          │
┌─▼──────┐ ┌▼──────┐ ┌▼──────┐ ┌▼──────┐
│Active 1│ │Active 2│ │Active 3│ │Active 4│
│  25%   │ │  25%   │ │  25%   │ │  25%   │
└────┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
     │         │         │         │
     └─────────┴────┬────┴─────────┘
                    │
            ┌───────▼────────┐
            │ Shared Storage │
            │   (Clustered)  │
            └────────────────┘

Load Distribution Strategies

Round Robin: Equal distribution across all instances
Least Connections: Route to instance with fewest active connections
Weighted: Distribute based on instance capacity
Geographic: Route based on user location
Session Affinity: Maintain user sessions on same instance

# Active-Active Load Balancer Configuration
upstream backend {
  least_conn;
  
  server backend1.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
  server backend2.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
  server backend3.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
  server backend4.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
  
  keepalive 32;
}

server {
  listen 80;
  
  location / {
    proxy_pass http://backend;
    proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
    proxy_connect_timeout 2s;
    proxy_read_timeout 30s;
  }
}

Advantages

No wasted resources
Instant failover capability
Better performance and scalability
Load distribution reduces strain

Disadvantages

More complex to implement
Data consistency challenges
Higher infrastructure costs
Requires session management

Multi-Region Pattern

99.999% Availability

Multi-region deployments provide the highest level of availability by distributing infrastructure across geographically separated data centers.

┌─────────────────────────────────────────────────┐
│                 Global Load Balancer             │
└────────┬──────────────┬──────────────┬──────────┘
         │              │              │
    ┌────▼────┐    ┌────▼────┐    ┌────▼────┐
    │Region 1 │    │Region 2 │    │Region 3 │
    │ US-East │    │   EU    │    │  APAC   │
    └────┬────┘    └────┬────┘    └────┬────┘
         │              │              │
    ┌────▼────┐    ┌────▼────┐    ┌────▼────┐
    │   App   │    │   App   │    │   App   │
    │ Cluster │◄───┤ Cluster │◄───┤ Cluster │
    └────┬────┘    └────┬────┘    └────┬────┘
         │              │              │
    ┌────▼────┐    ┌────▼────┐    ┌────▼────┐
    │   DB    │◄───┤   DB    │◄───┤   DB    │
    │Primary  │    │ Replica │    │ Replica │
    └─────────┘    └─────────┘    └─────────┘
         
    ◄─────────── Replication ───────────►

Key Components

Global Traffic Management (GTM) for intelligent routing
Cross-region data replication with conflict resolution
Regional failover automation
Latency-based routing for optimal performance
Disaster recovery orchestration

Consideration: Multi-region deployments require careful planning for data consistency, compliance, and network latency.

Database High Availability Patterns

Database availability is often the most critical component of system reliability. Here are proven patterns for database HA:

1. Primary-Replica Replication

# MySQL Primary-Replica Configuration
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production
binlog_format = ROW
max_binlog_size = 100M
expire_logs_days = 7

# Replica configuration
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin.log
read_only = 1
super_read_only = 1

2. Multi-Master Replication

All nodes can accept writes
Conflict resolution mechanisms required
Suitable for geographically distributed applications
Examples: Galera Cluster, CockroachDB

3. Database Clustering

Solution	Type	Consistency	Best For
PostgreSQL Streaming	Primary-Replica	Eventual	Read-heavy workloads
MySQL Group Replication	Multi-Master	Strong	Write scaling
MongoDB Replica Set	Primary-Secondary	Eventual	Document stores
Cassandra	Masterless	Tunable	Large scale, multi-DC

Load Balancing Strategies

Effective load balancing is crucial for distributing traffic and detecting failures:

Health Check Configuration

# Advanced health check with custom logic
location /health {
  access_log off;
  
  content_by_lua_block {
    local redis = require "resty.redis"
    local mysql = require "resty.mysql"
    
    -- Check Redis connectivity
    local red = redis:new()
    local ok, err = red:connect("127.0.0.1", 6379)
    if not ok then
      ngx.status = 503
      ngx.say("Redis unhealthy: ", err)
      return
    end
    
    -- Check MySQL connectivity
    local db = mysql:new()
    local ok, err = db:connect{
      host = "127.0.0.1",
      port = 3306,
      database = "production"
    }
    if not ok then
      ngx.status = 503
      ngx.say("MySQL unhealthy: ", err)
      return
    end
    
    -- All checks passed
    ngx.status = 200
    ngx.say("Healthy")
  }
}

Circuit Breaker Pattern

Prevents cascading failures
Automatic service isolation
Gradual recovery mechanisms
Configurable thresholds and timeouts

Failure Detection and Recovery

Quick and accurate failure detection is essential for maintaining high availability:

Detection Methods

Heartbeat: Regular ping checks between components
Consensus: Quorum-based failure agreement (Raft, Paxos)
Synthetic Monitoring: Simulated user transactions
Anomaly Detection: ML-based pattern recognition

Recovery Strategies

# Automated recovery script example
#!/bin/bash

MAX_RETRIES=3
RETRY_DELAY=30

recover_service() {
  local service=$1
  local retries=0
  
  while [ $retries -lt $MAX_RETRIES ]; do
    echo "Attempting to recover $service (attempt $((retries+1)))"
    
    # Stop the service
    systemctl stop $service
    sleep 5
    
    # Clear any locks or temp files
    rm -f /var/lock/$service/*
    
    # Start the service
    if systemctl start $service; then
      echo "$service recovered successfully"
      
      # Verify service is healthy
      sleep 10
      if systemctl is-active $service; then
        return 0
      fi
    fi
    
    retries=$((retries+1))
    sleep $RETRY_DELAY
  done
  
  echo "Failed to recover $service after $MAX_RETRIES attempts"
  # Trigger escalation
  notify_oncall "Service $service failed to recover"
  return 1
}

Best Practices

Follow these proven practices to achieve and maintain high availability:

Design Principles

Design for failure - assume components will fail
Eliminate single points of failure (SPOF)
Implement defense in depth
Use immutable infrastructure
Automate everything possible
Monitor proactively, not reactively

Testing High Availability

Chaos Engineering: Intentionally introduce failures
Load Testing: Verify performance under stress
Failover Drills: Regular practice of recovery procedures
Disaster Recovery: Full-scale regional failover tests

Monitoring and Alerting

Golden Signals: Monitor latency, traffic, errors, and saturation for comprehensive visibility.

# Example Prometheus alerting rules
groups:
  - name: high_availability
    rules:
    - alert: InstanceDown
      expr: up == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        
    - alert: DatabaseReplicationLag
      expr: mysql_slave_lag_seconds > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Database replication lag exceeds 10 seconds"

Documentation and Runbooks

Maintain comprehensive documentation for all HA components:

Architecture diagrams and data flows
Failover procedures and decision trees
Recovery time objectives (RTO) and recovery point objectives (RPO)
Contact lists and escalation policies
Post-incident review templates

Introduction

Availability Levels

Active-Passive Pattern

Implementation

Advantages

Disadvantages

Active-Active Pattern

Load Distribution Strategies

Advantages

Disadvantages

Multi-Region Pattern

Key Components

Database High Availability Patterns

1. Primary-Replica Replication

2. Multi-Master Replication

3. Database Clustering

Load Balancing Strategies

Health Check Configuration

Circuit Breaker Pattern

Failure Detection and Recovery

Detection Methods

Recovery Strategies

Best Practices

Design Principles

Testing High Availability

Monitoring and Alerting

Documentation and Runbooks

Related Documentation

Microservices Architecture Guide

Cloud Native Patterns

Architecture Guide