Home / Documentation / Architecture Guides / High Availability Patterns

High Availability Patterns

14 min read
Updated Jun 28, 2025

Introduction

High availability (HA) is the ability of a system to operate continuously without failing for a designated period. This guide covers proven patterns and strategies to achieve high availability in modern infrastructure deployments.

Key Principle: High availability is achieved through redundancy, fault tolerance, and automated failover mechanisms.

Availability Levels

Availability % Downtime per Year Downtime per Month Typical Use Case
99% 3.65 days 7.31 hours Development environments
99.9% 8.77 hours 43.83 minutes Standard web applications
99.99% 52.60 minutes 4.38 minutes E-commerce, SaaS platforms
99.999% 5.26 minutes 26.30 seconds Financial services, healthcare

Active-Passive Pattern

99.9% Availability

In an active-passive configuration, one instance handles all traffic while standby instances remain ready to take over in case of failure.

┌─────────────┐
│   Users     │
└──────┬──────┘
       │
   ┌───▼────┐
   │  Load  │
   │Balancer│
   └───┬────┘
       │
  ┌────▼─────────┬──────────────┐
  │              │              │
┌─▼──────────┐ ┌─▼──────────┐ ┌─▼──────────┐
│   Active   │ │  Passive   │ │  Passive   │
│  Instance  │ │ Instance 1 │ │ Instance 2 │
│  (Primary) │ │ (Standby)  │ │ (Standby)  │
└────────────┘ └────────────┘ └────────────┘
      │              │              │
      └──────────────┴──────────────┘
                     │
              ┌──────▼──────┐
              │  Database   │
              │  (Primary)  │
              └─────────────┘

Implementation

# Health check configuration
health_check {
  interval: 30s
  timeout: 5s
  healthy_threshold: 2
  unhealthy_threshold: 3
  
  http {
    path: "/health"
    expected_codes: [200, 204]
  }
}

# Failover configuration
failover {
  detection_time: 60s
  promotion_time: 30s
  automatic_failback: false
  
  priorities {
    primary: 100
    standby_1: 50
    standby_2: 25
  }
}

Advantages

  • Simple to implement and manage
  • Lower cost than active-active
  • No data synchronization issues
  • Clear failover path

Disadvantages

  • Wasted resources on standby
  • Failover time impacts availability
  • Manual intervention may be required
  • Testing failover can be disruptive

Active-Active Pattern

99.99% Availability

Active-active configurations distribute traffic across multiple active instances, providing both high availability and increased capacity.

┌─────────────┐
│   Users     │
└──────┬──────┘
       │
   ┌───▼────┐
   │  Load  │
   │Balancer│
   └───┬────┘
       │
  ┌────┴─────┬──────────┬──────────┐
  │          │          │          │
┌─▼──────┐ ┌▼──────┐ ┌▼──────┐ ┌▼──────┐
│Active 1│ │Active 2│ │Active 3│ │Active 4│
│  25%   │ │  25%   │ │  25%   │ │  25%   │
└────┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
     │         │         │         │
     └─────────┴────┬────┴─────────┘
                    │
            ┌───────▼────────┐
            │ Shared Storage │
            │   (Clustered)  │
            └────────────────┘

Load Distribution Strategies

  • Round Robin: Equal distribution across all instances
  • Least Connections: Route to instance with fewest active connections
  • Weighted: Distribute based on instance capacity
  • Geographic: Route based on user location
  • Session Affinity: Maintain user sessions on same instance
# Active-Active Load Balancer Configuration
upstream backend {
  least_conn;
  
  server backend1.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
  server backend2.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
  server backend3.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
  server backend4.example.com:80 weight=5 max_fails=3 fail_timeout=30s;
  
  keepalive 32;
}

server {
  listen 80;
  
  location / {
    proxy_pass http://backend;
    proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
    proxy_connect_timeout 2s;
    proxy_read_timeout 30s;
  }
}

Advantages

  • No wasted resources
  • Instant failover capability
  • Better performance and scalability
  • Load distribution reduces strain

Disadvantages

  • More complex to implement
  • Data consistency challenges
  • Higher infrastructure costs
  • Requires session management

Multi-Region Pattern

99.999% Availability

Multi-region deployments provide the highest level of availability by distributing infrastructure across geographically separated data centers.

┌─────────────────────────────────────────────────┐
│                 Global Load Balancer             │
└────────┬──────────────┬──────────────┬──────────┘
         │              │              │
    ┌────▼────┐    ┌────▼────┐    ┌────▼────┐
    │Region 1 │    │Region 2 │    │Region 3 │
    │ US-East │    │   EU    │    │  APAC   │
    └────┬────┘    └────┬────┘    └────┬────┘
         │              │              │
    ┌────▼────┐    ┌────▼────┐    ┌────▼────┐
    │   App   │    │   App   │    │   App   │
    │ Cluster │◄───┤ Cluster │◄───┤ Cluster │
    └────┬────┘    └────┬────┘    └────┬────┘
         │              │              │
    ┌────▼────┐    ┌────▼────┐    ┌────▼────┐
    │   DB    │◄───┤   DB    │◄───┤   DB    │
    │Primary  │    │ Replica │    │ Replica │
    └─────────┘    └─────────┘    └─────────┘
         
    ◄─────────── Replication ───────────►

Key Components

  • Global Traffic Management (GTM) for intelligent routing
  • Cross-region data replication with conflict resolution
  • Regional failover automation
  • Latency-based routing for optimal performance
  • Disaster recovery orchestration
Consideration: Multi-region deployments require careful planning for data consistency, compliance, and network latency.

Database High Availability Patterns

Database availability is often the most critical component of system reliability. Here are proven patterns for database HA:

1. Primary-Replica Replication

# MySQL Primary-Replica Configuration
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production
binlog_format = ROW
max_binlog_size = 100M
expire_logs_days = 7

# Replica configuration
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin.log
read_only = 1
super_read_only = 1

2. Multi-Master Replication

  • All nodes can accept writes
  • Conflict resolution mechanisms required
  • Suitable for geographically distributed applications
  • Examples: Galera Cluster, CockroachDB

3. Database Clustering

Solution Type Consistency Best For
PostgreSQL Streaming Primary-Replica Eventual Read-heavy workloads
MySQL Group Replication Multi-Master Strong Write scaling
MongoDB Replica Set Primary-Secondary Eventual Document stores
Cassandra Masterless Tunable Large scale, multi-DC

Load Balancing Strategies

Effective load balancing is crucial for distributing traffic and detecting failures:

Health Check Configuration

# Advanced health check with custom logic
location /health {
  access_log off;
  
  content_by_lua_block {
    local redis = require "resty.redis"
    local mysql = require "resty.mysql"
    
    -- Check Redis connectivity
    local red = redis:new()
    local ok, err = red:connect("127.0.0.1", 6379)
    if not ok then
      ngx.status = 503
      ngx.say("Redis unhealthy: ", err)
      return
    end
    
    -- Check MySQL connectivity
    local db = mysql:new()
    local ok, err = db:connect{
      host = "127.0.0.1",
      port = 3306,
      database = "production"
    }
    if not ok then
      ngx.status = 503
      ngx.say("MySQL unhealthy: ", err)
      return
    end
    
    -- All checks passed
    ngx.status = 200
    ngx.say("Healthy")
  }
}

Circuit Breaker Pattern

  • Prevents cascading failures
  • Automatic service isolation
  • Gradual recovery mechanisms
  • Configurable thresholds and timeouts

Failure Detection and Recovery

Quick and accurate failure detection is essential for maintaining high availability:

Detection Methods

  • Heartbeat: Regular ping checks between components
  • Consensus: Quorum-based failure agreement (Raft, Paxos)
  • Synthetic Monitoring: Simulated user transactions
  • Anomaly Detection: ML-based pattern recognition

Recovery Strategies

# Automated recovery script example
#!/bin/bash

MAX_RETRIES=3
RETRY_DELAY=30

recover_service() {
  local service=$1
  local retries=0
  
  while [ $retries -lt $MAX_RETRIES ]; do
    echo "Attempting to recover $service (attempt $((retries+1)))"
    
    # Stop the service
    systemctl stop $service
    sleep 5
    
    # Clear any locks or temp files
    rm -f /var/lock/$service/*
    
    # Start the service
    if systemctl start $service; then
      echo "$service recovered successfully"
      
      # Verify service is healthy
      sleep 10
      if systemctl is-active $service; then
        return 0
      fi
    fi
    
    retries=$((retries+1))
    sleep $RETRY_DELAY
  done
  
  echo "Failed to recover $service after $MAX_RETRIES attempts"
  # Trigger escalation
  notify_oncall "Service $service failed to recover"
  return 1
}

Best Practices

Follow these proven practices to achieve and maintain high availability:

Design Principles

  • Design for failure - assume components will fail
  • Eliminate single points of failure (SPOF)
  • Implement defense in depth
  • Use immutable infrastructure
  • Automate everything possible
  • Monitor proactively, not reactively

Testing High Availability

  • Chaos Engineering: Intentionally introduce failures
  • Load Testing: Verify performance under stress
  • Failover Drills: Regular practice of recovery procedures
  • Disaster Recovery: Full-scale regional failover tests

Monitoring and Alerting

Golden Signals: Monitor latency, traffic, errors, and saturation for comprehensive visibility.
# Example Prometheus alerting rules
groups:
  - name: high_availability
    rules:
    - alert: InstanceDown
      expr: up == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        
    - alert: DatabaseReplicationLag
      expr: mysql_slave_lag_seconds > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Database replication lag exceeds 10 seconds"

Documentation and Runbooks

Maintain comprehensive documentation for all HA components:

  • Architecture diagrams and data flows
  • Failover procedures and decision trees
  • Recovery time objectives (RTO) and recovery point objectives (RPO)
  • Contact lists and escalation policies
  • Post-incident review templates
Note: This documentation is provided for reference purposes only. It reflects general best practices and industry-aligned guidelines, and any examples, claims, or recommendations are intended as illustrative—not definitive or binding.