Disaster Recovery

Observo supports "Pilot Light" & "Warm Standby" DR models for Site deployments. Both of these strategies offer RPO/RTO choices based on which our customers make deployment decisions.

Disaster Recovery Architecture

Pilot Light

  • Site metadata is synced and ready to optimize RPO

  • Elements like compute are shut-off to optimize on cost

  • This increases RTO when a failover is triggered as compute instances for clusters need to be initialized

  • Recommended when RTO can be longer and cost optimization is more important

  • Typical RPO: < 15 minutes

  • Typical RTO: 1-2 hours

Warm Standby

  • Site metadata is synced and ready to optimize RPO

  • To reduce RTO, a fraction of compute instances are initialized

  • This option results in higher cost than "Pilot Light"

  • Recommended for critical workloads

  • Typical RPO: < 5 minutes

  • Typical RTO: 10-15 minutes

Implementation Details

Standby Site Configuration

In the standby site (as shown in the diagram), the dataplane is scaled to 0 to optimize costs while maintaining readiness. All configuration data is regularly pulled from the manager like the active site to ensure the standby site can quickly become operational when needed. When a failover is triggered, the dataplane is scaled up, allowing the site to start working with full functionality.

Scaling Commands

Scale Dataplane to Zero (For Standby Site)

# Scale the dataplane deployment to 0 replicas
kubectl scale deployment dataplane --replicas=0 -n observo-client

Scale Up Dataplane (During Failover)

# Scale the dataplane deployment to the required number of replicas
kubectl scale deployment dataplane --replicas=3 -n observo-client

HPA Management for Dataplane

Disable HPA (For Standby Site)

#  delete the HPA temporarily
kubectl delete hpa dataplane-hpa -n observo-system

Enable HPA (During Failover)

kubectl autoscale deployment dataplane --min=3 --max=10 --cpu-percent=80 -n observo-client

Key Features

  • Automated failover capabilities

  • Regular health checks and monitoring

  • Cross-region data replication

  • Automated backup and restore procedures

  • Configurable sync intervals for metadata

  • Flexible compute scaling options

Considerations

  • Network latency between primary and DR sites

  • Data consistency requirements

  • Cost implications of chosen strategy

  • Compliance and regulatory requirements

  • Testing and validation procedures

  • Regular testing of failover procedures to ensure recovery works as expected

Last updated

Was this helpful?