Disaster Recovery

Observo supports "Pilot Light" & "Warm Standby" DR models for Site deployments. Both of these strategies offer RPO/RTO choices based on which our customers make deployment decisions.

Pilot Light

Site metadata is synced and ready to optimize RPO
Elements like compute are shut-off to optimize on cost
This increases RTO when a failover is triggered as compute instances for clusters need to be initialized
Recommended when RTO can be longer and cost optimization is more important
Typical RPO: < 15 minutes
Typical RTO: 1-2 hours

Warm Standby

Site metadata is synced and ready to optimize RPO
To reduce RTO, a fraction of compute instances are initialized
This option results in higher cost than "Pilot Light"
Recommended for critical workloads
Typical RPO: < 5 minutes
Typical RTO: 10-15 minutes

Implementation Details

Standby Site Configuration

In the standby site (as shown in the diagram), the dataplane is scaled to 0 to optimize costs while maintaining readiness. All configuration data is regularly pulled from the manager like the active site to ensure the standby site can quickly become operational when needed. When a failover is triggered, the dataplane is scaled up, allowing the site to start working with full functionality.

Scaling Commands

Scale Dataplane to Zero (For Standby Site)

# Scale the dataplane deployment to 0 replicas
kubectl scale deployment dataplane --replicas=0 -n observo-client

Scale Up Dataplane (During Failover)

# Scale the dataplane deployment to the required number of replicas
kubectl scale deployment dataplane --replicas=3 -n observo-client

HPA Management for Dataplane

Disable HPA (For Standby Site)

#  delete the HPA temporarily
kubectl delete hpa dataplane-hpa -n observo-system

Enable HPA (During Failover)

kubectl autoscale deployment dataplane --min=3 --max=10 --cpu-percent=80 -n observo-client

Key Features

Automated failover capabilities
Regular health checks and monitoring
Cross-region data replication
Automated backup and restore procedures
Configurable sync intervals for metadata
Flexible compute scaling options

Considerations

Network latency between primary and DR sites
Data consistency requirements
Cost implications of chosen strategy
Compliance and regulatory requirements
Testing and validation procedures
Regular testing of failover procedures to ensure recovery works as expected

PreviousData Reliability and Scalability NextOnboarding

Last updated 11 months ago

Was this helpful?

hashtagPilot Light

hashtagWarm Standby

hashtagImplementation Details

hashtagStandby Site Configuration

hashtagScaling Commands

hashtagScale Dataplane to Zero (For Standby Site)

hashtagScale Up Dataplane (During Failover)

hashtagHPA Management for Dataplane

hashtagDisable HPA (For Standby Site)

hashtagEnable HPA (During Failover)

hashtagKey Features

hashtagConsiderations