Disaster Recovery
Observo supports "Pilot Light" & "Warm Standby" DR models for Site deployments. Both of these strategies offer RPO/RTO choices based on which our customers make deployment decisions.

Pilot Light
Site metadata is synced and ready to optimize RPO
Elements like compute are shut-off to optimize on cost
This increases RTO when a failover is triggered as compute instances for clusters need to be initialized
Recommended when RTO can be longer and cost optimization is more important
Typical RPO: < 15 minutes
Typical RTO: 1-2 hours
Warm Standby
Site metadata is synced and ready to optimize RPO
To reduce RTO, a fraction of compute instances are initialized
This option results in higher cost than "Pilot Light"
Recommended for critical workloads
Typical RPO: < 5 minutes
Typical RTO: 10-15 minutes
Implementation Details
Standby Site Configuration
In the standby site (as shown in the diagram), the dataplane is scaled to 0 to optimize costs while maintaining readiness. All configuration data is regularly pulled from the manager like the active site to ensure the standby site can quickly become operational when needed. When a failover is triggered, the dataplane is scaled up, allowing the site to start working with full functionality.
Scaling Commands
Scale Dataplane to Zero (For Standby Site)
# Scale the dataplane deployment to 0 replicas
kubectl scale deployment dataplane --replicas=0 -n observo-clientScale Up Dataplane (During Failover)
# Scale the dataplane deployment to the required number of replicas
kubectl scale deployment dataplane --replicas=3 -n observo-clientHPA Management for Dataplane
Disable HPA (For Standby Site)
# delete the HPA temporarily
kubectl delete hpa dataplane-hpa -n observo-systemEnable HPA (During Failover)
kubectl autoscale deployment dataplane --min=3 --max=10 --cpu-percent=80 -n observo-clientKey Features
Automated failover capabilities
Regular health checks and monitoring
Cross-region data replication
Automated backup and restore procedures
Configurable sync intervals for metadata
Flexible compute scaling options
Considerations
Network latency between primary and DR sites
Data consistency requirements
Cost implications of chosen strategy
Compliance and regulatory requirements
Testing and validation procedures
Regular testing of failover procedures to ensure recovery works as expected
Last updated
Was this helpful?

