Disaster Recovery

Observo supports "Pilot Light" & "Warm Standby" DR models for Site deployments. Both of these strategies offer RPO/RTO choices based on which our customers make deployment decisions.

Disaster Recovery Architecture

Pilot Light

  • Site metadata is synced and ready to optimize RPO

  • Elements like compute are shut-off to optimize on cost

  • This increases RTO when a failover is triggered as compute instances for clusters need to be initialized

  • Recommended when RTO can be longer and cost optimization is more important

  • Typical RPO: < 15 minutes

  • Typical RTO: 1-2 hours

Warm Standby

  • Site metadata is synced and ready to optimize RPO

  • To reduce RTO, a fraction of compute instances are initialized

  • This option results in higher cost than "Pilot Light"

  • Recommended for critical workloads

  • Typical RPO: < 5 minutes

  • Typical RTO: 10-15 minutes

Implementation Details

Standby Site Configuration

In the standby site (as shown in the diagram), the dataplane is scaled to 0 to optimize costs while maintaining readiness. All configuration data is regularly pulled from the manager like the active site to ensure the standby site can quickly become operational when needed. When a failover is triggered, the dataplane is scaled up, allowing the site to start working with full functionality.

Scaling Commands

Scale Dataplane to Zero (For Standby Site)

Scale Up Dataplane (During Failover)

HPA Management for Dataplane

Disable HPA (For Standby Site)

Enable HPA (During Failover)

Key Features

  • Automated failover capabilities

  • Regular health checks and monitoring

  • Cross-region data replication

  • Automated backup and restore procedures

  • Configurable sync intervals for metadata

  • Flexible compute scaling options

Considerations

  • Network latency between primary and DR sites

  • Data consistency requirements

  • Cost implications of chosen strategy

  • Compliance and regulatory requirements

  • Testing and validation procedures

  • Regular testing of failover procedures to ensure recovery works as expected

Last updated

Was this helpful?