Data Reliability and Scalability

Data Reliability

Reliably delivering data in an observability pipeline is crucial as it ensures accurate, real-time insights into system performance and health. Lets take a look at the different ways Observo ensures reliable and scalable delivery of data in your observability stack.

Dynamic Request Throttling

Dynamic Request Throttling is a core feature in Observo’s networking framework that:

  • Replaces static rate limits with an adaptive system.

  • Automatically adjusts HTTP concurrency based on real-time feedback from downstream services.

  • Continuously optimizes request flow through a feedback loop, similar to TCP congestion control.

  • Enhances performance and reliability by efficiently managing resources under varying loads.

How It Works:

  • Adaptive Management: Observo dynamically throttles requests during high load periods using signals such as response codes and response times. If a destination responses indicate rate limiting (e.g., 429) or if response times spike, Observo dynamically reduces the number of outgoing requests to maintain stability and performance.

  • Example: A surge in requests to an order management system can generate increased logs, metrics, and traces, leading to higher latency in downstream destinations like Splunk or Elasticsearch. This may put designations under higher stress increasing response times. Observo ensures that is dynamically scales request parallelism when it detects that a destination is under higher stress or is unhealthy.

Data Queuing

Observo utilizes queues in every destination that act as a temporary buffer for events. These queues serve as way to measure the health of the pipeline. When queues are filled up, it typically indicates an issue with a downstream destination. This could be due to a variety of reasons such as rate-limiting, destinations being down, etc.

Data Queuing in Observo ensures resilience and efficiency by:

  • Creating a mechanism to temporarily hold events when downstream destinations are unhealthy.

  • Acting as a shock absorber, handling sudden spikes in data volume without dropping events or overwhelming the system.

  • Sending a signal to the Source indicating a problem in processing events. This signal serves a mechanism to inform upstream components (even outside of Observo) about issues in downstream components. For some push based sources, Observo will return appropriate error codes to signify issues with downstream components. For pull based sources, Observo will stop pulling data until the downstream destination is healthy again.

Putting It All Together

In a robust observability infrastructure, managing data flow and system performance is crucial. Observo integrates:

  • Dynamic Request Throttling: Adjusts request rates based on real-time feedback.

  • Data Queuing: Buffers data to manage data volume spikes and downstream destination issues.

These features work in tandem to ensure reliability in your observability system. This combined approach ensures that your observability infrastructure remains:

  • Stable and resilient.

  • Capable of handling fluctuating loads and potential downstream challenges.

Scalability

Introduction

Observo sites are deployed on Kubernetes clusters, allowing them to take full advantage of the platform's auto-scaling features. As the volume of telemetry data fluctuates, the pods within the cluster can automatically scale up or down based on customizable CPU and memory thresholds, managed by the Horizontal Pod Autoscaler (HPA). HPA adjusts the number of pods in a replication controller, deployment, or replica set according to observed CPU utilization or other selected metrics. When pods scale, Kubernetes seamlessly manages load balancing across them using services.

Moreover, the cluster's nodes can also scale horizontally based on user configurations, which is facilitated by tools like Karpenter. Karpenter dynamically adjusts the number of nodes in the cluster to match the workload's needs. This ensures that data processing capacity scales in line with the allocated CPU and memory resources. On average, Observo can process approximately 6 MiB per second of data per vCPU, with each vCPU requiring around 2 GB of memory. This estimation is based on worst-case scenarios involving unstructured event data of 256 bytes.

Vertical Scaling

Observo automatically scales to take advantage of all available vCPUs without requiring configuration changes.

Horizontal Scaling

Observo Sites are equipped with a preconfigured load balancer that facilitates the horizontal scaling of both pods and nodes as required. Node scaling within the cluster is managed independently through Karpenter, Kubernetes' auto-scaling provider. Pod auto-scaling is handled automatically by Kubernetes, requiring no additional configuration on your part.

Avoiding Hot Spots

Not all connections are equal; some connections generate significantly more data, making it difficult to evenly load balanced traffic across aggregators. To mitigate this, we recommend the following best practices:

  • Use protocols that support even load balancing, such as HTTP-based protocols. Avoid plain TCP connections.

  • Distribute data across multiple connections for easier load balancing.

  • Ensure instances are large enough to handle your highest-volume connection, enabling full advantage of vertical scaling.

  • Avoid stateful transformations in aggregators (e.g., reduce transform) whenever possible to use a more balanced load distribution algorithm.

Auto-scaling

For most deployments, auto-scaling should be based on average CPU utilization. Observo is typically CPU-constrained, and CPU utilization provides the strongest signal for auto-scaling. We recommend the following settings, adjustable as needed:

  • Average CPU over 5 minutes with an 85% utilization target.

  • A 5-minute stabilization period for scaling up and down.

Benchmark

The following performance tests demonstrate baseline performance between common protocols:

Test Description
Observo
FluentD

File to TCP

76.7 MiB/s

26.1 MiB/s

TCP to HTTP

26.7 MiB/s

<1 MiB/s

TCP to TCP

69.9 MiB/s

3.9 MiB/s

Throughput Scaling with vCPUs

As more vCPUs are added to the cluster, Observo demonstrates a notable improvement in throughput scalability, reflecting an ability to utilize additional computational resources effectively. Ideally, the throughput would increase linearly with the number of vCPUs, meaning that doubling the vCPUs would result in a near doubling of throughput. However, in realistic scenarios, various factors such as overhead, contention, and diminishing returns typically lead to sub-linear scaling. Despite these practical limitations, Observo consistently outperforms comparable solutions, maintaining significantly higher throughput levels even as the number of vCPUs is scaled up.


Last updated

Was this helpful?