Log Archival

A Comprehensive Guide for log data archival in JSON and Parquet Files

Introduction

Log archival is a powerful solution for storing and managing your raw event data. In this guide, we'll explore how to configure archival storage for both JSON and Parquet file formats, with specific focus on compression settings and schema configuration.

Basic Setup

First, let's look at a basic configuration for Azure Blob Storage archival:

name: "event-archival"
description: "Event data archival to Azure Blob Storage"
container_name: "eventdata"
storage_account: "myarchivalaccount"

JSON Configuration

When writing JSON files, we'll use GZip compression as per requirements. Here's an example configuration:

name: "json-archival"
description: "JSON event archival with GZip compression"
container_name: "jsoneventdata"
storage_account: "myarchivalaccount"
encoding:
  codec: "JSON Encoding"
compression: "GZip"
blob_prefix: "year=%Y/month=%m/day=%d"
blob_time_format: "%s"
batch_max_events: 1000
batch_timeout_secs: 300

This configuration will:

  • Store events in JSON format

  • Apply GZip compression

  • Organize files in a year/month/day hierarchy

  • Batch up to 1000 events or flush every 5 minutes

Parquet Configuration

For Parquet files, we'll configure it with no compression and include the observo_record field. Here's an example:

name: "parquet-archival"
description: "Parquet event archival with custom schema"
container_name: "parqueteventdata"
storage_account: "myarchivalaccount"
encoding:
  codec: "Parquet"
  include_raw_log: true
  parquet_schema: |
    message root {
      optional binary stream;
      optional binary time;
      optional binary observo_record;
      optional group kubernetes {
        optional binary pod_name;
        optional binary pod_id;
        optional binary namespace;
        optional binary container_name;
        optional group labels {
          optional binary app;
          optional binary environment;
        }
      }
      optional group metrics {
        optional int64 response_time;
        optional int64 status_code;
      }
    }
compression: "None"
blob_prefix: "year=%Y/month=%m/day=%d"

Key points about this configuration:

  1. include_raw_log: true adds the observo_record field

  2. No compression is applied as specified

  3. Files will have .parquet extension by default. For AWS S3 Archival config add Filename Extension = parquet

  4. Custom schema includes both system and application-specific fields

  5. Supported data types

    • boolean for boolean values

    • int64 for numbers

    • binary for strings

    • double for floating point numbers

Advanced Configuration Tips

Batching Optimization

batch_max_bytes: 10485760  # 10MB
batch_max_events: 1000
batch_timeout_secs: 300

Blob Naming Strategy

blob_prefix: "year=%Y/month=%m/day=%d"
blob_append_uuid: true
blob_time_format: "%s"

Health Monitoring

healthcheck: true
time_generated_key: "source_timestamp"

Sample Output Files

With these configurations, you'll see files created like:

For JSON:

jsoneventdata/year=2024/month=10/day=22/1698012345.log.gz

For Parquet:

parqueteventdata/year=2024/month=10/day=22/1698012345.parquet

Parquet Configuration with observo_record Example

Let's look at a practical example of how data is stored when using Parquet with observo_record enabled. We'll use a simplified schema that captures application logs with some specific fields while also storing the complete raw message.

Sample Configuration

name: "app-logs-archival"
description: "Application logs archival in Parquet format with raw message preservation"
container_name: "applogs"
storage_account: "myarchivalaccount"
encoding:
  codec: "Parquet"
  include_raw_log: true
  parquet_schema: |
    message root {
      optional binary timestamp (TIMESTAMP_MILLIS);
      optional binary service_name;
      optional binary log_level;
      optional binary observo_record;
      optional group metrics {
        optional int64 response_time_ms;
        optional int64 status_code;
      }
    }
compression: "None"
blob_prefix: "year=%Y/month=%m/day=%d"

Example Input Log

Let's say we have an incoming JSON log message like this:

{
  "timestamp": "2024-10-22T15:30:45.123Z",
  "service_name": "payment-service",
  "log_level": "ERROR",
  "metrics": {
    "response_time_ms": 1500,
    "status_code": 500
  },
  "error_details": {
    "message": "Database connection timeout",
    "code": "DB_001",
    "stack_trace": "at line 42..."
  },
  "user_id": "usr_123",
  "transaction_id": "tx_789",
  "environment": "production"
}

Resulting Parquet File Content

When this log is written to Parquet with observo_record enabled, it will be stored like this:

Row 1:
├── timestamp: "2024-10-22T15:30:45.123Z"
├── service_name: "payment-service"
├── log_level: "ERROR"
├── metrics
│   ├── response_time_ms: 1500
│   └── status_code: 500
└── observo_record: {
      "timestamp": "2024-10-22T15:30:45.123Z",
      "service_name": "payment-service",
      "log_level": "ERROR",
      "metrics": {
        "response_time_ms": 1500,
        "status_code": 500
      },
      "error_details": {
        "message": "Database connection timeout",
        "code": "DB_001",
        "stack_trace": "at line 42..."
      },
      "user_id": "usr_123",
      "transaction_id": "tx_789",
      "environment": "production"
    }

Key Benefits of This Approach

  1. Efficient Querying: The specific fields in the Parquet schema (timestamp, service_name, log_level, metrics) can be queried efficiently as they are stored in a columnar format.

  2. Complete Data Preservation: The observo_record field contains the entire original log message, ensuring no data is lost even if it's not part of the specific Parquet schema.

  3. Storage Optimization: Even though we're storing the full message in observo_record, Parquet's efficient encoding and compression still provide good storage efficiency.

Example Queries

When analyzing this data, you can:

  1. Query specific fields efficiently:

SELECT timestamp, service_name, metrics.status_code
FROM applogs
WHERE log_level = 'ERROR'
AND metrics.response_time_ms > 1000;
  1. Access raw data when needed:

SELECT observo_record
FROM applogs
WHERE service_name = 'payment-service'
AND timestamp >= '2024-10-22';

Another Example with Application Metrics

Let's look at another example with application metrics data:

Input Log

{
  "timestamp": "2024-10-22T15:31:00.000Z",
  "service_name": "order-service",
  "log_level": "INFO",
  "metrics": {
    "response_time_ms": 250,
    "status_code": 200
  },
  "request_details": {
    "path": "/api/v1/orders",
    "method": "POST",
    "client_ip": "192.168.1.100",
    "user_agent": "Mozilla/5.0...",
    "payload_size": 1024
  },
  "business_metrics": {
    "order_value": 199.99,
    "items_count": 3,
    "customer_type": "premium"
  }
}

Resulting Parquet Storage

Row 2:
├── timestamp: "2024-10-22T15:31:00.000Z"
├── service_name: "order-service"
├── log_level: "INFO"
├── metrics
│   ├── response_time_ms: 250
│   └── status_code: 200
└── observo_record: {
      "timestamp": "2024-10-22T15:31:00.000Z",
      "service_name": "order-service",
      "log_level": "INFO",
      "metrics": {
        "response_time_ms": 250,
        "status_code": 200
      },
      "request_details": {
        "path": "/api/v1/orders",
        "method": "POST",
        "client_ip": "192.168.1.100",
        "user_agent": "Mozilla/5.0...",
        "payload_size": 1024
      },
      "business_metrics": {
        "order_value": 199.99,
        "items_count": 3,
        "customer_type": "premium"
      }
    }

Best Practices for Parquet Schema Design with observo_record

  1. Include Frequently Queried Fields: Add fields that you commonly use for filtering, sorting, or aggregating to the Parquet schema.

  2. Consider Field Types: Use appropriate Parquet data types for better query performance:

    • boolean for boolean values

    • int64 for numbers

    • binary for strings

    • double for floating point numbers

  3. Nested Structures: Group related fields together using nested structures (as shown in the metrics group).

  4. Schema Evolution: Plan for future schema changes by making all fields optional.

With this setup, you get the best of both worlds: efficient querying of specific fields through the Parquet schema while maintaining access to complete log data through the observo_record field.

Best Practices

  1. Always use appropriate batching configurations to optimize performance

  2. Implement a logical blob prefix strategy for easy data organization

  3. Enable healthchecks for monitoring

  4. Use appropriate compression based on file format (GZip for JSON, None for Parquet)

  5. Include relevant fields in your Parquet schema to optimize querying

Conclusion

Proper configuration of Archival destinations ensures efficient storage and retrieval of your event data. Whether you choose JSON or Parquet format, following these guidelines will help you set up a robust archival system that meets your specific needs.

Remember to adjust the configurations based on your specific use case and performance requirements.

Last updated

Was this helpful?