Log Archival
A Comprehensive Guide for log data archival in JSON and Parquet Files
Introduction
Log archival is a powerful solution for storing and managing your raw event data. In this guide, we'll explore how to configure archival storage for both JSON and Parquet file formats, with specific focus on compression settings and schema configuration.
Basic Setup
First, let's look at a basic configuration for Azure Blob Storage archival:
name: "event-archival"
description: "Event data archival to Azure Blob Storage"
container_name: "eventdata"
storage_account: "myarchivalaccount"JSON Configuration
When writing JSON files, we'll use GZip compression as per requirements. Here's an example configuration:
name: "json-archival"
description: "JSON event archival with GZip compression"
container_name: "jsoneventdata"
storage_account: "myarchivalaccount"
encoding:
codec: "JSON Encoding"
compression: "GZip"
blob_prefix: "year=%Y/month=%m/day=%d"
blob_time_format: "%s"
batch_max_events: 1000
batch_timeout_secs: 300This configuration will:
Store events in JSON format
Apply GZip compression
Organize files in a year/month/day hierarchy
Batch up to 1000 events or flush every 5 minutes
Parquet Configuration
For Parquet files, we'll configure it with no compression and include the observo_record field. Here's an example:
name: "parquet-archival"
description: "Parquet event archival with custom schema"
container_name: "parqueteventdata"
storage_account: "myarchivalaccount"
encoding:
codec: "Parquet"
include_raw_log: true
parquet_schema: |
message root {
optional binary stream;
optional binary time;
optional binary observo_record;
optional group kubernetes {
optional binary pod_name;
optional binary pod_id;
optional binary namespace;
optional binary container_name;
optional group labels {
optional binary app;
optional binary environment;
}
}
optional group metrics {
optional int64 response_time;
optional int64 status_code;
}
}
compression: "None"
blob_prefix: "year=%Y/month=%m/day=%d"Key points about this configuration:
include_raw_log: trueadds the observo_record fieldNo compression is applied as specified
Files will have .parquet extension by default. For AWS S3 Archival config add
Filename Extension = parquetCustom schema includes both system and application-specific fields
Supported data types
boolean for boolean values
int64 for numbers
binary for strings
double for floating point numbers
Advanced Configuration Tips
Batching Optimization
batch_max_bytes: 10485760 # 10MB
batch_max_events: 1000
batch_timeout_secs: 300Blob Naming Strategy
blob_prefix: "year=%Y/month=%m/day=%d"
blob_append_uuid: true
blob_time_format: "%s"Health Monitoring
healthcheck: true
time_generated_key: "source_timestamp"Sample Output Files
With these configurations, you'll see files created like:
For JSON:
jsoneventdata/year=2024/month=10/day=22/1698012345.log.gzFor Parquet:
parqueteventdata/year=2024/month=10/day=22/1698012345.parquetParquet Configuration with observo_record Example
Let's look at a practical example of how data is stored when using Parquet with observo_record enabled. We'll use a simplified schema that captures application logs with some specific fields while also storing the complete raw message.
Sample Configuration
name: "app-logs-archival"
description: "Application logs archival in Parquet format with raw message preservation"
container_name: "applogs"
storage_account: "myarchivalaccount"
encoding:
codec: "Parquet"
include_raw_log: true
parquet_schema: |
message root {
optional binary timestamp (TIMESTAMP_MILLIS);
optional binary service_name;
optional binary log_level;
optional binary observo_record;
optional group metrics {
optional int64 response_time_ms;
optional int64 status_code;
}
}
compression: "None"
blob_prefix: "year=%Y/month=%m/day=%d"Example Input Log
Let's say we have an incoming JSON log message like this:
{
"timestamp": "2024-10-22T15:30:45.123Z",
"service_name": "payment-service",
"log_level": "ERROR",
"metrics": {
"response_time_ms": 1500,
"status_code": 500
},
"error_details": {
"message": "Database connection timeout",
"code": "DB_001",
"stack_trace": "at line 42..."
},
"user_id": "usr_123",
"transaction_id": "tx_789",
"environment": "production"
}Resulting Parquet File Content
When this log is written to Parquet with observo_record enabled, it will be stored like this:
Row 1:
├── timestamp: "2024-10-22T15:30:45.123Z"
├── service_name: "payment-service"
├── log_level: "ERROR"
├── metrics
│ ├── response_time_ms: 1500
│ └── status_code: 500
└── observo_record: {
"timestamp": "2024-10-22T15:30:45.123Z",
"service_name": "payment-service",
"log_level": "ERROR",
"metrics": {
"response_time_ms": 1500,
"status_code": 500
},
"error_details": {
"message": "Database connection timeout",
"code": "DB_001",
"stack_trace": "at line 42..."
},
"user_id": "usr_123",
"transaction_id": "tx_789",
"environment": "production"
}Key Benefits of This Approach
Efficient Querying: The specific fields in the Parquet schema (
timestamp,service_name,log_level,metrics) can be queried efficiently as they are stored in a columnar format.Complete Data Preservation: The
observo_recordfield contains the entire original log message, ensuring no data is lost even if it's not part of the specific Parquet schema.Storage Optimization: Even though we're storing the full message in
observo_record, Parquet's efficient encoding and compression still provide good storage efficiency.
Example Queries
When analyzing this data, you can:
Query specific fields efficiently:
SELECT timestamp, service_name, metrics.status_code
FROM applogs
WHERE log_level = 'ERROR'
AND metrics.response_time_ms > 1000;Access raw data when needed:
SELECT observo_record
FROM applogs
WHERE service_name = 'payment-service'
AND timestamp >= '2024-10-22';Another Example with Application Metrics
Let's look at another example with application metrics data:
Input Log
{
"timestamp": "2024-10-22T15:31:00.000Z",
"service_name": "order-service",
"log_level": "INFO",
"metrics": {
"response_time_ms": 250,
"status_code": 200
},
"request_details": {
"path": "/api/v1/orders",
"method": "POST",
"client_ip": "192.168.1.100",
"user_agent": "Mozilla/5.0...",
"payload_size": 1024
},
"business_metrics": {
"order_value": 199.99,
"items_count": 3,
"customer_type": "premium"
}
}Resulting Parquet Storage
Row 2:
├── timestamp: "2024-10-22T15:31:00.000Z"
├── service_name: "order-service"
├── log_level: "INFO"
├── metrics
│ ├── response_time_ms: 250
│ └── status_code: 200
└── observo_record: {
"timestamp": "2024-10-22T15:31:00.000Z",
"service_name": "order-service",
"log_level": "INFO",
"metrics": {
"response_time_ms": 250,
"status_code": 200
},
"request_details": {
"path": "/api/v1/orders",
"method": "POST",
"client_ip": "192.168.1.100",
"user_agent": "Mozilla/5.0...",
"payload_size": 1024
},
"business_metrics": {
"order_value": 199.99,
"items_count": 3,
"customer_type": "premium"
}
}Best Practices for Parquet Schema Design with observo_record
Include Frequently Queried Fields: Add fields that you commonly use for filtering, sorting, or aggregating to the Parquet schema.
Consider Field Types: Use appropriate Parquet data types for better query performance:
boolean for boolean values
int64 for numbers
binary for strings
double for floating point numbers
Nested Structures: Group related fields together using nested structures (as shown in the metrics group).
Schema Evolution: Plan for future schema changes by making all fields optional.
With this setup, you get the best of both worlds: efficient querying of specific fields through the Parquet schema while maintaining access to complete log data through the observo_record field.
Best Practices
Always use appropriate batching configurations to optimize performance
Implement a logical blob prefix strategy for easy data organization
Enable healthchecks for monitoring
Use appropriate compression based on file format (GZip for JSON, None for Parquet)
Include relevant fields in your Parquet schema to optimize querying
Conclusion
Proper configuration of Archival destinations ensures efficient storage and retrieval of your event data. Whether you choose JSON or Parquet format, following these guidelines will help you set up a robust archival system that meets your specific needs.
Remember to adjust the configurations based on your specific use case and performance requirements.
Last updated
Was this helpful?

