Query Archived Logs

Deploy an AWS Athena cluster in your AWS account to query for logs stored in Archive.

Introduction

Amazon Athena is a serverless, interactive query service provided by Amazon Web Services (AWS). It allows users to analyze and query data stored in Amazon S3 (Simple Storage Service) using standard SQL (Structured Query Language) queries.

For Athena, all the command below can be run on AWS console

Athena Setup

  1. Click on “Query your data” on Athena’s home screen.\

  2. Choose a data source appropriate for your use case. We chose “S3 - AWS Glue Data Catalog” for our use case.\

  3. If you are using Amazon Glue to automate schema discovery, skip to Automate Schema Discover.

  4. Choose “create a table manually”\

  5. Give the table and the database a name\

  6. Give s3 bucket location of the data folder. If data is partitioned into multiple folders, give only the root folder location.

  7. Choose Table Type as “Apache Hive”, File format as the format, for our use case “JSON” and the SerDe (Serializer Deserializer) library of choice.\

  8. Add the table details, partition details etc. Alternatively, add a dummy column and skip the other sections as it can be done directly in the query layer later.

  9. In the settings tab, edit it and give a s3 bucket location for storing metadata of the database and queries. Create a new S3 bucket, if needed.\

Creating Schema Definition

Static definition of schema

With Athena (and similar solutions), the table structure and the raw data needs to be specified. The format is first specify an EXTERNAL TABLE, followed by the data format and the location.

Example:

CREATE EXTERNAL TABLE archived_logs (
  docker struct<container_id:string>,
  kubernetes struct<
      container_image:string,
      container_image_id:string,
      container_name:string,
      host:string,
      labels:struct<`app`:string,`pod-template-hash`:string>,
      master_url:string,
      namespace_id:string,
      namespace_name:string,
      pod_id:string,
      pod_name:string
    >,
  message string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://ajay-kubernetes-source/';

The above example is for JSON format (with Hive JSON Serialization Library). Additional information can also be specified depending on the exact library and data.

Timestamp Values

For custom timestamp values, add the following to SERDEPROPERTIES:

WITH SERDEPROPERTIES (
    "timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',yyyy-MM-dd'T'HH:mm:ss"
)

Default for timestamp is yyyy-mm-dd hh:mm:ss

Malformed Json Values

If some values of fields that are expected to be json may be malformed, add the following to SERDEPROPERTIES:

WITH SERDEPROPERTIES ( 
  'paths'='bucket,docker,kubernetes,log,object,region,source_type,stream,timestamp,timestamp_end',
  'timestamp.formats'='yyyy-MM-dd"T"HH:mm:ss.SSS"Z",yyyy-MM-dd"T"HH:mm:ss,yyyy-MM-dd"T"HH:mm:ss"Z"',
  'ignore.malformed.json'='true'
  ) 

If data itself is in a different format, the last few lines after the table definition will need to be replaced with appropriate SERDE (Serializer Deserializer library). Eg for parquet, it should be something like:

PARTITIONED BY (dt string)
STORED AS PARQUET
LOCATION 's3://<myregion>.<mypath>/<sub>/<path>/<json_file>'
tblproperties ("parquet.compression"="SNAPPY");

Refreshing Table Metadata

Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata. This is especially important if new partition folders / sub-folders might have been added:

MSCK REPAIR TABLE archived_logs;

Running Query

Run the query on the query console, like any normal SQL table:

SELECT * from archived_logs limit 5;

Automated schema discovery

AWS Glue is designed to automatically discover and catalog metadata from various data sources, and it integrates seamlessly with Athena to provide a comprehensive data processing and querying solution. Your Observo representative will help assist in deploying AWS Glue in your environment.

Glue Connector Setup

  1. Setup a new Crawler

  1. Choose “add a data source”

  1. Choose type as S3. Choose s3 bucket where your data is stored (either browse or give the s3 location)

  1. Create a new IAM role

  1. Set output database and table. Choose schedule for crawling and updating DB

  1. Under "Advanced Options", choose the following settings:\

    These setting ensure that a single schema is created for the data discovered by the crawler in the S3 bucket. For more information on these settings, please refer AWS documentation on crawler configuration options.

  2. Save and then Run the crawler

  1. After crawl succeeds, go to “Tables” on the left panel.

  2. Check the schema of your new table

  1. Navigate to Athena and use the table created for querying

Cost

All prices for US East

Query Run: $5 / TB

Glue Crawler: $0.44/DPU-Hour (billed per second, 10 minutes minimum per crawler run)

Data Catalog Storage: First 1M objects free, $1 / 100k objects per month (> 1M objects)

Data Catalog Requests: First 1M objects free, $1 / 1M requests per month (> 1M objects)

Last updated

Was this helpful?