Query Archived Logs
Deploy an AWS Athena cluster in your AWS account to query for logs stored in Archive.
Introduction
Amazon Athena is a serverless, interactive query service provided by Amazon Web Services (AWS). It allows users to analyze and query data stored in Amazon S3 (Simple Storage Service) using standard SQL (Structured Query Language) queries.
For Athena, all the command below can be run on AWS console
Athena Setup
Click on “Query your data” on Athena’s home screen.\

Choose a data source appropriate for your use case. We chose “S3 - AWS Glue Data Catalog” for our use case.\

If you are using Amazon Glue to automate schema discovery, skip to Automate Schema Discover.
Choose “create a table manually”\

Give the table and the database a name\

Give s3 bucket location of the data folder. If data is partitioned into multiple folders, give only the root folder location.
Choose Table Type as “Apache Hive”, File format as the format, for our use case “JSON” and the SerDe (Serializer Deserializer) library of choice.\

Add the table details, partition details etc. Alternatively, add a dummy column and skip the other sections as it can be done directly in the query layer later.
In the settings tab, edit it and give a s3 bucket location for storing metadata of the database and queries. Create a new S3 bucket, if needed.\

Creating Schema Definition
Static definition of schema
With Athena (and similar solutions), the table structure and the raw data needs to be specified. The format is first specify an EXTERNAL TABLE, followed by the data format and the location.
Example:
CREATE EXTERNAL TABLE archived_logs (
docker struct<container_id:string>,
kubernetes struct<
container_image:string,
container_image_id:string,
container_name:string,
host:string,
labels:struct<`app`:string,`pod-template-hash`:string>,
master_url:string,
namespace_id:string,
namespace_name:string,
pod_id:string,
pod_name:string
>,
message string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://ajay-kubernetes-source/';The above example is for JSON format (with Hive JSON Serialization Library). Additional information can also be specified depending on the exact library and data.
Timestamp Values
For custom timestamp values, add the following to SERDEPROPERTIES:
WITH SERDEPROPERTIES (
"timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',yyyy-MM-dd'T'HH:mm:ss"
)Default for timestamp is yyyy-mm-dd hh:mm:ss
Malformed Json Values
If some values of fields that are expected to be json may be malformed, add the following to SERDEPROPERTIES:
WITH SERDEPROPERTIES (
'paths'='bucket,docker,kubernetes,log,object,region,source_type,stream,timestamp,timestamp_end',
'timestamp.formats'='yyyy-MM-dd"T"HH:mm:ss.SSS"Z",yyyy-MM-dd"T"HH:mm:ss,yyyy-MM-dd"T"HH:mm:ss"Z"',
'ignore.malformed.json'='true'
) If data itself is in a different format, the last few lines after the table definition will need to be replaced with appropriate SERDE (Serializer Deserializer library). Eg for parquet, it should be something like:
PARTITIONED BY (dt string)
STORED AS PARQUET
LOCATION 's3://<myregion>.<mypath>/<sub>/<path>/<json_file>'
tblproperties ("parquet.compression"="SNAPPY");Refreshing Table Metadata
Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata. This is especially important if new partition folders / sub-folders might have been added:
MSCK REPAIR TABLE archived_logs;Running Query
Run the query on the query console, like any normal SQL table:
SELECT * from archived_logs limit 5;Automated schema discovery
AWS Glue is designed to automatically discover and catalog metadata from various data sources, and it integrates seamlessly with Athena to provide a comprehensive data processing and querying solution. Your Observo representative will help assist in deploying AWS Glue in your environment.
Glue Connector Setup
Setup a new Crawler

Choose “add a data source”

Choose type as S3. Choose s3 bucket where your data is stored (either browse or give the s3 location)

Create a new IAM role

Set output database and table. Choose schedule for crawling and updating DB

Under "Advanced Options", choose the following settings:\

These setting ensure that a single schema is created for the data discovered by the crawler in the S3 bucket. For more information on these settings, please refer AWS documentation on crawler configuration options.
Save and then Run the crawler


After crawl succeeds, go to “Tables” on the left panel.
Check the schema of your new table

Navigate to Athena and use the table created for querying
Cost
All prices for US East
Query Run: $5 / TB
Glue Crawler: $0.44/DPU-Hour (billed per second, 10 minutes minimum per crawler run)
Data Catalog Storage: First 1M objects free, $1 / 100k objects per month (> 1M objects)
Data Catalog Requests: First 1M objects free, $1 / 1M requests per month (> 1M objects)
Last updated
Was this helpful?

