AWS Scan Ingestion

The AWS S3 Scan Ingestion Source in Observo AI enables periodic scanning and ingestion of data, such as logs, metrics, or events, from Amazon S3 buckets in formats like JSON, CSV, or Parquet, supporting monitoring, analytics, and security use cases.

Purpose

The purpose of the Observo AI AWS S3 Scan Ingestion source is to enable users to ingest data from Amazon S3 buckets into the Observo AI platform for analysis and processing. It supports formats like JSON, CSV, Parquet, and plain text, allowing the collection of data such as logs, metrics, or events by periodically scanning the bucket. This integration helps organizations streamline data pipelines, enhance observability, and support use cases like monitoring, analytics, and security by processing S3 data efficiently.

Prerequisites

Before configuring the AWS S3 Scan Ingestion source in Observo AI, ensure the following requirements are met to facilitate seamless data ingestion:

Observo AI Platform Setup:
- The Observo AI platform must be installed and operational, with support for AWS S3 Scan Ingestion as a data source.
- Verify that the platform supports common data formats such as JSON, CSV, Parquet, or plain text. Additional formats may require specific parser configurations.
AWS Account and Permissions:
- An active AWS account with access to the target S3 bucket is required.
- Required IAM permissions:
  - For S3: s3:GetObject and s3:ListBucket.
- Optionally, configure Amazon S3 Inventory to generate inventory reports for large buckets to optimize scanning performance.
Authentication:
- Prepare one of the following authentication methods:
  - Auto Authentication: Use IAM roles, shared credentials, or environment variables.
  - Manual Authentication: Provide an AWS access key and secret key.
  - Secret Authentication: Use a stored secret within Observo AI's secure storage for credentials.
Network and Connectivity:
- Ensure Observo AI can communicate with the AWS S3 endpoint (e.g., s3.<region>.amazonaws.com).
- Check for proxy settings, firewall rules, or VPC endpoint configurations that may affect connectivity to AWS S3.

Prerequisite

Description

Notes

Observo AI Platform

Must be installed and support S3 Scan Ingestion

Verify support for JSON, CSV, Parquet, etc.; additional parsers may be needed

AWS Account

Active account with S3 access

Ensure access to target bucket; consider S3 Inventory for large buckets

Authentication

Auto, Manual, or Secret

Prepare IAM roles or credentials accordingly

Network

Connectivity to AWS S3 endpoint

Check VPC endpoints, proxies, and firewalls

Integration

The Integration section outlines default configurations for the AWS S3 Scan Ingestion source. To configure AWS S3 Scan Ingestion as a source in Observo AI, follow these steps to set up and test the data flow:

Log in to Observo AI:
- Navigate to the Sources tab.
- Click the Add Source button and select Create New.
- Choose AWS S3 Scan Ingestion from the list of available sources to begin configuration.
General Settings:
- Name: A unique identifier for the source, such as s3-scan-source-1.
- Description (Optional): Provide a description for the source.
- Scan Interval: The interval at which to run the query. This is a duration string, such as 30s or 1m.
- AWS Region: The name of the S3 bucket to scan
  Example
  my-s3-bucket
- Bucket Scans: A list of bucket configs that will be scanned for log files (Add as needed)
  - Bucket: The AWS bucket name.
    Example
    my-logs-bucket
  - Prefix (Optional): The AWS bucket prefix.
    Example
    logs/
  - File Filter Pattern (Optional): The pattern to filter files. Files matching the filter are read.
    Examples
    *.json*
    *.gz
  - Time Pattern (Optional): The Pattern for time-based prefixes. Supports: yyyy: 4-digit year yy: 2-digit year MM: 2-digit month (01-12) dd: 2-digit day (01-31) HH: 2-digit hour (00-23) mm: 2-digit minute (00-59) ss: 2-digit second (00-59).
    Examples
    yyyy/MM/dd
    year=yyyy/month=MM/day=dd
  - Decode Content Type (Optional): How to decode file content: plain: No decompression gzip: Gzip decompression zstd: Zstandard decompression auto (default): Auto-detect based on file extension or content type.
    Example
    gzip
  - Max Files Per Scan (Optional): Maximum number of files to process in a single scan.
    Example
    1000
  - Record Start Pattern (Optional): Regular expression that marks the start of a record for multi-line logs. If not specified, each line is treated as a separate record.
    Example
    start=>.*
  - Record End Pattern (Optional): Regular expression that marks the end of a record for multi-line logs.
    Example
    .*=>end
  - Add S3 Metadata (True): Add S3 metadata as attributes.
  - Add S3 File Name (True): Add S3 file name metadata as attributes.
  - Add S3 File Path (True): Add S3 file path metadata as attributes.
  - Add S3 Bucket Name (True): Add S3 Bucket name metadata as attributes.
  - Extract Attributes (Optional): Regular expression to extract attributes from the file path.
    Example
    logs/(\\w+)/(\\d{4})/(\\d{2})/(\\d{2})/
  - Extracted Attribute Names (Add as needed): List of attribute names for the regex capture groups.
    Examples
    service
    year
    month
    day
Authentication (Optional):
- Access Key: Enter the AWS access key ID to use for the assumed role.
  Example
  AKIAIOSFODNN7EXAMPLE
- Secret Access Key: Enter The AWS secret access key to use for assumed role.
  Example
  wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Advanced Settings (Optional):
- AWS Endpoint: Custom endpoint for use with AWS-compatible services.
  Example
  http://127.0.0.0:5000/path/to/service
- Storage: Enter the ID of a storage extension to be used to track processed results.
- Log scanned files (True): Log the file that are scanned
Parser Config:
- Enable Source Log Parser: Disabled by default.
- Toggle Enable Source Log Parser to enable and select an appropriate parser from the Source Log Parser dropdown.
- Add additional parsers as needed.
Pattern Extractor:
- Refer to Observo AI’s Pattern Extractor documentation for details on configuring pattern-based data extraction.
Archival Destination:
- Toggle Enable Archival on Source to enable.
- Under Archival Destination, select from the list of available Archival Destinations (Required).
Save and Test Configuration:
- Save the configuration settings.
- Upload sample data to the S3 bucket and verify ingestion in the Analytics tab in the target pipeline to confirm data flow.

Example Scenarios

HealthSync Solutions, a fictitious healthcare provider specializing in telemedicine and patient data management, stores patient activity logs, medical device telemetry, and compliance audit data in an Amazon S3 bucket. To enhance observability and ensure compliance with HIPAA regulations, HealthSync integrates the AWS S3 Scan Ingestion source with the Observo AI platform. This integration enables periodic scanning of the S3 bucket to ingest JSON-formatted logs for real-time monitoring of patient interactions, anomaly detection in device data, and generating compliance reports.

Standard AWS S3 Scan Ingestion Source Setup

Here is a standard AWS S3 Scan Ingestion Source configuration example. Only the required sections and their associated field updates are displayed in the table below:

General Settings

Field

Value

Description

Name

healthsync-s3-scan

Unique identifier for the AWS S3 Scan Ingestion source.

Description

Ingest S3 logs for patient activity and compliance monitoring

Optional description of the source's purpose.

Scan Interval

Scans the S3 bucket every minute for near real-time data ingestion.

AWS Region

us-east-1

AWS region where the S3 bucket is located.

Bucket Scans

Field

Value

Description

Bucket

healthsync-logs-bucket

The AWS S3 bucket name contains patient and compliance logs.

Prefix

logs/patient-activity/

Bucket prefix to filter logs related to patient activities.

File Filter Pattern

*.json.gz

Filters for JSON files compressed with Gzip.

Time Pattern

yyyy/MM/dd

Time-based prefix pattern for organizing logs by year, month, and day.

Decode Content Type

gzip

Decompresses Gzip-compressed JSON files.

Max Files Per Scan

500

Limits to 500 files per scan to manage processing load.

Record Start Pattern

start=>.*

Regex to mark the start of multi-line log records.

Record End Pattern

.*=>end

Regex to mark the end of multi-line log records.

Add S3 Metadata

True

Includes S3 metadata as attributes in ingested data.

Add S3 File Name

True

Includes S3 file name as an attribute in ingested data.

Add S3 File Path

True

Includes S3 file path as an attribute in ingested data.

Add S3 Bucket Name

True

Includes S3 bucket name as an attribute in ingested data.

Extract Attributes

logs/(\w+)/(\d{4})/(\d{2})/(\d{2})/

Regex to extract attributes from the file path.

Extracted Attribute Names

service, year, month, day

Names for regex capture groups: service, year, month, and day.

Authentication

Field

Value

Description

Access Key

AKIAHEALTHSYNC1234567890

AWS access key ID for accessing the S3 bucket.

Secret Access Key

zZ9kLmNpQrStUvWxYz1234567890AbCdEfGh

AWS secret access key for accessing the S3 bucket (securely stored).

Advanced Settings

Field

Value

Description

AWS Endpoint

https://s3.us-east-1.amazonaws.com

Standard AWS S3 endpoint for the us-east-1 region.

Storage

healthsync-s3-storage-tracker

ID of the storage extension to track processed files.

Log Scanned Files

True

Logs the files scanned during each cycle for auditing purposes.

Troubleshooting

If issues arise with the AWS S3 Scan Ingestion source in Observo AI, use the following steps to diagnose and resolve them:

Verify Configuration Settings:
- Ensure fields like Bucket, Region, Prefix, and File Filter are correctly entered and match the AWS S3 setup.
- Confirm that the scan interval and file filter align with the expected data in the bucket.
Check Authentication:
- Verify the authentication method:
  - For Auto authentication, ensure IAM roles, shared credentials, or environment variables are correctly configured.
  - For Manual authentication, check that the access key and secret key are valid.
  - For Secret authentication, confirm the secret is accessible in Observo AI.
Validate Permissions:
- Ensure the credentials have the required permissions: s3:GetObject and s3:ListBucket.
- Check IAM policies in the AWS Console to confirm correct permissions are assigned.
Check Network Connectivity:
- Verify that firewall rules, proxy settings, or VPC configurations allow traffic to the AWS S3 endpoint (s3.<region>.amazonaws.com).
- Test connectivity using the AWS CLI or tools like curl to ensure Observo AI can access the S3 bucket.
Monitor Logs and Data:
- Verify data ingestion by monitoring the Analytics tab in the Observo AI pipeline for throughput and event counts.
- Check Observo AI logs for errors related to parsing, object access, or connection issues.
Common Error Messages:
- "No data received": Ensure objects are present in the bucket and match the prefix or file filter. Verify network connectivity to the S3 endpoint.
- "Invalid payload format": Confirm that the S3 objects are in a supported format (e.g., JSON, CSV, Parquet). Use the appropriate parser in the Source Log Parser settings.
- "Bucket does not exist": Check the bucket name and ensure there are no certificate validation issues. Consider adding CA certificates if needed.
- "Access denied": Verify that the IAM credentials have the required s3:GetObject and s3:ListBucket permissions.

Issue

Possible Cause

Resolution

No data received

Incorrect bucket, prefix, or file filter

Verify bucket configuration and file filter

Invalid payload format

Unsupported data format

Use appropriate parser for JSON, CSV, Parquet

Bucket does not exist

Incorrect bucket name or certificate issues

Check bucket name and certificate settings

Access denied

Insufficient IAM permissions

Verify s3:GetObject and s3:ListBucket permissions

Connectivity issues

Firewall or proxy blocking access

Test network connectivity and VPC endpoints

Resources

For additional guidance and detailed information, refer to the following resources:

AWS Documentation:
Best Practices:
- Use Amazon S3 Inventory for large buckets to optimize scan performance.
- Configure specific prefixes and file filters to reduce unnecessary scanning of irrelevant objects.
- Ensure secure authentication using IAM roles or secrets to minimize credential exposure.

PreviousAWS S3 NextAWS SQS

Last updated 7 months ago

Was this helpful?

hashtagPurpose

hashtagPrerequisites

hashtagIntegration

hashtagExample Scenarios

hashtagStandard AWS S3 Scan Ingestion Source Setup

hashtagTroubleshooting

hashtag

hashtagResources

Purpose

Prerequisites

Integration

Example Scenarios

Standard AWS S3 Scan Ingestion Source Setup

Troubleshooting

Resources