AWS Scan Ingestion
The AWS S3 Scan Ingestion Source in Observo AI enables periodic scanning and ingestion of data, such as logs, metrics, or events, from Amazon S3 buckets in formats like JSON, CSV, or Parquet, supporting monitoring, analytics, and security use cases.
Purpose
The purpose of the Observo AI AWS S3 Scan Ingestion source is to enable users to ingest data from Amazon S3 buckets into the Observo AI platform for analysis and processing. It supports formats like JSON, CSV, Parquet, and plain text, allowing the collection of data such as logs, metrics, or events by periodically scanning the bucket. This integration helps organizations streamline data pipelines, enhance observability, and support use cases like monitoring, analytics, and security by processing S3 data efficiently.
Prerequisites
Before configuring the AWS S3 Scan Ingestion source in Observo AI, ensure the following requirements are met to facilitate seamless data ingestion:
Observo AI Platform Setup:
The Observo AI platform must be installed and operational, with support for AWS S3 Scan Ingestion as a data source.
Verify that the platform supports common data formats such as JSON, CSV, Parquet, or plain text. Additional formats may require specific parser configurations.
AWS Account and Permissions:
An active AWS account with access to the target S3 bucket is required.
Required IAM permissions:
For S3: s3:GetObject and s3:ListBucket.
Optionally, configure Amazon S3 Inventory to generate inventory reports for large buckets to optimize scanning performance.
Authentication:
Prepare one of the following authentication methods:
Auto Authentication: Use IAM roles, shared credentials, or environment variables.
Manual Authentication: Provide an AWS access key and secret key.
Secret Authentication: Use a stored secret within Observo AI's secure storage for credentials.
Network and Connectivity:
Ensure Observo AI can communicate with the AWS S3 endpoint (e.g., s3.<region>.amazonaws.com).
Check for proxy settings, firewall rules, or VPC endpoint configurations that may affect connectivity to AWS S3.
Observo AI Platform
Must be installed and support S3 Scan Ingestion
Verify support for JSON, CSV, Parquet, etc.; additional parsers may be needed
AWS Account
Active account with S3 access
Ensure access to target bucket; consider S3 Inventory for large buckets
Authentication
Auto, Manual, or Secret
Prepare IAM roles or credentials accordingly
Network
Connectivity to AWS S3 endpoint
Check VPC endpoints, proxies, and firewalls
Integration
The Integration section outlines default configurations for the AWS S3 Scan Ingestion source. To configure AWS S3 Scan Ingestion as a source in Observo AI, follow these steps to set up and test the data flow:
Log in to Observo AI:
Navigate to the Sources tab.
Click the Add Source button and select Create New.
Choose AWS S3 Scan Ingestion from the list of available sources to begin configuration.
General Settings:
Name: A unique identifier for the source, such as s3-scan-source-1.
Description (Optional): Provide a description for the source.
Scan Interval: The interval at which to run the query. This is a duration string, such as 30s or 1m.
AWS Region: The name of the S3 bucket to scan
Examplemy-s3-bucket
Bucket Scans: A list of bucket configs that will be scanned for log files (Add as needed)
Bucket: The AWS bucket name.
Examplemy-logs-bucket
Prefix (Optional): The AWS bucket prefix.
Examplelogs/
File Filter Pattern (Optional): The pattern to filter files. Files matching the filter are read.
Examples*.json*
*.gz
Time Pattern (Optional): The Pattern for time-based prefixes. Supports: yyyy: 4-digit year yy: 2-digit year MM: 2-digit month (01-12) dd: 2-digit day (01-31) HH: 2-digit hour (00-23) mm: 2-digit minute (00-59) ss: 2-digit second (00-59).
Examplesyyyy/MM/dd
year=yyyy/month=MM/day=dd
Decode Content Type (Optional): How to decode file content: plain: No decompression gzip: Gzip decompression zstd: Zstandard decompression auto (default): Auto-detect based on file extension or content type.
Examplegzip
Max Files Per Scan (Optional): Maximum number of files to process in a single scan.
Example1000
Record Start Pattern (Optional): Regular expression that marks the start of a record for multi-line logs. If not specified, each line is treated as a separate record.
Examplestart=>.*
Record End Pattern (Optional): Regular expression that marks the end of a record for multi-line logs.
Example.*=>end
Add S3 Metadata (True): Add S3 metadata as attributes.
Add S3 File Name (True): Add S3 file name metadata as attributes.
Add S3 File Path (True): Add S3 file path metadata as attributes.
Add S3 Bucket Name (True): Add S3 Bucket name metadata as attributes.
Extract Attributes (Optional): Regular expression to extract attributes from the file path.
Examplelogs/(\\w+)/(\\d{4})/(\\d{2})/(\\d{2})/
Extracted Attribute Names (Add as needed): List of attribute names for the regex capture groups.
Examplesservice
year
month
day
Authentication (Optional):
Access Key: Enter the AWS access key ID to use for the assumed role.
ExampleAKIAIOSFODNN7EXAMPLE
Secret Access Key: Enter The AWS secret access key to use for assumed role.
ExamplewJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Advanced Settings (Optional):
AWS Endpoint: Custom endpoint for use with AWS-compatible services.
Examplehttp://127.0.0.0:5000/path/to/service
Storage: Enter the ID of a storage extension to be used to track processed results.
Log scanned files (True): Log the file that are scanned
Parser Config:
Enable Source Log Parser: Disabled by default.
Toggle Enable Source Log Parser to enable and select an appropriate parser from the Source Log Parser dropdown.
Add additional parsers as needed.
Pattern Extractor:
Refer to Observo AI’s Pattern Extractor documentation for details on configuring pattern-based data extraction.
Archival Destination:
Toggle Enable Archival on Source to enable.
Under Archival Destination, select from the list of available Archival Destinations (Required).
Save and Test Configuration:
Save the configuration settings.
Upload sample data to the S3 bucket and verify ingestion in the Analytics tab in the target pipeline to confirm data flow.
Example Scenarios
HealthSync Solutions, a fictitious healthcare provider specializing in telemedicine and patient data management, stores patient activity logs, medical device telemetry, and compliance audit data in an Amazon S3 bucket. To enhance observability and ensure compliance with HIPAA regulations, HealthSync integrates the AWS S3 Scan Ingestion source with the Observo AI platform. This integration enables periodic scanning of the S3 bucket to ingest JSON-formatted logs for real-time monitoring of patient interactions, anomaly detection in device data, and generating compliance reports.
Standard AWS S3 Scan Ingestion Source Setup
Here is a standard AWS S3 Scan Ingestion Source configuration example. Only the required sections and their associated field updates are displayed in the table below:
General Settings
Name
healthsync-s3-scan
Unique identifier for the AWS S3 Scan Ingestion source.
Description
Ingest S3 logs for patient activity and compliance monitoring
Optional description of the source's purpose.
Scan Interval
1m
Scans the S3 bucket every minute for near real-time data ingestion.
AWS Region
us-east-1
AWS region where the S3 bucket is located.
Bucket Scans
Bucket
healthsync-logs-bucket
The AWS S3 bucket name contains patient and compliance logs.
Prefix
logs/patient-activity/
Bucket prefix to filter logs related to patient activities.
File Filter Pattern
*.json.gz
Filters for JSON files compressed with Gzip.
Time Pattern
yyyy/MM/dd
Time-based prefix pattern for organizing logs by year, month, and day.
Decode Content Type
gzip
Decompresses Gzip-compressed JSON files.
Max Files Per Scan
500
Limits to 500 files per scan to manage processing load.
Record Start Pattern
start=>.*
Regex to mark the start of multi-line log records.
Record End Pattern
.*=>end
Regex to mark the end of multi-line log records.
Add S3 Metadata
True
Includes S3 metadata as attributes in ingested data.
Add S3 File Name
True
Includes S3 file name as an attribute in ingested data.
Add S3 File Path
True
Includes S3 file path as an attribute in ingested data.
Add S3 Bucket Name
True
Includes S3 bucket name as an attribute in ingested data.
Extract Attributes
logs/(\w+)/(\d{4})/(\d{2})/(\d{2})/
Regex to extract attributes from the file path.
Extracted Attribute Names
service, year, month, day
Names for regex capture groups: service, year, month, and day.
Authentication
Access Key
AKIAHEALTHSYNC1234567890
AWS access key ID for accessing the S3 bucket.
Secret Access Key
zZ9kLmNpQrStUvWxYz1234567890AbCdEfGh
AWS secret access key for accessing the S3 bucket (securely stored).
Advanced Settings
AWS Endpoint
https://s3.us-east-1.amazonaws.com
Standard AWS S3 endpoint for the us-east-1 region.
Storage
healthsync-s3-storage-tracker
ID of the storage extension to track processed files.
Log Scanned Files
True
Logs the files scanned during each cycle for auditing purposes.
Troubleshooting
If issues arise with the AWS S3 Scan Ingestion source in Observo AI, use the following steps to diagnose and resolve them:
Verify Configuration Settings:
Ensure fields like Bucket, Region, Prefix, and File Filter are correctly entered and match the AWS S3 setup.
Confirm that the scan interval and file filter align with the expected data in the bucket.
Check Authentication:
Verify the authentication method:
For Auto authentication, ensure IAM roles, shared credentials, or environment variables are correctly configured.
For Manual authentication, check that the access key and secret key are valid.
For Secret authentication, confirm the secret is accessible in Observo AI.
Validate Permissions:
Ensure the credentials have the required permissions: s3:GetObject and s3:ListBucket.
Check IAM policies in the AWS Console to confirm correct permissions are assigned.
Check Network Connectivity:
Verify that firewall rules, proxy settings, or VPC configurations allow traffic to the AWS S3 endpoint (s3.<region>.amazonaws.com).
Test connectivity using the AWS CLI or tools like curl to ensure Observo AI can access the S3 bucket.
Monitor Logs and Data:
Verify data ingestion by monitoring the Analytics tab in the Observo AI pipeline for throughput and event counts.
Check Observo AI logs for errors related to parsing, object access, or connection issues.
Common Error Messages:
"No data received": Ensure objects are present in the bucket and match the prefix or file filter. Verify network connectivity to the S3 endpoint.
"Invalid payload format": Confirm that the S3 objects are in a supported format (e.g., JSON, CSV, Parquet). Use the appropriate parser in the Source Log Parser settings.
"Bucket does not exist": Check the bucket name and ensure there are no certificate validation issues. Consider adding CA certificates if needed.
"Access denied": Verify that the IAM credentials have the required s3:GetObject and s3:ListBucket permissions.
No data received
Incorrect bucket, prefix, or file filter
Verify bucket configuration and file filter
Invalid payload format
Unsupported data format
Use appropriate parser for JSON, CSV, Parquet
Bucket does not exist
Incorrect bucket name or certificate issues
Check bucket name and certificate settings
Access denied
Insufficient IAM permissions
Verify s3:GetObject and s3:ListBucket permissions
Connectivity issues
Firewall or proxy blocking access
Test network connectivity and VPC endpoints
Resources
For additional guidance and detailed information, refer to the following resources:
Best Practices:
Use Amazon S3 Inventory for large buckets to optimize scan performance.
Configure specific prefixes and file filters to reduce unnecessary scanning of irrelevant objects.
Ensure secure authentication using IAM roles or secrets to minimize credential exposure.
Last updated
Was this helpful?

