AWS VPC Flow Logs
The AWS VPC Flow Logs Optimizer in Observo AI allows users to perform various optimizations including smart summarization on AWS VPC Flow Logs data.
Purpose
The AWS VPC Flow Logs Optimizer refines raw VPC Flow Logs data by applying advanced aggregation techniques, ensuring that only high-value, actionable information is forwarded for analysis. It supports various aggregations—including smart summarization—to reduce noise and volume while highlighting critical network flow patterns. This optimization improves query performance, lowers storage costs, and enhances the overall efficiency of SIEM integration for robust security monitoring.
Usage
Select AWS VPC Flow Logs Optimizer transform. Add Name (required) and Description (optional).
General Configuration:
Bypass Transform: Defaults to disable. When enabled, this transform will be bypassed entirely, allowing the event to pass through without any modifications.
Add Filter Conditions: Defaults to disable. When enabled, it allows events to filter through conditions. Only events that meet the true condition will be processed; all others will bypass this transform. Based on AND/OR conditions, "+Rule" or "+Group" buttons.
AWS VPC Flow Logs Optimizer: Drop Fields (pulldown):
Enabled: Defaults to enabled, meaning it does evaluate all events. Toggle Enabled off to prevent event processing to feed data to the downstream Transforms.
Fields to Drop: Add field names which can be dropped. Click the Add button to add a new field to drop.
Filter Traffic (pulldown):
Enabled: Defaults to enabled, meaning it does evaluate all events. Toggle Enabled off to prevent event processing to feed data to the downstream Transforms.
Filter events with non OK log_status: Defaults to enabled. Filters out non OK log_status. This effectively removes SKIPDATA and NODATA log_status events. Toggle Enabled off to prevent filtering. For more info follow AWS.
Filter traffic within private subnets: Defaults to enabled. Filter traffic within private subnets. Toggle Enabled off to prevent filtering.
Filter traffic for CIDR pairs: Set of event fields to evaluate and add/set. Click Add button to add new field as a key-value pair, with the following inputs:
First CIDR: First CIDR. Examples: 192.168.2.1/16.
Second CIDR: Second CIDR. Examples: 192.168.2.1/16.
Smart Summarization (pulldown):
Enabled: Defaults to enabled, meaning it does evaluate all events. Toggle Enabled off to prevent event processing to feed data to the downstream Transforms.
Aggregation interval(seconds): Aggregation interval(seconds) to use for summarization. Default: 60.
Aggregation (pulldown):
Enabled: Defaults to Disabled, meaning it does NOT evaluate all events. Toggle Enabled on to allow event processing to feed data to the downstream Transforms.
Field names to Aggregate By: A comma separate list of columns to group by and merge. Use the Add button to add as needed.
Default Examplessrcaddr
dstaddr
srcport
dstport
action
Max Events: The maximum number of events to group together. Default 100.
Flush Time(seconds): The maximum amount of time in seconds to wait before flushing events to Destination. Default 30.
Aggregation Methods: Set of event fields to evaluate and add/set. Default: start. Click Add button to add new field as a key-value pair, with the following inputs:
Field Name: The name of the field whose value is being aggregated.
Aggregation Method: The method used to aggregate the values of the field. For example, if the field is an integer, you can sum the values, or keep the maximum value. If the field is a string, you can keep the first value, or keep the latest value. Here are the possible methods:
Keep first value.
Keep last value.
Keep maximum value
Keep minimum value
Sum values
Field Name (Defaults)Aggregation Methodstart
Keep minimum value
end
Keep maximum value
protocol
Keep last value
tcp_flags
Keep maximum value
traffic_path
Keep last value
version
Keep maximum value
Examples
Aggregate Logs
Scenario: Aggregate a set of fields based on (1) Field Names to Aggregate By and the (2) AggregationMethods settings. Max Events and Flush Time(seconds) control the frequency.
Aggregation (Pulldown)
Toggled to enabled
100
30
srcaddr
dstaddr
srcport
dstport
Aggregation Methods
srcaddr
Keep first value
dstaddr
Keep first value
scrport
Keep first value
dstport
Keep first value
start
Keep first value
end
Keep last value
packets
Sum values
bytes
Sum values
Input (fields within the 3 log entries)
Ip1
Ip2
32000
8080
01:00:23
01:00:25
10
200
Ip1
Ip2
32000
8080
01:00:24
01:00:27
5
100
Ip1
Ip2
40000
8080
01:00:23
01:00:24
5
50
Output (Aggregated log entries)
Ip1
Ip2
32000
8080
01:00:23
01:00:27
15
300
Ip1
Ip2
40000
8080
01:00:23
01:00:24
5
50
Results: Aggregate the log entry fields match and sum associated packets and bytes fields.
Smart Summarization
Smart summarization involves the process of data summarizing network flows through the identification of ephemeral ports within a VPC network flow.
Consider the scenario where two IP addresses, namely 'ip1' and 'ip2,' are engaged in communication. Let's assume that 'ip2' serves as a server, actively listening on port 8080, while 'ip1' initiates the connection.
Within the same capture window for a flow log, there can be multiple instances of network interactions between 'ip1' and 'ip2.' However, what remains constant in all of these interactions is that 'ip1' is communicating with 'ip2' on port 8080. Other details, such as the ephemeral ports used for this communication, become less significant.
Utilizing this insight, we can treat these flows uniformly as instances of 'ip1' communicating with 'ip2' on port 8080.
The original data, which includes source and destination addresses (srcaaddr, dstaddr), source and destination ports (srcport, dstport), start and end times, as well as packet and byte counts, appears as follows:
Ip1
Ip2
32456
8080
01:00:23
01:00:25
15
200
Ip1
Ip2
32458
8080
01:00:24
01:00:27
5
100
After summarization, the data is transformed into the following format:
Ip1
Ip2
-
8080
01:00:23
01:00:27
15
300
Before aggregation, we organize the flow logs based on their start times. As incoming flow logs have irregular timestamps at the start, each flow log entry start timestamp is aligned to the nearest boundary (e.g., 01:00, 02:00, 03:00, etc.).
During the aggregation process, the earliest start time and the latest end time are selected. For aggregating packet and byte counts, the method used is addition.
AWS VPC Flow Logs Optimizer Best Practices
Here’s a breakdown of best practices when using Observo AI’s VPC Flow Logs Optimizer, which leverages techniques like dropping fields, filtering traffic, smart summarization, and aggregation:
Drop Fields
Identify Low-Value Data: Review the default 29 fields emitted by VPC flow logs and determine which ones are not used for your security, troubleshooting, or compliance needs.
Early Data Reduction: Drop extraneous fields at the ingestion stage to reduce data volume and processing cost without impacting key insights.
Filter Traffic
Focus on High-Value Flows: Set rules to exclude internal or redundant traffic that does not contribute to your analytical objectives.
Tailor Filtering by Context: Use criteria like subnet, CIDR ranges, or specific interface IDs to drop traffic that is known to be “noisy” or irrelevant.
Reduce Unnecessary Log Entries: For example, filter out flows with minimal activity or those that simply indicate “NODATA” events (if applicable), ensuring that your logs only include actionable traffic.
Smart Summarization
Automated Flow Grouping: Leverage ML-powered smart summarization to automatically identify network flows (using the key tuple: source IP, source port, destination IP, destination port, and protocol).
Volume Reduction: By aggregating similar flows, you can reduce log volume by over 80% while preserving important statistics like packet counts, bytes transferred, and time ranges.
Zero-Click Efficiency: This feature works without manual intervention, meaning your system continually adapts and maintains high-level insight with lower data noise.
Aggregation
Custom Aggregation Semantics: In addition to smart summarization, provide options for custom aggregations that let you define how network flows should be grouped based on your domain or infrastructure specifics.
Improved Query Performance: Aggregated data not only reduces storage costs but also speeds up downstream queries and analysis, as smaller, summarized datasets are much faster to process.
Overall Recommendations
Combine Techniques for Maximum Efficiency: By first dropping non-essential fields and filtering out low-value traffic, you minimize the volume before applying smart summarization and aggregation.
Automate Where Possible: Use Observo AI’s dynamic pipelines that automatically adjust to the incoming data, reducing the need for constant manual tuning and boosting developer productivity.
Retain Analytical Integrity: Ensure that any reduction in data volume does not compromise critical insights required for security monitoring, troubleshooting, or cost analysis.
These best practices help you achieve a more efficient observability pipeline, lower storage and processing costs, and improve the overall performance of your AWS VPC Flow Log analysis.
Related Functions
Cloudtrail Optimizer: Transform group to process AWS Cloudtrail events.
GCP Flow Logs: Optimize VPC flow logs using this transform.
Last updated
Was this helpful?

