Dedupe
The Dedupe function removes duplicate events from your data stream based on specified criteria. This is useful for reducing redundancy and optimizing storage and processing.
Purpose
Use the Dedupe function when you want to eliminate duplicate events from your data pipeline. It is particularly helpful in scenarios where redundant data can skew analytics or increase costs.
Usage
Select Dedupe transform. Add Name (required) and Description (optional).
General Configuration:
Bypass Transform: Defaults to disabled. When enabled, this transform will be bypassed entirely, allowing the event to pass through without any modifications.
Add Filter Conditions: Defaults to disabled. When enabled, it allows events to filter through conditions. Only events that meet the true condition will be processed; all others will bypass this transform. Based on AND/OR conditions, "+Rule" or "+Group" buttons.
Dedupe: Number of Events to Cache: Number of events to cache and use for comparing incoming events to previously seen events. Defaults to 5000.
Dedupe Rules: Set of event fields to evaluate and add/set. First field entry (1 rule) key-value pair added by default. Click Add button to add new field as a key-value pair, with the following inputs:
Field Names to Ignore: The deduplication mechanism matches events using all fields except for the ones specified in the "Field Names to Ignore" parameter.
Field Names to Match: The deduplication mechanism matches events based on the specified fields only.
Note: If neither "Field Names to Match" nor "Field Names to Ignore" are specified, all events will go through the deduplication process. In such cases, it is recommended to use "_ob.host" and "_ob.ts" fields as they are commonly available and can help differentiate events. It's crucial to ensure that the field names specified for deduplication are accurate.
Examples
Fields Names to Ignore
Scenario: Ignore the timestamp and hostname fields that have higher cardinality to avoid skewing the deduplication process.
Number of Events to Cache: 1000
Dedupe Rules
timestamp
[blank]
hostname
[blank]
"appname":"shaneIxD",
"facility":"daemon",
"hostname":"for.us",
"message":"A bug was encountered but not in Data Plane, which doesn't have bugs",
"msgid":"ID211",
"procid":6598,
"severity":"debug",
"sourceIPAddress":"10.0.0.99",
"timestamp":"2025-01-27T17:53:42.398Z",
"version":1Results: Would reduce the deduplicated events while ignoring the timestamp and hostname fields.
Fields Names to Match
Scenario: Match on the version and facility fields that lower cardinality and less skew the deduplication process.
Number of Events to Cache: 5000 (default)
Dedupe Rules
[blank]
version
[blank]
facility
"appname":"shaneIxD",
"facility":"daemon",
"hostname":"for.us",
"message":"A bug was encountered but not in Data Plane, which doesn't have bugs",
"msgid":"ID211",
"procid":6598,
"severity":"debug",
"sourceIPAddress":"10.0.0.99",
"timestamp":"2025-01-27T17:53:42.398Z",
"version":1Results: Would reduce the deduplicated events while more heavily matching on the version and facility fields.
Limitations
If Field Names to Ignore feature utilized:
In this case, you can not use the "Field Names to Match" option.
It is recommended to include the commonly available "_ob.host" and "_ob.ts" fields as they can effectively differentiate events.
Take care to provide accurate field names for deduplication.
Incorrectly specifying field names can result in all events being considered duplicates, potentially causing unintended data loss. Exercise caution when selecting field names to avoid inadvertently deduplicating all events.
If Field Names to Match feature utilized:
The deduplication will only consider the targeted fields for identifying duplicate events.
If you use this option, you can not utilize the "Field Names to Ignore" feature.
If incorrect field names are used, all events will be considered duplicates, potentially leading to unintended data loss.
Exercise caution when specifying field names to avoid accidental deduplication of all events.
Best Practices
Define Key Fields for Deduplication: Identify and configure key fields to ignore or match such as event_id, timestamp, user_id to accurately detect duplicates.
Validate Deduplication Results: Regularly validate the deduplication process to ensure it is working as intended and not discarding non-duplicate records.
Document and Communicate: Document the deduplication logic, including key fields ignored or matched and number of events cached and communicate its impact to stakeholders.
Related Functions
Sample: Reduce the volume of events by setting sampling rate.
Aggregate Metrics: Aggregate multiple metrics into a single metric based on a set of conditions.
Last updated
Was this helpful?

