Dedupe

The Dedupe function removes duplicate events from your data stream based on specified criteria. This is useful for reducing redundancy and optimizing storage and processing.

Purpose

Use the Dedupe function when you want to eliminate duplicate events from your data pipeline. It is particularly helpful in scenarios where redundant data can skew analytics or increase costs.

Usage

Select Dedupe transform. Add Name (required) and Description (optional).

General Configuration:

  • Bypass Transform: Defaults to disabled. When enabled, this transform will be bypassed entirely, allowing the event to pass through without any modifications.

  • Add Filter Conditions: Defaults to disabled. When enabled, it allows events to filter through conditions. Only events that meet the true condition will be processed; all others will bypass this transform. Based on AND/OR conditions, "+Rule" or "+Group" buttons.

Dedupe: Number of Events to Cache: Number of events to cache and use for comparing incoming events to previously seen events. Defaults to 5000.

Dedupe Rules: Set of event fields to evaluate and add/set. First field entry (1 rule) key-value pair added by default. Click Add button to add new field as a key-value pair, with the following inputs:

  • Field Names to Ignore: The deduplication mechanism matches events using all fields except for the ones specified in the "Field Names to Ignore" parameter.

  • Field Names to Match: The deduplication mechanism matches events based on the specified fields only.

Note: If neither "Field Names to Match" nor "Field Names to Ignore" are specified, all events will go through the deduplication process. In such cases, it is recommended to use "_ob.host" and "_ob.ts" fields as they are commonly available and can help differentiate events. It's crucial to ensure that the field names specified for deduplication are accurate.

Examples

Fields Names to Ignore

Scenario: Ignore the timestamp and hostname fields that have higher cardinality to avoid skewing the deduplication process.

  • Number of Events to Cache: 1000

Dedupe Rules

Field Names to Ignore
Field Names to Match

timestamp

[blank]

hostname

[blank]

"appname":"shaneIxD",
"facility":"daemon",
"hostname":"for.us",
"message":"A bug was encountered but not in Data Plane, which doesn't have bugs",
"msgid":"ID211",
"procid":6598,
"severity":"debug",
"sourceIPAddress":"10.0.0.99",
"timestamp":"2025-01-27T17:53:42.398Z",
"version":1

Results: Would reduce the deduplicated events while ignoring the timestamp and hostname fields.

Fields Names to Match

Scenario: Match on the version and facility fields that lower cardinality and less skew the deduplication process.

  • Number of Events to Cache: 5000 (default)

Dedupe Rules

Field Names to Ignore
Field Names to Match

[blank]

version

[blank]

facility

"appname":"shaneIxD",
"facility":"daemon",
"hostname":"for.us",
"message":"A bug was encountered but not in Data Plane, which doesn't have bugs",
"msgid":"ID211",
"procid":6598,
"severity":"debug",
"sourceIPAddress":"10.0.0.99",
"timestamp":"2025-01-27T17:53:42.398Z",
"version":1

Results: Would reduce the deduplicated events while more heavily matching on the version and facility fields.

Limitations

  • If Field Names to Ignore feature utilized:

    • In this case, you can not use the "Field Names to Match" option.

    • It is recommended to include the commonly available "_ob.host" and "_ob.ts" fields as they can effectively differentiate events.

    • Take care to provide accurate field names for deduplication.

    • Incorrectly specifying field names can result in all events being considered duplicates, potentially causing unintended data loss. Exercise caution when selecting field names to avoid inadvertently deduplicating all events.

  • If Field Names to Match feature utilized:

    • The deduplication will only consider the targeted fields for identifying duplicate events.

    • If you use this option, you can not utilize the "Field Names to Ignore" feature.

    • If incorrect field names are used, all events will be considered duplicates, potentially leading to unintended data loss.

    • Exercise caution when specifying field names to avoid accidental deduplication of all events.

Best Practices

  • Define Key Fields for Deduplication: Identify and configure key fields to ignore or match such as event_id, timestamp, user_id to accurately detect duplicates.

  • Validate Deduplication Results: Regularly validate the deduplication process to ensure it is working as intended and not discarding non-duplicate records.

  • Document and Communicate: Document the deduplication logic, including key fields ignored or matched and number of events cached and communicate its impact to stakeholders.

  • Sample: Reduce the volume of events by setting sampling rate.

  • Aggregate Metrics: Aggregate multiple metrics into a single metric based on a set of conditions.

Last updated

Was this helpful?