The Deduplication
feature in Fluvio is the way to deduplicate records based on their keys.
Deduplication can be used on a topic.
It will drop the duplicate records in a window which is defined by the configured bounds
.
The current supported bounds
are age
and count
, they are explained in the bounds section.
The deduplication function is deterministic and survives restarts. During restart the deduplication algorithm scans the data stream and rebuilds the memory object.
Example configuration on topic:
# topic.yaml
version: 0.1.0
meta:
name: topic-with-dedup
deduplication:
bounds:
count: 5 # remember at least 5 last records
age: 5s # remember records for at least 5 seconds
filter:
transform:
uses: infinyon-labs/dedup-filter@0.0.2
A topic can be created using this config file like so:
$ fluvio topic create -c topic.yaml
After creating the topic, it can be tested like so:
$ fluvio produce topic-with-dedup --key-separator :
1:2
1:2
2:5
$ fluvio consume -B topic-with-dedup
2
5
Parameter | default | type | optional | description |
---|---|---|---|---|
count | - | Integer | false | Minimum number of records the filter will remember. It doesn’t guarantee to remember records that came count records before now. |
age | - | Integer | true | Minimum amount of time this filter will remember a record for. It can be specified using this format: 15days 2min 2s , or 2min 5s , or 15ms |
The specific algorithm used for deduplication is defined by a smartmodule. Right now the only smartmodule for this is the dedup-filter
smartmodule.
The dedup-filter
splits the data into smaller chunks and keeps these chunks in memory. Each chunk has an age so it knows if it can
drop a chunk based on it’s age.
When count of total records in memory exceeds the configured bounds.count
, it will check if it can delete the oldest record chunk using the configured bounds.age
and delete it if it can. This allows for fast deletion of old data with minimum amount of bookkeeping.
The chunking of data introduces a small memory usage overhead but allows the filter to have no spikes in execution time and memory usage.