How to Dedupe with over 10,000 records in a duplicate group
When you have a large data set you’re trying to deduplicate, when you have over 10,000 records in a single duplicate group, you will receive an error message:
In order to limit the number of records in a duplicate group, you can create an additional Deduplication task to work around this.
For task 4, above, we start by de-duplicating on 3 attributes to limit the number of records in a duplicate group:
While in this case the final deduplication will be based on OP State, OP Year-Quarter, and OP Channel, here we use OP Sugar Free-Classic-Flavor-Item Type to cut down the number of records in the duplicate group. Because in this data set there are only 200-300 values in the OP Sugar Free-Classic-Flavor-Item Type attribute, this will easily give us less than 10,000 records per duplicate group.
If you are working with other kinds of data, examples of good attributes to use to narrow down the duplicate groups include country (because there are fewer than 200 in the world) or, if you’re working with US data, for example, you could use state, because there will be fewer than 100 values. This is a case where knowing your data well and knowing how many possible values could exist in an attribute will help you know how to work best with your data.
After this task, we filter out the non-surviving records, so we are working with a smaller number of records in the next dedupe task.
With the remaining records in the next task, we can deduplicate based on the final duplicate criteria we’re looking for:
And then we filter out the non-surviving records again:
This will give you your final deduplication based on the attributes you are using to identify your duplicates.
There are cases where you might need to have more than 2 deduplication processes in your job, as in the job below, so experiment with your data and see what works best for you.