Purpose
Find duplicate records within a single Data Source using the specified matching method. From the duplicate records, choose one of the records as the surviving record. The results can then be further processed, for example, use a merge function to eliminate the non-surviving duplicate records.
Category Location: All, Deduplication
Field Description
- Manual review of de-duplication result: Check this box to save the results to a manual data source. This allows for more review options.
- Drop duplicate records: Check this box to only output non-duplicates or surviving records to the results.
- How to identify duplicates: Select the attributes to match exactly on the left and/or to match with the fuzzy option on the right. There are fuzzy level options in the advanced configuration section after an attribute has been selected.
- How to determine surviving record: Select how to determine the surviving record.
- Add Criteria: Select this option to add more criteria to determine the surviving record.
- How to merge records and review results: Check this box to set merging requirements. Options are: Fill only if empty from non-surviving records, use value from any record where, append values from all records, and do not merge.
- Advanced Configurations: This section will contain fuzzy options if you've matched an attribute in that section. Additionally, select the Add Exception (arrow) option to add an exception to the how to merge records logic. (i.e. The selection "fill only if empty" has been selected in the how to merge records section, but you'd like the Description field to append values from all records. You can add that exception for the Description field here)
Tips
- You can use more than one attribute to determine if records are duplicates. For each attribute, use exact matching or fuzzy matching for duplicate identification.
- For fuzzy matching, the fuzziness factor can be further fine-tuned via advanced configuration.
- Fuzzy matching can result in false-positive matches, so be sure to review your results closely to make sure your job produces the results you intend.
- As an example, if setting a character limit of 5 for fuzzy matching, any data with less than 5 characters will match exactly.
- Identification of the surviving record within a group of duplicates uses an elimination methodology where you can define a set of criteria that are applied in priority order to try to achieve only 1 single surviving record.
- You can use the "Manual review of de-duplication result" section to specify a data source to store the de-duplication results and then review it using the "Data - Review De-dupe" option.
- The de-duplication task supports multiple actions to perform sequential de-duplication. The result of each action is stored in attributes "Duplicates1", "Duplicates2", etc… and "Merged Attributes1", "Merged Attributes2", etc… where the number is the corresponding action in top to bottom order. The final results are stored in the attributes "Duplicates" and "Merged Attributes".
- Please note that this task does not update your target system. Your job will need to push the deduplication request to your target system to complete the process. Refer to the following task templates to merge records: SFDC Merge Records, Marketo Merge Leads.
Multiple Actions
- You can use multiple actions to perform sequential de-duplication.
- Example: Action 1 you match records on Email, Company - Action 2 you match records on Email only
- When performing multiple actions, attributes output will include Duplicates, Duplicate 1, Duplicate 2, Duplicate Set IDs, Duplicate Set IDs 1, Duplicate Set IDs 2, Merged Attributes, Merged Attributes 1, Merged Attributes 2.
Using the Image above:
- Column A (Duplicates) is the final information on which record is surviving, which in the image above is the record in row 1. Column D (Duplicate Set IDs) is the final grouping result - it contains the record id of the surviving merged record. Sorting by this field shows you the final duplicate grouping results.
- Duplicates 1 and Duplicate Set IDs 1 are the results from Action 1 in the template. Duplicates 2 and Duplicate Set IDs 2 are the results from Action 2 in the template.
- If you look at Column B (Duplicates 1), you can see that all 3 records are Surviving Merged - in Column E (Duplicate Set IDs 1), you can see that they all have different values, which means they are all surviving merged because they were not matched in Action 1 which was matching on Email and Company (we can see the variations in the Company field in Column H).
- Duplicates 2 is showing the record in row 1 as being the surviving merged record and rows 2 & 3 being the non-surviving records in this group. Duplicate Set IDs2 shows the id for the record in row 1. Additionally, the Duplicate Set Ids attribute (column D) is the final information and also shows this record.
- If a duplicate group is found in Action 1, the Non-surviving record will be removed from further evaluation, but the Surviving Merged record will continue to be compared to the rest of the records in the input data source in subsequent actions.
Merging
-
The merge process is controlled by a default option and can be overridden for individual attributes. There are 4 different options on how to merge data from non-surviving records into the surviving record:
-
"Fill only if empty from non-surviving records" - This option will fill in empty attributes of the surviving record using available values from non-surviving records to achieve maximum completeness.
- The non-surviving records are selected in order of any date attribute and based on your setting of earliest / latest order.
- Once all the available values are harvested from the first non-surviving record, the next latest non-surviving record will be harvested.
- This process is repeated until the surviving record has no more empty attributes, or when all the non-surviving records have been harvested.
- "Always overwrite from non-surviving records" - This option will overwrite the data in the surviving record from a record within the duplicate group where the record is selected in order of any date attribute and based on the setting of earliest / latest order. Note that the surviving record is also within the duplicate group and your logic may indicate that the surviving record is the selected record and, in this case, no data is overwritten.
- "Append values from non-surviving records" - This option will append all the data from non-surviving records into the surviving record using a specified delimiter.
- "Never merge values from non-surviving records" - This option will prevent any data merge into the surviving record.
-
"Fill only if empty from non-surviving records" - This option will fill in empty attributes of the surviving record using available values from non-surviving records to achieve maximum completeness.
- Use the "Advanced Configuration" and "Add Exception" buttons to override the default merge logic on a per attribute level. Each exception can use different merge logic to achieve even the most complicated merge requirements.
- Thoughtfully select only the attributes that are required for merging. Many attributes are populated by other processes and do not need to be merged.
- Please note that the merge option creates a merged record in Openprise, but does not update your target system. Your job will need to push the merged records to your target system to update the winning record. Refer to the task template Export: Add / Update for details on updating records.
Examples
Find duplicate records where Email Address and Company Name attributes are identical.
- The surviving record is the one with Lead Source has value and Job Title has value. If there are multiple records remaining based on those criteria, then use the record with the earliest Created Date.
- By default, fill in the surviving record's empty attributes from records within the duplicates group using the data from the latest Modified Date record first.
- However, for contact information attributes Address, City, State, Zip, Country, and Phone Number, always overwrite them with data from records within the duplicates group using the data from the latest Modified Date.
- Also, for the Notes attribute, always append all the data from the duplicates group into the surviving record.