Purpose
Generate a reference data source to be used by a normalization task. Analyze any attribute and generate grouping of similar values. The output is a reference Data Source with these attributes: Normalized value and Alias value.
Category Location: All, Match and lookup
Field Description
- Generate normalization reference data for this attribute: Select the attribute for which we need to generate reference data.
- Advanced configurations: Identification of similar values is based on a combination of the algorithms in this section.
- Degree of fuzziness: Fuzzy matching based on this parameter (0.1 to 1, where 0.1 is maximum fuzziness)
- Percentage of leading text that must match: percentage of similarity
- Ignore if characters less than: there is no similarity check if character length is less than specified in this field.
Tips
- The cleaner the source data, the better the result. When generating company name reference data, use the Company Name Clean Up task to pre-clean the data first.
-
Identification of similar values is based on a combination of these algorithms:
- Fuzzy matching based on your parameters
- Values that begin with identical words
- Over 90% similarity
- The matching algorithm is not case sensitive.
- The matching algorithm ignores short words. The threshold is configurable with a default of 3.
Examples
-
Generate reference for the purpose of normalizing company names.
- Primary value = Toyota
- Aliases = Toyota motor, Toyota motor sales, Toyota usa, Toyota financial services