This article describes one method to build a custom reference table that can be used for segmentation.
Prerequisites
- Pull your data into Openprise if needed. For example, if you want to segment based on job title, you’ll need your people records pulled into Openprise.
- Have a list of words/phrases/categories you want to segment your data into. In this example, we’re segmenting IT Sub-Function into a small handful of categories.
Process
- Use the Word Frequency analysis in the Analytics feature of Openprise to get a list of the top words you have in your data. For example, look for the top 300 words in Job Title. These will become your keywords for the next step.
- Download the data from the Word Frequency chart by selecting Action/Download Data.
- Use the downloaded data to create a Google or Excel spreadsheet. Our sheet will have 3 columns: Keyword (use the word column from the downloaded data), Priority, IT Job Sub-Function.
- Remove any keywords that you don't want to use for segmentation. For example, the keyword President doesn't indicate an IT Sub-function so it can be removed.
-
For each of those 300 keywords, assign a category from your list.
-
Assign a priority to the keyword/category pair. A suggested method of assigning the right priority is as follows, and is based on a 10 point scale. Remember 1 is highest priority.
- Regular single words should be priority 3. Most keywords are in this category.
- More “generic” single words should be priority 4.
- Multi-word combos should be priority 1 because they are more specific.
- Very specific brand names or product names should also be 1.
- A word that can be part of another word or word combo should be 5.
- Priority 2 is usually reserved for more complicated multi-word scenarios where you need granular differentiation between priorities 1 and 3.
- Last resort words should be priority 10.
For example, from our current keyword list for IT Job Sub-function:
Oracle E-Business = 1 (very specific)
Information Officer = 2 (a fairly generic version that can potentially be superseded by another more specific word)
Compliance = 3
Architecture = 4 (this usually has a modifier that goes along with it that will likely have higher priority)
CTO = 5 (this is part of the word Director and Contractor)
Technologist = 10
Import your new sheet and create a job to use it
- Create a new data source to use as a reference, and import your new table with keywords/priority/category.
- Create a test Job that uses the Infer task to compare the data in an attribute (ie. “Title”) with the keyword in your new reference Data Source. You will typically use the match logic “contains” in your Infer rule logic. Make sure you set the option “If more than one matches are found” to “Write single value using priority”.
- Using the Infer Task, capture the keyword (so you can see what matched), category (so you can see the match results) and priority (so you can see what priority won) in new attributes.
To fine-tune your reference Data Source:
- Look at all the keywords matched and the priority associated with each keyword in multi-value attributes.
- When reviewing the results, if a job title isn’t segmented properly, then you can easily see why by looking at all the matched keywords and the associated priority.
- Adjust the keywords and priorities to get the results you want.
- When making adjustments, it is helpful to keep a back-up file in case your changes need to be reverted because the new version doesn't produce better results.
For example:
Say “Cloud Security Architect” is the title, and you matched on these 3 keywords:
Security wins, sub-function = information security. If you don’t like this result and want this title to end up with sub-function = systems, then you can either:
- Change the priorities so Cloud wins over Security, or
- Create a new keyword “Cloud Security” = 1 with sub-function = systems
Conclusion
There is a bit of an art when doing this. Keep in mind some changes can have unintended consequences, and your goal should be to get most titles segmented properly as no mapping from unstructured data to structured data can be 100% perfect.