After running your jobs, it’s critical that you check your data to make sure you are getting the expected results. Without this verification, you can’t be confident that the job is working and that the data is creating the processes you thought you built. A box left checked or unchecked or a missing (or extra) filter can be the difference between successful data management and erroneous data. Before you push any data back to your systems of record, make sure that you have the intended results.
- Check that you have approximately the number of results you expected.
- Example: If you are filtering records (and not intending to include the records filtered out in the next task), you’ll expect fewer than you started with. If you are permuting, you’ll expect more. If you’re cleaning data, you’ll want the same number you started with.
- Spot check for correct values.
- Example: If you’re segmenting job level based on title, compare the input titles to the output job levels and make sure your data is assigned as you expected. Be sure to look at values for all the levels in your checks, not just one, and check multiple records.
- Example: If you’re cleaning “Country” values, make sure that you’re getting the output (full country name, ISO-2, or ISO-3) that you’re expecting. Check countries that might have the country name in different languages or formats, such as Germany and Deutschland or Great Britain, United Kingdom, England, Scotland, Wales, and Northern Ireland. Also check that you are handling territories (ie. Puerto Rico) the way you desire.
- Make sure nothing’s missing.
- Example: If you’re expecting your data to include all the different companies that could be included in the account “IBM,” check that everything that begins with “IBM” is included, such as IBM USA, IBM France, IBM Research. Also make sure you’ve got “International Business Machines” and “I.B.M.” If you’ve captured only part of your data in your results, you won’t have the results you expected.
- Check that your output doesn't include anything that shouldn't be there.
- Example: If you’re filtering for “Annual Revenue Band” and the lowest value is $10,000,000, make sure that an account with a Revenue of $5,000,000 hasn’t slipped in.
- Example: If all your data should be in Europe, check that nothing in other parts of the world is in your output. This is a critical step that many people miss, assuming that if they have the data they want, nothing else has accidentally been included. Check that things that shouldn’t be included aren’t in your output.
Found an error? Make sure you’re using the diagnostic mode for your job and check the values at each step. When you discover where your results go wrong, you’ll know which task you need to fix or add to.
Sample Walk Through
In this sample, we’re bringing together four data sources to be able to assign a metro area to cities in France.
To check the data, we go through the above process:
-
Check the number of results.
We’ve started and ended with the same number of records, so that’s correct in this case. We’re not changing the records, just assigning a metro area value to those that are not rural and are in metropolitan areas.
-
Spot check for correct values.
To do this, we look up some places that have metro area assigned, using the output of the last task.
There are fewer records with a metro area, because some places are rural and will not be assigned a metro area.
Now we can look at individual records to spot check them.
Because these are geographical, we can make sure they’re actually in the right metro areas by using Google Maps.
Sure enough, Saint-Vitte-sur-Briance is within an hour of Limoges, France. We’d want to check a decent sample number of these in different metro areas to make sure they are consistently working correctly.
-
Make sure nothing’s missing.
In the spot checking, Tours is not mapping to a metro area, but it is a city large enough to constitute a metro area. Now we know something isn’t right.
Upon checking the data sources and the tasks, we can see that the postal code for the sample is 37927 CEDEX 9 and we are trying to map it to a postal code value that is 37927 CED*. Because of this “Begins With” won’t work and we need to add a new task to the job to make sure we’re mapping the postal code without the CEDEX value.
When we re-run the data after creating this task, we see that we’re now mapping data for Tours correctly. This will fix the data for other cities that use CEDEX in their postal codes as well.
-
Check that your output doesn't include anything that shouldn’t be there.
This is the step that people most commonly miss, but checking negative data is as important as checking positive data. Using the example of the data before we fixed the CEDEX values in the postal codes, we can see if we look at the records where OP Metro Area has no value, some are truly rural, but others are in places we know are metropolitan areas.
Douarnenez, for example, isn’t close to a major metropolitan area.
But we clearly have a problem, because Paris doesn’t have a metro area in some cases!
But when we fix the data for the CEDEX values in postal code, Paris is correctly assigned to the Paris metro area.
By checking your data carefully before you push anything back to your systems of record, you can be sure that your data is exactly as you want it. It’s worth the time to validate your data before someone on your team—or another team—questions the validity of your results!
Click HERE for Reviewing Outputs documentation.