dqnowlogo      
 

Domain Knowledge

What it is and why it matters

do·main: n. - 2. A sphere of activity, concern, or function...

knowl·edge: n. - 2. Familiarity, awareness, or understanding gained through experience or study.

(The American Heritage® Dictionary of the English Language, Fourth Edition, as quoted by Dictionary.com)

For data quality, "domain knowledge" refers to information about each field (e.g. first name, country) and what makes it valid or invalid: characters (e.g. letters vs. digits), character pattern, range or list of values (e.g. all states), relationship with other fields (e.g. US states vs. Canadian provinces), and more.

Every data cleansing engine uses some form of domain knowledge; DQ Now is the only product that also uses this information at the data profiling stage. They tell you what data a field contains; we tell you what it means.

Let's consider an example. One of the fundamental tools of data profiling is a frequency distribution. Here's an analysis of a real data set. Note that it is relatively clean: there are no misfielded values (e.g. a zip code or city that ended up in the country field) and no typos. Each value is plausible (e.g. "Great Britain"), and might pass a quick human inspection. The data is shown two ways: sorted by frequency and by name. Now, how long does it take for a data analyst (or an end user acting in that capacity) to identify all the data quality problems, and then determine the best way to address them?

Their Generic Profile

Total records 541
Distinct values 54
 
Sorted by Frequency       Sorted by Name
Name Count  Percent   Name Count  Percent

     
United Kingdom 89 16.5%   ARGENTINA 1 0.2%
Germany 86 15.9%   Argentina 1 0.2%
Japan 64 11.8%   Australia 36 6.7%
Australia 36 6.7%   Austria 15 2.8%
Switzerland 31 5.7%   AUSTRIA 1 0.2%
France 29 5.4%   Belgium 14 2.6%
Sweden 21 3.9%   BELGIUM 2 0.4%
Netherlands 18 3.3%   belgium 1 0.2%
Austria 15 2.8%   Brazil 1 0.2%
Belgium 14 2.6%   Denmark 13 2.4%
Denmark 13 2.4%   England 3 0.6%
UK 13 2.4%   Finland 2 0.4%
SWEDEN 10 1.8%   France 29 5.4%
New Zealand 9 1.7%   FRANCE 4 0.7%
Norway 9 1.7%   france 1 0.2%
Italy 8 1.5%   Germany 86 15.9%
GERMANY 7 1.3%   GERMANY 7 1.3%
Spain 6 1.1%   Great Britain 1 0.2%
Hong Kong 5 0.9%   Holland 1 0.2%
FRANCE 4 0.7%   HOLLAND 1 0.2%
England 3 0.6%   Hong Kong 5 0.9%
ITALY 3 0.6%   HONG KONG 1 0.2%
JAPAN 3 0.6%   Ireland 2 0.4%
NETHERLANDS 3 0.6%   Israel 2 0.4%
NORWAY 3 0.6%   Italy 8 1.5%
Republic of Singapore 3 0.6%   ITALY 3 0.6%
The Netherlands 3 0.6%   Japan 64 11.8%
BELGIUM 2 0.4%   JAPAN 3 0.6%
Finland 2 0.4%   Latvia 1 0.2%
Ireland 2 0.4%   MALAYSIA 1 0.2%
Israel 2 0.4%   N. Ireland 1 0.2%
Philippines 2 0.4%   Netherlands 18 3.3%
Singapore 2 0.4%   NETHERLANDS 3 0.6%
ARGENTINA 1 0.2%   New Zealand 9 1.7%
Argentina 1 0.2%   NEW ZEALAND 1 0.2%
AUSTRIA 1 0.2%   Norway 9 1.7%
belgium 1 0.2%   NORWAY 3 0.6%
Brazil 1 0.2%   Philippines 2 0.4%
france 1 0.2%   Portugal 1 0.2%
Great Britain 1 0.2%   Republic of Singapore 3 0.6%
Holland 1 0.2%   Singapore 2 0.4%
HOLLAND 1 0.2%   South Africa 1 0.2%
HONG KONG 1 0.2%   Spain 6 1.1%
Latvia 1 0.2%   Sweden 21 3.9%
MALAYSIA 1 0.2%   SWEDEN 10 1.8%
N. Ireland 1 0.2%   Switzerland 31 5.7%
NEW ZEALAND 1 0.2%   SWITZERLAND 1 0.2%
Portugal 1 0.2%   switzerland 1 0.2%
South Africa 1 0.2%   TAIWAN 1 0.2%
switzerland 1 0.2%   THAILAND 1 0.2%
SWITZERLAND 1 0.2%   The Netherlands 3 0.6%
TAIWAN 1 0.2%   UK 13 2.4%
THAILAND 1 0.2%   uk 1 0.2%
uk 1 0.2%   United Kingdom 89 16.5%

The answer: too long!


(Many of the DQ Now features described here are still in beta test. Current features are listed on the products page.)

DQ Now uses domain knowledge to profile the data, showing a much more useful picture. The overview report is simple:

DQ Now Profile Summary

541 records
29 countries represented

26 values were corrected
60 values were standardized to mixed case

And, most importantly, zero items require the data analyst's attention! We've taken a task that requires several minutes of detail work, and reduced it to the few seconds required to skim our summary report. That's productivity! If there were any remaining problems (e.g. an unrecognized value in the country field), they would be clearly identified rather than remaining hidden in a generic distribution report.

How did we do it? There are two classes of problems: those that the cleansing engine will fix (using current settings) and those it won't. Other profiling tools do not separate these two cases, so the data analyst is forced to deal with both in the same profile -- with no way of knowing which problems fall into which category. To use an analogy from another field: a conventional data profile is full of noise (issues that don't require attention) so it's very difficult to discern the signal (issues that the cleansing engine won't fix on its own).


If the data analyst wants to validate the corrections, they can zoom into a detail report to get all the information at a glance:

DQ Now Profile Detail

26 values were corrected (details below)
60 values were standardized to mixed case

           
New Value
Old Value Correction Rule
           
18   United Kingdom  
    13   UK standard abbreviation
    3   England common alternative
    2   Great Britain common alternative
           
5   Netherlands  
    3   The Netherlands common alternative
    2   Holland common alternative
           
3   Singapore  
    3   Republic of Singapore   common alternative

After DQ Now applied domain knowledge and discovered that no values require user intervention, a frequency distribution is no longer useful for understanding data quality. (It may be useful for the marketing department to understand where customers are, but that's a separate issue.) Nevertheless, DQ Now is happy to create one; notice how much simpler it is?

DQ Now Frequency Distribution

541 records
29 countries represented

Sorted by Frequency       Sorted by Name
Value Count  Percent   Value Count  Percent

     
United Kingdom 107 19.8%   Argentina 2 0.4%
Germany 93 17.2%   Australia 36 6.7%
Japan 67 12.4%   Austria 16 3.0%
Australia 36 6.7%   Belgium 18 3.3%
France 34 6.3%   Brazil 1 0.2%
Switzerland 33 6.1%   Denmark 13 2.4%
Sweden 31 5.7%   Finland 2 0.4%
Netherlands 26 4.8%   France 34 6.3%
Belgium 18 3.3%   Germany 93 17.2%
Austria 16 3.0%   Hong Kong 6 1.1%
Denmark 13 2.4%   Ireland 2 0.4%
Norway 12 2.2%   Israel 2 0.4%
Italy 11 2.0%   Italy 11 2.0%
New Zealand 10 1.8%   Japan 67 12.4%
Hong Kong 6 1.1%   Latvia 1 0.2%
Spain 6 1.1%   Malaysia 1 0.2%
Singapore 5 0.9%   Netherlands 26 4.8%
Argentina 2 0.4%   New Zealand 10 1.8%
Finland 2 0.4%   Norway 12 2.2%
Ireland 2 0.4%   Philippines 2 0.4%
Israel 2 0.4%   Portugal 1 0.2%
Philippines 2 0.4%   Singapore 5 0.9%
Brazil 1 0.2%   South Africa 1 0.2%
Latvia 1 0.2%   Spain 6 1.1%
Malaysia 1 0.2%   Sweden 31 5.7%
Portugal 1 0.2%   Switzerland 33 6.1%
South Africa 1 0.2%   Taiwan 1 0.2%
Taiwan 1 0.2%   Thailand 1 0.2%
Thailand 1 0.2%   United Kingdom 107 19.8%

Next step: see how DQ Now can be used instead of and in addition to related products.