dqnowlogo      
 

Data Quality = Iterative Refinement

An outsider may view data quality as a single event: run the data through the software. In fact, it's a multi-step process. Why? A key reason is that many quality issues do not have a single "correct" solution. For example, should the names of cities outside of the US be in English or in the local language? Some companies will choose the former; others the latter. Should "B. Ross" and "Betsy Ross" at the same address be considered a duplicate? Probably so for a software company; maybe not for a financial services firm. To account for these issues, data quality products allow you to adjust cleansing settings and specify custom transformations.

Real-world data involves tradeoffs. For cleansing: changes made to fix one problem may introduce errors elsewhere -- the familiar "two steps forward, one step back." For match/dedup: the goal is to adjust settings far enough to maximize the number of desired matches but not so far as to match too many unrelated records. Finding this balance requires a process of iterative refinement: run a representative set of data, evaluate the results, refine the settings, and run again.


Their Process

Existing data quality products are point solutions, focused on cleansing and match/dedup. They do not provide effective tools to assist with the process of iterative refinement.

process_their

Some data quality vendors claim to provide support for profiling -- but everything we've seen still requires the data analyst to do most of the work, especially when compared with our use of domain knowledge. (Do you agree or disagree with our analysis? Please emailsend us your experience.)


Our Process

Whether used with or instead of existing products, DQ Now fills the above gaps with interactive tools that dramatically reduce the time spent per cycle and total number of cycles.

process_our

(Many of the DQ Now features described here are still in beta test. Current features are listed on the products page.)

(To keep the diagrams manageable, we've included custom transformations with cleansing.)

Steps:

  • Profile raw data to establish a baseline
  • Cleanse the data, review changes, adjust cleansing settings ... repeat
  • Match & dedup cleansed data, review match groups, adjust match/dedup settings ... repeat
  • Cleanse the data, match & dedup, profile SUSPECT data, adjust cleansing and/or match/dedup settings ... repeat
  • Cleanse the data, match & dedup, profile CLEANSED data, adjust cleansing and/or match/dedup settings ... repeat

As another view of the same issue, here's a side-by-side comparison that takes a simplified path through the process.

The Data Quality Process -- A Comparison

  Task Problems with their approach Our solution
1 Profile the data to identify problems. Hard to find all the problems or figure out which ones require attention. Domain knowledge omits problems the engine will fix and organizes the remainder for easy understanding.
2 Adjust cleansing settings and add custom transformations. Awkward, non-standard syntax for custom changes. Regex provides a powerful way to specify custom changes.
3 Cleanse.    
4 Review changes. Spend much too long poring over reports that look like those greenbar printouts from the '70s. Every change to every field is shown, with an overview and detail view.
5 Match & Dedup.    
6 Review match groups. Painstaking effort to understand why records matched and to find items that don't meet company requirements. Match levels are color-coded and character-level differences highlighted.
7 Review cleansed data to identify remaining problems. Hunt around for problems the engine missed. See step 1: we profile the latest "cleansed" data to find additional exceptions that surfaced due to other changes.
  Repeat as needed. Go back to step 2.
("No, not again!")
Go back to step 1: our profiler is an integral part of the process.

For DQ Now, cleansing is a small part of the whole process. We focus on providing useful information in convenient form every step of the way.


Just for completeness, it's worth noting that the above process is usually embedded in a larger one.

  1. Extract data from source.
  2. Import into data quality software.
  3. Profile, cleanse, match and review. (described above)
  4. Export out of data quality software.
  5. Load into destination.

Next step: see how we use domain knowledge to save you time.