Data Quality = Iterative Refinement
An outsider may view data quality as a single event: run the data through the software. In fact, it's a multi-step process. Why? A key reason is that many quality issues do not have a single "correct" solution. For example, should the names of cities outside of the US be in English or in the local language? Some companies will choose the former; others the latter. Should "B. Ross" and "Betsy Ross" at the same address be considered a duplicate? Probably so for a software company; maybe not for a financial services firm. To account for these issues, data quality products allow you to adjust cleansing settings and specify custom transformations.
Real-world data involves tradeoffs. For cleansing: changes made to fix one problem may introduce errors elsewhere -- the familiar "two steps forward, one step back." For match/dedup: the goal is to adjust settings far enough to maximize the number of desired matches but not so far as to match too many unrelated records. Finding this balance requires a process of iterative refinement: run a representative set of data, evaluate the results, refine the settings, and run again.
Existing data quality products are point solutions, focused on cleansing and match/dedup. They do not provide effective tools to assist with the process of iterative refinement.
Some data quality vendors claim to provide support for profiling -- but everything we've seen still requires the data analyst to do most of the work, especially when compared with our use of domain knowledge. (Do you agree or disagree with our analysis? Please send us your experience.)
(Many of the DQ Now features described here are still in beta test. Current features are listed on the products page.)
(To keep the diagrams manageable, we've included custom transformations with cleansing.)
As another view of the same issue, here's a side-by-side comparison that takes a simplified path through the process.
The Data Quality Process -- A Comparison
For DQ Now, cleansing is a small part of the whole process. We focus on providing useful information in convenient form every step of the way.
Just for completeness, it's worth noting that the above process is usually embedded in a larger one.
Next step: see how we use domain knowledge to save you time.