How to Use Data Profiling to Avoid a Data Warehouse Quality Train Wreck

So you have a vision for a data warehouse, youdata profiler.
have executive support, and you are sitting downStrategies based on the results
to write up your budget proposal. Once you haveOnce the result of the profiling is available, it is
enough of a subject area scope defined to knowcritical to be realistic about what your analysis tells
roughly which source systems, and which tablesyou.
your DW is going to extract from, the very nextFaced with data quality issues, you have a
thing you need to do is some data profiling.number of options:
Data profiling should be done as early as possible.Reduce scope to avoid data that simply is not
Poor data quality is a serious issue that canviable
undermine every aspect of a DW project. FirstNobody will want to hear that some key
and foremost, if the data has quality issues theinformation that was advertised during the
actual analysis done by users of the dataproject justification phase is simply not viable, but
warehouse will not be of value.its better to break the news now, than after
But more directly important to a project managerspending thousands developing ETL jobs and
is the impact data quality issues in sourcereports that are worthless. Cut your losses if you
systems can have on the duration and quality ofhave to.
ETL (extract, transform and load) jobIsolate problem areas and mark them
development.In some cases, its not necessary to abandon bad
Blindly trusting the meta data and estimating ETLdata completely- it might be possible to isolate
development time without looking at the sourcebad records and ensure that they are clearly
system data risks having development schedulesidentified. If users know that for certain analysis
run over as the developers discover that thethey are only able to use subsets of the data, it
simple extraction code they have written failsmight still be of value to them to run certain
when it comes into contact with real world,analysis on a portion. For example, if sales data
messy data.from a subsidiary is missing certain fields, at least
It is one thing to know that a given text field isthe full analysis can be run on a portion of the
supposed to contain product category code- it issales not affected by the problem. Reports need
quite another to know how many rows areto clearly state "Excluding XYZ".
actually populated, and if the values conform toCleanse issue areas
established categories, or fall into the correctIt may be possible to cleanse data with issues
hierarchy.early.
At a minimum, all data sources should be profiledFor example:
for the following:
1. Remove duplicates from Master files such as
1. Value Completeness- is there a value in thethe customer list using matching software
field.2. Identify missing information and launch data
2. Historical Completeness- is the historical datacollection and/or entry sub-projects to populate
complete- missing data can corrupt results, butthe key tables
might not be noticed at a summary level.3. Use external data sources to enhance existing
3. Data format - are phone numbers phoneor sparse data sets. For example, with an
numbers? Are email addresses properly formed?external service it might be possible to populate
Do postal codes have the correct structure?missing longitude and latitude information from
4. Cross column consistency- Even if all the valuesstreet address to enable map based visualizations.
in the columns are correct, its important to profileRevise your project budget if needed
across columns. Do product codes match theIf you have established a project budget before
product categories? Are geographical tagsyou undertake your data quality investigation, as a
consistent? Do the same products appear underproject manager you need to honestly look
different categories depending on transactionyourself in the mirror and answer the question
date?"Based on what I now know, is the approved
5. Value outliers- By looking for extremely high orbudget still viable?"
low values, "killer rows" can be identified. One rowIdeally, don't set the detailed budget until AFTER
with a value that is orders of magnitude off willyou have done the data quality investigation.
skew averages and totals and surprisingly mightWithout understanding the raw materials available,
not be noticed through the normal course ofit is difficult to design the data warehouse, and
report validation. The most efficient way to profileeven harder to accurately estimate the effort
data is through the use of an automated datainvolved.
profiling tool. While some custom SQL queriesProfile your data early and often. Data quality is
may be necessary for very specific tests,an often overlooked key aspect of data
attempting to do all checks by developing querieswarehouse project success.
is extremely inefficient compared to the use of a