Proactive Approach For Improved Data Quality In Data Warehousing

Ever since data warehousing is being used as aallow the form to be submitted or in case of
facilitator for strategic decision making, theimplicit data collection we need to distinguish
importance of the quality of the underlying databetween actual user clicks and a bot or a scraping
has grown many folds. Data quality issues areprogram clicking links on your web pages
much like the software quality issues. They bothautomatically).
can sabotage the project at any stage.2. Data cleansing process.Data cleansing is a
This being my first article ever, is more of a louddifficult process due to sheer size of the source
thinking than a definitive set of steps. Indata. It is not easy to pick out the badly behaving
subsequent articles I will discuss data quality issuesdata from a collection of few terabytes of data.
more in depth.The techniques used here are many ranging from
1. Data collection process:Many organizationsfuzzy matching, custom de-duplication algorithms,
depend on the ETL tools available in the marketand script based custom transforms.
to make their transactional data ready for OLAP.The best approach is studying the source data
These tools would be much more effective if themodel and building basic rules for the checking of
data coming from the day to day used systemsdata quality. This can also be done iteratively. In
is having valid contents. So the data qualitymany cases clients do not provide data upfront
checks should be applied right from the databut data model only with trial data. The BA and
collection process.domain expert can with mutual consultation come
For example we see that in case of feedbackup with certain rules as to how the actual data
collection where users write ad-hoc feedback forshould be. These rules may not be very detailed
the open ended questions. To ensure validbut that is ok as this is just a first iteration. As
feedbacks are registered, techniques ranging fromthe understanding of the source data model
parsing feedback text for some keywords toevolves, so can the data quality rules. (This might
complex text mining algorithms are employed.sound almost heavenly to anyone who has been
More efficient techniques of data quality checkinga part even a single data warehousing project but
will offload data quality burden from subsequentit is an approach worth trying.)
stages of the DW projects.Please note that this is different from data
According to me there are many separateprofling tools which run on source data. We are
aspects of looking at data collection. One way totrying to analyze metadata and the project
look at it is implicit data collection and explicit datarequirements so as to specify the data quality.
collection. For example, data collected at theGenerally building this rule requires the sound
server, proxy or client level for tracking user'sknowledge of the industry concerned and also the
browsing behavior will have to be treatedconsistent and in-sync data dictionary but the
separately while preparing it for mining inworse part is once these rules are built; data
comparison to data collected through data entrymodeling team also has to carry out the actual
forms.data verification against these rules manually. This
However proactively taken steps to ensure thatprocess being cumbersome and error prone might
valid content gets into the databases would becompromise on data quality. We will discuss more
useful in either case (e.g. In explicit form, it couldabout how can this be reduced and possibly
be string pattern matching tasks like validating theautomated in the next article.
email addresses pattern using which we may not