| Ever since data warehousing is being used as a | | | | allow the form to be submitted or in case of |
| facilitator for strategic decision making, the | | | | implicit data collection we need to distinguish |
| importance of the quality of the underlying data | | | | between actual user clicks and a bot or a scraping |
| has grown many folds. Data quality issues are | | | | program clicking links on your web pages |
| much like the software quality issues. They both | | | | automatically). |
| can sabotage the project at any stage. | | | | 2. Data cleansing process.Data cleansing is a |
| This being my first article ever, is more of a loud | | | | difficult process due to sheer size of the source |
| thinking than a definitive set of steps. In | | | | data. It is not easy to pick out the badly behaving |
| subsequent articles I will discuss data quality issues | | | | data from a collection of few terabytes of data. |
| more in depth. | | | | The techniques used here are many ranging from |
| 1. Data collection process:Many organizations | | | | fuzzy matching, custom de-duplication algorithms, |
| depend on the ETL tools available in the market | | | | and script based custom transforms. |
| to make their transactional data ready for OLAP. | | | | The best approach is studying the source data |
| These tools would be much more effective if the | | | | model and building basic rules for the checking of |
| data coming from the day to day used systems | | | | data quality. This can also be done iteratively. In |
| is having valid contents. So the data quality | | | | many cases clients do not provide data upfront |
| checks should be applied right from the data | | | | but data model only with trial data. The BA and |
| collection process. | | | | domain expert can with mutual consultation come |
| For example we see that in case of feedback | | | | up with certain rules as to how the actual data |
| collection where users write ad-hoc feedback for | | | | should be. These rules may not be very detailed |
| the open ended questions. To ensure valid | | | | but that is ok as this is just a first iteration. As |
| feedbacks are registered, techniques ranging from | | | | the understanding of the source data model |
| parsing feedback text for some keywords to | | | | evolves, so can the data quality rules. (This might |
| complex text mining algorithms are employed. | | | | sound almost heavenly to anyone who has been |
| More efficient techniques of data quality checking | | | | a part even a single data warehousing project but |
| will offload data quality burden from subsequent | | | | it is an approach worth trying.) |
| stages of the DW projects. | | | | Please note that this is different from data |
| According to me there are many separate | | | | profling tools which run on source data. We are |
| aspects of looking at data collection. One way to | | | | trying to analyze metadata and the project |
| look at it is implicit data collection and explicit data | | | | requirements so as to specify the data quality. |
| collection. For example, data collected at the | | | | Generally building this rule requires the sound |
| server, proxy or client level for tracking user's | | | | knowledge of the industry concerned and also the |
| browsing behavior will have to be treated | | | | consistent and in-sync data dictionary but the |
| separately while preparing it for mining in | | | | worse part is once these rules are built; data |
| comparison to data collected through data entry | | | | modeling team also has to carry out the actual |
| forms. | | | | data verification against these rules manually. This |
| However proactively taken steps to ensure that | | | | process being cumbersome and error prone might |
| valid content gets into the databases would be | | | | compromise on data quality. We will discuss more |
| useful in either case (e.g. In explicit form, it could | | | | about how can this be reduced and possibly |
| be string pattern matching tasks like validating the | | | | automated in the next article. |
| email addresses pattern using which we may not | | | | |