| So you have a vision for a data warehouse, you | | | | data profiler. |
| have executive support, and you are sitting down | | | | Strategies based on the results |
| to write up your budget proposal. Once you have | | | | Once the result of the profiling is available, it is |
| enough of a subject area scope defined to know | | | | critical to be realistic about what your analysis tells |
| roughly which source systems, and which tables | | | | you. |
| your DW is going to extract from, the very next | | | | Faced with data quality issues, you have a |
| thing you need to do is some data profiling. | | | | number of options: |
| Data profiling should be done as early as possible. | | | | Reduce scope to avoid data that simply is not |
| Poor data quality is a serious issue that can | | | | viable |
| undermine every aspect of a DW project. First | | | | Nobody will want to hear that some key |
| and foremost, if the data has quality issues the | | | | information that was advertised during the |
| actual analysis done by users of the data | | | | project justification phase is simply not viable, but |
| warehouse will not be of value. | | | | its better to break the news now, than after |
| But more directly important to a project manager | | | | spending thousands developing ETL jobs and |
| is the impact data quality issues in source | | | | reports that are worthless. Cut your losses if you |
| systems can have on the duration and quality of | | | | have to. |
| ETL (extract, transform and load) job | | | | Isolate problem areas and mark them |
| development. | | | | In some cases, its not necessary to abandon bad |
| Blindly trusting the meta data and estimating ETL | | | | data completely- it might be possible to isolate |
| development time without looking at the source | | | | bad records and ensure that they are clearly |
| system data risks having development schedules | | | | identified. If users know that for certain analysis |
| run over as the developers discover that the | | | | they are only able to use subsets of the data, it |
| simple extraction code they have written fails | | | | might still be of value to them to run certain |
| when it comes into contact with real world, | | | | analysis on a portion. For example, if sales data |
| messy data. | | | | from a subsidiary is missing certain fields, at least |
| It is one thing to know that a given text field is | | | | the full analysis can be run on a portion of the |
| supposed to contain product category code- it is | | | | sales not affected by the problem. Reports need |
| quite another to know how many rows are | | | | to clearly state "Excluding XYZ". |
| actually populated, and if the values conform to | | | | Cleanse issue areas |
| established categories, or fall into the correct | | | | It may be possible to cleanse data with issues |
| hierarchy. | | | | early. |
| At a minimum, all data sources should be profiled | | | | For example: |
| for the following: | | | | |
| | | | 1. Remove duplicates from Master files such as |
| 1. Value Completeness- is there a value in the | | | | the customer list using matching software |
| field. | | | | 2. Identify missing information and launch data |
| 2. Historical Completeness- is the historical data | | | | collection and/or entry sub-projects to populate |
| complete- missing data can corrupt results, but | | | | the key tables |
| might not be noticed at a summary level. | | | | 3. Use external data sources to enhance existing |
| 3. Data format - are phone numbers phone | | | | or sparse data sets. For example, with an |
| numbers? Are email addresses properly formed? | | | | external service it might be possible to populate |
| Do postal codes have the correct structure? | | | | missing longitude and latitude information from |
| 4. Cross column consistency- Even if all the values | | | | street address to enable map based visualizations. |
| in the columns are correct, its important to profile | | | | Revise your project budget if needed |
| across columns. Do product codes match the | | | | If you have established a project budget before |
| product categories? Are geographical tags | | | | you undertake your data quality investigation, as a |
| consistent? Do the same products appear under | | | | project manager you need to honestly look |
| different categories depending on transaction | | | | yourself in the mirror and answer the question |
| date? | | | | "Based on what I now know, is the approved |
| 5. Value outliers- By looking for extremely high or | | | | budget still viable?" |
| low values, "killer rows" can be identified. One row | | | | Ideally, don't set the detailed budget until AFTER |
| with a value that is orders of magnitude off will | | | | you have done the data quality investigation. |
| skew averages and totals and surprisingly might | | | | Without understanding the raw materials available, |
| not be noticed through the normal course of | | | | it is difficult to design the data warehouse, and |
| report validation. The most efficient way to profile | | | | even harder to accurately estimate the effort |
| data is through the use of an automated data | | | | involved. |
| profiling tool. While some custom SQL queries | | | | Profile your data early and often. Data quality is |
| may be necessary for very specific tests, | | | | an often overlooked key aspect of data |
| attempting to do all checks by developing queries | | | | warehouse project success. |
| is extremely inefficient compared to the use of a | | | | |