| What is Dirty Data? | | | | exercise is prior to data mining. The first step |
| Before we get into how to deal with dirty data let | | | | then is to analyse the quality of the data. Large |
| us define exactly what it is. Dirty data is a term | | | | amounts of aggregated data can be significantly |
| used to refer to information/data that is | | | | affected by relatively small amounts of dirty |
| misleading, incorrect or without generalized | | | | data. |
| formatting, that has been collected by means of | | | | Examples of Dirty data |
| data capture forms It could even be spelling | | | | Common causes of dirty data are: |
| mistakes or poor punctuation, incomplete or | | | | - Wrong fields sizes |
| outdated data or even data that is duplicated in | | | | - Wrong and inconsistent formats |
| the database. | | | | - Logical inconsistency like typing zipcode into |
| But what is the cost of dirty data to business? | | | | phone number box |
| Well it is hard to put a figure on it and that is a | | | | - User Errors |
| big part of the problem. We may know that bad | | | | Most of the problems comes when working with |
| data exists on our systems but we cannot | | | | Text or Excel Files |
| quantify it so tend to ignore it, obviously the | | | | Life is much easier when data source is ODBC |
| wrong approach. And the amount of dirty data in | | | | compliant database however there are still some |
| the system will keep increasing as we allow more | | | | potentional problems |
| people freedom of access to input their | | | | Imagine that you are loading orders from |
| information onto the system. | | | | different countries into your oracle |
| Where is it coming from? | | | | datawarehouse. |
| System data can degrade very rapidly, starting | | | | Part of the data comes from text files, part |
| with customer information such as spelling of | | | | from MS Excel files and some of the data is |
| names, addresses, and missing information. Errors | | | | direct ODBC connection to the source database. |
| like these will accumulate within days, and in a few | | | | Some files are result of manual consolidation of |
| weeks the sales figures will no longer make sense. | | | | multiple files |
| A database that offers any kind of unsecured | | | | Datawarehouse Table Definition is |
| access can become unreliable-and ultimately | | | | - COUNTRY_ID INTEGER |
| worthless within two months. | | | | - ORDER_ID INTEGER |
| Even in a professionally designed and operated | | | | - ORDER_DATE DATE |
| system, where the data is strictly controlled, | | | | - AMOUNT NUMBER(10.2) |
| errors exist. But when does dirty data become a | | | | Every country has different formats for |
| problem? Large companies have database | | | | ORDER_DATE and Amount field. This situation is |
| administrators who train users on data entry, | | | | far too familiar for many ETL Consultants |
| rules for data validation, and software to support | | | | |
| data cleaning. Unfortunately, smaller businesses | | | | In order to load data we need to make sure that |
| might not even know they have a problem. | | | | format of Amount and Order_Date fields is |
| So now we know what dirty data is and how it | | | | consistent. |
| gets there, so what can we do about it? | | | | For amount field we need to get rid of dollars, |
| The basic steps for dealing with bad data are the | | | | pounds and commas. |
| same, whatever the source; detect it, set a | | | | It could easily done by using replace function of |
| toleration level, and remove it. The first two steps | | | | Advanced ETL Processor. |
| are the most difficult to implement and require a | | | | |
| level of sophistication. The techniques involved in | | | | For ORDER_DATE field we will apply multiple date |
| data cleansing are similar, whether you are | | | | formats. |
| dealing with spam, a database, or an XML | | | | Result of Date Format function is a string in |
| document. | | | | 'YYYY-MM-DD HH:NN:SS.ZZZ' format |
| Solving the problem | | | | |
| First you have to identify the dirty data, logical | | | | Full Transformation |
| really as you cannot deal with a problem if it does | | | | |
| not exist. The most straightforward approach is | | | | Result of Transformation |
| to scan the message, document, or database for | | | | |
| characters that don't belong there or are missing. | | | | This is just a small example how Advanced ETL |
| This involves parsing the content and comparing | | | | Processor can help you to validate and transform |
| it against a control that indicates what should be | | | | data. |
| there. | | | | |
| Once this is complete you are able to evaluate | | | | About Advanced ETL Processor |
| the problem. Decisions now have to be made | | | | Advanced ETL Processor is an ETL tool designed |
| concerning the level of the problem, and the | | | | to automate extracting data from ANY database, |
| counter measures to be auctioned | | | | transform, validate it and load into ANY database . |
| Database administrators know there is some dirty | | | | Typical usage of it would be extract data from |
| data in every database but they have to scope | | | | Excel File,Validate Date Formats, Sort data, |
| the cost effectiveness of detection and removal. | | | | deduplicate it and load it into Oracle database, run |
| All databases have critical and non-critical areas, | | | | stored procedure or Sql script, once loading is |
| bad data in critical areas are worth the expense | | | | completed. Unlike Oracle SQL loader, BCP, DTS |
| but whose budget is going to weed out the bad | | | | or SSIS Advanced ETL Processor can also add |
| data from a non critical area. | | | | new and update old records based on primary |
| A truly critical time to implement a data cleaning | | | | key. |