The problem of dirty data

What is Dirty Data?exercise is prior to data mining. The first step
Before we get into how to deal with dirty data letthen is to analyse the quality of the data. Large
us define exactly what it is. Dirty data is a termamounts of aggregated data  can be significantly
used to refer to information/data that isaffected by relatively small amounts of dirty
misleading, incorrect or without generalizeddata. 
formatting, that has been collected by means ofExamples of Dirty data
data capture forms   It could even be spellingCommon causes of dirty data are:
mistakes  or poor punctuation, incomplete or- Wrong fields sizes
outdated data or even data that is duplicated in- Wrong and inconsistent formats
the database.- Logical inconsistency like typing zipcode into
But what is the cost of dirty data to business?phone number box
Well it is hard to put a figure on it and that is a- User Errors
big part of the problem. We may know  that badMost of the problems comes when working with
data exists on our systems but we cannotText or Excel Files
quantify it so tend to ignore it, obviously theLife is much easier when data source is ODBC
wrong approach. And the amount of dirty data incompliant database however there are still some
the system will keep increasing as we allow morepotentional problems
people freedom of access to input theirImagine that you are loading orders from
information onto the system.different countries into your oracle
Where is it coming from?datawarehouse.
System data can degrade very rapidly, startingPart of  the data comes from text files, part
with customer information such as spelling offrom MS Excel files and some of the data is
names, addresses, and missing information. Errorsdirect ODBC connection to the source database.
like these will accumulate within days, and in a fewSome files are result of manual consolidation of
weeks the sales figures will no longer make sense.multiple files
A database that offers any kind of unsecuredDatawarehouse Table Definition is
access can become unreliable-and ultimately- COUNTRY_ID INTEGER
worthless within two months.- ORDER_ID INTEGER
Even in a professionally designed and operated- ORDER_DATE DATE
system, where the data is strictly controlled,- AMOUNT NUMBER(10.2)
errors exist. But when does dirty data become aEvery country has different formats for
problem? Large companies have databaseORDER_DATE and Amount field. This situation is
administrators who train users on data entry,far too familiar for many ETL Consultants
rules for data validation, and software to support
data cleaning. Unfortunately, smaller businessesIn order to load data we need to make sure that
might not even know they have a problem.format of Amount and Order_Date fields is
So now we know what dirty data is and how itconsistent.
gets there, so what can we do about it?For amount field we need to get rid of dollars,
The basic steps for dealing with bad data are thepounds and commas.
same, whatever the source; detect it, set aIt could easily done by using replace function of
toleration level, and remove it. The first two stepsAdvanced ETL Processor.
are the most difficult to implement and require a
level of sophistication. The techniques involved inFor ORDER_DATE field we will apply multiple date
data cleansing are similar, whether you areformats.
dealing with spam, a database, or an XMLResult of Date Format function is a string in
document.'YYYY-MM-DD HH:NN:SS.ZZZ' format
Solving the problem
First you have to identify the dirty data, logicalFull Transformation
really as you cannot deal with a problem if it does
not exist. The most straightforward approach isResult of Transformation
to scan the message, document, or database for
characters that don't belong there or are missing.This is just a small example how Advanced ETL
This involves parsing the content and comparingProcessor can help you to validate and transform
it against a control that indicates what should bedata.
there.
Once this is complete you are able to evaluateAbout Advanced ETL Processor
the problem. Decisions now have to be madeAdvanced ETL Processor is an ETL tool designed
concerning the level of the problem, and theto automate extracting data from ANY database,
counter measures to be auctionedtransform, validate it and load into ANY database .
Database administrators know there is some dirtyTypical usage of it would be extract data from
data in every database but they have to scopeExcel File,Validate Date Formats, Sort data,
the cost effectiveness of detection and removal.deduplicate it and load it into Oracle database, run
All databases have critical and non-critical areas,stored procedure or Sql script, once loading is
bad data in critical areas are worth the expensecompleted. Unlike Oracle SQL loader, BCP,  DTS
but whose budget is going to weed out the bador SSIS Advanced ETL Processor can also add
data from  a non critical area.new and update old records based on primary
A truly critical time to implement a data cleaningkey.