Database Cleansing - the What, Why and How

-link">postcode + suburb + state that you may have
In this article I will attempt to set out simply whatobtained from Australia Post. Such a process
it means to clean a database, why you need towould likely generate a list of records where the
do it, and how you might go about it. Firstly thesuburb was not found, requiring you to manually
"What"investigate and correct the data.
Cleaning a database is done to:Correcting the formatting of your data, is usually
- Remove duplicate recordsdone using some pretty simple SQL perhaps
- Ensure your data is consistently formattedcombined with logic programming. You need to
- Correct data that is obviously wrong e.g. wrongdecide the format you wish to apply to your
postcode for a known suburbdata, for example, whether you would like the
- Find other records that are likely to be thesuburb in title case or all capitals. While this is much
same (more on this later)So "Why" would youless important than getting the data actually right,
want to do that?it can help to make your communications look
To explain why, I am going to use the examplemore professional.
of a customer database, but the principles applyFinding duplicates is a fairly easy task for
to other types of data also.someone who knows a little about the SQL
Have you ever received a marketing message /database language. It is more difficult to find
catalogue in the mail twice or more times? Isimilar records that really are the same person,
receive multiple copies of such communicationsbut are not listed in exactly the same way in your
regularly, and I don't always get around to tellingdatabase. For instance the following two records
the sender of their mistake. This can:may actually be the same
- be interpretted as sloppiness on the part of therbPostcodeState3442JohnCitizenPO Box
organisation33Frankston3199VIC682JonathonCitien14 Beach
- undo your efforts to target / personalise - anyRoadFRANKSTON3199VIC
attempt on the organisation's part to "personalise"Finding records such as the above calls for what
and "target" the message is wasted, because theis usually called "Fuzzy" Matching. Software is
recipient knows immediately that it was aavailable to find such records, and much more
mindless distribution of information using aexperienced SQL programmers could write
database.software to find such possible duplicates.
- waste $$$! Everytime you send aBecause you can't confidently use logic to
communication twice to the one person ordetermine whether or not two records are the
household, you have most likely just wastedsame in the case given above, usually fuzzy
some of your hard-earned funds.matching would leave the data as is, but produce
In addition, cleaning your data, will help you toan exception report, highlighting likely duplicate
analyse your data more accurately. For instance,records.
you will know the real number of contacts andEven when you can determine confidently that
perhaps how they are geographically distributed,two records are the same, you may wish to
rather than the distorted figures that can bemanually process the data cleanup to ensure that
derived from analysing a corrupted database.only the correct data is kept, and that all
It's not a crime! In fact it is very easy for yourassociated pieces of information are transferred
data to get in a state that requires cleaning. Foracross to the valid record e.g. customer payment
example, when a client changes their address,history. It is possible however, to set up your
your staff might update the suburb but forget tode-duplication process to remove all the duplicates
put in the new postcode. Or, an existing clientand clean up all the records automatically.
returns to your organisation several years later,Cleaning your database can take some time, and
without informing new staff that they are ansome manual effort on the part of your staff. If
existing client, and if you don't have theyou are just starting out with a new database, it
appropriate keys on your database preventingis very worthwhile to:
duplicates, the client could be set up again as1. Agree and document the data structure, and
another customer with the same or similar details.what information will be stored in what field (which
Having documented processes that your staffisn't always obvious despite the names you might
can use as a checklist, and appropriate uniquegive fields)
keys on your database fields, will go some way2. Agree the format of the data entered into
to ensuring that your data is kept clean, buteach field
incorrect data will never be prevented."How" then,3. Agree a process to handle the case where a
do you efficiently clean your database?record needs to be entered that won't fit into the
Fixing incorrect information such as the postcodecurrent structure
matching the suburb is usually done by comparingIf you need help cleaning your database, Contact
each record to the correct values in another table.Point can help you. We provide a quick and
For example, to correct all the postcodes in yourefficient service to deal with all the database
data, assuming that the suburb entered is correct,issues discussed above, and can tailor our service
you would write SQL code that would compareto meet your particular needs. Submit a request
the postcode of your record against a table ofnow for an obligation free quote.