Data Parsing Gets at the Roots of Your Legacy Data

Get at the roots of your data conversionThe components of the city lines are city, state,
problems by designing a custom data parsingzip and country. City lines are generally
algorithm. This is an iterative process whichrecognizable by their position (last), the presence
requires a knowledgeable programmer and a userof a recognizable state name or abbreviation and
who is well versed in the actual data, what itthe presence of a 5 or 9 digit number (zip code).
looks like, its anomalies and how they came about.Allowances have to be made to deal with foreign
The business decisions that are required departcountries, absence of zip codes, state name
from the typical standard applied to programmingmisspellings, and other data entry errors.
work because the data to be parsed is generallyAddress lines can generally be recognized by their
passed from one system to another only once.position (between the name and city lines), and
That being the case, some parsing will be manual,the presence of numeric values, key words and
that being the least costly solution.abbreviations (street, avenue, box, etc.). The
A successful data migration project involvescomponents of the address line are the most
transforming data from the format and datacomplicated and include items such as street
integrity rules (or lack of rules) of the sourcenumber, street name, street directional, street
system into a different format and a differenttype, etc.
set of rules, those that makeup the databaseOnce the address parsing algorithm has properly
integrity of your new system. The process ofidentified the address components, the individual
evaluating, standardizing and interpreting theparts can then be reassembled in the form
source data so that it can be properlyrequired by the target system. Specific
reformatted and stored in the target system iscomponents can also be standardized, if desired,
sometimes referred to as "parsing". All data in theby using standard abbreviations and by correcting
source system must be looked at beforemisspellings. These options often accomplish a
migrating it to the new system. Nothing should besignificant portion of data "scrubbing" that would
taken for granted regardless of how well definedotherwise have to be done manually.
the data entry procedures for the old systemThe use of translation tables also play a big part in
may have been. The nature of data and thecorrecting data entry errors or inconsistencies,
flexibility for its use which is designed intoand changing data items that are not incorrect but
systems, results in Parsing being required fordo not conform to the data integrity rules of the
most data elements.new system. An example might be changing
Perhaps the most globally used data itemsabbreviated insurance company names where no
requiring the use of a parsing algorithm are namespecific rules were followed and the data does
and address information. Name and addressnot easily conform to a programmable solution.
parsing is based on the concept that name andSummary:
address information is comprised of numerousThis explanation has focused on the name and
components that have common, identifiableaddress component of data, however, the same
characteristics. Although the process is notgeneral concepts apply to any other data element
infallible, a high degree of success can be achievedthat requires transformation between systems.
in parsing out names and addresses so they canIt is important to know that extensive
be successfully reformatted for use in a systemdevelopment of a good parsing tool only comes
with different formatting requirements than theover time and many conversions.
system which originally captured the data. One ofAdditional uses:
the most common problems which parsing is lessIn some cases the parsing tool can be used to
than perfect in resolving is inconsistent data entry.locate data elements that have not been
A simplified overview of parsing name andcorrelated in the data base but the correlation
address information:exists in transaction items or through some other
A name and address block is made up of threeset of conditions and data.
main components-name lines, address lines andAn example:
city/state/zip code lines. Any of these may occurBy extracting and correlating part numbers and
multiple times or may be absent. Each hastheir associated part names, we electronically built
particular characteristics that can be identified anda translation table which we then used in the
is made up of its own set of components.conversion. This could have been done manually
Alternatively, any two or all three may bebut the example cited involved over a million
combined into one data field.transactions and several hundred part numbers
The name lines will be first and are made up of aand their associated part names. It may seem
name prefix (i.e. Mr., Mrs., Ms, etc.) a first name,unreasonable that a system would be developed
middle name, and last name and a name suffix,and put into production where something as
(i.e. Dr., DDS, etc.). Compound names can besimple as tying the part name and part number
recognized by key words or characters such astogether in the data base, was actually missing.
"and" or "&". Business names can be recognizedBut the conversion was done in our office and we
by keywords such as "Company", "Inc." etc. Todid have to build the translation table as described.
be successful the name parser must take intoThis is only one of many examples from my
account such things as misspellings, plural forms,experience of nearly 2000 data conversions. I am
hyphenated names and abbreviations. To allowsure the reader has at least one they know of
flexibility for the unique characteristics of aand probably more, not counting the ones the
particular region or business, the identification of"got away".
these components is built into tables. The parserA well thought out and well planned data
must also have options to deal with names storedconversion which incorporates the use of good
in reverse order, i.e. "Smith, John" instead of "Johntools, a talented programmer and an expert user,
Smith" and, in the case of last name first, mustcan save many hours before go-live and even
allow for various methods of denoting the lastmore hours after the system is in use.
name termination, i.e. "," or "#" or ";".