| Get at the roots of your data conversion | | | | The components of the city lines are city, state, |
| problems by designing a custom data parsing | | | | zip and country. City lines are generally |
| algorithm. This is an iterative process which | | | | recognizable by their position (last), the presence |
| requires a knowledgeable programmer and a user | | | | of a recognizable state name or abbreviation and |
| who is well versed in the actual data, what it | | | | the presence of a 5 or 9 digit number (zip code). |
| looks like, its anomalies and how they came about. | | | | Allowances have to be made to deal with foreign |
| The business decisions that are required depart | | | | countries, absence of zip codes, state name |
| from the typical standard applied to programming | | | | misspellings, and other data entry errors. |
| work because the data to be parsed is generally | | | | Address lines can generally be recognized by their |
| passed from one system to another only once. | | | | position (between the name and city lines), and |
| That being the case, some parsing will be manual, | | | | the presence of numeric values, key words and |
| that being the least costly solution. | | | | abbreviations (street, avenue, box, etc.). The |
| A successful data migration project involves | | | | components of the address line are the most |
| transforming data from the format and data | | | | complicated and include items such as street |
| integrity rules (or lack of rules) of the source | | | | number, street name, street directional, street |
| system into a different format and a different | | | | type, etc. |
| set of rules, those that makeup the database | | | | Once the address parsing algorithm has properly |
| integrity of your new system. The process of | | | | identified the address components, the individual |
| evaluating, standardizing and interpreting the | | | | parts can then be reassembled in the form |
| source data so that it can be properly | | | | required by the target system. Specific |
| reformatted and stored in the target system is | | | | components can also be standardized, if desired, |
| sometimes referred to as "parsing". All data in the | | | | by using standard abbreviations and by correcting |
| source system must be looked at before | | | | misspellings. These options often accomplish a |
| migrating it to the new system. Nothing should be | | | | significant portion of data "scrubbing" that would |
| taken for granted regardless of how well defined | | | | otherwise have to be done manually. |
| the data entry procedures for the old system | | | | The use of translation tables also play a big part in |
| may have been. The nature of data and the | | | | correcting data entry errors or inconsistencies, |
| flexibility for its use which is designed into | | | | and changing data items that are not incorrect but |
| systems, results in Parsing being required for | | | | do not conform to the data integrity rules of the |
| most data elements. | | | | new system. An example might be changing |
| Perhaps the most globally used data items | | | | abbreviated insurance company names where no |
| requiring the use of a parsing algorithm are name | | | | specific rules were followed and the data does |
| and address information. Name and address | | | | not easily conform to a programmable solution. |
| parsing is based on the concept that name and | | | | Summary: |
| address information is comprised of numerous | | | | This explanation has focused on the name and |
| components that have common, identifiable | | | | address component of data, however, the same |
| characteristics. Although the process is not | | | | general concepts apply to any other data element |
| infallible, a high degree of success can be achieved | | | | that requires transformation between systems. |
| in parsing out names and addresses so they can | | | | It is important to know that extensive |
| be successfully reformatted for use in a system | | | | development of a good parsing tool only comes |
| with different formatting requirements than the | | | | over time and many conversions. |
| system which originally captured the data. One of | | | | Additional uses: |
| the most common problems which parsing is less | | | | In some cases the parsing tool can be used to |
| than perfect in resolving is inconsistent data entry. | | | | locate data elements that have not been |
| A simplified overview of parsing name and | | | | correlated in the data base but the correlation |
| address information: | | | | exists in transaction items or through some other |
| A name and address block is made up of three | | | | set of conditions and data. |
| main components-name lines, address lines and | | | | An example: |
| city/state/zip code lines. Any of these may occur | | | | By extracting and correlating part numbers and |
| multiple times or may be absent. Each has | | | | their associated part names, we electronically built |
| particular characteristics that can be identified and | | | | a translation table which we then used in the |
| is made up of its own set of components. | | | | conversion. This could have been done manually |
| Alternatively, any two or all three may be | | | | but the example cited involved over a million |
| combined into one data field. | | | | transactions and several hundred part numbers |
| The name lines will be first and are made up of a | | | | and their associated part names. It may seem |
| name prefix (i.e. Mr., Mrs., Ms, etc.) a first name, | | | | unreasonable that a system would be developed |
| middle name, and last name and a name suffix, | | | | and put into production where something as |
| (i.e. Dr., DDS, etc.). Compound names can be | | | | simple as tying the part name and part number |
| recognized by key words or characters such as | | | | together in the data base, was actually missing. |
| "and" or "&". Business names can be recognized | | | | But the conversion was done in our office and we |
| by keywords such as "Company", "Inc." etc. To | | | | did have to build the translation table as described. |
| be successful the name parser must take into | | | | This is only one of many examples from my |
| account such things as misspellings, plural forms, | | | | experience of nearly 2000 data conversions. I am |
| hyphenated names and abbreviations. To allow | | | | sure the reader has at least one they know of |
| flexibility for the unique characteristics of a | | | | and probably more, not counting the ones the |
| particular region or business, the identification of | | | | "got away". |
| these components is built into tables. The parser | | | | A well thought out and well planned data |
| must also have options to deal with names stored | | | | conversion which incorporates the use of good |
| in reverse order, i.e. "Smith, John" instead of "John | | | | tools, a talented programmer and an expert user, |
| Smith" and, in the case of last name first, must | | | | can save many hours before go-live and even |
| allow for various methods of denoting the last | | | | more hours after the system is in use. |
| name termination, i.e. "," or "#" or ";". | | | | |