| Data Cleansing or Data Scrubbing is an act of | | | | of completeness and soundness. |
| identifying and correcting fraudulent or inaccurate | | | | Uniqueness: Related to number of duplicates in |
| evidences from a dataset or table. This activity is | | | | the data. |
| largely used in databases or files and the term | | | | The cleansing services offered by most data |
| refers to identify the inexact, imprecise, | | | | cleaning companies are: |
| immaterial, imperfect kind of data or source and | | | | Removal of duplicate ideas. |
| then delete, replace and modify these unclean | | | | Tagging and identifying same records or |
| facts. Many companies offer business sales leads | | | | facts. |
| and databases to generate sales by giving them | | | | Removing forged or bogus and untrue proof. |
| the service of data cleansing. Data cleansing helps | | | | Data validation. |
| keep business data up to date and error free. | | | | Deleting outdated records. |
| After the cleaning process, the dataset is | | | | Comparing and removing facts of third party |
| consistent with other similar datasets in the | | | | in sequence as opt-in and opt-out list. |
| system as all consistencies are removed. The | | | | Data cleansing, aggregation and organization. |
| process is different from data validation and | | | | Identifying incomplete or misplaced facts or |
| involves removal of typographical errors as well. | | | | figures. |
| Well known techniques like data transformation, | | | | Improving facts including product |
| statistical methods, parsing (detect the syntax | | | | characteristics, assemble order and metaphors. |
| errors) and duplicate eradication are used for data | | | | Eliminating duplicate data or figures, which |
| cleansing. Good and clean data needs to fulfill | | | | many look as similar records. |
| criteria mentioned below: | | | | The common challenges faced by data cleansing |
| Accuracy: including integrity, density and | | | | applications are: |
| consistency. | | | | Many a times there is a loss of information in |
| Completeness: Difference of data should be | | | | the corrected data. No doubt, invalid and duplicate |
| corrected. | | | | entries are deleted, but many a times the |
| Density: The proportion of omitted values in | | | | information is limited and insufficient for some |
| the data and number of total values must be well | | | | entries. This too is deleted leading to a loss of |
| known. | | | | information. |
| Consistency: Concerned with challenges and | | | | Data cleansing is highly expensive and time |
| syntactical differences. | | | | consuming. Thus, it is important to maintain it |
| Uniformity: Is directed to irregularities or | | | | effectively. |
| indiscretions. | | | | Fortunately, the benefits are worth much more |
| Integrity: A combined value over the criteria | | | | than the challenges. |