What is Data Profiling?

Before you start any data integration or databe addressed in a cost-effective way.
analysis project (indeed before you start anyTraditionally Data Profiling, if it has been done at
project involving data) you need to understandall, has been a manual, error-prone, process. But
what data you really have; not what data youthere are now Data Profiling tools and processes
think you have.which do the heavy lifting, leaving you to leverage
Data Profiling is about building that understanding,your experience in your business and your
and validating everyone's assumptions about whatrequirements, focusing your time and energy
data you have and what uses it can be put to.where it really matters.
Many data projects start off with data which wasBut we still haven't truly said what Data Profiling is.
collected for one reason and is now being put toAt its simplest, it is a collection of simple to
some new unanticipated use. Data Profiling isunderstand and generate statistics and checks
about finding gaps in your data which you maywhich you can perform against your data to find
need to augment. It's about finding what uses theissues, outliers, missing data, or anomalies; all
data will actually support. And most importantly ofitems that you need to address, or at least be
all, it's about flagging these issues up early in youraware of, as your project progresses. And while
project, before they become critical.it would be great to have a business expert
Any issue fixed in the analysis stage of a projectsitting alongside you during this process, you can
is going to be hundreds of times cheaper to fixquickly find issues and create a meaningful list of
than one found during the testing or, worse,questions about any dataset with minimal prior
rollout phase of your project.knowledge. Of course to get the best out of your
As such Data Profiling is an essential first step fordata you are going to need some knowledge of
any data project. Not just because it tells youhow the data was collected, the business needs
what data you have, but because it is a quickand any implicit assumptions.
way to find out problems early while they can still