Data Mining Introduction

Introductionvectors,(a summarized version of the raw data
We have been "manually" extracting data insource) at a rate of one vector per source. The
relation to the patterns they form for manyfeature vectors are then split into two sets, a
years but as the volume of data and the varied"training set" and a "test set". The training set is
sources from which we obtain it grow a moreused to "train" the data mining algorithm(s), while
automatic approach is required.the test set is used to verify the accuracy of
The cause and solution to this increase in data toany patterns found.
be processed has been because the increasingData mining
power of computer technology has increasedData mining commonly involves four classes of
data collection and storage.task:
Direct hands-on data analysis has increasingly been- Classification - Arranges the data into predefined
supplemented, or even replaced entirely, bygroups. For example email could be classified as
indirect, automatic data processing.legitimate or spam.
Data mining is the process uncovering hidden data- Clustering - Arranges data in groups defined by
patterns and has been used by businesses,algorithms that attempt to group similar items
scientists and governments for years to producetogether
market research reports. A primary use for data- Regression - Attempts to find a function which
mining is to analyse patterns of behaviour.models the data with the least error.
It can be easily be divided into stages- Association rule learning - Searches for
Pre-processingrelationships between variables. Often used in
Once the objective for the data that has beensupermarkets to work out what products are
deemed to be useful and able to be interpreted isfrequently bought together. This information can
known, a target data set has to be assembled.then be used for marketing purposes.
Logically data mining can only discover dataValidation of Results
patterns that already exist in the collected data,The final stage is to verify that the patterns
therefore the target dataset must be able toproduced by the data mining algorithms occur in
contain these patterns but small enough to bethe wider data set as not all patterns found by
able to succeed in its objective within anthe data mining algorithms are necessarily valid.
acceptable time frame.If the patterns do not meet the required
The target set then has to be cleansed. Thisstandards, then the preprocessing and data mining
removes sources that have noise and missingstages have to be re-evaluated. When the
data.patterns meet the required standards then these
The clean data is then reduced into featurepatterns can be turned into knowledge.