| Introduction | | | | vectors,(a summarized version of the raw data |
| We have been "manually" extracting data in | | | | source) at a rate of one vector per source. The |
| relation to the patterns they form for many | | | | feature vectors are then split into two sets, a |
| years but as the volume of data and the varied | | | | "training set" and a "test set". The training set is |
| sources from which we obtain it grow a more | | | | used to "train" the data mining algorithm(s), while |
| automatic approach is required. | | | | the test set is used to verify the accuracy of |
| The cause and solution to this increase in data to | | | | any patterns found. |
| be processed has been because the increasing | | | | Data mining |
| power of computer technology has increased | | | | Data mining commonly involves four classes of |
| data collection and storage. | | | | task: |
| Direct hands-on data analysis has increasingly been | | | | - Classification - Arranges the data into predefined |
| supplemented, or even replaced entirely, by | | | | groups. For example email could be classified as |
| indirect, automatic data processing. | | | | legitimate or spam. |
| Data mining is the process uncovering hidden data | | | | - Clustering - Arranges data in groups defined by |
| patterns and has been used by businesses, | | | | algorithms that attempt to group similar items |
| scientists and governments for years to produce | | | | together |
| market research reports. A primary use for data | | | | - Regression - Attempts to find a function which |
| mining is to analyse patterns of behaviour. | | | | models the data with the least error. |
| It can be easily be divided into stages | | | | - Association rule learning - Searches for |
| Pre-processing | | | | relationships between variables. Often used in |
| Once the objective for the data that has been | | | | supermarkets to work out what products are |
| deemed to be useful and able to be interpreted is | | | | frequently bought together. This information can |
| known, a target data set has to be assembled. | | | | then be used for marketing purposes. |
| Logically data mining can only discover data | | | | Validation of Results |
| patterns that already exist in the collected data, | | | | The final stage is to verify that the patterns |
| therefore the target dataset must be able to | | | | produced by the data mining algorithms occur in |
| contain these patterns but small enough to be | | | | the wider data set as not all patterns found by |
| able to succeed in its objective within an | | | | the data mining algorithms are necessarily valid. |
| acceptable time frame. | | | | If the patterns do not meet the required |
| The target set then has to be cleansed. This | | | | standards, then the preprocessing and data mining |
| removes sources that have noise and missing | | | | stages have to be re-evaluated. When the |
| data. | | | | patterns meet the required standards then these |
| The clean data is then reduced into feature | | | | patterns can be turned into knowledge. |