Data Preprocessing

Data Preprocessing is the process in Data Mining, used to transform the raw data in a meaningful way to understand it better and make the most out of it. By doing so, it will further increase the efficiency of the desired output or the generated model.

The quality of the data should be taken into the consideration before applying the algorithm or data mining techniques to get the best results. Preprocessing also gives a clear view of the data, structure of the data, and what is to be predicted or analyzed in the data.

Why Data Preprocessing is done?

Data in the real world is dirty that means,

  1. Incomplete: lacking attribute values, lacking certain attribute of interests, or sometimes containing only aggregated data.
  2. Noisy: Containing errors or outliers.
  3. Inconsistent: Containing discrepancies in the names. 
No Quality of data means no Quality mining results.
Quality decisions must be based on quality data.
Data mining needs consistent integration of quality data.

Steps in Data Preprocessing:


  1. Data Cleaning:  
    1. Fill in missing values
    2. Smooth noisy data
    3. Identify or remove outliers
    4. Remove inconsistencies 
  2. Data Integration:
    1. Integrate the data from multiple sources, databases, files and even surveys.
  3. Data Transformation:
    1. Normalization and the aggregation on the data is performed.
    2. Discretization, reduces the size of the data.
  4. Data Reduction:
    1. Obtains reduced representation in volume but produces the same or similar analytical results.
Data Cleaning:
Its a step to remove the unwanted data, fix the missing values from the dataset. 
Data is not always available, sometimes tuples have no record value for several attributes.

Missing data may be due to:
  1. Equipment malfunction at the time of recording.
  2. Inconsistent with other recorded data and thus deleted.
  3. Data not entered due to misunderstanding.
  4. Certain data may not be considered important at the time of entry.
Data Integration:
Data Integration is the process to combine the data from multiple sources and databases into a dedicated platform or into a single dataset.
Careful integration can help reduce and avoid redundancies and inconsistencies in resulting data set.
This can further help improve the accuracy and speed of the subsequent data mining process.
  
Data Transformation:
Data transformation is an important step to normalize the data which changes the structure the of the data. It includes strategies:
  1. Smoothing: Remove the noise from the data which in turn help to know the important features of the dataset.
  2.  Attribute Construction: New attributes are created.
  3. Aggregation: Summarization of the data is presented.
  4. Normalization: Scaling of the data is performed to make sure the data fall in the specified range.
  5. Discretization: Raw values of a numeric attribute are replaced by interval labels or conceptual labels.
Data Reduction:
Data reduction is performed on the dataset to obtain a reduced representation of the data set that is smaller in volume but yet produces the same analytical results.

Data reduction methods:
  1. Dimensionality reduction: process of reducing the number of random variables or attributes under consideration.
  2. Data Compression: transformation are applied so as to obtain a reduced or compressed representation of the original data.
  3. Numerosity reduction: is a technique to replace the original data volume by alternative, smaller forms of data representation.




Comments

Popular Posts