Data Preparation/Scrubbing
  • is the act of manipulating raw data into a form that can readily and accurately be analyzed

Data Preparation - For Parametric Statistical Methods

Parametric Statistical Methods assume that the data has a known and specific distribution, often a Gaussian Distribution

Data Scrubbing Type

Description

Normality Tests

  • testing whether sample-data has a normal distribution

Transforming Data to Normal Distribution

  • transforming a data variable to take on a normal distribution

Data Preparation - Other

Data Scrubbing Type

Description

Feature Engineering

is the process of using domain knowledge to extract features from raw data and creating feature functions

Normalization

given 𝒙 = {𝑥1, …, 𝑥𝑛}

  • min-max normalization(𝑥ᵢ) = [𝑥ᵢ - 𝑚𝑖𝑛(𝒙)] / [𝑚𝑎𝑥(𝒙) - 𝑚𝑖𝑛(𝒙)]
  • mean normalization(𝑥ᵢ) = [𝑥ᵢ - 𝜇] / [𝑚𝑎𝑥(𝒙) - 𝑚𝑖𝑛(𝒙)]
  • z-score normalization(𝑥ᵢ) = [𝑥ᵢ - 𝜇] / 𝜎 # transform all variables to have same standard deviation

where:

  • 𝜇 - mean
  • 𝜎 - standard deviation

Detecting Outliers

  • removing and detecting outliers that skew the model

Discretization

  • divides the range of a continuous attribute into intervals
  • equal-width (distance) binning
    • 𝑏𝑖𝑛-𝑤𝑖𝑑𝑡ℎ = (𝑚𝑎𝑥 - 𝑚𝑖𝑛) / 𝐾
  • equal-depth (frequency) binning
    • divides the range intervals each containing the same number of data points
  • bottom-up binning
    • criteria such as entropy to characterize the purity of bins
  • clustering

Nominal & Ordinal

sometimes we want to convert nominal and ordinal values to “continuous” values

  • nominal - values from an unordered set (colors, profession, etc.)
  • ordinal - values from an ordered set (rank, etc.)

Missing Data

  • Ignoring Instances with Unknown Feature Values
  • Most Common Feature Value
  • Mean substitution
  • Regression or Classification methods
  • Nearest-Neighbor Imputation
  • Treating Missing Feature Values as Special Values
  • Latent Variable Methods