Data Preparation - For Parametric Statistical Methods

Parametric Statistical Methods assume that the data has a known and specific distribution, often a Gaussian Distribution

Data Scrubbing Type	Description
Normality Tests	testing whether sample-data has a normal distribution
Transforming Data to Normal Distribution	transforming a data variable to take on a normal distribution

Data Preparation - Other

Data Scrubbing Type	Description
Feature Engineering	is the process of using domain knowledge to extract features from raw data and creating feature functions
Normalization	given 𝒙 = {𝑥₁, …, 𝑥_𝑛} min-max normalization(𝑥ᵢ) = [𝑥ᵢ - 𝑚𝑖𝑛(𝒙)] / [𝑚𝑎𝑥(𝒙) - 𝑚𝑖𝑛(𝒙)] mean normalization(𝑥ᵢ) = [𝑥ᵢ - 𝜇] / [𝑚𝑎𝑥(𝒙) - 𝑚𝑖𝑛(𝒙)] z-score normalization(𝑥ᵢ) = [𝑥ᵢ - 𝜇] / 𝜎 # transform all variables to have same standard deviation where: 𝜇 - mean 𝜎 - standard deviation
Detecting Outliers	removing and detecting outliers that skew the model
Discretization	divides the range of a continuous attribute into intervals equal-width (distance) binning 𝑏𝑖𝑛-𝑤𝑖𝑑𝑡ℎ = (𝑚𝑎𝑥 - 𝑚𝑖𝑛) / 𝐾 equal-depth (frequency) binning divides the range intervals each containing the same number of data points bottom-up binning criteria such as entropy to characterize the purity of bins clustering
Nominal & Ordinal	sometimes we want to convert nominal and ordinal values to “continuous” values nominal - values from an unordered set (colors, profession, etc.) ordinal - values from an ordered set (rank, etc.)
Missing Data	Ignoring Instances with Unknown Feature Values Most Common Feature Value Mean substitution Regression or Classification methods Nearest-Neighbor Imputation Treating Missing Feature Values as Special Values Latent Variable Methods