|
Data Scrubbing Type
|
Description
|
|---|
|
Feature Engineering
|
is the process of using domain knowledge to extract features from raw data and creating feature functions
|
|
Normalization
|
given 𝒙 = {𝑥1, …, 𝑥𝑛}
- min-max normalization(𝑥ᵢ) = [𝑥ᵢ - 𝑚𝑖𝑛(𝒙)] / [𝑚𝑎𝑥(𝒙) - 𝑚𝑖𝑛(𝒙)]
- mean normalization(𝑥ᵢ) = [𝑥ᵢ - 𝜇] / [𝑚𝑎𝑥(𝒙) - 𝑚𝑖𝑛(𝒙)]
- z-score normalization(𝑥ᵢ) = [𝑥ᵢ - 𝜇] / 𝜎 # transform all variables to have same standard deviation
where:
- 𝜇 - mean
- 𝜎 - standard deviation
|
|
Detecting Outliers
|
- removing and detecting outliers that skew the model
|
|
Discretization
|
- divides the range of a continuous attribute into intervals
- equal-width (distance) binning
- 𝑏𝑖𝑛-𝑤𝑖𝑑𝑡ℎ = (𝑚𝑎𝑥 - 𝑚𝑖𝑛) / 𝐾
- equal-depth (frequency) binning
- divides the range intervals each containing the same number of data points
- bottom-up binning
- criteria such as entropy to characterize the purity of bins
- clustering
|
|
Nominal & Ordinal
|
sometimes we want to convert nominal and ordinal values to “continuous” values
- nominal - values from an unordered set (colors, profession, etc.)
- ordinal - values from an ordered set (rank, etc.)
|
|
Missing Data
|
- Ignoring Instances with Unknown Feature Values
- Most Common Feature Value
- Mean substitution
- Regression or Classification methods
- Nearest-Neighbor Imputation
- Treating Missing Feature Values as Special Values
- Latent Variable Methods
|