big data vary in shape. these call for different approaches

Wide Data

Tall Data

Wide & Tall Data

  • thousands/millions of variables
  • hundreds of samples
  • tens/hundreds of variables
  • thousands/millions of samples
  • thousands/millions of variables
  • millions/billions of samples
  • we have too many variables; prone to overfitting
  • need to remove variables, or regularize, or both
  • sometimes simple models (linear) don’t suffice
  • we have enough samples to fit non-linear models with many interactions, and not too many variables

tricks of the trade:

  • exploit sparsity
  • random projections/hashing
  • variable screening
  • subsample rows
  • divide and recombine
  • MapReduce
  • ADMM (divide & conquer)
  • Screening and FDR
  • Lasso
  • SVM
  • Stepwise LR Model Building
  • GLM
  • Random Forests
  • Boosting
  • Deep Learning