Quantitative/Numerical Univariate Analysis Descriptive Statistics
- a type of Univariate Analysis Descriptive Statistics whose variable is quantitative
Statistics Terminology
Link to originalSome may argue that statisticians are not really interested in generalizing from a sample to a specified population but to an idealized superpopulation spanning space and time
best course on statistics: https://bolt.mph.ufl.edu/6050-6052/
Introduction & Terminology
The field of statistics exists because it is usually impossible to collect data from all individuals of interest (population). Our only solution is to collect data from a subset (sample) of the individuals of interest, but our real desire is to know the “truth” about the population. Quantities such as means, standard deviations and proportions are all important values and are called “parameters” when we are talking about a population. Since we usually cannot get data from the whole population, we cannot know the values of the parameters for that population. We can, however, calculate estimates of these quantities for our sample. When they are calculated from sample data, these quantities are called “statistics.” A statistic estimates a parameter.
- population distribution consists of all units of interest
- empirical distribution consists of observed units collected from the population
- population parameter (𝜽)
- sometimes just called a parameter
- is any variate analysis of population distribution (e.g. mean, variance, etc)
- usually have an unknown value
- sample statistic (𝜽ˆ)
- sometimes just called statistic
- is a function of sample distribution as input
- is any variate analysis of a sample distribution (e.g. sample mean, sample variance, etc)
- is an estimate of the corresponding population parameter 𝜽
- is a random variable because it is computed from a random sample distribution a subset of population distribution. Thus, this statistic has a sampling distribution
- see methods estimating sample statistic
- Error
Random Process - Random Variables - Stochastic Model - Probability Distribution - Statistical Inference - Statistical Model - Exploratory Data Analysis - Estimator - Probability Model
Many times there are observable phenomena that are random in nature. We call it a Random Process (Random Experiment). The random process has outcomes, and subsets of these outcomes are called Events. We map these events to a numeric form using Random Variables.
We study and capture our knowledge about this random process by creating a Stochastic Model. The stochastic model predicts the output of an event by:
- providing different choices (of values of a random variable)
- the probability of those choices
These two elements are summarized as a Probability Distribution.
This distribution has some parameters (like mean, standard deviation, etc) which were inferred from the observable phenomena using Statistical Inference.
Before inference, the distribution had unknown (not inferred yet) parameters. It was, hence, a family of distributions, since each value of the parameter is a different distribution. This family is called a Statistical Model.
Usually, a statistical model is guessed (exponential, binomial, normal, uniform, Bernoulli, etc) using Exploratory Data Analysis, then its parameters are inferred (estimated) by applying statistical inference (say, algorithms involving loss function minimization) to arrive at a stochastic model (statistical model with known parameters) (a.k.a. Estimator) that captures our knowledge about the random process.
The term ‘Probability Model’ (probabilistic model) is usually an alias for stochastic models.
Quantitative/Numerical Univariate Analysis Descriptive Statistics - Types
Central Tendency - what are the most typical values?
see Central Tendency
Link to original
Statistic
population parameter notation
sample statistic notation
Description
- most occurring value in the distribution
𝑀
𝑀̅ or 𝑥̃
- the middle value in the sorted distribution
- same as 0.5-quantile, 50th percentile, and 2nd quartile
- is a value 𝑚 that minimizes 𝐄[|𝑋 - 𝑚|]
𝜇
𝑋̅
- average in distribution
- is a value 𝑚 that minimizes 𝐄[(𝑋 - 𝑚)2]
- average in distribution
- ignores negative sign in computing the arithmetic mean
- average in distribution
- the average between min and max
Dispersion/Variation - how do the values vary?
see Variation
Link to original
Statistic
population parameter notation
sample statistic notation
Description
𝜎2
𝜎̂2or 𝑠2
- measures how far a set of numbers are spread out from their average value
- calculation of variance uses squares because it weights outliers more heavily than data very near the mean. This calculation also prevents differences above the mean from canceling out those below, which can sometimes result in a variance of zero
𝜎
𝜎̂ or 𝑠
- measures how far a set of numbers are spread out from their average value
- brings variance back to the original unit of data
𝑞𝑝
𝑞̂𝑝
- generalizes median from 0.5 to a range [0,1]
- quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities
- variants: median, quartiles, percentile, etc
- TODO
- range = max - min
Statistics Involving Distances
Description
- is the most occurring distance between each data point and the mean
- is the middle distance between each data point and the mean
Description
- is the variance of the distances between each data point
Distribution Shape - are the values symmetrically or asymmetrically distributed?
Link to original
Statistic
population parameter notation
sample statistic notation
Description
- measures the symmetry or asymmetry of a dataset about its mean
- measures whether the data are heavy-tailed or light-tailed relative to a normal distribution
Outliers - are there values that represent abnormalities?
Statistic
population parameter notation
sample statistic notation
Description
Others
Statistic
population parameter notation
sample statistic notation
Description
size
𝑁
𝑛
- number of members of dataset (sample or population)
𝑀𝑘
- raw moments vs central moments


