Statistics Terminology

Some may argue that statisticians are not really interested in generalizing from a sample to a specified population but to an idealized superpopulation spanning space and time

best course on statistics: https://bolt.mph.ufl.edu/6050-6052/

Introduction & Terminology

The field of statistics exists because it is usually impossible to collect data from all individuals of interest (population). Our only solution is to collect data from a subset (sample) of the individuals of interest, but our real desire is to know the “truth” about the population. Quantities such as means, standard deviations and proportions are all important values and are called “parameters” when we are talking about a population. Since we usually cannot get data from the whole population, we cannot know the values of the parameters for that population. We can, however, calculate estimates of these quantities for our sample. When they are calculated from sample data, these quantities are called “statistics.” A statistic estimates a parameter.

population distribution consists of all units of interest

empirical distribution consists of observed units collected from the population

population parameter (𝜽)

sometimes just called a parameter

is any variate analysis of population distribution (e.g. mean, variance, etc)

usually have an unknown value

sample statistic (𝜽ˆ)

sometimes just called statistic

is a function of sample distribution as input

is any variate analysis of a sample distribution (e.g. sample mean, sample variance, etc)

is an estimate of the corresponding population parameter 𝜽

is a random variable because it is computed from a random sample distribution a subset of population distribution. Thus, this statistic has a sampling distribution

see methods estimating sample statistic

Error

Random Process - Random Variables - Stochastic Model - Probability Distribution - Statistical Inference - Statistical Model - Exploratory Data Analysis - Estimator - Probability Model

Many times there are observable phenomena that are random in nature. We call it a Random Process (Random Experiment). The random process has outcomes, and subsets of these outcomes are called Events. We map these events to a numeric form using Random Variables.

We study and capture our knowledge about this random process by creating a Stochastic Model. The stochastic model predicts the output of an event by:

providing different choices (of values of a random variable)

the probability of those choices

These two elements are summarized as a Probability Distribution.

This distribution has some parameters (like mean, standard deviation, etc) which were inferred from the observable phenomena using Statistical Inference.

Before inference, the distribution had unknown (not inferred yet) parameters. It was, hence, a family of distributions, since each value of the parameter is a different distribution. This family is called a Statistical Model.

Usually, a statistical model is guessed (exponential, binomial, normal, uniform, Bernoulli, etc) using Exploratory Data Analysis, then its parameters are inferred (estimated) by applying statistical inference (say, algorithms involving loss function minimization) to arrive at a stochastic model (statistical model with known parameters) (a.k.a. Estimator) that captures our knowledge about the random process.

The term ‘Probability Model’ (probabilistic model) is usually an alias for stochastic models.

Link to original

Quantitative/Numerical Univariate Analysis Descriptive Statistics - Types

Central Tendency - what are the most typical values?

see Central Tendency

Statistic

population parameter notation

sample statistic notation

Description

Mode

most occurring value in the distribution

Median

𝑀

𝑀̅ or 𝑥̃

the middle value in the sorted distribution

same as 0.5-quantile, 50th percentile, and 2nd quartile

is a value 𝑚 that minimizes 𝐄[|𝑋 - 𝑚|]

Arithmetic Mean

𝜇

𝑋̅

average in distribution

is a value 𝑚 that minimizes 𝐄[(𝑋 - 𝑚)²]

Harmonic Mean

average in distribution

Root Mean Square
Quadratic Mean

ignores negative sign in computing the arithmetic mean

Geometric Mean

average in distribution

Mid Range

the average between min and max

Link to original

Statistic	population parameter notation	sample statistic notation	Description
Mode			most occurring value in the distribution
Median	𝑀	𝑀̅ or 𝑥̃	the middle value in the sorted distribution same as 0.5-quantile, 50th percentile, and 2nd quartile is a value 𝑚 that minimizes 𝐄[\|𝑋 - 𝑚\|]
Arithmetic Mean	𝜇	𝑋̅	average in distribution is a value 𝑚 that minimizes 𝐄[(𝑋 - 𝑚)²]
Harmonic Mean			average in distribution
Root Mean Square Quadratic Mean			ignores negative sign in computing the arithmetic mean
Geometric Mean			average in distribution
Mid Range			the average between min and max

Dispersion/Variation - how do the values vary?

see Variation

Statistic

population parameter notation

sample statistic notation

Description

Variance

𝜎²

𝜎̂²or 𝑠²

measures how far a set of numbers are spread out from their average value

calculation of variance uses squares because it weights outliers more heavily than data very near the mean. This calculation also prevents differences above the mean from canceling out those below, which can sometimes result in a variance of zero

Standard Deviation

𝜎

𝜎̂ or 𝑠

measures how far a set of numbers are spread out from their average value

brings variance back to the original unit of data

p-Quantiles

𝑞_𝑝

𝑞̂_𝑝

generalizes median from 0.5 to a range [0,1]

quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities

variants: median, quartiles, percentile, etc

Coefficient of Variation

TODO

Max
Min

For a subset 𝑆 of field 𝐹, 𝑠̃∊𝑆 is called the max of 𝑆 if: ∀𝑠∊𝑆: 𝑠≤𝑠̃

For a subset 𝑆 of field 𝐹, 𝑠̃∊𝑆 is called the min of 𝑆 if: ∀𝑠∊𝑆: 𝑠≥𝑠̃

Upper Bound
Lower Bound

For a subset 𝑆 of field 𝐹, 𝑠̃∊𝐹 is called an upper bound of 𝑆 if: ∀𝑠∊𝑆: 𝑠≤𝑠̃

For a subset 𝑆 of field 𝐹, 𝑠̃∊𝐹 is called a lower bound of 𝑆 if: ∀𝑠∊𝑆: 𝑠≥𝑠̃

Supremum
Infimum

For a subset 𝑆 of field 𝐹, 𝑠̃∊𝐹 is called the supremum of 𝑆 if:

𝑠̃ is an upper bound of 𝑆

𝑠̃≤𝑠 for any other upper bound 𝑠∊𝑆

For a subset 𝑆 of field 𝐹, 𝑠̃∊𝐹 is called the infimum of 𝑆 if:

𝑠̃ is a lower bound of 𝑆

𝑠̃≥𝑠 for any other lower bound 𝑠∊𝑆

Range

range = max - min

Statistics Involving Distances

Central Tendency of Deviation

Description

Mode Deviation

is the most occurring distance between each data point and the mean

Median Deviation

is the middle distance between each data point and the mean

Variation of Distances

Description

Variation

is the variance of the distances between each data point

Link to original

Statistic	population parameter notation	sample statistic notation	Description
Variance	𝜎²	𝜎̂²or 𝑠²	measures how far a set of numbers are spread out from their average value calculation of variance uses squares because it weights outliers more heavily than data very near the mean. This calculation also prevents differences above the mean from canceling out those below, which can sometimes result in a variance of zero
Standard Deviation	𝜎	𝜎̂ or 𝑠	measures how far a set of numbers are spread out from their average value brings variance back to the original unit of data
p-Quantiles	𝑞_𝑝	𝑞̂_𝑝	generalizes median from 0.5 to a range [0,1] quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities variants: median, quartiles, percentile, etc
Coefficient of Variation			TODO

Statistics Involving Distances
Central Tendency of Deviation	Description
Mode Deviation	is the most occurring distance between each data point and the mean
Median Deviation	is the middle distance between each data point and the mean
Variation of Distances	Description
Variation	is the variance of the distances between each data point

Distribution Shape - are the values symmetrically or asymmetrically distributed?

see Distribution Shape

Statistic

population parameter notation

sample statistic notation

Description

Skewness

measures the symmetry or asymmetry of a dataset about its mean

Kurtosis

measures whether the data are heavy-tailed or light-tailed relative to a normal distribution

Link to original

Statistic	population parameter notation	sample statistic notation	Description
Skewness			measures the symmetry or asymmetry of a dataset about its mean
Kurtosis			measures whether the data are heavy-tailed or light-tailed relative to a normal distribution

Outliers - are there values that represent abnormalities?

Statistic

population parameter notation

sample statistic notation

Description

Others

Statistic

population parameter notation

sample statistic notation

Description

size

𝑁

𝑛

number of members of dataset (sample or population)

kth Moments

𝑀_𝑘

raw moments vs central moments

Statistic	population parameter notation	sample statistic notation	Description
size	𝑁	𝑛	number of members of dataset (sample or population)
kth Moments	𝑀_𝑘		raw moments vs central moments

Resources

http://uc-r.github.io/descriptives_numeric

Max Min	For a subset 𝑆 of field 𝐹, 𝑠̃∊𝑆 is called the max of 𝑆 if: ∀𝑠∊𝑆: 𝑠≤𝑠̃ For a subset 𝑆 of field 𝐹, 𝑠̃∊𝑆 is called the min of 𝑆 if: ∀𝑠∊𝑆: 𝑠≥𝑠̃
Upper Bound Lower Bound	For a subset 𝑆 of field 𝐹, 𝑠̃∊𝐹 is called an upper bound of 𝑆 if: ∀𝑠∊𝑆: 𝑠≤𝑠̃ For a subset 𝑆 of field 𝐹, 𝑠̃∊𝐹 is called a lower bound of 𝑆 if: ∀𝑠∊𝑆: 𝑠≥𝑠̃
Supremum Infimum	For a subset 𝑆 of field 𝐹, 𝑠̃∊𝐹 is called the supremum of 𝑆 if: 𝑠̃ is an upper bound of 𝑆 𝑠̃≤𝑠 for any other upper bound 𝑠∊𝑆 For a subset 𝑆 of field 𝐹, 𝑠̃∊𝐹 is called the infimum of 𝑆 if: 𝑠̃ is a lower bound of 𝑆 𝑠̃≥𝑠 for any other lower bound 𝑠∊𝑆
Range	range = max - min

／var／log marcus chiu

Explorer

Quantitative／Numerical Univariate Analysis Descriptive Statistics

Quantitative/Numerical Univariate Analysis Descriptive Statistics

Statistics Terminology

Quantitative/Numerical Univariate Analysis Descriptive Statistics - Types

Statistics Involving Distances

Resources

／var／logmarcus chiu

Explorer

Quantitative／Numerical Univariate Analysis Descriptive Statistics

Quantitative/Numerical Univariate Analysis Descriptive Statistics

Statistics Terminology

Quantitative/Numerical Univariate Analysis Descriptive Statistics - Types

Statistics Involving Distances

Resources

／var／log marcus chiu