A Gaussian processes model is a probability distribution over possible functions that fit a set of points. Because we have the probability distribution over all possible functions, we can calculate the means as the function, and the variances to indicate how confident the predictions are. The key points are summarized as:
- the function (posteriors) updates with new observations
- a Gaussian process model is a probability distribution over possible functions, and any finite sample of functions are jointly Gaussian distributed
- the mean function calculated by the posterior distribution of possible functions is the function used for regression predictions
The regression function modeled by a multivariate Gaussian is given as:
- 𝐏(𝐟|𝐗) = 𝑁(𝐟|𝜇,𝐊)
where:
- 𝐗 = [𝑥1, …, 𝑥𝑛] # 𝐗 is a set of 𝑛 observed data points
- 𝐟 = [𝑓(𝑥1), …, 𝑓(𝑥𝑛)] # 𝐟 is the set of 𝑓 values for each observed data point
- 𝜇 = [𝑚(𝑥1), …, 𝑚(𝑥𝑛)] # 𝜇 is the “estimated” mean of 𝐟, 𝑚 represents the mean function (𝜇 is the “prior” that is later updated with 𝐟)
- 𝐊𝑖𝑗 = 𝑘(𝑥𝑖, 𝑥𝑗) # 𝑘 represents a positive definite kernel function
With no observation, the mean function 𝑚 is defaulted to be 𝑚(𝐗) = 0 given that the data is often normalized to a zero mean. The Gaussian processes model is a distribution over functions whose shape (smoothness) is defined by 𝐊. If points 𝑥𝑖 and 𝑥𝑗 are considered to be similar by the kernel, function outputs of the two points, 𝑓(𝑥𝑖,) and 𝑓(𝑥𝑗) are expected to be similar.
The process of conducting regressions by Gaussian processes model is illustrated below: given the observed data (red points) and a mean function 𝑓 (blue line) estimated by these observed data points, we make predictions at new points 𝐗∗ as 𝐟(𝐗∗)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/non-parametric-regression-(npr)-models/kernel-distribution(density/mass)-estimation/classification-(kde/kdc)/gaussian-process-regression-(gpr)---kriging/gaussian-process-regression-(gpr)---explanation/gaussian-regression-example-1.png)
The joint distribution of 𝐟 and 𝐟∗ is expressed as:
Indent
where:
- 𝐊 = 𝐾(𝐗, 𝐗)
- 𝐊∗ = 𝐾(𝐗, 𝐗∗)
- 𝐊∗∗ = 𝐾(𝐗∗, 𝐗∗)
- 𝑚(𝐗) = 0
- 𝑚(𝐗∗) = 0
This is the joint probability distribution equation 𝐏(𝐟,𝐟∗|𝐗,𝐗∗) over 𝐟 and 𝐟∗, but regressions need the conditional distribution 𝐏(𝐟∗|𝐟,𝐗,𝐗∗) over 𝐟∗ only. The derivation from the joint distribution 𝐏(𝐟,𝐟∗|𝐗,𝐗∗) to the conditional 𝐏(𝐟∗|𝐟,𝐗,𝐗∗) uses the following theorem. The result is:
- 𝐟∗|𝐟,𝐗,𝐗∗∼ 𝑁(𝐊∗T𝐊𝐟, 𝐊∗∗ - 𝐊∗T𝐊-1𝐊∗
In more realistic situations, we don’t have access to true function values but noisy versions thereof:
- 𝑦 = 𝑓(𝑥) + ε
Assuming there is an additive independent and identically distributed (i.i.d.) Gaussian noise with variance 𝜎𝑛2, the prior on the noisy observations becomes 𝑐𝑜𝑣(𝑦) = 𝐊 + 𝜎𝑛2𝐼. The joint distribution of the observed values and the function values at new testing points becomes:
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/non-parametric-regression-(npr)-models/kernel-distribution(density/mass)-estimation/classification-(kde/kdc)/gaussian-process-regression-(gpr)---kriging/gaussian-process-regression-(gpr)---explanation/gaussian-regression-example-4.png)
By deriving the conditional distribution, we get the predictive equations for Gaussian processes regression as:
Indent
where:
Indent
In the variance function 𝑐𝑜𝑣(𝐟∗), it can be noted that the variance does not depend on the observed output 𝐲 but only on the inputs 𝐗 and 𝐗∗. This is a property of the Gaussian distribution
Computation Complexity for learning Multivariate Unimodal Gaussian Models
for standard or vanilla Gaussian processes, there are two main constraints:
- the overall computation complexity is 𝑂(𝑁3) where 𝑁 is the dimension of the covariance matrix 𝐾
- the memory consumption is quadratic
Because of the computation complexity and memory consumption, the standard Gaussian processes model gets struck quickly. For regression tasks with a big dataset, Sparse GP is used to reduce computational complexity
Hyperparameters Optimization
Kernel functions play significant roles in GPR. The choice of kernel functions determines almost all the generalization properties of a GP model. There are many covariance functions to choose or make your own for a Gaussian process depending on your specific problem. These criteria include if the model is smooth, if it is sparse, if it can change drastically, and if it needs to be differentiable. More depth information on choosing a kernel/covariance function for a Gaussian process can be found in [5]. In kernels, hyperparameters optimization is essential. Here we will use the most widely used kernel, RBF, as an example to explain the hyperparameters optimization. The general RBF function is given by:
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/non-parametric-regression-(npr)-models/kernel-distribution(density/mass)-estimation/classification-(kde/kdc)/gaussian-process-regression-(gpr)---kriging/gaussian-process-regression-(gpr)---explanation/gaussian-regression-example-7.png)
where the hyperparameters are:
- 𝜎𝑓 - is the vertical scale that describes how vertically the function can span
- 𝑙 - is the horizontal scale that indicates how quickly the correlation relationship between two points drops as their distance increases
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/non-parametric-regression-(npr)-models/kernel-distribution(density/mass)-estimation/classification-(kde/kdc)/gaussian-process-regression-(gpr)---kriging/gaussian-process-regression-(gpr)---explanation/gaussian-regression-example8.png)
The optimized hyperparameters 𝚯∗ are determined by the log marginal likelihood as:
- 𝚯∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝚯 [ 𝑙𝑜𝑔 𝐏(𝐲|𝐗,𝚯) ]
Thus, considering hyperparameters, a more general equation of predictions at the new testing points is:
- 𝐟 ̄∗|𝐗,𝐲,𝐗∗,𝚯 ∼ 𝑁(𝐟 ̄∗, 𝑐𝑜𝑣(𝐟∗))
Note that after learning/tuning the hyperparameters, the predictive variance 𝑐𝑜𝑣(𝐟∗) depends on not only the inputs 𝐗 and 𝐗∗ but also the outputs 𝐲
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/non-parametric-regression-(npr)-models/kernel-distribution(density/mass)-estimation/classification-(kde/kdc)/gaussian-process-regression-(gpr)---kriging/gaussian-process-regression-(gpr)---explanation/gaussian-regression-example-2.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/non-parametric-regression-(npr)-models/kernel-distribution(density/mass)-estimation/classification-(kde/kdc)/gaussian-process-regression-(gpr)---kriging/gaussian-process-regression-(gpr)---explanation/gaussian-regression-example-5.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/non-parametric-regression-(npr)-models/kernel-distribution(density/mass)-estimation/classification-(kde/kdc)/gaussian-process-regression-(gpr)---kriging/gaussian-process-regression-(gpr)---explanation/gaussian-regression-example-6.png)