A Gaussian processes model is a probability distribution over possible functions that fit a set of points. Because we have the probability distribution over all possible functions, we can calculate the means as the function, and the variances to indicate how confident the predictions are. The key points are summarized as:

the function (posteriors) updates with new observations
a Gaussian process model is a probability distribution over possible functions, and any finite sample of functions are jointly Gaussian distributed
the mean function calculated by the posterior distribution of possible functions is the function used for regression predictions

The regression function modeled by a multivariate Gaussian is given as:

𝐏(𝐟|𝐗) = 𝑁(𝐟|𝜇,𝐊)

where:

𝐗 = [𝑥₁, …, 𝑥_𝑛] # 𝐗 is a set of 𝑛 observed data points
𝐟 = [𝑓(𝑥₁), …, 𝑓(𝑥_𝑛)] # 𝐟 is the set of 𝑓 values for each observed data point
𝜇 = [𝑚(𝑥₁), …, 𝑚(𝑥_𝑛)] # 𝜇 is the “estimated” mean of 𝐟, 𝑚 represents the mean function (𝜇 is the “prior” that is later updated with 𝐟)
𝐊_𝑖𝑗 = 𝑘(𝑥_𝑖, 𝑥_𝑗) # 𝑘 represents a positive definite kernel function

With no observation, the mean function 𝑚 is defaulted to be 𝑚(𝐗) = 0 given that the data is often normalized to a zero mean. The Gaussian processes model is a distribution over functions whose shape (smoothness) is defined by 𝐊. If points 𝑥_𝑖 and 𝑥_𝑗 are considered to be similar by the kernel, function outputs of the two points, 𝑓(𝑥_𝑖,) and 𝑓(𝑥_𝑗) are expected to be similar.

The process of conducting regressions by Gaussian processes model is illustrated below: given the observed data (red points) and a mean function 𝑓 (blue line) estimated by these observed data points, we make predictions at new points 𝐗_∗ as 𝐟(𝐗_∗)

The joint distribution of 𝐟 and 𝐟_∗ is expressed as:

Indent

where:

𝐊 = 𝐾(𝐗, 𝐗)
𝐊_∗ = 𝐾(𝐗, 𝐗_∗)
𝐊_∗∗ = 𝐾(𝐗_∗, 𝐗_∗)
𝑚(𝐗) = 0
𝑚(𝐗_∗) = 0

This is the joint probability distribution equation 𝐏(𝐟,𝐟_∗|𝐗,𝐗_∗) over 𝐟 and 𝐟_∗, but regressions need the conditional distribution 𝐏(𝐟_∗|𝐟,𝐗,𝐗_∗) over 𝐟_∗ only. The derivation from the joint distribution 𝐏(𝐟,𝐟_∗|𝐗,𝐗_∗) to the conditional 𝐏(𝐟_∗|𝐟,𝐗,𝐗_∗) uses the following theorem. The result is:

𝐟_∗|𝐟,𝐗,𝐗_∗∼ 𝑁(𝐊_∗^T𝐊𝐟, 𝐊_∗∗ - 𝐊_∗^T𝐊^-1𝐊_∗

In more realistic situations, we don’t have access to true function values but noisy versions thereof:

𝑦 = 𝑓(𝑥) + ε

Assuming there is an additive independent and identically distributed (i.i.d.) Gaussian noise with variance 𝜎_𝑛², the prior on the noisy observations becomes 𝑐𝑜𝑣(𝑦) = 𝐊 + 𝜎_𝑛²𝐼. The joint distribution of the observed values and the function values at new testing points becomes:

By deriving the conditional distribution, we get the predictive equations for Gaussian processes regression as:

Indent

where:

Indent

In the variance function 𝑐𝑜𝑣(𝐟_∗), it can be noted that the variance does not depend on the observed output 𝐲 but only on the inputs 𝐗 and 𝐗_∗. This is a property of the Gaussian distribution

Computation Complexity for learning Multivariate Unimodal Gaussian Models

for standard or vanilla Gaussian processes, there are two main constraints:

the overall computation complexity is 𝑂(𝑁³) where 𝑁 is the dimension of the covariance matrix 𝐾
the memory consumption is quadratic

Because of the computation complexity and memory consumption, the standard Gaussian processes model gets struck quickly. For regression tasks with a big dataset, Sparse GP is used to reduce computational complexity

Hyperparameters Optimization

Kernel functions play significant roles in GPR. The choice of kernel functions determines almost all the generalization properties of a GP model. There are many covariance functions to choose or make your own for a Gaussian process depending on your specific problem. These criteria include if the model is smooth, if it is sparse, if it can change drastically, and if it needs to be differentiable. More depth information on choosing a kernel/covariance function for a Gaussian process can be found in [5]. In kernels, hyperparameters optimization is essential. Here we will use the most widely used kernel, RBF, as an example to explain the hyperparameters optimization. The general RBF function is given by:

where the hyperparameters are:

𝜎_𝑓 - is the vertical scale that describes how vertically the function can span
𝑙 - is the horizontal scale that indicates how quickly the correlation relationship between two points drops as their distance increases

The optimized hyperparameters 𝚯^∗ are determined by the log marginal likelihood as:

𝚯^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝚯 [ 𝑙𝑜𝑔 𝐏(𝐲|𝐗,𝚯) ]

Thus, considering hyperparameters, a more general equation of predictions at the new testing points is:

𝐟 ̄_∗|𝐗,𝐲,𝐗_∗,𝚯 ∼ 𝑁(𝐟 ̄_∗, 𝑐𝑜𝑣(𝐟_∗))

Note that after learning/tuning the hyperparameters, the predictive variance 𝑐𝑜𝑣(𝐟_∗) depends on not only the inputs 𝐗 and 𝐗_∗ but also the outputs 𝐲

／var／log marcus chiu

Explorer

Gaussian Process Regression (GPR) - Explanation

Computation Complexity for learning Multivariate Unimodal Gaussian Models

Hyperparameters Optimization

／var／logmarcus chiu

Explorer

Gaussian Process Regression (GPR) - Explanation

Computation Complexity for learning Multivariate Unimodal Gaussian Models

Hyperparameters Optimization

／var／log marcus chiu