Loading [MathJax]/extensions/TeX/boldsymbol.js

Monday, July 2, 2018

Gaussian Processes for dumb dummies

It's quite disheartening when you don't understand something titled "xxx for dummies" so that's what I was after reading Katherine Bailey's blogpost "Gaussian Processes for Dummies". Luckily the post included Python code and a link to some lecture notes for which I also found the corresponding recorded lecture on YouTube.

Having looked at all that, the blogpost made more sense and I now feel I understand Gaussian Processes (GP) a little bit better, at least how it applies to regression, and here is my take on it.

What is GP good for?
GP is an algorithm that takes x and y coordinates as input and returns a numerical fit and the associated standard deviation at each point.  What makes GP special is that you don't have to choose a mathematical function for the fit and you get the uncertainty of the fit at each point

I refactored the code from Katherine Baily's blogpost here and show a plot of a GP fit ("mu", red line) to five points of a sin function (blue squares, "Xtrain" and "ytrain"). "Xtest" is a vector of x-coordinates for which I want to evaluate the fit, "mu". "stdev" is a vector of standard deviations of the fit at each point in Xtest and the gray shading in the plot represents 2 standard deviations of uncertainty.  We'll get to "L" later and there is a hyperparameter ("param" in the code) that we also need to talk about.

mu, stdv, L = Gaussian_process(Xtest, Xtrain, ytrain)


What does the GP algorithm do?
1. Construct the kernel matrix \mathbf{K_{**}}, where K_{**,ij} = e^{-(x_i-x_j)^2/2\lambda}, for the test set. The kernel is a measure of how similar x_i is to x_j and \lambda is a parameter, called "param" in the code (note to self: "lambda" is a bad choice for a variable name in Python).

2. Construct the kernel matrix \mathbf{K} for the test set and Cholesky-decompose it to get \mathbf{L}., i.e. \mathbf{K = LL^T}

3. Construct the kernel matrix \mathbf{K_*} connecting the test set to the training set and compute \mathbf{L_k = L^{-1}K_*} (use the solve function for better numerical stability).

4. Compute  \boldsymbol{\mu} = \mathbf{ L^T_kL^{-1}y}_{train}

5. Compute the standard deviation s_i = \sqrt{k_{**,ii} - \sum_j L^2_{k,ij}} where \mathbf{k_{**}} is obtained by diagonalising \mathbf{K_{**}}

6. Compute \mathbf{L} by Cholesky-decomposing \mathbf{K_{** }- L_k^T L_k}

What is the basic idea behind the GP algorithm?
How would you write an algorithm that would generate an random function y=f(x). I would argue that the simplest way is simply to generate a random number y \sim \mathcal{N}(0,1) (i.e. y = np.random.normal()) for each value of x. ( \mathcal{N}(0,1) is standard notation for a Gaussian distribution with 0 mean and a standard deviation of 1.)

Here's a plot of three such functions.


If I generated 1000 of them the average y-value at each x-coordinate \langle y_i \rangle would be 0. I can change this by y \sim \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1).

You'll notice that these functions are much more jagged than the functions you usually work with. Another way of saying this is that the values of y tend to be similar if the corresponding values of x are similar, i.e. the y values are correlated by distance (|x_i - x_j|).

This correlation is quantified by the kernel matrix \mathbf{K} and can be used to generate smoother functions by y_i \sim \sum_j L_{ij} \mathcal{N}(0,1). This works great as you can see from this plot


You can think of \mathbf{K} and \mathbf{L} a bit like the variance \sigma^2 and the standard deviation \sigma\langle y_i \rangle = 0 as before but this can be changed in analogy with the uncorrelated case y_i \sim \mu + \sum_j L_{ij} \mathcal{N}(0,1)

The GP is a way of generalising this equation as y_i \sim \mu_i + \sum_j L_{ij} \mathcal{N}(0,1) and using the training data to obtain values for \mu_i and L_{ij} such that \mu_i matches the y-values in the training data with correspondingly small L_{ij} values, i.e. greater certainty. Now if you generate a 1000 such random functions and average you will get  \boldsymbol{\mu}.



This work is licensed under a Creative Commons Attribution 4.0

No comments: