## Monday, July 2, 2018

### Gaussian Processes for dumb dummies

It's quite disheartening when you don't understand something titled "xxx for dummies" so that's what I was after reading Katherine Bailey's blogpost "Gaussian Processes for Dummies". Luckily the post included Python code and a link to some lecture notes for which I also found the corresponding recorded lecture on YouTube.

Having looked at all that, the blogpost made more sense and I now feel I understand Gaussian Processes (GP) a little bit better, at least how it applies to regression, and here is my take on it.

What is GP good for?
GP is an algorithm that takes x and y coordinates as input and returns a numerical fit and the associated standard deviation at each point.  What makes GP special is that you don't have to choose a mathematical function for the fit and you get the uncertainty of the fit at each point

I refactored the code from Katherine Baily's blogpost here and show a plot of a GP fit ("mu", red line) to five points of a $sin$ function (blue squares, "Xtrain" and "ytrain"). "Xtest" is a vector of x-coordinates for which I want to evaluate the fit, "mu". "stdev" is a vector of standard deviations of the fit at each point in Xtest and the gray shading in the plot represents 2 standard deviations of uncertainty.  We'll get to "L" later and there is a hyperparameter ("param" in the code) that we also need to talk about.

mu, stdv, L = Gaussian_process(Xtest, Xtrain, ytrain)

What does the GP algorithm do?
1. Construct the kernel matrix $\mathbf{K_{**}}$, where $K_{**,ij} = e^{-(x_i-x_j)^2/2\lambda}$, for the test set. The kernel is a measure of how similar $x_i$ is to $x_j$ and $\lambda$ is a parameter, called "param" in the code (note to self: "lambda" is a bad choice for a variable name in Python).

2. Construct the kernel matrix $\mathbf{K}$ for the test set and Cholesky-decompose it to get $\mathbf{L}$., i.e. $\mathbf{K = LL^T}$

3. Construct the kernel matrix $\mathbf{K_*}$ connecting the test set to the training set and compute $\mathbf{L_k = L^{-1}K_*}$ (use the solve function for better numerical stability).

4. Compute  $\boldsymbol{\mu} = \mathbf{ L^T_kL^{-1}y}_{train}$

5. Compute the standard deviation $s_i = \sqrt{k_{**,ii} - \sum_j L^2_{k,ij}}$ where $\mathbf{k_{**}}$ is obtained by diagonalising $\mathbf{K_{**}}$

6. Compute $\mathbf{L}$ by Cholesky-decomposing $\mathbf{K_{** }- L_k^T L_k}$

What is the basic idea behind the GP algorithm?
How would you write an algorithm that would generate an random function $y=f(x)$. I would argue that the simplest way is simply to generate a random number $y \sim \mathcal{N}(0,1)$ (i.e. y = np.random.normal()) for each value of $x$. ($\mathcal{N}(0,1)$ is standard notation for a Gaussian distribution with 0 mean and a standard deviation of 1.)

Here's a plot of three such functions.

If I generated 1000 of them the average y-value at each x-coordinate $\langle y_i \rangle$ would be 0. I can change this by $y \sim \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1)$.

You'll notice that these functions are much more jagged than the functions you usually work with. Another way of saying this is that the values of $y$ tend to be similar if the corresponding values of $x$ are similar, i.e. the $y$ values are correlated by distance ($|x_i - x_j|$).

This correlation is quantified by the kernel matrix $\mathbf{K}$ and can be used to generate smoother functions by $y_i \sim \sum_j L_{ij} \mathcal{N}(0,1)$. This works great as you can see from this plot

You can think of $\mathbf{K}$ and $\mathbf{L}$ a bit like the variance $\sigma^2$ and the standard deviation $\sigma$.  $\langle y_i \rangle$ = 0 as before but this can be changed in analogy with the uncorrelated case $y_i \sim \mu + \sum_j L_{ij} \mathcal{N}(0,1)$

The GP is a way of generalising this equation as $y_i \sim \mu_i + \sum_j L_{ij} \mathcal{N}(0,1)$ and using the training data to obtain values for $\mu_i$ and $L_{ij}$ such that $\mu_i$ matches the y-values in the training data with correspondingly small $L_{ij}$ values, i.e. greater certainty. Now if you generate a 1000 such random functions and average you will get  $\boldsymbol{\mu}$.