It's quite disheartening when you don't understand something titled "xxx for dummies" so that's what I was after reading Katherine Bailey's blogpost "Gaussian Processes for Dummies". Luckily the post included Python code and a link to some lecture notes for which I also found the corresponding recorded lecture on YouTube.

Having looked at all that, the blogpost made more sense and I now feel I understand Gaussian Processes (GP) a little bit better, at least how it applies to regression, and here is my take on it.

GP is an algorithm that takes x and y coordinates as input and returns a numerical fit and the associated standard deviation at each point.

I refactored the code from Katherine Baily's blogpost here and show a plot of a GP fit ("mu", red line) to five points of a $sin$ function (blue squares, "Xtrain" and "ytrain"). "Xtest" is a vector of

mu, stdv, L = Gaussian_process(Xtest, Xtrain, ytrain)

1. Construct the kernel matrix $\mathbf{K_{**}}$, where $K_{**,ij} = e^{-(x_i-x_j)^2/2\lambda}$, for the test set. The kernel is a measure of how similar $x_i$ is to $x_j$ and $\lambda$ is a parameter, called "param" in the code (note to self: "lambda" is a bad choice for a variable name in Python).

2. Construct the kernel matrix $\mathbf{K}$ for the test set and Cholesky-decompose it to get $\mathbf{L}$., i.e. $\mathbf{K = LL^T}$

3. Construct the kernel matrix $\mathbf{K_*}$ connecting the test set to the training set and compute $\mathbf{L_k = L^{-1}K_*}$ (use the solve function for better numerical stability).

4. Compute $\boldsymbol{\mu} = \mathbf{ L^T_kL^{-1}y}_{train}$

5. Compute the standard deviation $s_i = \sqrt{k_{**,ii} - \sum_j L^2_{k,ij}}$ where $\mathbf{k_{**}}$ is obtained by diagonalising $\mathbf{K_{**}}$

6. Compute $\mathbf{L}$ by Cholesky-decomposing $\mathbf{K_{** }- L_k^T L_k}$

How would you write an algorithm that would generate an random function $y=f(x)$. I would argue that the simplest way is simply to generate a random number $y \sim \mathcal{N}(0,1)$ (i.e. y = np.random.normal()) for each value of $x$. ($ \mathcal{N}(0,1)$ is standard notation for a Gaussian distribution with 0 mean and a standard deviation of 1.)

Here's a plot of three such functions.

If I generated 1000 of them the average

You'll notice that these functions are much more jagged than the functions you usually work with. Another way of saying this is that the values of $y$ tend to be similar if the corresponding values of $x$ are similar, i.e. the $y$ values are

This correlation is quantified by the kernel matrix $\mathbf{K}$ and can be used to generate smoother functions by $y_i \sim \sum_j L_{ij} \mathcal{N}(0,1)$. This works great as you can see from this plot

You can think of $\mathbf{K}$ and $\mathbf{L}$ a bit like the variance $\sigma^2$ and the standard deviation $\sigma$. $\langle y_i \rangle$ = 0 as before but this can be changed in analogy with the uncorrelated case $y_i \sim \mu + \sum_j L_{ij} \mathcal{N}(0,1)$

The GP is a way of generalising this equation as $y_i \sim \mu_i + \sum_j L_{ij} \mathcal{N}(0,1)$ and using the training data to obtain values for $\mu_i$ and $L_{ij}$ such that $\mu_i$ matches the y-values in the training data with correspondingly small $L_{ij}$ values, i.e. greater certainty. Now if you generate a 1000 such random functions and average you will get $\boldsymbol{\mu}$.

This work is licensed under a Creative Commons Attribution 4.0

Having looked at all that, the blogpost made more sense and I now feel I understand Gaussian Processes (GP) a little bit better, at least how it applies to regression, and here is my take on it.

**What is GP good for?**GP is an algorithm that takes x and y coordinates as input and returns a numerical fit and the associated standard deviation at each point.

*What makes GP special is that you don't have to choose a mathematical function for the fit and you get the uncertainty of the fit at each point*I refactored the code from Katherine Baily's blogpost here and show a plot of a GP fit ("mu", red line) to five points of a $sin$ function (blue squares, "Xtrain" and "ytrain"). "Xtest" is a vector of

*x*-coordinates for which I want to evaluate the fit, "mu". "stdev" is a vector of standard deviations of the fit at each point in Xtest and the gray shading in the plot represents 2 standard deviations of uncertainty. We'll get to "L" later and there is a hyperparameter ("param" in the code) that we also need to talk about.mu, stdv, L = Gaussian_process(Xtest, Xtrain, ytrain)

**What does the GP algorithm do?**1. Construct the kernel matrix $\mathbf{K_{**}}$, where $K_{**,ij} = e^{-(x_i-x_j)^2/2\lambda}$, for the test set. The kernel is a measure of how similar $x_i$ is to $x_j$ and $\lambda$ is a parameter, called "param" in the code (note to self: "lambda" is a bad choice for a variable name in Python).

2. Construct the kernel matrix $\mathbf{K}$ for the test set and Cholesky-decompose it to get $\mathbf{L}$., i.e. $\mathbf{K = LL^T}$

3. Construct the kernel matrix $\mathbf{K_*}$ connecting the test set to the training set and compute $\mathbf{L_k = L^{-1}K_*}$ (use the solve function for better numerical stability).

4. Compute $\boldsymbol{\mu} = \mathbf{ L^T_kL^{-1}y}_{train}$

5. Compute the standard deviation $s_i = \sqrt{k_{**,ii} - \sum_j L^2_{k,ij}}$ where $\mathbf{k_{**}}$ is obtained by diagonalising $\mathbf{K_{**}}$

6. Compute $\mathbf{L}$ by Cholesky-decomposing $\mathbf{K_{** }- L_k^T L_k}$

**What is the basic idea behind the GP algorithm?**How would you write an algorithm that would generate an random function $y=f(x)$. I would argue that the simplest way is simply to generate a random number $y \sim \mathcal{N}(0,1)$ (i.e. y = np.random.normal()) for each value of $x$. ($ \mathcal{N}(0,1)$ is standard notation for a Gaussian distribution with 0 mean and a standard deviation of 1.)

Here's a plot of three such functions.

If I generated 1000 of them the average

*y*-value at each*x*-coordinate $\langle y_i \rangle$ would be 0. I can change this by $y \sim \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1)$.You'll notice that these functions are much more jagged than the functions you usually work with. Another way of saying this is that the values of $y$ tend to be similar if the corresponding values of $x$ are similar, i.e. the $y$ values are

*correlated*by distance ($|x_i - x_j|$).This correlation is quantified by the kernel matrix $\mathbf{K}$ and can be used to generate smoother functions by $y_i \sim \sum_j L_{ij} \mathcal{N}(0,1)$. This works great as you can see from this plot

You can think of $\mathbf{K}$ and $\mathbf{L}$ a bit like the variance $\sigma^2$ and the standard deviation $\sigma$. $\langle y_i \rangle$ = 0 as before but this can be changed in analogy with the uncorrelated case $y_i \sim \mu + \sum_j L_{ij} \mathcal{N}(0,1)$

The GP is a way of generalising this equation as $y_i \sim \mu_i + \sum_j L_{ij} \mathcal{N}(0,1)$ and using the training data to obtain values for $\mu_i$ and $L_{ij}$ such that $\mu_i$ matches the y-values in the training data with correspondingly small $L_{ij}$ values, i.e. greater certainty. Now if you generate a 1000 such random functions and average you will get $\boldsymbol{\mu}$.

This work is licensed under a Creative Commons Attribution 4.0

## No comments:

Post a Comment