It's quite disheartening when you don't understand something titled "xxx for dummies" so that's what I was after reading Katherine Bailey's blogpost "Gaussian Processes for Dummies". Luckily the post included Python code and a link to some lecture notes for which I also found the corresponding recorded lecture on YouTube.
Having looked at all that, the blogpost made more sense and I now feel I understand Gaussian Processes (GP) a little bit better, at least how it applies to regression, and here is my take on it.
What is GP good for?
GP is an algorithm that takes x and y coordinates as input and returns a numerical fit and the associated standard deviation at each point. What makes GP special is that you don't have to choose a mathematical function for the fit and you get the uncertainty of the fit at each point
I refactored the code from Katherine Baily's blogpost here and show a plot of a GP fit ("mu", red line) to five points of a sin function (blue squares, "Xtrain" and "ytrain"). "Xtest" is a vector of x-coordinates for which I want to evaluate the fit, "mu". "stdev" is a vector of standard deviations of the fit at each point in Xtest and the gray shading in the plot represents 2 standard deviations of uncertainty. We'll get to "L" later and there is a hyperparameter ("param" in the code) that we also need to talk about.
mu, stdv, L = Gaussian_process(Xtest, Xtrain, ytrain)
What does the GP algorithm do?
1. Construct the kernel matrix \mathbf{K_{**}}, where K_{**,ij} = e^{-(x_i-x_j)^2/2\lambda}, for the test set. The kernel is a measure of how similar x_i is to x_j and \lambda is a parameter, called "param" in the code (note to self: "lambda" is a bad choice for a variable name in Python).
2. Construct the kernel matrix \mathbf{K} for the test set and Cholesky-decompose it to get \mathbf{L}., i.e. \mathbf{K = LL^T}
3. Construct the kernel matrix \mathbf{K_*} connecting the test set to the training set and compute \mathbf{L_k = L^{-1}K_*} (use the solve function for better numerical stability).
4. Compute \boldsymbol{\mu} = \mathbf{ L^T_kL^{-1}y}_{train}
5. Compute the standard deviation s_i = \sqrt{k_{**,ii} - \sum_j L^2_{k,ij}} where \mathbf{k_{**}} is obtained by diagonalising \mathbf{K_{**}}
6. Compute \mathbf{L} by Cholesky-decomposing \mathbf{K_{** }- L_k^T L_k}
What is the basic idea behind the GP algorithm?
How would you write an algorithm that would generate an random function y=f(x). I would argue that the simplest way is simply to generate a random number y \sim \mathcal{N}(0,1) (i.e. y = np.random.normal()) for each value of x. ( \mathcal{N}(0,1) is standard notation for a Gaussian distribution with 0 mean and a standard deviation of 1.)
Here's a plot of three such functions.
If I generated 1000 of them the average y-value at each x-coordinate \langle y_i \rangle would be 0. I can change this by y \sim \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1).
You'll notice that these functions are much more jagged than the functions you usually work with. Another way of saying this is that the values of y tend to be similar if the corresponding values of x are similar, i.e. the y values are correlated by distance (|x_i - x_j|).
This correlation is quantified by the kernel matrix \mathbf{K} and can be used to generate smoother functions by y_i \sim \sum_j L_{ij} \mathcal{N}(0,1). This works great as you can see from this plot
You can think of \mathbf{K} and \mathbf{L} a bit like the variance \sigma^2 and the standard deviation \sigma. \langle y_i \rangle = 0 as before but this can be changed in analogy with the uncorrelated case y_i \sim \mu + \sum_j L_{ij} \mathcal{N}(0,1)
The GP is a way of generalising this equation as y_i \sim \mu_i + \sum_j L_{ij} \mathcal{N}(0,1) and using the training data to obtain values for \mu_i and L_{ij} such that \mu_i matches the y-values in the training data with correspondingly small L_{ij} values, i.e. greater certainty. Now if you generate a 1000 such random functions and average you will get \boldsymbol{\mu}.

This work is licensed under a Creative Commons Attribution 4.0
Having looked at all that, the blogpost made more sense and I now feel I understand Gaussian Processes (GP) a little bit better, at least how it applies to regression, and here is my take on it.
What is GP good for?
GP is an algorithm that takes x and y coordinates as input and returns a numerical fit and the associated standard deviation at each point. What makes GP special is that you don't have to choose a mathematical function for the fit and you get the uncertainty of the fit at each point
I refactored the code from Katherine Baily's blogpost here and show a plot of a GP fit ("mu", red line) to five points of a sin function (blue squares, "Xtrain" and "ytrain"). "Xtest" is a vector of x-coordinates for which I want to evaluate the fit, "mu". "stdev" is a vector of standard deviations of the fit at each point in Xtest and the gray shading in the plot represents 2 standard deviations of uncertainty. We'll get to "L" later and there is a hyperparameter ("param" in the code) that we also need to talk about.
mu, stdv, L = Gaussian_process(Xtest, Xtrain, ytrain)
1. Construct the kernel matrix \mathbf{K_{**}}, where K_{**,ij} = e^{-(x_i-x_j)^2/2\lambda}, for the test set. The kernel is a measure of how similar x_i is to x_j and \lambda is a parameter, called "param" in the code (note to self: "lambda" is a bad choice for a variable name in Python).
2. Construct the kernel matrix \mathbf{K} for the test set and Cholesky-decompose it to get \mathbf{L}., i.e. \mathbf{K = LL^T}
3. Construct the kernel matrix \mathbf{K_*} connecting the test set to the training set and compute \mathbf{L_k = L^{-1}K_*} (use the solve function for better numerical stability).
4. Compute \boldsymbol{\mu} = \mathbf{ L^T_kL^{-1}y}_{train}
5. Compute the standard deviation s_i = \sqrt{k_{**,ii} - \sum_j L^2_{k,ij}} where \mathbf{k_{**}} is obtained by diagonalising \mathbf{K_{**}}
6. Compute \mathbf{L} by Cholesky-decomposing \mathbf{K_{** }- L_k^T L_k}
What is the basic idea behind the GP algorithm?
How would you write an algorithm that would generate an random function y=f(x). I would argue that the simplest way is simply to generate a random number y \sim \mathcal{N}(0,1) (i.e. y = np.random.normal()) for each value of x. ( \mathcal{N}(0,1) is standard notation for a Gaussian distribution with 0 mean and a standard deviation of 1.)
Here's a plot of three such functions.
If I generated 1000 of them the average y-value at each x-coordinate \langle y_i \rangle would be 0. I can change this by y \sim \mathcal{N}(\mu,\sigma^2) = \mu + \sigma \mathcal{N}(0,1).
You'll notice that these functions are much more jagged than the functions you usually work with. Another way of saying this is that the values of y tend to be similar if the corresponding values of x are similar, i.e. the y values are correlated by distance (|x_i - x_j|).
This correlation is quantified by the kernel matrix \mathbf{K} and can be used to generate smoother functions by y_i \sim \sum_j L_{ij} \mathcal{N}(0,1). This works great as you can see from this plot
You can think of \mathbf{K} and \mathbf{L} a bit like the variance \sigma^2 and the standard deviation \sigma. \langle y_i \rangle = 0 as before but this can be changed in analogy with the uncorrelated case y_i \sim \mu + \sum_j L_{ij} \mathcal{N}(0,1)
The GP is a way of generalising this equation as y_i \sim \mu_i + \sum_j L_{ij} \mathcal{N}(0,1) and using the training data to obtain values for \mu_i and L_{ij} such that \mu_i matches the y-values in the training data with correspondingly small L_{ij} values, i.e. greater certainty. Now if you generate a 1000 such random functions and average you will get \boldsymbol{\mu}.

This work is licensed under a Creative Commons Attribution 4.0
No comments:
Post a Comment