Proteins and Wave Functions: March 2021

Monday, March 22, 2021

Can machine learning regression extrapolate?

I recently developed a ML model to predict pIC50 values for molecules and used it together with a genetic algorithm code to search for molecules with large pIC50 values. However, the GA searches never found molecules with pIC50 values that where larger than in my training set.

This brought up the general question of whether ML models are capable of outputting values that are larger than those found in the training set. I made a simple example to investigate this issue for different ML models.

$\mathbf{X}_1 = (1, 0, 0)$ corresponds to 1, $\mathbf{X}_2 = (0, 1, 0)$ corresponds to 2, and $\mathbf{X}_3 = (0, 0, 1)$ corresponds to 3.

The code can be found here. If you are new to ML check out this site

Linear Regression
This training set can be fit by a linear regression model $y = \mathbf{wX}$ with the weights $\mathbf{w} = (1, 2, 3)$. Clearly this simple ML can extrapolate in the sense that, for example, $\mathbf{X} = (1, 0, 1)$ will yield 4, which is larger than max value in the training set (3). Similarly, $\mathbf{X} = (0, 0, 2)$ will yield 6.

Neural Network
Next I tried a NN with one hidden layer with 2 nodes and the sigmoid activation function. For this model $\mathbf{X} = (1, 0, 1)$ yields 1.6 and $\mathbf{X} = (0, 0, 2)$ yields 3.2, which is only slightly larger than 3.

The output of the NN is given by $\mathbf{O}_h\mathbf{w}_{ho}+\mathbf{b}_{ho}$, where $\mathbf{O}_h$ is the output of the hidden layer. Using the sigmoid function, the maximum value of $\mathbf{O}_h = (1, 1)$, for which $\mathbf{O}_h\mathbf{w}_{ho}+\mathbf{b}_{ho}$ = 3.3. So this is the maximum value this NN can output. For comparison, $\mathbf{O}_h = (0.99, 0.65)$ for $\mathbf{X}_3 = (0, 0, 1)$.

If I instead use the ReLU activation function (which doesn't have an upper bound), $\mathbf{X} = (1, 0, 1)$ yields 2.2 and $\mathbf{X} = (0, 0, 2)$ yields 4.2, which is somewhat larger than 3.

So, NNs that exclusively use ReLU can in principle yield values that are larger than those found in the training set. But if one layer uses bounded activation functions such as sigmoid, then it depends on how close the outputs of that layer are to 1 when predicting the largest value in the training set.

Random Forest
The example is so simple that it can be fit with a single decision tree (RF outputs the mean prediction of a collection of such decision trees):