Processing math: 100%

Thursday, November 24, 2016

Which method is more accurate? or Errors have error bars

2017.01.10 update: this blogpost is now available as a citeable preprint

This post is my attempt at distilling some of the information in two papers published by Anthony Nicholls (here and here). Anthony also very kindly provided some new equations, not found in the papers, in response to my questions.

Errors also have error bars
Say you have two methods, A and B, for predicting some property and you want to determine which method is more accurate by computing the property using both methods for the same set of N different molecules for which reference values are available. You evaluate the error (for example the RMSE) of each method relative to the reference values and compare. The point of this post is that these errors have uncertainties (error bars) that depend on the number of data points (N, more data less uncertainty) and you have to take these uncertainties into consideration when you compare errors. 

The most common error bars reflect 95% confidence and that's what I'll use here.  

The expression for the error bars assume a large N where in practice "large" in this context means roughly 10 or more data points.  If you use fewer points or would like more accurate estimates please see the Nicholls papers for what to do.

Root-Mean-Square-Error (RMSE)
The error bars for the RMSE are asymmetric.  The lower and higher error bar on the RMSE for method X (RMSE_X) is
L_X = RMSE_X - \sqrt {RMSE_X^2 - \frac{{1.96\sqrt 2 RMSE_X^2}}{{\sqrt {N - 1} }}}
= RMSE_X \left( 1- \sqrt{ 1- \frac{1.96\sqrt{2}}{\sqrt{N-1}}}  \right)

U_X =  RMSE_X \left(  \sqrt{ 1+ \frac{1.96\sqrt{2}}{\sqrt{N-1}}}-1  \right) 

Mean Absolute Error (MAE)
The error bars for the MAE is also asymetric. The lower and higher error bar on the MAE for method X (MAE_X) is

L_X =  MAE_X \left( 1- \sqrt{ 1- \frac{1.96\sqrt{2}}{\sqrt{N-1}}}  \right)  

U_X =  MAE_X \left(  \sqrt{ 1+ \frac{1.96\sqrt{2}}{\sqrt{N-1}}}-1  \right)  

Mean Error (ME) 
The error bars for the mean error are symmetric and given by 
L_X = U_X =  \frac{1.96 s_N}{\sqrt{N}}

where s_N is the standard population deviation (e.g. STDEVP in Excel).

Pearson’s correlation coefficient, \textbf{r}
The first thing to check is whether your r values themselves are statistically significant, i.e. r_X > r_{significant} where

r_{significant} = \frac{1.96}{\sqrt{N-2+1.96^2}}  

The error bars for the Pearson's r value are asymmetric and given by 
L_X = r_X - \frac{e^{2F_-}-1}{e^{2F_-}+1}
U_X =  \frac{e^{2F_+}-1}{e^{2F_+}+1} - r_X

where

F_{\pm} = \frac{1}{2} \ln \frac{1+r_X}{1-r_X} \pm r_{significant}

Comparing two methods
If error_X is some measure of the error, RMSE, MAE, etc, and error_A > error_B then the difference is statistically significant only if 

error_A - error_B > \sqrt {L_A^2 + U_B^2 - 2{r_{AB}}{L_A}{U_B}}

where r_{AB} is the Pearson's r value of method A compared to B, not to be confused with r_A which compares A to the reference value.  Conversely, if this condition is not satisfied then you cannot say that method B is not more accurate than method A with 95% confidence because the error bars are too large.

Note also that if there is a high degree of correlation between the predictions (r_{AB} \approx 1) and the error bars are similar in size L_A \approx U_B then even small differences in error could be significant.

Usually one can assume that r_{AB} > 0 so if error_A - error_B > \sqrt {L_A^2 + U_B^2} or error_A - error_B > L_A + U_B then the difference is statistically significant, but it is better to evaluate r_{AB} to be sure.

The meaning of 95% confidence
Say you compute errors for some property for 50 molecules using method A (error_A) and B (error_B) and observe that Eq 11 is true.  

Assuming no prior knowledge on the performance of A and B, if you repeat this process an additional 40 times using all new molecules each time then in 38 cases (38/40 = 0.95) the errors observed for method A will likely be between error_A - L_A and error_A + U_A and similarly for method B. For one of the remaining two cases the error is expected to be larger than this range, while for the other remaining case it is expected to be smaller. Furthermore, in 39 of the 40 cases error_A is likely larger than error_B, while error_A is likely smaller than error_B in the remaining case. 



This work is licensed under a Creative Commons Attribution 3.0 Unported License.

No comments: