Proteins and Wave Functions: RMSD and MUE versus Correlation Coefficient: a simple illustration of the difference

Thursday, March 7, 2013

RMSD and MUE versus Correlation Coefficient: a simple illustration of the difference

Here is a simple illustration of what the root-mean-square deviation (RMSD), mean-unsigned error (MUE) and correlation coefficient (

$r$ ) can tell you about your data. Imagine that the

$x$ -axis is experimental data and the

$y$ -axis is computed data in some arbitrary units. (you can access the data here).

The blue dots represent a perfect correlation

$(y=x)$ for which

$r$ = 1 and RMSD = MUE = 0.

The red dots represent the function

$y=2x-5$ . The RMSD = 2.9 and MUE = 2.5, and both would seem to indicate a pretty crappy model. However,

$r$ = 1.0 indicating that there is a systematic error that can be fixed completely by a linear fit. In this case, the MUE is an indicator of part of this systematic error that can be fixed by an offset

$[x=\frac{1}{2}y+2.5]$ .

The orange dots do not represent a linear function and clearly represent a worse model than red dots. However, the RMSD = 2.6 and MUE = 2.1 are both slightly better the red model. But,

$r$ = 0.7 indicating that only part of the discrepancy can be fixed by a linear fit.

Indeed, a linear fit to the orange data

$(x=1.2y-2.75)$ can only reduce the RMSD and MUE to 2.1 and 1.8, respectively.

The relationship between

$r$ and the RMSD after a linear fit is

$RMSD_{fit}=\sigma_x\sqrt{1-r^2}$ where

$\sigma_x$ is the standard deviation of the experimental data, which in this case is 2.9. So knowing

$r$ and

$\sigma_x$ tells one immediately what the lowest possible RMSD value for a model is using a linear fit.

Also, you can think of

$\sigma_x$ as the RMSD for the very simple model

$y=\langle x \rangle$ , i.e. the model simply returns the average value of the experimental data. This is the maximum RMSD value for a linear fit (where