Proteins and Wave Functions: RMSD and MUE versus Correlation Coefficient: a simple illustration of the difference

Thursday, March 7, 2013

RMSD and MUE versus Correlation Coefficient: a simple illustration of the difference

Here is a simple illustration of what the root-mean-square deviation (RMSD), mean-unsigned error (MUE) and correlation coefficient ($r$) can tell you about your data. Imagine that the $x$-axis is experimental data and the $y$-axis is computed data in some arbitrary units. (you can access the data here).

The blue dots represent a perfect correlation $(y=x)$ for which $r$ = 1 and RMSD = MUE = 0.

The red dots represent the function $y=2x-5$. The RMSD = 2.9 and MUE = 2.5, and both would seem to indicate a pretty crappy model. However, $r$ = 1.0 indicating that there is a systematic error that can be fixed completely by a linear fit. In this case, the MUE is an indicator of part of this systematic error that can be fixed by an offset $[x=\frac{1}{2}y+2.5]$.

The orange dots do not represent a linear function and clearly represent a worse model than red dots. However, the RMSD = 2.6 and MUE = 2.1 are both slightly better the red model. But, $r$ = 0.7 indicating that only part of the discrepancy can be fixed by a linear fit.

Indeed, a linear fit to the orange data $(x=1.2y-2.75)$ can only reduce the RMSD and MUE to 2.1 and 1.8, respectively.

The relationship between $r$ and the RMSD after a linear fit is$$RMSD_{fit}=\sigma_x\sqrt{1-r^2}$$where $\sigma_x$ is the standard deviation of the experimental data, which in this case is 2.9. So knowing $r$ and $\sigma_x$ tells one immediately what the lowest possible RMSD value for a model is using a linear fit.

Also, you can think of $\sigma_x$ as the RMSD for the very simple model $y=\langle x \rangle$, i.e. the model simply returns the average value of the experimental data. This is the maximum RMSD value for a linear fit (where $r$ = 0).