Here is a simple illustration of what the root-mean-square deviation (RMSD), mean-unsigned error (MUE) and correlation coefficient ($r$) can tell you about your data. Imagine that the $x$-axis is experimental data and the $y$-axis is computed data in some arbitrary units. (you can access the data here).

The blue dots represent a perfect correlation $(y=x)$ for which $r$ = 1 and RMSD = MUE = 0.

The red dots represent the function $y=2x-5$. The RMSD = 2.9 and MUE = 2.5, and both would seem to indicate a pretty crappy model. However, $r$ = 1.0 indicating that there is a systematic error that can be

The orange dots do not represent a linear function and clearly represent a worse model than red dots. However, the RMSD = 2.6 and MUE = 2.1 are both slightly better the red model. But, $r$ = 0.7 indicating that only part of the discrepancy can be fixed by a linear fit.

Indeed, a linear fit to the orange data $(x=1.2y-2.75)$ can only reduce the RMSD and MUE to 2.1 and 1.8, respectively.

The relationship between $r$ and the RMSD after a linear fit is$$RMSD_{fit}=\sigma_x\sqrt{1-r^2}$$where $\sigma_x$ is the standard deviation of the experimental data, which in this case is 2.9. So knowing $r$ and $\sigma_x$ tells one immediately what the lowest possible RMSD value for a model is using a linear fit.

Also, you can think of $\sigma_x$ as the RMSD for the very simple model $y=\langle x \rangle$, i.e. the model simply returns the average value of the experimental data. This is the maximum RMSD value for a linear fit (where $r$ = 0).

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

The blue dots represent a perfect correlation $(y=x)$ for which $r$ = 1 and RMSD = MUE = 0.

The red dots represent the function $y=2x-5$. The RMSD = 2.9 and MUE = 2.5, and both would seem to indicate a pretty crappy model. However, $r$ = 1.0 indicating that there is a systematic error that can be

*fixed completely*by a linear fit. In this case, the MUE is an indicator of part of this systematic error that can be fixed by an offset $[x=\frac{1}{2}y+2.5]$.The orange dots do not represent a linear function and clearly represent a worse model than red dots. However, the RMSD = 2.6 and MUE = 2.1 are both slightly better the red model. But, $r$ = 0.7 indicating that only part of the discrepancy can be fixed by a linear fit.

Indeed, a linear fit to the orange data $(x=1.2y-2.75)$ can only reduce the RMSD and MUE to 2.1 and 1.8, respectively.

The relationship between $r$ and the RMSD after a linear fit is$$RMSD_{fit}=\sigma_x\sqrt{1-r^2}$$where $\sigma_x$ is the standard deviation of the experimental data, which in this case is 2.9. So knowing $r$ and $\sigma_x$ tells one immediately what the lowest possible RMSD value for a model is using a linear fit.

Also, you can think of $\sigma_x$ as the RMSD for the very simple model $y=\langle x \rangle$, i.e. the model simply returns the average value of the experimental data. This is the maximum RMSD value for a linear fit (where $r$ = 0).

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

## 5 comments:

The correlation coefficient also has a geometric interpretation:

http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation

Another curiosity regarding R is the Anscombe's quartet:

http://en.wikipedia.org/wiki/Anscombe%27s_quartet

I do recall having read somewhere that in linear regressions R-squared can be interpreted as the fraction of the total variance that is "explained" by the regression model. In your data, R-squared is 0.47 and R=0.69 (as computed from your original data. I believe you mistakenly wrote R2 instead of R). That means that the the regression explains 47% of the variance, i.e. 69 % of the original RMSD (which is the square root of the variance).

Yes, I should have written $r$ instead of $R^2$. This led me to look more carefully at everything and I discovered I made a mistake when computing the RMSD and MUE.

I had a look at $r^2$. For linear models this is also the coefficient of determination (http://en.wikipedia.org/wiki/Coefficient_of_determination), which is a measure of how much the total variation in $y$ is described by the linear fit. Long story short: $$r={sqrt{1-\frac{\sigma_y^2/n}{RMSD_{fit}^2}}$$So not exactly easy to interpret in terms of RMSD and no direct relation to the RMSD of the data before the fit.

Let's try that equation again: $$r=\sqrt{ 1-\frac{RMSD_{fit}^2}{\sigma^2_x}}$$

... but that doesn't matter. Post updated

Post a Comment