## Tuesday, March 18, 2014

### ROC curves and picking cutoffs

We just got the 2nd rounds of reviews for +Luca De Vico's latest PLoS ONE paper.  In the paper we try to predict HIV protease mutants that will cleave a particular peptide sequence and we use peptide-protein interaction energies as a measure of cleavability.  How well does this work?  The reviewer suggested ROC curves to quantify this.  Here's how it works.

We have 11 naturally occurring peptides that we know are cleavable (there are also some non-natural peptides that I'll ignore in this post) and 42 that we know are non-cleavable. Here are computed interaction energies (in kcal/mol) for all cleavable peptides and non-cleavable peptides which interaction energies < -40 kcal/mol .

 Cleaveable (11) Non cleaveable (42) -72 -68 -68 -68 -63 -64 -54 -63 -49 -62 -45 -62 -45 -57 -44 -52 -42 -47 -41

If we say that peptides with interaction energies < -40 kcal/mol are cleavable then we will have correctly predicted that all 11 cleavable peptides are cleavable, but also that 8 non-cleavable peptides will be cleavable.  Put another way, our "true positive" rate is 100% (11/11) and our "false positive" rate is 19% (8/42).

If we pick -45 kcal/mol as the cutoff the numbers are 91% and 10%: we have fewer false positives but we miss some true positives. The plot of true vs false positives is an ROC curve:

In a perfect world our true positive rate would be 100% and our false positive rate would be 0, so we are looking for the point closest to 0, 1, which happens to be -45 kcal/mol.

We can also quantify how good this approach is in general by finding the area under the curve, which will range from 1 (perfect) to 0.5 (useless) and, for example, compare two different methods for calculating the interaction energies