The Lasko et al. paper you cited is great. It goes through alternative estimates for area under the curve and its connection to Wilcoxon/U stats. Unlike most of the presentations I see, Lasko et al. frame the problem properly as that of estimating the true ROC curve (which is not only unknown, but not directly observable).

Like a lot of these things (e.g. collapsed Gibbs for LDA, hybrid Monte Carlo, SGD for logistic regression), the proof and formal definitions are way more complex than the straightforward implementation.

The answer to the question as I framed it is that the convention is to use the parallelograms over uninterpolated sample ROC to estimate area under ROC. For precision-recall, the convention is to use the interpolated step functions, which is why the latter is equivalent to average precision at true positive points (as stated but not explained in the Manning et al. *IR* book). If you add a 1.0 recall and 0.0 precision point, it’ll get interpolated away. And the 0.0 recall and 1.0 precision point is not part of the curve because of the way interpolation is done.

In the current LingPipe, my ROC calcs have sensitivity on the x axis for just this reason. I just rewrote the code to follow the conventions in the literature (x=1-specificity, y = sensitivity/recall for ROC; x=sensitivity/recall, y = precision/positivie-predictive-accuracy for precision-recall).

]]>http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U#Calculations

The first one is quadratic in the worst case, the second Theta(n log n) because you need to sort the test data by predicted score, then do a linear pass over them.

]]>I prefer to have recall on the y-axis so that I can have ROC and P/R plots side by side, ROC on the left. Then there’s a single y-axis, and you can read off precision, F-score, and specificity all at once for a given level of recall.

]]>I like the dotted lines, too. I’d just drawn the F-measure contour lines in a previous plot but hadn’t thought about putting them together.

As Dave says, it’s always recall on the x axis and precision on the y. Except for ROC curves, where it’s 1-specificity on the x axis and sensitivity (= recall) onthe y axis.

]]>Martin – I like displaying the F1 contour lines. But is there a reason you put precision on the x-axis? It’s almost always on the y-axis in the IR and NLP literature.

]]>The red line is the P/R curve. The diagonal quickly shows you the point of equal precision and recall (about 0.77 in this contrived example). The black contour lines show F-score for equally weighted precision and recall: you can immediately find the operating point with the highest F-score (~0.85) and see its precision (~0.80) and recall (~0.92).

The script that generated this allows you to adjust the weighting of precision and recall, which changes the shape of the contour lines and the slope of the diagonal line.

]]>Thanks! That was exactly what I was looking for on the edge cases. This is the kind of stuff that drives me crazy when I’m coding something, because most presentations in textbooks or papers don’t go over degenerate instances.

And the Lasko et al. paper already won me over by saying you can only estimate the true ROC curve. Spot on. This always bugs me when people talk about sensitivity as true positive rate — it’s the sample true positive rate, which is also the maximum likelihood estimate, but we just don’t know the real true positive rate.

]]>(Note: When dealing with precision, recall, etc., it helps if you define 0/0 to be 1. This makes a certain amount of sense: e.g. a classifier that never predicts a positive label when there are indeed no positive items should intuitively have 100% precision. It also saves you from having to exclude otherwise undefined corner cases. Although letting 0/0 = 1 will introduce inconsistencies if not used sparingly.)

The two simple extreme points are these:

(A) You never predict the positive class, so your precision is trivially 1 (0/0, see note above) and your recall is necessarily 0 (except in the degenerate case where there are no actual positive items).

(B) You always predict the positive class, so your recall is necessarily 1 and your precision is the maximum achievable precision on your test data, which can be any rational between 0 and 1.

There are two other extreme cases, but you can only achieve them with a perfect classifier:

(C) Precision is 1, recall is 1. This generally only happens when your classifier makes no mistakes (or trivially if your test set is empty).

(D) Precision is 0, recall is 0. This happens when the number of true positives is 0, there is at least one positive item (otherwise R=0/0=1), and your classifier emits at least one positive label (otherwise P=0/0=1). For this, you need to take a perfect classifier and invert its predictions.

The last corner only happens in a degenerate case:

(E) Precision is 0, recall is 1 (actually, recall is 0/0 per the Note above). This can only happen on a nonempty test set with no positive items. If you have such a test set, this situation happens whenever your classifier emits at least one positive label.

You can visualize this on a square D,A,C,E where B is somewhere along CE. In practice, if you only vary the decision threshold of a binary classifier, your P/R curve goes from A to B.

]]>T.A. Lasko et al., “The use of receiver operating characteristic curves in biomedical informatics”, Journal of Biomedical Informatics 38 (2005), 404–415.

]]>