Metrics

Sometimes it’s hard to tell the differences between precision, accuracy, recall and so on especially for newbees like me.

But let’s try to distinguish them with stories. In this article, you will see some common used metrics, including:

  • Metrics for binary classification: accuracy, precision, reacall, f1-score and so on

Binary classification

The following metircs for binary classification will be introduced:

  1. accuracy (精度)
  2. error (rate) (错误率)
  3. precision (准确率,查准率)
  4. recall (召回率,查全率)
  5. f1-score
  6. fn-score
  7. AUC

(中文名称参考《机器学习》,周志华著)1

First, we should know some terminologies from a table.

Predictiton
Positive Negative
Truth Positive True Positive False Negative
Negative False Positive True Negative

The table above is called as a 2x2 contingency table or confusion matrix (混淆矩阵). 2

  • True Positive (TP): We predict it is positive, and it’s right.
  • True Negative (TN): We predict it is negative, and it’s right.
  • False Positive (FP): We predict it is positive, but it’s wrong. In fact, it’s negative.
  • False Negative (FN): We predict it is negative, but it’s wrong. In fact, it’s positive.

Accuracy

Accuracy is the correctness rate of your prediction.

The number of your prediction:

$$TP+TN+FP+FN$$

The number of your correct prediction:

$$TP+TN$$

So the accuracy is:

$$ accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$

Error (rate)

Error is the wrong rate of your prediction.

So according to the formula of accuracy, we can get the error rate.

$$ error = 1 - accuracy = \frac{FP+FN}{TP+TN+FP+FN} $$

Precision

Precision is the correctness rate of your positive prediction.

Try to understand it use this example from Zhou’s Machine Learnig:

Consider you have trained a model to help you choose good watermelons.
You want to know the ratio of really good ones in all watermelons which choosen by your model.
That is precision.

So the number of positive prediction is:

$$TP+FP$$

The number of correct positive prediction is:

$$TP$$

We can get the precision with:

$$ precision = \frac{TP}{TP+FP} $$

Recall

Recall is the correctness rate of real positive. The focus is not your prediction, is the cases.

This time we try to understand it with an example I saw in Quora.

Consider you have a girl friend and she gave you a birthday surprise ever year in last 10 years.
However, one day, she asks you: sweetie, do you remember all birthday surprises from me? You need to recall all 10 surprises or you are in danger.

In this situation, all you need is to remember all 10 surprises correctly, no matter how many times you guess. You may guess many times before you get the right one of one year so that you can’t get 100% precision.
But that’s ok, finally, you remember all luckily and you get the recall of 100%.

So the number of really positive cases is:

$$TP+FN$$

The number of correct ones among them is:

$$TP$$

The recall is:

$$ recall = \frac{TP}{TP+FN} $$

\(F_1\) score

Usually, we can’t get high precision and high recall simultaneously.

Girl friend example again.
Assume you can’t remember all superises clearly. You need guess to get correct answer. You have two strategies immediately.

  1. Guess as much as you can to make every surprise coverd correctely.
  2. Only tell the surprises you remember and don’t say anything more.

But I think they will both make you in danger.
The frist one get a high recall and the second get a high precision. In fact, you should find a middle point.

F1 score is the harmonic mean of precision and recall:

$$ F_1 = \frac{2 \cdot precision \cdot recall}{precision + recall} $$

\(F_n\) score

Sometimes we probably think precision is more important than recall or recall is more important. And \(F_1\) score doesn’t satisfy us, so we need more general \(F_n\) socre.

This could be understand with two examples.

  1. Considering spam detecting. It’s unberable for us that important mails are detected as spams. So we want all spams we detected are real spams. So get a high precision is more important.
  2. Considering criminal catching. Police officers won’t let every suspicious people go. So get a high recall is more important.

The definition of \(F_n\) score is:

$$ F_n = (1+\beta^2) \cdot \frac{precision \cdot recall}{\beta^2 \cdot precision + recall} $$

Two commonly used F measures are the \(F_2\) measure, which weights recall higher than precision. (More emphasis on \(\displaystyle\frac1{recall}\)).
And the \( F_{0.5} \) measure, which weighs precision higher than recall. (More emphasis on \(\displaystyle\frac1{precision}\))3

AUC

AUC is the area under the (Receiver Operating Characteristic) curve. So let’s discuss ROC curve first.

ROC

The full name of ROC is Receiver Operating Characteristic curve. ROC curve was first developed in the World War II for radar-signal detection methodology.

We can image a radar used to detect flights of enemy, and you are a radar operator. You need to report every time you see a signal in radar.

  • If you see a signal in radar and you say it’s a flight and actually it’s really a flight. We called this a hit or TP (True Positive).
  • If you see a signal in radar and you say it’s a flight but actually it’s just a bird. We called this a false alarm or FP (False Positive).
  • If you see a signal in radar and you say it’s just a bird but actually it’s an enemy flight. We called this a miss or FN (False Negative).
  • If you see a signal in radar and you say it’s just a bird and actually it’s a bird. We called this TN (True Negative).

In brief, ROC curve is used to assess the seperating ablity of two models in binary classification.

The x-axis of curve is FPR (False Positive Rate), also called as False alarm rate.

$$ FPR = \frac{FP}{Real \ Negative} = \frac{FP}{FP + TN} $$

And the y-axis is TPR (True Positive Rate), also called as Hit rate.

$$ TPR = \frac{TP}{Real \ Positive} = \frac{TP}{TP + FN} $$

For a specific model, how do we plot this ROC curve?

We should be clear about how we decide the class of a case. A model in binary classification usually generate a number range from 0 to 1 for every case. If this number is larger than a number you preset then you will say it’s a positive case, otherwise it’s a negative one. This preset number is called threshold.

To generate a ROC curve, we change the threshold from 0 to 1. Everytime it will give you a False alarm rate and Hit rate and we plot them in one figure and link them with a curve then we get a ROC curve.

Demo of ROC

Below is a demonstration from Navan’s website. And you can see the tutorial on Youtube.

In this demonstration, the blue area represents real negative cases and the red area represents real positive cases. The x-axis is the predict value, range from 0 to 1 and the y-axis is the number of cases corresponding to the predict value.

The case on the right side of black verticle line will be predicted as positive.

Let’s anaylaze two extreme situations.

For first one, let’s set the threshold to 0 (just drag the black vertical line to the left), then all case will be predicted as positive cases. So all real negative cases will be falsely predicted as positive (False Positive). The FPR will be 1. And all real positive cases will be predicted correctly so the TPR will be 1 to.

Fot the other one, let’s set the threshold to 1 (this time drag the black vertical line to the right). Then all case will be predicted as negative cases. All real negative cases are predicted correctly, there is no false positive. The false alarm rate will be 0. And all positive cases are predicted wrong. The hit rate is 0 too.

With these two situations, two points are always existed in the curve: \((0, 0)\) and \((1, 1)\)

Area under the curve

The AUC is equal to the probability that a classifer will rank a randomly choosen positive instance higher than a randomly choosen negative one. 4

In scikit-learn, they implement AUC by choosing unique value in predicts as thresholds. and then integrating (FPR, TPR) points with trapezoidal rule.

Type I and Type II error

Before we talk about Type I and Type II error, we should explain what is null hypothesis.

In statistical hypothesis testing, the null hypothesis (零假设) and alternate hypothesis (备择假设) are two rival hypotheses.

  • Null hypothesis is usually symbolized as \(H_0\).
  • Alternative hypothesis against null hypothesis and is often symbolized as \(H_1\).

The purpose of hypothesis testing is to establish whether experimental evidence supports the rejection of the null hypothesis. Thus we will accept alternate hypothesis.5

For example, that we wish to establish whether the hypothesis \(H_0\) that a coin is fair is true. To do so, we toss the coin 100 times and observe that heads show \(k\) times.

  • If \(k = 15\), we reject \(H_0\), that is, we decide on the basis of the evidence that the fair-coin hypothesis should be rejected.
  • If \(k = 49\), we accept \(H_0\). that is, we decide that the evidence does not support the rejection of the fair-coin hypothesis. The evidence alone. however, does not lead to the conclusion that the coin is fair.

Why called “null”?

There are two explain, one is that the word “null” in this context means that it’s a commonly accepted fact that researchers work to nullify 6 7.
Another one is that the null statement is expressed as no (significant) relationship between two variables or no (significant) difference between two groups. 8

Type I error is that, we reject \(H_0\) even though it is true. A type I error is also known as a “false positive”, because \(H_0\) is usually a kind of “negative” (no relationship).

The probability for such an error is denoted by \(\alpha\) and is also called the significance level of the test.

Type II error is that, we accept \(H_0\) even though it is false. A type II error is also known as a “false negative”.


  1. 周志华, 机器学习 [return]
  2. Wikipedia - Precision and recall [return]
  3. Wikipedia - F1 score [return]
  4. Wikipedia - Receiver operating characteristic [return]
  5. Athanasions Papoulis, Probability, random variable, and stochastic processes (354-355), University Professor Polytechnic University [return]
  6. Null Hypothesis Definition and Examples, How to State [return]
  7. Why is Null hypothesis called null? [return]
  8. Why “Null hypothesis” is called “Null hypothesis”? [return]