当前位置：网站首页>[R language data science]: common evaluation indicators of machine learning

[R language data science]: common evaluation indicators of machine learning

2022-07-01 14:07:00 【JOJO's data analysis Adventure】

【R Language data science 】： Machine learning evaluation index

Personal home page ：JoJo Data analysis adventure
Personal introduction ： I'm reading statistics in my senior year , At present, Baoyan has reached statistical top3 Colleges and universities continue to study for Postgraduates in Statistics
If it helps you , welcome Focus on 、 give the thumbs-up 、 Collection 、 subscribe special column
This article is included in 【R Language data science 】 This series mainly introduces R The applications of language in the field of data science include ：
R Fundamentals of language programming 、R Language visualization 、R Language for data manipulation 、R Modeling language 、R Language machine learning algorithm implementation 、R Language statistical theory and method . This series will continue to be completed , Please pay more attention, praise and support , Learning together ~, Try to keep updating every week , Welcome to subscribe to exchange study ！

Please add a picture description

List of articles

【R Language data science 】： Machine learning evaluation index
Preface
1. Training set and test set
2. Overall accuracy （accuracy）
3. Confusion matrix
4. Sensitivity （Sensitivity） And specificity （specificity）
5.F1-score
6.ROC Curves and AUC value

Preface

Before we begin to introduce how to build machine learning algorithms , Let's first introduce how to evaluate the effect of a model . We will focus on the indicators for evaluating machine learning algorithms . say concretely , We need to quantify what kind of model is “ Better ” Of , In this article, we will discuss how to evaluate the effect of a model . Here we need to use caret package

library(tidyverse)
library(caret)

We use Height data set To illustrate

library(dslabs)
data(heights)

X <- heights$height
Y <- heights$sex

ad locum , We have only one explanatory variable ： height (height), An explained variable ： Gender （sex）, among sex It's a binary variable ： Respectively female and male.
What we want to study is the effect of different height on gender .

1. Training set and test set

Machine learning algorithms are evaluated on new data sets . however , When we train the model , We often do it on a dataset with known results . therefore , When we evaluate the model , Data sets are often divided into two parts , Part of it is for training , The other part is used to test . The standard method of generating training sets and test sets is to randomly split the data .caret Include functions createDataPartition , It helps us generate indexes to randomly divide the data into training sets and test sets , As follows

set.seed(12345)# Design random seeds 
test_index <- createDataPartition(Y,times=1,p=0.3,list = FALSE)

among ,times Indicates the number of random samples ,p Represents the scale of the test set . With the test set index , We can divide the data set

#  Training set 
train_set <- heights[-test_index,]
#  Test set 
test_set <- heights[test_index,]

Now we have training set and test set , Our general approach to evaluating models is , Train the model only with the data of the training set , Then the trained model is used to predict the data on the test set , Then compare the predicted results with the actual values . For the second category target , One of the most commonly used indicators is the overall accuracy .

2. Overall accuracy （accuracy）

The overall accuracy is simply defined as the overall proportion of correct prediction , The formula is as follows ：
$\frac{\#True Predict}{\#Total}$
To demonstrate the use of overall accuracy , We will build two different algorithms and compare them . Here we do not apply any algorithm , Choose two ways to get our prediction results :

1. Random selection "Male" and "Female"
2. according to X To predict the results

y_hat <- sample(c("Male", "Female"), length(test_index), replace = TrUE)

Please note that , We completely ignored the predictive variables , Just guess the gender randomly . In machine learning applications , It is useful to use factors to represent classification results , Because it was developed for machine learning r function （ For example, functions in the caret package ） It is required or suggested to code the classification results as factor. therefore , Use factor Function will y_hat Convert to factor , There is no other meaning ：

y_hat <- sample(c("Male", "Female"), length(test_index), replace = TrUE) %>%
  factor(levels = levels(test_set$sex))

mean(y_hat == test_set$sex)

0.484177215189873

It can be seen that the correct rate of random guess is 0.48, As expected , Our accuracy is about 50%. Because we are completely random guessing , But can we do better ？ Through prior knowledge, we know , On average, , Men are slightly higher than women . therefore , Let's first calculate the average height and standard deviation of different genders

heights %>% group_by(sex) %>% summarize(mean(height), sd(height))

A tibble: 2 × 3
sex	mean(height)	sd(height)
<fct>	<dbl>	<dbl>
Female	64.93942	3.760656
Male	69.31475	3.611024

How do we use these data ？ Let's try a simpler method ： If the height is within two standard deviations of the average male , Predict men ：

y_hat <- ifelse(test_set$height > 62, "Male", "Female") %>% 
  factor(levels = levels(test_set$sex))

mean(y_hat == test_set$sex )

0.810126582278481

It can be seen that , At this time, the accuracy rate rises to 81%, This is because we have applied certain information , namely Taller than 62 Feet for men

3. Confusion matrix

If the student is taller than 64 Inch , According to the results of our previous section , We predict men . Whereas the average height of women is about 64 Inch , This prediction rule seems to be wrong . If a student is the height of an ordinary woman , Shouldn't we predict women ？ Generally speaking , Overall accuracy can be a deceptive measure . To see that , We will start by building the so-called confusion matrix , It basically tabulates each combination of predicted and actual values . We can do it in r Using functions in table To do that

table(predicted = y_hat, actual = test_set$sex)

         actual
predicted Female Male
   Female     14    2
   Male       58  242

If we study this table carefully , There's a problem . If we calculate the accuracy of each sex separately , We will get ：

test_set %>% 
  mutate(y_hat = y_hat) %>%
  group_by(sex) %>% 
  summarize(accuracy = mean(y_hat == sex))

A tibble: 2 × 2
sex	accuracy
<fct>	<dbl>
Female	0.1944444
Male	0.9918033

It can be seen that , The prediction accuracy for women is very low , Less than 20%, This situation is easy to occur when the data is unbalanced , Our data set ,

mean(Y=='Male')

0.773333333333333

77% All men . Therefore, it is meaningless to use the accuracy rate as the evaluation index . Take a very extreme example , In credit default , The proportion of default is very small , Maybe not 1%, At this time, even if we have been predicting non default 99% The accuracy of the , But this obviously makes no sense . So at this time, we use other evaluation indicators

We can use several indicators to evaluate the algorithm , These can be derived from the confusion matrix . The general improvement in the overall accuracy of use is to separate Study sensitivity and specificity .

4. Sensitivity （Sensitivity） And specificity （specificity）

To define sensitivity and specificity , We need a binary result . When the result is classified , We can define these terms for specific categories .
To facilitate the definition of , Let's look at the following confusion matrix !

Please add a picture description )

Generally speaking , Sensitivity is defined as the ability of the algorithm to predict positive results when the actual result is positive : $\hat Y=1,Y=1$ . Use the above indicators to define ：
$\frac{TP}{TP+FN}$
This value is called True positive rate (TPR) Or recall rate (recall).

Specificity is defined as $\frac{TP}{TP+FP}$ , This value is called Precition. We can use caret The function in confusionMatrix, Calculate the value of each indicator

cm <- confusionMatrix(data = y_hat, reference = test_set$sex)
cm

Confusion Matrix and Statistics

          reference
Prediction Female Male
    Female     14    2
    Male       58  242
                                          
               Accuracy : 0.8101          
                 95% CI : (0.7625, 0.8519)
    No Information rate : 0.7722          
    P-Value [Acc > NIr] : 0.05927         
                                          
                  Kappa : 0.2566          
                                          
 Mcnemar's Test P-Value : 1.243e-12       
                                          
            Sensitivity : 0.19444         
            Specificity : 0.99180         
         Pos Pred Value : 0.87500         
         Neg Pred Value : 0.80667         
             Prevalence : 0.22785         
         Detection rate : 0.04430         
   Detection Prevalence : 0.05063         
      Balanced Accuracy : 0.59312         
                                          
       'Positive' Class : Female

We can see , Although the sensitivity is relatively low , But the overall accuracy is still very high . As we summarized above , The reason for this is the low proportion of women . Therefore, it is impossible to predict actual women as women （ Low sensitivity ）. This is an example of why it is important to check sensitivity and specificity, not just accuracy . Before applying this evaluation index to the general data set , We need to consider whether the sample size is balanced

5.F1-score

Although we usually recommend studying both specificity and sensitivity , But usually a single indicator is useful . For example, for optimization purposes . A better indicator of overall accuracy is the average of specificity and sensitivity , It is called balance accuracy . Because specificity and sensitivity are ratios , Therefore, it is more appropriate to calculate the harmonic average . in fact ,F1-score It is a widely used singular summary , It is the harmonic average of accuracy and recall ：
$F_1 = \frac{2}{\frac{1}{recall}+\frac{1}{precision}}$ We know we want F1-score The bigger the better .

please remember , Sometimes, according to the specific situation , Some types of errors are more costly than others . for example , In the case of death sentence murder , We pay more attention precision, Because miscarriage of justice led to the execution of innocent citizens . contrary , When the aircraft will fail , We pay more attention recall, Because if there is an accident, we predict that no accident will cause greater losses . therefore , We can adjust F1-score To adapt to different specificity and sensitivity . So , We define β, Express recall and precision Importance ratio ：
$F_1 = \frac{1}{\frac{\beta^2}{1+\beta^2}\frac{1}{recall}+\frac{1}{1+\beta^2}\frac{1}{precision}}$
F_meas Function can specify $\beta$ , The default is 1
Now let's build a model , Its evaluation index is F1-score, Not accuracy

cutoff <- seq(61, 70)
F_1 <- map_dbl(cutoff, function(x){
    
  y_hat <- ifelse(train_set$height > x, "Male", "Female") %>% 
    factor(levels = levels(test_set$sex))
  F_meas(data = y_hat, reference = factor(train_set$sex))
})

plot(cutoff,F_1)

Please add a picture description )

max(F_1)

0.609164420485175

best_cutoff <- cutoff[which.max(F_1)]
best_cutoff

It can be seen that it is greater than 66 Feet get F1-score The highest

6.ROC Curves and AUC value

Bold style Comparing these two methods （ Random guess and height judgment ） when , We checked the accuracy and F1 Two indicators .
The second method is obviously better than the first method . However , Although we consider several critical points for the second method , But for the first method , We only considered one method ： Guess with equal probability . Please note that , Due to uneven data in the sample , Guessing men with a higher probability will give us higher accuracy ：

p <- 0.9
n <- length(test_index)
y_hat_1 <- sample(c("Male", "Female"), n, replace = TrUE, prob=c(p, 1-p)) %>% 
  factor(levels = levels(test_set$sex))
mean(y_hat_1 == test_set$sex)

0.721518987341772

It can be seen that the accuracy has improved at this time , This is because of the unbalanced data of the sample

y_hat_2 <- ifelse(test_set$height > 66, "Male", "Female") %>% 
  factor(levels = levels(test_set$sex))

mean(y_hat_2 == test_set$sex)

0.800632911392405

Now let's see how to draw roc curve , And back to AUC value , First, import. prOC package

library(prOC)
roc1<-roc(test_set$sex,test_set$height,levels=c("Male","Female"))
plot(roc1,print.auc=T, auc.polygon=T, grid=c(0.1, 0.2), grid.col=c("green","red"), max.auc.polygon=T, auc.polygon.col="skyblue",print.thres=T)