当前位置:网站首页>[R language data science]: common evaluation indicators of machine learning
[R language data science]: common evaluation indicators of machine learning
2022-07-01 14:07:00 【JOJO's data analysis Adventure】
【R Language data science 】: Machine learning evaluation index
- Personal home page :JoJo Data analysis adventure
- Personal introduction : I'm reading statistics in my senior year , At present, Baoyan has reached statistical top3 Colleges and universities continue to study for Postgraduates in Statistics
- If it helps you , welcome
Focus on
、give the thumbs-up
、Collection
、subscribe
special column- This article is included in 【R Language data science 】 This series mainly introduces R The applications of language in the field of data science include :
R Fundamentals of language programming 、R Language visualization 、R Language for data manipulation 、R Modeling language 、R Language machine learning algorithm implementation 、R Language statistical theory and method . This series will continue to be completed , Please pay more attention, praise and support , Learning together ~, Try to keep updating every week , Welcome to subscribe to exchange study !
List of articles
Preface
Before we begin to introduce how to build machine learning algorithms , Let's first introduce how to evaluate the effect of a model . We will focus on the indicators for evaluating machine learning algorithms . say concretely , We need to quantify what kind of model is “ Better ” Of , In this article, we will discuss how to evaluate the effect of a model . Here we need to use
caret
package
library(tidyverse)
library(caret)
We use Height data set To illustrate
library(dslabs)
data(heights)
X <- heights$height
Y <- heights$sex
ad locum , We have only one explanatory variable :
height (height)
, An explained variable :Gender (sex)
, among sex It's a binary variable : Respectivelyfemale
andmale
.
What we want to study is the effect of different height on gender .
1. Training set and test set
Machine learning algorithms are evaluated on new data sets . however , When we train the model , We often do it on a dataset with known results . therefore , When we evaluate the model , Data sets are often divided into two parts , Part of it is for training , The other part is used to test . The standard method of generating training sets and test sets is to randomly split the data .
caret
Include functionscreateDataPartition
, It helps us generate indexes to randomly divide the data into training sets and test sets , As follows
set.seed(12345)# Design random seeds
test_index <- createDataPartition(Y,times=1,p=0.3,list = FALSE)
among ,
times
Indicates the number of random samples ,p
Represents the scale of the test set . With the test set index , We can divide the data set
# Training set
train_set <- heights[-test_index,]
# Test set
test_set <- heights[test_index,]
Now we have training set and test set , Our general approach to evaluating models is , Train the model only with the data of the training set , Then the trained model is used to predict the data on the test set , Then compare the predicted results with the actual values . For the second category target , One of the most commonly used indicators is the overall accuracy .
2. Overall accuracy (accuracy)
The overall accuracy is simply defined as the overall proportion of correct prediction , The formula is as follows :
# T r u e P r e d i c t # T o t a l \frac{\#True Predict}{\#Total} #Total#TruePredict
To demonstrate the use of overall accuracy , We will build two different algorithms and compare them . Here we do not apply any algorithm , Choose two ways to get our prediction results :
- 1. Random selection "Male" and "Female"
- 2. according to X To predict the results
y_hat <- sample(c("Male", "Female"), length(test_index), replace = TrUE)
Please note that , We completely ignored the predictive variables , Just guess the gender randomly . In machine learning applications , It is useful to use factors to represent classification results , Because it was developed for machine learning r function ( For example, functions in the caret package ) It is required or suggested to code the classification results as
factor
. therefore , Usefactor
Function will y_hat Convert to factor , There is no other meaning :
y_hat <- sample(c("Male", "Female"), length(test_index), replace = TrUE) %>%
factor(levels = levels(test_set$sex))
mean(y_hat == test_set$sex)
0.484177215189873
It can be seen that the correct rate of random guess is 0.48, As expected , Our accuracy is about 50%. Because we are completely random guessing , But can we do better ? Through prior knowledge, we know , On average, , Men are slightly higher than women . therefore , Let's first calculate the average height and standard deviation of different genders
heights %>% group_by(sex) %>% summarize(mean(height), sd(height))
sex | mean(height) | sd(height) |
---|---|---|
<fct> | <dbl> | <dbl> |
Female | 64.93942 | 3.760656 |
Male | 69.31475 | 3.611024 |
How do we use these data ? Let's try a simpler method : If the height is within two standard deviations of the average male , Predict men :
y_hat <- ifelse(test_set$height > 62, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
mean(y_hat == test_set$sex )
0.810126582278481
It can be seen that , At this time, the accuracy rate rises to 81%, This is because we have applied certain information , namely Taller than 62 Feet for men
3. Confusion matrix
If the student is taller than 64 Inch , According to the results of our previous section , We predict men . Whereas the average height of women is about 64 Inch , This prediction rule seems to be wrong . If a student is the height of an ordinary woman , Shouldn't we predict women ? Generally speaking , Overall accuracy can be a deceptive measure . To see that , We will start by building the so-called confusion matrix , It basically tabulates each combination of predicted and actual values . We can do it in r Using functions in
table
To do that
table(predicted = y_hat, actual = test_set$sex)
actual
predicted Female Male
Female 14 2
Male 58 242
If we study this table carefully , There's a problem . If we calculate the accuracy of each sex separately , We will get :
test_set %>%
mutate(y_hat = y_hat) %>%
group_by(sex) %>%
summarize(accuracy = mean(y_hat == sex))
sex | accuracy |
---|---|
<fct> | <dbl> |
Female | 0.1944444 |
Male | 0.9918033 |
It can be seen that , The prediction accuracy for women is very low , Less than 20%, This situation is easy to occur when the data is unbalanced , Our data set ,
mean(Y=='Male')
0.773333333333333
77% All men . Therefore, it is meaningless to use the accuracy rate as the evaluation index . Take a very extreme example , In credit default , The proportion of default is very small , Maybe not 1%, At this time, even if we have been predicting non default 99% The accuracy of the , But this obviously makes no sense . So at this time, we use other evaluation indicators
We can use several indicators to evaluate the algorithm , These can be derived from the confusion matrix . The general improvement in the overall accuracy of use is to separate Study sensitivity and specificity .
4. Sensitivity (Sensitivity) And specificity (specificity)
To define sensitivity and specificity , We need a binary result . When the result is classified , We can define these terms for specific categories .
To facilitate the definition of , Let's look at the following confusion matrix !
)
Generally speaking , Sensitivity is defined as the ability of the algorithm to predict positive results when the actual result is positive : Y ^ = 1 , Y = 1 \hat Y=1,Y=1 Y^=1,Y=1. Use the above indicators to define :
T P T P + F N \frac{TP}{TP+FN} TP+FNTP
This value is called True positive rate (TPR) Or recall rate (recall).
Specificity is defined as T P T P + F P \frac{TP}{TP+FP} TP+FPTP, This value is called Precition. We can use
caret
The function inconfusionMatrix
, Calculate the value of each indicator
cm <- confusionMatrix(data = y_hat, reference = test_set$sex)
cm
Confusion Matrix and Statistics
reference
Prediction Female Male
Female 14 2
Male 58 242
Accuracy : 0.8101
95% CI : (0.7625, 0.8519)
No Information rate : 0.7722
P-Value [Acc > NIr] : 0.05927
Kappa : 0.2566
Mcnemar's Test P-Value : 1.243e-12
Sensitivity : 0.19444
Specificity : 0.99180
Pos Pred Value : 0.87500
Neg Pred Value : 0.80667
Prevalence : 0.22785
Detection rate : 0.04430
Detection Prevalence : 0.05063
Balanced Accuracy : 0.59312
'Positive' Class : Female
We can see , Although the sensitivity is relatively low , But the overall accuracy is still very high . As we summarized above , The reason for this is the low proportion of women . Therefore, it is impossible to predict actual women as women ( Low sensitivity ). This is an example of why it is important to check sensitivity and specificity, not just accuracy . Before applying this evaluation index to the general data set , We need to consider whether the sample size is balanced
5.F1-score
Although we usually recommend studying both specificity and sensitivity , But usually a single indicator is useful . For example, for optimization purposes . A better indicator of overall accuracy is the average of specificity and sensitivity , It is called balance accuracy . Because specificity and sensitivity are ratios , Therefore, it is more appropriate to calculate the harmonic average . in fact ,F1-score It is a widely used singular summary , It is the harmonic average of accuracy and recall :
F 1 = 2 1 r e c a l l + 1 p r e c i s i o n F_1 = \frac{2}{\frac{1}{recall}+\frac{1}{precision}} F1=recall1+precision12 We know we want F1-score The bigger the better .
please remember , Sometimes, according to the specific situation , Some types of errors are more costly than others . for example , In the case of death sentence murder , We pay more attention precision, Because miscarriage of justice led to the execution of innocent citizens . contrary , When the aircraft will fail , We pay more attention recall, Because if there is an accident, we predict that no accident will cause greater losses . therefore , We can adjust F1-score To adapt to different specificity and sensitivity . So , We define β, Express recall and precision Importance ratio :
F 1 = 1 β 2 1 + β 2 1 r e c a l l + 1 1 + β 2 1 p r e c i s i o n F_1 = \frac{1}{\frac{\beta^2}{1+\beta^2}\frac{1}{recall}+\frac{1}{1+\beta^2}\frac{1}{precision}} F1=1+β2β2recall1+1+β21precision11
F_meas Function can specify β \beta β, The default is 1
Now let's build a model , Its evaluation index is F1-score, Not accuracy
cutoff <- seq(61, 70)
F_1 <- map_dbl(cutoff, function(x){
y_hat <- ifelse(train_set$height > x, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
F_meas(data = y_hat, reference = factor(train_set$sex))
})
plot(cutoff,F_1)
)
max(F_1)
0.609164420485175
best_cutoff <- cutoff[which.max(F_1)]
best_cutoff
66
It can be seen that it is greater than 66 Feet get F1-score The highest
6.ROC Curves and AUC value
Bold style Comparing these two methods ( Random guess and height judgment ) when , We checked the accuracy and F1 Two indicators .
The second method is obviously better than the first method . However , Although we consider several critical points for the second method , But for the first method , We only considered one method : Guess with equal probability . Please note that , Due to uneven data in the sample , Guessing men with a higher probability will give us higher accuracy :
p <- 0.9
n <- length(test_index)
y_hat_1 <- sample(c("Male", "Female"), n, replace = TrUE, prob=c(p, 1-p)) %>%
factor(levels = levels(test_set$sex))
mean(y_hat_1 == test_set$sex)
0.721518987341772
It can be seen that the accuracy has improved at this time , This is because of the unbalanced data of the sample
y_hat_2 <- ifelse(test_set$height > 66, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
mean(y_hat_2 == test_set$sex)
0.800632911392405
Now let's see how to draw roc curve , And back to AUC value , First, import. prOC package
library(prOC)
roc1<-roc(test_set$sex,test_set$height,levels=c("Male","Female"))
plot(roc1,print.auc=T, auc.polygon=T, grid=c(0.1, 0.2), grid.col=c("green","red"), max.auc.polygon=T, auc.polygon.col="skyblue",print.thres=T)
This is the introduction of this chapter , If it helps you , Please do more thumb up 、 Collection 、 Comment on 、 Focus on supporting !!
边栏推荐
- Leetcode question 1: sum of two numbers (3 languages)
- sqlilabs less-11~12
- 深度合作 | 涛思数据携手长虹佳华为中国区客户提供 TDengine 强大企业级产品与完善服务保障
- So programmers make so much money doing private work? It's really delicious
- 用对场景,事半功倍!TDengine 的窗口查询功能及使用场景全介绍
- [IOT design. Part I] stm32+ smart cloud aiot+ laboratory security monitoring system
- sqlilabs less9
- Applet - applet chart Library (F2 chart Library)
- Summary of 20 practical typescript single line codes
- Distributed dynamic (collaborative) rendering / function runtime based on computing power driven, data and function collaboration
猜你喜欢
日志中打印统计信息的方案
学会使用LiveData和ViewModel,我相信会让你在写业务时变得轻松
2022 · 让我带你Jetpack架构组件从入门到精通 — Lifecycle
2022上半年英特尔有哪些“硬核创新”?看这张图就知道了!
Use lambda function URL + cloudfront to realize S3 image back to source
That hard-working student failed the college entrance examination... Don't panic! You have another chance to counter attack!
佩服,阿里女程序卧底 500 多个黑产群……
使用CMD修复和恢复病毒感染文件
How will the surging tide of digitalization overturn the future?
[repair version] imitating the template of I love watching movies website / template of ocean CMS film and television system
随机推荐
算网融合赋能行业转型,移动云点亮数智未来新路标
佩服,阿里女程序卧底 500 多个黑产群……
【Flask】Flask启程与实现一个基于Flask的最小应用程序
介绍一种对 SAP GUI 里的收藏夹事务码管理工具增强的实现方案
8款最佳实践,保护你的 IaC 安全!
Blind box NFT digital collection platform system development (build source code)
Open source internship experience sharing: openeuler software package reinforcement test
开源者的自我修养|为 ShardingSphere 贡献了千万行代码的程序员,后来当了 CEO
How can we protect our passwords?
A new book by teacher Zhang Yujin of Tsinghua University: 2D vision system and image technology (five copies will be sent at the end of the article)
3.4 data query in introduction to database system - select (single table query, connection query, nested query, set query, multi table query)
MySQL log
8 best practices to protect your IAC security!
【商业终端仿真解决方案】上海道宁为您带来Georgia介绍、试用、教程
Enter the top six! Boyun's sales ranking in China's cloud management software market continues to rise
C语言课程设计题目
Tdengine connector goes online Google Data Studio app store
C语言基础知识
开源实习经验分享:openEuler软件包加固测试
Après avoir été licencié pendant trois mois, l'entrevue s'est effondrée et l'état d'esprit a commencé à s'effondrer.