当前位置：网站首页>Machine learning 02: model evaluation

Machine learning 02: model evaluation

2022-07-05 17:16:00 【Fei Fei is a princess】

Evaluation methods

Before the learned model is put into use , Performance evaluation is usually required . So , Need to use a “ Test set ”（testing set） To test the generalization ability of the model to new samples , Then take the... On the test set “ Test error ”（testing error） As an approximation of the generalization error .
We assume that the test set is obtained by independent sampling from the real distribution of samples , Therefore, the test set and the samples in the training set should be mutually exclusive as far as possible .
Given a known data set , Split the data set into training sets S And test set T, Common practices include setting aside 、 Cross validation 、 Self help law .

Set aside method

Directly divide the data set into two mutually exclusive sets , Training set S And test set T
Generally, it is divided randomly several times , Because the same data set , The same division ratio , It can also be divided into many different training sets and test sets , such as 1000 Data sets （ These include 500 A good example ,500 A counterexample ）, according to 7：3 Divided by the proportion of $C_{500}^{150})^{2}$ Different groups . Practical application , We were randomly assigned several times , All experiments are averaged , As a test error （ Generalization error approximation ）.
Training / The proportion of test samples is usually 2:1~4:1, About to 2/3~4/5 A sample of is used for training , The remaining samples are used for testing .

Cross validation

The data set is divided into k A mutually exclusive subset of similar size , Each time k-1 Union of subsets as training set , The remaining subset is used as the test set , Eventually return k The average of the test results ,k The most commonly used value is 10.
Insert picture description here
explain ：D1,D2,D3…… Are different subsets , First test $D_{10}$ As test set , Second test $D_{9}$ As test set …… The first 10 Secondary test $D_{1}$ As test set , In the end, all $_{1}$ ~ $_{10}$ Take the average value as the test error .
Similar to the set aside method , Put the dataset D Divided into k There are also many ways to divide subsets , In order to reduce the differences introduced by different sample divisions ,k Fold cross validation usually repeats randomly with different partitions p Time , The final assessment is this p Time k The mean value of fold cross validation results , For example, common “10 Time 10 Crossover verification ”.

Keep one

The limit case of cross validation , When each subset has only one element , Just leave one way , Leaving one method has the following characteristics ：

Not affected by the way of random sample division
The results are often more accurate
When the data set is large , The computational overhead is unbearable

Self help law

Based on self-help sampling method , The data set D There is a return sample m Get a training set , As a test set, both the actual model and the expected model are used m Training samples （m About% of the total number of samples $\frac{1}{3}$ ）
A plurality of different training sets are generated from the initial data set , It is very good for integrated learning
The self-help method is smaller in the data set 、 It's hard to divide training effectively / It's useful when testing sets ; Estimation bias may be introduced due to changing the data set distribution , When the amount of data is sufficient , Set aside method and cross validation method are more commonly used .

Evaluation indicators

To evaluate the quality of the model, only the evaluation method is not enough , We have to determine the evaluation indicators .

Accuracy and error rate

Compare the model results with the real situation , The two most commonly used indicators are Accuracy rate and Error rate
Accuracy rate ： That is, the proportion of paired samples in the total number of test samples ;
Error rate ： That is, the proportion of wrong samples in the total number of test samples

Precision and recall

Insert picture description here

real ： The proportion of all positives is divided into positive
False positive ： All negatives are divided into positive proportions

You can calculate the size of the area by integrating ！

Comparative test

There are the following questions about performance comparison ：

Test performance is not equal to generalization performance
Test performance will change as the test set changes
Many machine learning algorithms have certain randomness

therefore , It is not advisable to evaluate directly through evaluation indicators , Because what we get is the expression effect on a specific data set , To be more persuasive in Mathematics , We also need to test hypotheses .
Hypothesis testing It provides an important basis for the performance comparison of classifiers , Based on the results, we can infer , If in Test set Observe the classifier A Than B good , be A Of Generalization performance Whether in Statistically higher than B, And how sure this conclusion is .

Paired bilateral t test

Insert picture description here

Simply speaking , Divided into the following steps

Calculate the mean $\mu$
Calculate variance $\sigma^2$
Calculation T statistic $\tau_t$
According to degrees of freedom $V$ And confidence $\alpha$ Look up the table , Get the critical value , If $\tau_t< critical value$ , shows “ In confidence $\alpha$ Under the premise of , It can be considered that the performance of the two classifiers is not significantly different , Otherwise, the performance of the two classifiers is considered to be significantly different , The classifier with low average error rate has better performance .”

Friedman Test and Nemenyi Follow up inspection

Paired bilateral t The test is to compare the performance of two classifiers on a data set , And a lot of the time , We need to compare the performance of multiple classifiers on a set of data , This requires the use of sorting based Friedman test .
Suppose we are going to N Compare... On two data sets k Algorithms , First, use the set aside method or cross validation method to get the test results of each algorithm on each data set , Then sort on each data set according to the performance , And assign order value 1,2,…; If the algorithm performance is the same, then bisect the order value , Then we get the average order value of each algorithm on all data sets .
Insert picture description here
explain ： The figure above is calculating “ Chi square distribution ” $\tau_{\chi^2}$
N： Number of data sets
k： Number of models
And then calculate F statistic $\tau_F$

explain ： according to N and k Look up the table , And then F statistic $\tau_F$ Compare , If $\tau_F< Common critical values$ , It is considered that the comparison algorithm is the same , Otherwise, the algorithm is considered to be significantly different , Only when the algorithm is considered to be significantly different , Before proceeding to the next step $N e m e n y i check Examination$ .
Insert picture description here
explain ： Calculation CD value , $q_\alpha$ By looking up the table of algorithm quantity （k,N It is known that ）

explain ： Compare the sum of the difference of the average order values between the algorithms CD Size , If it is greater than CD Think significantly different , Less than CD Think there is no significant difference .
Insert picture description here
explain ：Friedman The test chart can intuitively show the performance difference between the algorithms , It's just an intuitive representation , There is no new way .