当前位置：网站首页>Chapter VIII integrated learning

Chapter VIII integrated learning

2022-07-08 01:07:00 【Intelligent control and optimization decision Laboratory of Cen】

1． Talk about the concept and thought of integrated learning .

Integrated learning (ensemble learning) Learning tasks are accomplished by building and combining multiple learners , It is sometimes called multi classifier system (multi-classifier system) 、 Committee based learning (committee-based learning) etc. . The schematic diagram of integrated learning is shown below ： Insert picture description here
Its main idea is to combine multiple learners , It is often possible to obtain significantly better generalization performance than a single learner . Because the errors of the base learners are independent of each other . In a real task , Individual learners are trained to solve the same problem , They obviously can't be independent of each other ! in fact , Of individual learners " accuracy " and " diversity " There is conflict in itself . General , After high accuracy , Increasing diversity requires sacrificing accuracy . Because the final result is voted by each learner , therefore , How to produce and combine " Good but different " Individual learners of , It is the core of integrated learning research .

2． What kinds of integrated learning methods can be divided into , And explain their characteristics respectively .

According to how individual learners are generated , The current ensemble learning methods can be roughly divided into two categories ? That is, there is a strong dependency between individual learners 、 Serialization methods that must be generated serially ? And there is no strong dependence between individual learners 、 Parallel methods that can be generated at the same time ; The representative of the former is Boosting, The latter is represented by Bagging and " Random forests " (Random Forest).

Boosting Method
First, a basic learner is trained from the initial training set , Then adjust the distribution of training samples according to the performance of the basic learner , This makes the training samples that have made mistakes in the previous base learner receive more attention in the follow-up , Then the next base learner is trained based on the adjusted sample distribution ; Repeat it like this , Until the number of base learners reaches a predetermined value $T$ , Finally, I will $T$ The weighted combination of the base learners is carried out . The flow chart is shown in the figure below ：

characteristic : Each round of training should check whether the currently generated base learner meets the basic conditions （ Such as whether it is better than random guess ）, If the conditions are not met , Then the current learner is discarded , The learning process ends . If the number of study rounds does not reach $T$ And lead to poor results , Resampling can be used to restart learning .
Bagging
This kind of method is to sample from the initial training data set $T$ Inclusive $m$ Sample sets of training samples , Then a base learner is trained based on each sample set , Then combine these basic learning machine roots . The flow chart of this method is shown below ：

characteristic ： Self-help sampling only uses some samples of the initial training set , The remaining samples can be used as validation sets to estimate the generalization performance outside the package （ Pruning can be assisted when the base learner is a decision tree , When the base learner is a neural network , Can reduce the risk of over fitting ）.

3． In ensemble learning , Elaborate on the problem of two classifications AdaBoost Algorithm implementation process . reflection AdaBoost How does the algorithm change the weight or probability distribution of the training data in each round ?

3.1 Adaboost Algorithm flow

iterative process ： Insert picture description here
The illustration ：

3.2 Main steps ：

1. Initialize the weight distribution of training data .
Insert picture description here
among $D_1$ Represents the weight of each sample in the first iteration , $w_{11}$ Represents the weight of the first sample in the first iteration , $N$ Is the total number of samples .
2. Conduct M Sub iteration
a） Use ownership weight distribution $D_m(m=1,2,...,N)$ Training samples for learning , Get the weak classifier ： $G_m(x):x\rightarrow$ { $- 1, 1$ }.
The performance index of the weak classifier is determined by the value of the following error function $\epsilon_m$ To measure ：
Insert picture description here
b) Compute weak classifiers $G_m$ The right to speak $\alpha_m$ ( That is, the weight ), It said $G_m$ Importance in final classifier . The calculation method is as follows ：
With $e_m$ Reduce , $\alpha_m$ Gradually increase . The formula shows that , The classifier with small error rate is of great importance in the final classifier .
c） Update the weight distribution of training samples , For the next iteration , The weight of the incorrectly classified samples increases , The weight of the correctly classified samples decreases . The calculation method is as follows ：
Insert picture description here

among , $D_{m+1}$ Is the sample value for the next iteration , $w_{m+1,1}$ It's the next iteration , The first $i$ Weight of samples . $y_{i}$ On behalf of the $i$ Categories corresponding to samples （1 or -1）, $G_m(x_i)$ Indicates that the weak classifier pairs samples $x_{i}$ The classification of . If the classification is correct , be $G_m(x_i)$ The value of is 1, Instead of -1. among $Z_m)$ It's the normalization factor , The calculation method is as follows ：

Insert picture description here
The third step : Combine weak classifiers , Get a strong classifier
First , Weighted sum of all iterated classifiers ：
next , take sign The function acts on the sum result , Get the final strong classifier $G (x)$ :

sign function ： Is a logical function , The sign used to judge real numbers . The definition for ：
Insert picture description here

4． What is the relationship between random forest and ensemble learning ？

Random forest is an algorithm that integrates multiple trees through the idea of ensemble learning , Its basic unit is the decision tree , And its essence belongs to the integrated learning method .
There are two key words in the name of random forest , One is “ Random ”, One is “ The forest ”.“ The forest ” It's easy for us to understand , One is called a tree , Then hundreds of trees can be called forests , This is also the main idea of random forest – The embodiment of integration thought . However ,bagging The cost of this is not to use a single decision tree to make predictions , Which variable plays an important role becomes unknown , therefore bagging Improved prediction accuracy but lost interpretability .

After learning the decision tree , Understanding the decision tree is simple and intuitive , Easy to understand , The data does not need preprocessing, and the fault tolerance rate for outliers is abnormally high . However, in practice , Decision trees tend to over fit , Unstable performance , This is the need for integration . The theoretical basis is ： Law of large Numbers , Under the condition of constant test , Repeat the experiment many times , The frequency of a random event approximates its probability . There is a certain inevitability in contingency .

What is integration ？ It's similar to asking thousands of people complicated questions at random , Then summarize their answers , At this time, you will find that their answers are more detailed than those of experts 、 accurate . In machine learning , We all use some estimator to train the model , Now aggregate multiple estimators to Train fitting data , We become integrated learning , This method is called integration method . Random forest is the product of integrated learning , It is composed of a group of decision trees , Each tree is trained based on different two-thirds random subsets in the training set , The growth of each tree is no longer based on the best characteristics , Random forests introduce more randomness into the growth of trees , Instead, select the best feature in a randomly generated feature subset , In this way, the random forest trades higher deviation for lower variance . In this case , Random forests do not have important features like decision trees that appear near heel nodes , Unimportant features appear in leaf nodes unknown . For random forest, we need to calculate the average depth of a feature on all decision trees to estimate The importance of features （ Features can be applied in many ways , For example, whether the credit rating of an enterprise can be correctly evaluated in the bank loan business , It is related to whether the loan can be effectively recovered . But there are many data characteristics of credit evaluation models , There is a lot of noise , So we need to calculate the importance of each feature and rank them , Then we can select the features with high importance from all the features ）. This is an important feature of random forest , Apply to feature selection . Finally, as long as we count the results of each tree, we can draw a corresponding conclusion .

5. use python Based on single-layer decision tree AdaBoost Algorithm .

Code ：

// An highlighted block
var foo ='bar'';

# -*- coding: utf-8 -*-

import numpy as np

def loadSimData():
    '''
     Input ： nothing 
     function ： Provide a data set with two characteristics 
     Output ： Datasets with labels 
    '''
    datMat = np.matrix([[1. ,2.1],[2. , 1.1],[1.3 ,1.],[1. ,1.],[2. ,1.]])
    classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
    return datMat, classLabels

def stumpClassify(dataMatrix,dimen,thresholdValue,thresholdIneq):
    '''
     Input ： Data matrix , Characteristic dimension , Classification threshold of a feature , Classification inequality 
     function ： Output decision tree tag 
     Output ： label 
    '''
    returnArray =  np.ones((np.shape(dataMatrix)[0],1))
    if thresholdIneq == 'lt':
        returnArray[dataMatrix[:,dimen] <= thresholdValue] = -1
    else:
        returnArray[dataMatrix[:,dimen] > thresholdValue] = -1
    return returnArray

def buildStump(dataArray,classLabels,D):
    '''
     Input ： Data matrix , Corresponding real category label , Weight distribution of features 
     function ： On the dataset , Find the weighted error rate （ Classification error rate ） The smallest single-layer decision tree , obviously , The index function is closely related to the weight vector 
     Output ： Best stump （ features , Classification feature threshold , Unequal sign direction ）, Minimum weighted error rate , The weight vector D The estimated value of the classification label under 
    '''
    dataMatrix = np.mat(dataArray); labelMat = np.mat(classLabels).T
    m,n = np.shape(dataMatrix)
    stepNum = 10.0; bestStump = {
    }; bestClassEst = np.mat(np.zeros((m,1)))
    minError = np.inf
    for i in range(n):
        rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max()
        stepSize = (rangeMax - rangeMin)/stepNum
        for j in range(-1, int(stepNum)+1):
            for thresholdIneq in ['lt', 'gt']:
                thresholdValue =  rangeMin + float(j) * stepSize
                predictClass = stumpClassify(dataMatrix,i,thresholdValue,thresholdIneq)
                errArray =  np.mat(np.ones((m,1)))
                errArray[predictClass == labelMat] = 0
                weightError = D.T * errArray
                #print "split: dim %d, thresh: %.2f,threIneq:%s,weghtError %.3F" %(i,thresholdValue,thresholdIneq,weightError)
                if weightError < minError:
                    minError = weightError
                    bestClassEst = predictClass.copy()
                    bestStump['dimen'] = i
                    bestStump['thresholdValue'] = thresholdValue
                    bestStump['thresholdIneq'] = thresholdIneq
    return bestClassEst, minError, bestStump

def adaBoostTrainDS(dataArray,classLabels,numIt=40):
    '''
     Input ： Data sets , Label vector , Maximum number of iterations 
     function ： establish adaboost additive model 
     Output ： An array of multiple weak classifiers 
    '''
    weakClass = []# Define weak classification array , Save each basic classifier bestStump
    m,n = np.shape(dataArray)
    D = np.mat(np.ones((m,1))/m)
    aggClassEst = np.mat(np.zeros((m,1)))
    for i in range(numIt):
        print ("i:",i)
        bestClassEst, minError, bestStump = buildStump(dataArray,classLabels,D)#step1: Find the best single-layer decision tree 
        print ("D.T:", D.T)
        alpha = float(0.5*np.log((1-minError)/max(minError,1e-16)))#step2:  to update alpha
        print ("alpha:",alpha)
        bestStump['alpha'] = alpha
        weakClass.append(bestStump)#step3: Add the basic classifier to the array of weak classifiers 
        print ("classEst:",bestClassEst)
        expon = np.multiply(-1*alpha*np.mat(classLabels).T,bestClassEst)
        D = np.multiply(D, np.exp(expon))
        D = D/D.sum()#step4: Update weights , This formula is to let D Obey probability distribution 
        aggClassEst += alpha*bestClassEst#steo5: Update cumulative category estimates 
        print ("aggClassEst:",aggClassEst.T)
        print (np.sign(aggClassEst) != np.mat(classLabels).T)
        aggError = np.multiply(np.sign(aggClassEst) != np.mat(classLabels).T,np.ones((m,1)))
        print ("aggError",aggError)
        aggErrorRate = aggError.sum()/m
        print ("total error:",aggErrorRate)
        if aggErrorRate == 0.0: break
    return weakClass

def adaTestClassify(dataToClassify,weakClass):
    dataMatrix = np.mat(dataToClassify)        
    m =np.shape(dataMatrix)[0]
    aggClassEst = np.mat(np.zeros((m,1)))
    for i in range(len(weakClass)):
        classEst = stumpClassify(dataToClassify,weakClass[i]['dimen'],weakClass[i]['thresholdValue']\
                                 ,weakClass[i]['thresholdIneq'])
        aggClassEst += weakClass[i]['alpha'] * classEst
        print (aggClassEst)
    return np.sign(aggClassEst)
if __name__  ==  '__main__':
    D =np.mat(np.ones((5,1))/5)
    dataMatrix ,classLabels= loadSimData()
    bestClassEst, minError, bestStump = buildStump(dataMatrix,classLabels,D)
    weakClass = adaBoostTrainDS(dataMatrix,classLabels,9)            
    testClass = adaTestClassify(np.mat([0,0]),weakClass)

result ：

// An highlighted block
i: 0
D.T: [[0.2 0.2 0.2 0.2 0.2]]
alpha: 0.6931471805599453
classEst: [[-1.]
 [ 1.]
 [-1.]
 [-1.]
 [ 1.]]
aggClassEst: [[-0.69314718  0.69314718 -0.69314718 -0.69314718  0.69314718]]
[[ True]
 [False]
 [False]
 [False]
 [False]]
aggError [[1.]
 [0.]
 [0.]
 [0.]
 [0.]]
total error: 0.2
i: 1
D.T: [[0.5   0.125 0.125 0.125 0.125]]
alpha: 0.9729550745276565
classEst: [[ 1.]
 [ 1.]
 [-1.]
 [-1.]
 [-1.]]
aggClassEst: [[ 0.27980789  1.66610226 -1.66610226 -1.66610226 -0.27980789]]
[[False]
 [False]
 [False]
 [False]
 [ True]]
aggError [[0.]
 [0.]
 [0.]
 [0.]
 [1.]]
total error: 0.2
i: 2
D.T: [[0.28571429 0.07142857 0.07142857 0.07142857 0.5       ]]
alpha: 0.8958797346140273
classEst: [[1.]
 [1.]
 [1.]
 [1.]
 [1.]]
aggClassEst: [[ 1.17568763  2.56198199 -0.77022252 -0.77022252  0.61607184]]
[[False]
 [False]
 [False]
 [False]
 [False]]
aggError [[0.]
 [0.]
 [0.]
 [0.]
 [0.]]
total error: 0.0
[[-0.69314718]]
[[-1.66610226]]
[[-2.56198199]]