当前位置：网站首页>Chapter 6 promotion

Chapter 6 promotion

2022-07-28 13:18:00 【Sang zhiweiluo 0208】

1 The characteristics of random forest

The decision tree of random forest is established by sampling respectively , Relatively independent .

ps: By weak classifier ——> Strong classifier method ： Sample weighting 、 Classifier weighting .

Sample weighting ： For example, classify a sample , There will be samples with wrong classification , Increase its weight .

Classifier weighting ： For weak classifiers with low misclassification rate , We give higher weight to the final result .

Weight refers to the predicted value .

2 promote

2.1 promote

promote —— Promotion is a machine learning technology , It can be used for regression and classification problems , It generates a weak prediction model at every step （ Such as the decision tree ）, And weighted and accumulated into the total model .

Gradient rise —— If the weak prediction model of each step is generated according to the gradient direction of loss function , It's called gradient ascension .

The theoretical significance of promotion —— If a problem exists Weak classifier , You can get Strong Classifier .

2.2 Gradient lifting algorithm

The gradient lifting algorithm first gives a target loss function （ We give it according to the actual problems , It has nothing to do with ascension ）, Its definition domain is all feasible Weak function set （ Base function ）. The lifting algorithm selects one through iteration In the direction of negative gradient To gradually approach Local minima . This view of gradient lifting in function domain has a profound impact on many fields of machine learning .

2.3 Lifting algorithm

Given the input vector x And output variables y A number of training samples (x1,y1),(x2,y2),……(xn,yn), The goal is to find an approximate function $\widehat{F}(\overrightarrow{x})$ , Make the loss function L(y,F(x)) The loss value of is the smallest .

Loss function L(y,F(x)) The typical definition of is ： $L(y,F(\overrightarrow{x}))=\frac{1}{2}(y-F(\overrightarrow{x}))^{2}$ or $L(y,F(\overrightarrow{x}))=|y-F(\overrightarrow{x})|$

Suppose the optimal function is $F^{*}(\overrightarrow{x})$ , namely ： $F^{*}(\overrightarrow{x})=\underset{F}{arg min}E_{(x,y)}[L(y,F(\overrightarrow{x}))]$

Assume F(x) It's a family of basis functions $f_{i}(x)$ Weighted sum of $F(\overrightarrow{x})=\sum_{i=1}^{M}\gamma _{i}f_{i}(x)+const$

prove ： The median is the absolute minimum optimal solution .

Given the sample x1,x2,……,xn, Calculation $\mu ^{*}=\underset{\mu }{arg min}\sum_{i=1}^{n}|x_{i}-\mu |$

$J(\mu )=\sum_{i=1}^{n}|x_{i}-\mu |=\sum_{i=1}^{k}(\mu -x_{i})+\sum_{i=k+1}^{n}(x_{i}-\mu )$

Finding partial derivatives ： $\frac{\partial J(\mu )}{\partial \mu }=\sum_{i=1}^{k}(1)+\sum_{i=k+1}^{n}(-1)$ , Make it equal to 0.

Before getting k The number of samples is the same as that after n-k The number of samples is the same , namely $\mu$ Is the median .

Lifting algorithm derivation ：

Gradient approximation ：

Lifting algorithm ：

3 Gradient lift decision tree GBDT

3.1 Definition

3.2 summary

4 Objective function

4.1 Second order derivative information

4.2 Calculation of objective function

4.3 Simplification of objective function

5 Adaboost

5.1 Adaboost Definition

Set up training data set $T=\left \{ (x_{1},y_{1}),(x_{2},y_{2})...(x_{N},y_{N})\right \}$

Initialize the weight distribution of training data ： $D_{1}=(w_{11},w_{12}...w_{1i}...,w_{1N}),w_{1i}=\frac{1}{N},i=1,2,...,N$ ,

5.2 Adaboost Algorithm

5.3 Illustrate with examples

m=1：

Serial number	1	2	3	4	5	6	7	8	9	x
X	0	1	2	3	4	5	6	7	8	9
Y	1	1	1	-1	-1	-1	1	1	1	-1

The weight distribution is D1 On the training data , threshold v take 2.5 The time error rate is the lowest , So the basic classifier is $G_{1}(x)=\left\{\begin{matrix} 1 & x<2.5\\ -1 & x>2.5 \end{matrix}\right.$

We can see x=6、7、8 There is error in the data , So the error rate is 0.3, namely e1=0.3=3*0.1.

Plug in G1 The coefficient of ： $\alpha _{1}=\frac{1}{2}log\frac{1-e_{1}}{e_{1}}=0.4236$

∴ $f_{1}(x)=0.4236*G_{1}(x)$

classifier $sign(f_{1}(x))$ On the training data set, there are 3 A misclassification point .

Calculate weights $D_{2}=(0.0715,0.0715,0.0715,0.0715,0.0715,0.0715,0.1666,0.1666,0.1666,0.0715)$

You can see the point of error x=6、7、8 The weight of the . For the next basic classifier , namely m=2 when .

m=2:

X	0	1	2	3	4	5	6	7	8	9
Y	1	1	1	-1	-1	-1	1	1	1	-1
w	0.0715	0.0715	0.0715	0.0715	0.0715	0.0715	0.1666	0.1666	0.1666	0.0715

The weight distribution is D2 On the training data , threshold v take 8.5 The time error rate is the lowest , So the basic classifier is $G_{2}(x)=\left\{\begin{matrix} 1 & x<8.5\\ -1 & x>8.5 \end{matrix}\right.$