当前位置:网站首页>Wuenda machine learning course - week 7

Wuenda machine learning course - week 7

2022-06-11 10:14:00 J___ code

1. Support vector machine SVM

1.1 Optimization objectives

In logical regression , The loss function for a sample is as follows :

− y ( i ) l o g ( h θ ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) = − y ( i ) l o g ( 1 1 + e − θ T ⋅ x ) + ( 1 − y ( i ) ) l o g ( 1 − 1 1 + e − θ T ⋅ x ) -y^{(i)}log(h_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)})=-y^{(i)}log(\frac{1}{1+e^{-\theta^T·x}})+(1-y^{(i)})log(1-\frac{1}{1+e^{-\theta^T·x}}) y(i)log(hθ(x(i))+(1y(i))log(1hθ(x(i))=y(i)log(1+eθTx1)+(1y(i))log(11+eθTx1)

Suppose z = θ T ⋅ x z=\theta^T·x z=θTx Abscissa , The value of the above loss function is the ordinate , When y y y Different images can be drawn with different values :

 Insert picture description here

The blue curve in the above figure is the original image , The purplish red line is similar to the curve . If the new cost function is used to express the purple red line , It can be expressed as : min ⁡ θ C ∑ i = 1 m [ y ( i ) cost ⁡ 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) cost ⁡ 0 ( θ T x ( i ) ) ] + 1 2 ∑ i = 1 n θ j 2 \min\limits_{\theta} C \sum_{i=1}^{m}\left[y^{(i)} \operatorname{cost}_{1}\left(\theta^{T} x^{(i)}\right)+\left(1-y^{(i)}\right) \operatorname{cost}_{0}\left(\theta^{T} x^{(i)}\right)\right]+\frac{1}{2} \sum_{i=1}^{n} \theta_{j}^{2} θminCi=1m[y(i)cost1(θTx(i))+(1y(i))cost0(θTx(i))]+21i=1nθj2

​ The converted cost function is the cost function of support vector machine . among c o s t 1 ( θ T x ( i ) ) = − l o g h θ ( x ( i ) ) , c o s t 0 ( θ T x ( i ) ) = − l o g ( 1 − h θ ( x ( i ) ) ) cost_1(\theta^Tx^{(i)})=-logh_{\theta}(x^{(i)}),cost_0(\theta^Tx^{(i)})=-log(1-h_{\theta}(x^{(i)})) cost1(θTx(i))=loghθ(x(i)),cost0(θTx(i))=log(1hθ(x(i))), Respectively represent the purple red lines in the two subgraphs , And in the above formula, the constants in the front and back parts 1 m \frac{1}{m} m1 And in the regularization term λ \lambda λ Delete , Change to... In Item 1 C C C, This is a SVM Common expressions , The actual meaning has not changed : The original formula can be understood as A + λ B A+\lambda B A+λB, If λ \lambda λ A higher value indicates that B B B The weight of the cost function is large ; Now change to C A + B CA+B CA+B, When C C C The value of is small , Disguised increase B B B The weight in the cost function , When C = 1 λ C=\frac{1}{\lambda} C=λ1 when , The two formulas will reach the same value when the gradient is updated

1.2 The intuitive understanding of the big boundary

The picture below is SVM The image of the cost function . When y = 1 y=1 y=1 when ( Corresponding to left subgraph ), In order to minimize the loss value z = θ T x ≥ 1 z=\theta^Tx \ge 1 z=θTx1; When y = 0 y=0 y=0 when ( Corresponding to right subgraph ), In order to minimize the loss value z = θ T x ≤ − 1 z=\theta^Tx \le -1 z=θTx1. But in logistic regression, when y = 1 y=1 y=1 when , Only required θ T x ≥ 0 \theta^Tx \ge 0 θTx0 The sample can be correctly classified as a positive sample ; When y = 0 y=0 y=0 when , Only required θ T x < 0 \theta^Tx \lt 0 θTx<0 The sample can be correctly classified as a negative sample :

 Insert picture description here

From the above analysis, we can see that SVM It's more demanding , Need one Safe spacing

When put C C C When the value is set to a large value , At this point, it is necessary to find out that the first item is 0 Parameters of θ \theta θ. Suppose that the first item is chosen to be 0 Parameters of θ \theta θ, The constraints must be followed , When y = 1 y=1 y=1 when , θ T x ≥ 1 \theta^Tx \ge 1 θTx1 And when y = 0 y=0 y=0 when , θ T x ≤ − 1 \theta^Tx \le -1 θTx1

 Insert picture description here

Why? SVM It is called a large spacing classifier ? For the dataset in the figure below , You can use multiple lines to separate positive and negative samples , See the purple red and green straight lines in the picture . Compared with these two lines, the black line visually separates the two types of samples better , Because there is a greater shortest distance between the black line and the sample ( There are two distance values from two types of samples , The smaller of these values ), signify SVM Robust , This distance is called SVM The distance between :

 Insert picture description here

When will C C C When it is set very large , Outliers can seriously affect decision boundaries . In the following illustration , When C C C large , The black line is the Red Cross not added to the lower left corner ( Outliers ) The resulting decision boundary . After adding outliers , In order to separate the samples with maximum spacing , The decision boundary changes from a black line to a purplish red line , It's obviously not very good ( It can be understood as over fitting ). So when C C C When the set value is not very large , The influence of outliers can be ignored , Get a better decision boundary like the black line :

 Insert picture description here

1.3 The mathematics behind big boundary classification

First review the intuitive understanding of vector inner product , For two vectors u = [ u 1 , u 2 ] T , v = [ v 1 , v 2 ] T u=[u_1,u_2]^T,v=[v1,v2]^T u=[u1,u2]T,v=[v1,v2]T, p p p by v v v stay u u u The projection on the ( p p p There are also positive and negative values , But not a vector ), be u T v = p ⋅ ∣ ∣ u ∣ ∣ = u 1 v 1 + u 2 v 2 u^Tv=p·||u||=u_1v_1+u_2v_2 uTv=pu=u1v1+u2v2

 Insert picture description here

Now let's understand through the inner product property SVM The objective function of :

 Insert picture description here

Let's start with θ 0 = 0 \theta_0=0 θ0=0, And set the characteristic number to n = 2 n=2 n=2, So the objective function is converted to :

min ⁡ θ 1 2 ∑ j = 1 n θ j 2 = 1 2 ( θ 1 2 + θ 2 2 ) = 1 2 ( θ 1 2 + θ 2 2 ) 2 = 1 2 ∣ ∣ θ ∣ ∣ 2 \min\limits_\theta\frac{1}{2}\sum_{j=1}^n\theta_j^2=\frac{1}{2}(\theta_1^2+\theta_2^2)=\frac{1}{2}(\sqrt{\theta_1^2+\theta_2^2)}^2=\frac{1}{2}||\theta||^2 θmin21j=1nθj2=21(θ12+θ22)=21(θ12+θ22)2=21θ2

Through the previous intuitive introduction to inner product , constraint condition θ T x ( i ) = p ( i ) ∣ ∣ θ ∣ ∣ \theta^Tx^{(i)}=p^{(i)}||\theta|| θTx(i)=p(i)θ, p ( i ) p^{(i)} p(i) Express x ( i ) x^{(i)} x(i) stay θ \theta θ The projection on the . So the objective function is as follows :

 Insert picture description here

Two types of samples in the left subgraph below , Suppose the green line is taken as the decision boundary ( namely θ T x = 0 \theta^Tx=0 θTx=0, The reason why the green line goes through the origin here is because θ 0 = 0 \theta_0=0 θ0=0), Then the blue vector θ \theta θ Orthogonal to the green line , So a decision boundary corresponds to a parameter θ \theta θ vector . Suppose the Red Cross is a positive sample , The blue circle is a negative sample , Then the positive sample in the figure x ( 1 ) x^{(1)} x(1) And negative samples x ( 2 ) x^{(2)} x(2) stay θ \theta θ The length of the projection on is very short , To meet the p ( i ) ∣ ∣ θ ∣ ∣ ≥ 1 p^{(i)}||\theta|| \ge 1 p(i)θ1 and p ( i ) ∣ ∣ θ ∣ ∣ ≤ − 1 p^{(i)}||\theta|| \le -1 p(i)θ1 You need to ∣ ∣ θ ∣ ∣ ||\theta|| θ The value of is very large , But the objective function is to ∣ ∣ θ ∣ ∣ ||\theta|| θ The value of is as small as possible , therefore θ \theta θ The direction is not good ( That is, the choice of decision boundary is not good ).

SVM The decision boundary in the right subgraph below will be selected according to the objective function , Because the boundary makes ∣ ∣ θ ∣ ∣ ||\theta|| θ To become smaller :

 Insert picture description here

1.4 Kernel function Ⅰ

In order to obtain the decision boundary in the figure below , The model could be h ( θ ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 x 2 + . . . h(\theta)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1x_2+... h(θ)=θ0+θ1x1+θ2x2+θ3x1x2+... In the form of , Now suppose f 1 = x 1 , f 2 = x 2 , f 3 = x 1 x 2 , . . . f_1=x_1,f_2=x_2,f_3=x_1x_2,... f1=x1,f2=x2,f3=x1x2,..., be h ( θ ) = θ 0 + θ 1 f 1 + θ 2 f 2 + . . . + θ n f n h(\theta)=\theta_0+\theta_1f_1+\theta_2f_2+...+\theta_nf_n h(θ)=θ0+θ1f1+θ2f2+...+θnfn. In addition to combining the original features , Is there a better way to construct f 1 , f 2 , . . . f_1,f_2,... f1,f2,...? This requires a kernel function

Suppose a sample is given x x x, utilize x x x And pre selected tags l ( 1 ) 、 l ( 2 ) 、 l ( 3 ) l^{(1)}、l^{(2)}、l^{(3)} l(1)l(2)l(3) The degree of approximation as f 1 、 f 2 、 f 3 f_1、f_2、f_3 f1f2f3

 Insert picture description here

f 1 = similarity ⁡ ( x , l ( 1 ) ) = e ( − ∥ x − l ( 1 ) ∥ 2 2 σ 2 ) f_{1}=\operatorname{similarity}\left(x, l^{(1)}\right)=e\left(-\frac{\left\|x-l^{(1)}\right\|^{2}}{2 \sigma^{2}}\right) f1=similarity(x,l(1))=e(2σ2xl(1)2)

f 2 = similarity ⁡ ( x , l ( 2 ) ) = e ( − ∥ x − l ( 2 ) ∥ 2 2 σ 2 ) f_{2}=\operatorname{similarity}\left(x, l^{(2)}\right)=e\left(-\frac{\left\|x-l^{(2)}\right\|^{2}}{2 \sigma^{2}}\right) f2=similarity(x,l(2))=e(2σ2xl(2)2)

f 3 = similarity ⁡ ( x , l ( 3 ) ) = e ( − ∥ x − l ( 3 ) ∥ 2 2 σ 2 ) f_{3}=\operatorname{similarity}\left(x, l^{(3)}\right)=e\left(-\frac{\left\|x-l^{(3)}\right\|^{2}}{2 \sigma^{2}}\right) f3=similarity(x,l(3))=e(2σ2xl(3)2)

among , similarity ⁡ ( x , l ( i ) ) \operatorname{similarity}\left(x, l^{(i)}\right) similarity(x,l(i)) It's a kernel function ( This function is a Gaussian kernel function ), Usually used k ( x , l ( i ) ) k(x,l^{(i)}) k(x,l(i)) Express . When x x x Marking and marking l ( i ) l^{(i)} l(i) The closer we get to , Characteristics f i f_i fi Approximate to e − 0 = 1 e^{-0}=1 e0=1; When x x x Marking and marking l ( i ) l^{(i)} l(i) The farther away you are , Characteristics f i f_i fi Approximate to e − b i g n u m = 0 e^{-bignum}=0 ebignum=0. It can be understood intuitively from the following figure , When l ( 1 ) = [ 3 , 5 ] T l^{(1)}=[3,5]^T l(1)=[3,5]T when , If x = [ 3 , 5 ] T x=[3,5]^T x=[3,5]T, be z z z Axis Max ( z z z Axis representation f 1 f_1 f1 Value , x x x and y y y The axes are represented separately x 1 x_1 x1 and x 2 x_2 x2). in addition σ \sigma σ The value of will control f 1 f_1 f1 The value of the x x x The rate of change :

 Insert picture description here

So how does the kernel function determine the decision boundary ? Suppose that h θ ( x ) = θ 0 + θ 1 f 1 + θ 2 f 2 + θ 3 f 3 h_\theta(x)=\theta_0+\theta_1f_1+\theta_2f_2+\theta_3f_3 hθ(x)=θ0+θ1f1+θ2f2+θ3f3, At this point, the parameters are known , Respectively θ 0 = 0.5 , θ 1 = θ 2 = 1 , θ 3 = 0 \theta_0=0.5,\theta_1=\theta_2=1,\theta_3=0 θ0=0.5,θ1=θ2=1,θ3=0. For the purplish red sample points in the figure below , It can be seen that it is far from l ( 1 ) l^{(1)} l(1) Closer , leave l ( 2 ) l^{(2)} l(2) and l ( 3 ) l^{(3)} l(3) Far away , therefore f 1 ≈ 1 , f 2 ≈ 0 , f 3 ≈ 0 f_1 \approx 1,f_2 \approx 0,f_3 \approx 0 f11,f20,f30, therefore h θ ( x ) > 0 h_\theta(x)>0 hθ(x)>0, That is, it is predicted that the sample point is a positive sample point . Similarly, green sample points are predicted as positive sample points , Blue and blue sample points are negative sample points , Finally, the red decision boundary . It can be seen that the characteristic value of the sample is not used in the above calculation x 1 、 x 2 、 x 3 x_1、x_2、x_3 x1x2x3, Instead, new features calculated using kernel functions f 1 、 f 2 、 f 3 f_1、f_2、f_3 f1f2f3

 Insert picture description here

1.5 Kernel function Ⅱ

stay 1.4 The concept and function of kernel function are introduced in this section , But I still don't know how to select tags l ( i ) l^{(i)} l(i). The number of tags is usually selected according to the number of training sets , Suppose there is m m m Samples , Then choose m m m A sign , The new feature obtained in this way is based on the distance between a single sample point and all other sample points . Suppose each sample point is used as a marker , The new feature can be expressed as :

 Insert picture description here

therefore SVM The objective function of is converted to min ⁡ θ C ∑ i = 1 m [ y ( i ) cost ⁡ 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) cost ⁡ 0 ( θ T f ( i ) ) ] + 1 2 ∑ i = 1 n = m θ j 2 \min\limits_{\theta} C \sum_{i=1}^{m}\left[y^{(i)} \operatorname{cost}_{1}\left(\theta^{T} f^{(i)}\right)+\left(1-y^{(i)}\right) \operatorname{cost}_{0}\left(\theta^{T} f^{(i)}\right)\right]+\frac{1}{2} \sum_{i=1}^{n=m} \theta_{j}^{2} θminCi=1m[y(i)cost1(θTf(i))+(1y(i))cost0(θTf(i))]+21i=1n=mθj2, And calculating ∑ i = 1 n = m θ j 2 = θ T θ \sum_{i=1}^{n=m} \theta_{j}^{2}=\theta^T\theta i=1n=mθj2=θTθ Will use θ T M θ \theta^T M \theta θTMθ To replace , Because of the simplified calculation , among M M M Varies with the kernel function

Be careful : When SVM When kernel functions are not used , It is called a linear kernel function , That is, the objective function or θ T x \theta^Tx θTx

Let's take a look C = 1 λ C=\frac{1}{\lambda} C=λ1 and σ \sigma σ Yes SVM Influence :

  • When C C C large , amount to λ \lambda λ smaller , May cause over fitting , High variance
  • When C C C More hours , amount to λ \lambda λ more , May lead to under fitting , High deviation
  • When σ \sigma σ large , features f i f_i fi The change of is smooth , May lead to low variance , High deviation
  • When σ \sigma σ More hours , features f i f_i fi The change of is more drastic , May cause low deviation , High variance

1.6 SVM Use

To use SVM, The following things must be done :

  • Choose the right one C C C value
  • Select kernel function ( It is also possible not to use kernel functions ), Kernel function except Gaussian kernel function , There are also polynomial kernel functions , But the goal is to build new features based on the distance between the sample and the marker
  • Choose the right one σ \sigma σ value

about SVM There are universal rules for the use of , m m m Sample size , n n n Characteristic number :

  • If n n n Compare with m m m It's much bigger , That is, the amount of data in the training set is not enough to support the training of a complex nonlinear model , The logistic regression model or the one without kernel function is used SVM

  • If n n n smaller , m m m Moderate size , Then the Gaussian kernel function is used SVM

  • If n n n smaller , m m m more , Then use SVM Will be very slow , More features need to be created , Then we use logistic regression or the function without kernel SVM

Neural network in the above three cases may have better performance , But training neural networks can be very slow , The main reason for choosing support vector machine is that its cost function is convex function , There is no local minimum

2. Reference resources

https://www.bilibili.com/video/BV164411b7dx?p=70-75

http://www.ai-start.com/ml2014/html/week7.html

原网站

版权声明
本文为[J___ code]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111008288015.html