当前位置：网站首页>Machine learning 7-Support vector machine

Machine learning 7-Support vector machine

2022-06-29 18:42:00 【Just a】

List of articles

One . SVM Basic concepts of the model
- 1.1 Starting from linear discrimination
- 1.2 Support vector machine (SVM) Basic concepts of
Two . SVM Objective function and dual problem of
3、 ... and . Soft space
Four . Kernel function
Reference resources :

One . SVM Basic concepts of the model

1.1 Starting from linear discrimination

If you need to build a classifier to separate the yellow dot from the blue dot in the above figure , The simplest way is to choose a line in the plane to separate the two , Make all the yellow dots and blue dots belong to the two sides of the straight line . There are an infinite number of options for such a line , But what kind of line is optimal ？

The obvious thing is , The effect of the red split line in the middle is better than that of the blue dotted line and green dotted line . as a result of , The sample points to be classified are generally far from the red line , So it is more robust . contrary , The blue dotted line and the green dotted line are close to several sample points respectively , Thus, after adding new sample points , Misclassification can easily occur .

1.2 Support vector machine (SVM) Basic concepts of

Distance from point to hyperplane
In the above classification task , In order to obtain a robust linear classifier , A very natural idea is , Find a dividing line so that the average distance between the samples on both sides and the dividing line is far enough . In European Space , Define a point 𝒙 The straight line （ Or hyperplane in high dimensional space ） $𝒘^𝑇 𝒙+𝑏=0$ The formula for distance is ：
$𝑟(𝑥)= (|𝒘^𝑇 𝒙+𝑏|)/(||𝒘||)$
In the classification problem , If such a dividing line or plane can accurately separate the samples , For samples ${𝒙_𝑖,𝑦_𝑖}∈𝐷, 𝑦_𝑖=±1$ for , if $𝑦_𝑖=1$ , Then there are $𝒘^𝑇 𝒙_𝒊+𝑏≥1$ , Conversely, if $𝑦_𝑖=-1$ , Then there are $𝒘^𝑇 𝒙_𝒊+𝑏≤−1.$

Support vector and interval
For satisfying $𝒘^𝑇 𝒙_𝒊+𝑏=±1$ The sample of , They must have landed on 2 On hyperplanes . These samples are called “ Support vector （support vector）”, this 2 A hyperplane is called the maximum separation boundary . The sum of the distances between the samples belonging to different categories and the segmentation plane is

𝛾=2/(||𝑤||)

The sum of these distances is called “ interval ”

Two . SVM Objective function and dual problem of

2.1 Optimization problem of support vector machine

therefore , For completely linearly separable samples , The task of classification model is to find such hyperplane , Satisfy

It is equivalent to solving the constrained minimization problem ：

2.2 Dual problem of optimization problem

Generally speaking , When solving optimization problems with equality or inequality constraints , The Lagrange multiplier method is usually used to transform the original problem into a dual problem . stay SVM In the optimization problem of , The corresponding dual problem is ：

Yes 𝐿(𝑤,𝑏,𝛼) About 𝑤,𝑏,𝛼 And let be the partial derivative of 0, Yes ：

The final optimization problem turns into

figure out 𝛼 after , Find out 𝑤,𝑏 You get the model . In general use SMO Algorithmic solution .

2.3 Support vector and non support vector

be aware , $𝑦_𝑖 (𝒘^𝑇 𝒙_𝒊+𝑏)≥ 1$ It's an inequality constraint , therefore $a_𝑖$ Need to meet $a_𝑖 (𝑦_𝑖 (𝒘^𝑇 𝒙_𝒊+𝑏)−1)=0$ （ This is a KKT The condition of inequality constraint in condition ）. therefore , A sample that satisfies such a condition ${𝒙_𝒊,y_i}$ , or $a_𝑖=0$ , or $𝑦_𝑖 (𝒘^𝑇 𝒙_𝒊+𝑏)−1$ . So for SVM In terms of training samples ,
If $a_𝑖=0$ , be $a_i−1/2 ∑∑a_i a_𝑗 𝑦_𝑖 𝑦_𝑗 𝒙_𝑖^𝑇 𝒙_𝒋$ The sample will not appear in the calculation of
If $𝑦_𝑖 (𝒘^𝑇 𝒙_𝒊+𝑏)−1$ , Then the sample is on the maximum interval boundary
You can see that , Most of the training samples will not have any influence on the solution of the model , Only support vectors affect the solution of the model .

3、 ... and . Soft space

3.1 Linearly indivisible

In a normal business scenario , Linear separability can be encountered but not solved . It is more linear and indivisible , That is, it is impossible to find such a hyperplane that can completely and correctly separate the two types of samples .
To solve this problem , One way is that we allow some samples to be incorrectly classified （ But not too much ！） . Intervals with misclassification , be called “ Soft space ”. therefore , The objective function is still a constrained maximization interval , The constraints are , dissatisfaction $𝑦_𝑖 (𝒘^𝑇 𝒙_𝒊+𝑏)≥ 1$ The fewer samples the better .

3.2 Loss function

Based on this idea , We rewrite the optimization function

Turn it into

The available loss functions are :

3.3 Relax variables

When using hinge loss When , The loss function becomes

3.4 Solve the soft interval with relaxation variable SVM

Make 𝐿(𝑤,𝑏,𝛼,𝜂,𝜇) About 𝑤,𝑏, 𝜂 The partial derivative of is equal to 0, Then there are ：

3.5 Support vector and non support vector

Insert picture description here

Four . Kernel function

4.1 From low dimension to high dimension

Insert picture description here

Linearly indivisible :

Linearly separable :

4.2 Kernel function

4.3 The choice of kernel function

Some prior experience

If the number of features is much larger than the number of samples , Just use a linear kernel
If both the number of features and the number of samples are large , For example, document classification , Linear kernels are generally used
If the number of features is much smaller than the number of samples , In this case, we usually use RBF
Or use cross validation to select the most appropriate kernel function

4.4 SVM Advantages and disadvantages of the model

advantage :

Suitable for small sample classification
Strong generalization ability
The local optimal solution must be the global optimal solution

shortcoming :

It takes a lot of calculation , Large scale training samples are difficult to implement
The result is hard classification rather than probability based soft classification .SVM Probability can also be output , But the calculation is more complicated