当前位置:网站首页>Machine learning from entry to re entry: re understanding of SVM

Machine learning from entry to re entry: re understanding of SVM

2022-06-12 07:19:00 Nidiya is trying

Let's start with some nonsense . A long time ago, I wanted to open a blog similar to the technology blog , However, I feel that I have no technology worth recording , Add to that the constant deluge of procrastination , I haven't started a blog yet . Recently, I think that blogging may help me sort out my knowledge , Keep track of your learning ; Plus there is an intermittent frenzy AI Followers , So I decided to open a blog , Record your recent knowledge 、AI Technology or some other interested dency . It is likely to be updated from time to time , But I will try to urge myself to update it in time .( Shout : The skin tightens ! Branch up !) nitrogen , As a rookie , There must be many shortcomings and mistakes , I also hope you will gently point it out to me , I will listen carefully to your suggestions ~~~

Here we go ~~~

The support vector machine in machine learning is a common question in interview , I was also questioned in the interview before !!! Several times , Although I have studied , But not deep enough , The foundation is not solid enough , So now Yanyi continues to take this course . Just a few days ago, I combed several places I didn't understand before in combination with what the teacher said in class and what other leaders summarized ( Mainly personal understanding , I won't cover the derivation of the formula here ), So the first blog of the small vegetable fish was born .

One 、 Development background

1963 year , Put forward the original SVM;

1992 year , introduce Kernel function It's solved Nonlinear classification problem ;

1995 year , Put forward Maximize soft spacing It's solved Noise problem ( Misclassification problem );

2001 year , Put forward SVC Way to complete Unsupervised learning ;

2005 year , Put forward and compare a variety of multi classification SVM Method .

Two 、 thought

While ensuring the correct classification , Let the point closest to the hyperplane ( These points are called support vectors ) To the hyperplane The biggest gap . In fact, it means , Maximize the spacing between the hyperplane and the support vector , This can best ensure the accuracy of classification .

3、 ... and 、 Comparison with perceptron model

We all know that the perceptron is a kind of aging machine ( I didn't mean to say it was bad ) The binary linear classification model , It's the basis of neural networks and support vector machines . Its idea is Try to find a hyperplane that can separate data sets . A place to pay attention to , Its loss function is only for misclassification points !!! Its optimization goal is to make The distance from the misclassification point to the hyperplane is the smallest , The optimization goal of support vector is to make Maximum distance between support vector and hyperplane .( This “ Maximum ”、“ Minimum ” To understand the concept of , In fact, it is not difficult to understand )

Four 、SVM The goal of optimization

Undergraduate studies svm When , I have seen many pairs of svm Different interpretations of the optimization objectives of , Very ignorant , But because the end of the period does not involve , So I ended the course in a muddle , I didn't delve into it after class , Therefore, I continue to be ignorant until I graduate from college . The main reason for ignorance is , Its objective function will involve “[1-y_{i}(wx_{i}+b)]_{+}”, And then again “\frac{1}{2}\left \| W \right \|^{2}” Form like this . I haven't been ignorant recently , In fact, they are in the process of explanation “ In silence ” The empirical risk and structural risk are introduced .

1、 Empirical risk : Empirical risk is the average minimization of the loss function of all sample points in the training set . If only empirical risk is considered , There will be over fitting , Like neural networks .

The loss function used in support vector machine is “ [1-y_{i}(wx_{i}+b)]_{+}” Form like this , among \left [z \right]_{+}=\left\{\begin{matrix}0,z\leqslant 0 \\ z,z>0 \end{matrix}\right. It is called hinge loss function , When the sample point (x_{i},y_{i}) Be correctly classified and y_{i}(wx_{i}+b)>1 when , The loss is 0.

If you want to know the model f(x) The ability to predict all the samples in the training sample , It is only necessary to calculate the loss of all sample points once and then add it up , namely :R=\frac{1}{N}\sum_{i=1}^{N}l(y_{i},f(x_{i})), among l(y_{i},f(x_{i})) Is the hinge loss function ,R Experience risk . Empirical risk model f(x) The better the fitting degree of the training set . The so-called empirical risk minimization is to minimize this formula .

2、 Structural risk : Add a regularization term after the empirical risk function ( Penalty item ) Structural risk ( Because we need to find a balance between training error and model complexity , Prevent over fitting , Reduce generalization error , So the regularization term is introduced ), At this time, the optimization goal becomes :L=min\sum_{i}^{N}\left [ 1-y_{i}(wx_{i}+b)) \right ]_{+}+\lambda \left \| W \right \|^{2}.

This structural risk and svm Original optimization objective L=min \frac{1}{2}\left \| W \right \|^2 + C\sum_{i=1}^{N}\xi _{i}, s.t. y_{i}(wx_{i}+b)\geqslant 1-\xi _{i},i=1,2,3...N,\xi _{i}\geqslant 0,i=1,2,3...N It is equivalent. .

( The proof can refer to Li Hang's statistical learning method P131)

【 Add : After checking the information, we know that , Statistical learning theory shows that , Learning about machines The actual risk is determined by the training sample Empirical risk and Confidence range Two parts . And I said before SVM The idea is to solve the separation hyperplane which can correctly divide the training data set and has the largest geometric interval , The reason is that the classification hyperplane can satisfy the minimum empirical risk , The maximum interval satisfies the minimum confidence range , In this way, less actual risk can be obtained , So as to achieve optimal structural risk , The tradeoff takes into account empirical risk and confidence limits , With good promotion ability . therefore , Support vector machine is a classifier designed by structural risk minimization criterion .

( About experience risk 、 Structural risk 、 The relationship between confidence intervals , It can be moved (12 Bar message ) The way of machine learning ( 7、 ... and ) Support vector machine _asd2479745295 The blog of -CSDN Blog

5、 ... and 、 Why is it called support vector machine

We know that with the introduction of Lagrange multipliers W The expression of is W=\sum_{i=1}^{m}\alpha _{i}y_{i}x_{i}, from KKT The condition is known ,y_{i}f(x_{i})>1 when ,\alpha _{i}=0, So for those samples that are correctly classified ( As shown in the figure below pic1 Sample points in the blue circle , They are on the upper and lower sides of the support vector ), They are concerned about W Weight contribution of \alpha _{i} by 0, That is, non support vector samples do not affect W Value , therefore W The value of is only related to the support vector . Hence the name of support vector machine .

6、 ... and 、 Kernel function

1、 effect : Everyone is familiar with the function of kernel function , Is to take the sample " mapping " To high dimensional space ( This mapping is not a real mapping , The second point will be ), Make the samples that are not separable in the low dimensional feature space linearly separable in the high dimensional feature space .

2、 principle : that , What's the principle of it ? It can be seen that there are both support vector machine optimization function and prediction function “a^{T}b” Form like this , It represents the inner product of two vectors . If you use \varphi (x) Represents a mapping from a low dimensional space to a high dimensional space , be x After mapping, it can be expressed as \varphi(x), So the formula of optimization function and prediction function is \varphi (x)^{T}\varphi(z), But it would be too complicated to calculate such inner product in high-dimensional space , So the kernel function was born . Define kernel function k(x,z)=\varphi (x)^{T}\varphi (z), Its meaning is x and z The inner product in the feature space is equal to their passage in the original sample space k(x,z) The result of function calculation . in other words , Actually Kernel function is a simple method to calculate inner product after mapping to high dimensional space . That's why the kernel function is calculated in a low dimension , But the classification effect is shown in the high dimension .

I always thought that kernel function is a mapping function that maps features to high-dimensional space , Only here do we know that its essence is actually an inner product of vectors in high-dimensional space directly in low-dimensional space Simple calculation method .

That's all for my partial understanding of support vector machines , As for other concepts , For example, function interval 、 Geometric interval 、 The dual problem 、 And why there are so many numbers in these formulas 1······ There are detailed explanations in watermelon books , I won't repeat it . So on the whole , This article has no technical content , The main thing is to have more comprehensible content ~

Suggestions and comments are very welcome !!!!!

Do not accept malicious criticism , It's just that you're right !

See next article ~~~

------------------------2022.1.9 to update ------------------------

7、 ... and 、 Semi supervised support vector machine

1、 Semi-supervised learning : Let the learner automatically use unlabeled samples for self-learning , Improve learning performance . Can be understood as in the supervised SVM In the classification algorithm, unlabeled samples are added to realize semi supervised classification .

2、 Algorithmic thought : Iterative search for optimization objectives

L=min \frac{1}{2}\left \| W \right \|_{2}^{2}+C_{l}\sum_{i=1}^{l}\xi _{i}+C_{u}\sum_{i=l+1}^{m}\xi _{i}, s.t. y_{i}(w^{T}+b)\geqslant 1-\xi_{i},i=1,2...,l,{y_{i}}'(w^{T}+b)\geqslant 1-\xi_{i},i=l+1.l+2,...,m,\xi _{i}\geq 0,i=1,2...,m

among \xi _{i}(i=1,2,...,l) Corresponding to the marked sample ,\xi _{i}(i=l+1,l+2,...,m) Approximate solutions corresponding to unlabeled samples ,ClCu It is a parameter to balance the effect of marked data and unlabeled data on the results .( In order to make the labeled samples account for a larger proportion in solving the hyperplane, because the predicted pseudo markers are largely deviated from the true values , At initialization time .C_{u} The ratio should be set Cl Small value . 

3、 Algorithm steps : Use labeled samples to learn a “ Master ”SVM—— Use this master to predict unlabeled data , The prediction result is used as a pseudo marker —— Plug in L To solve hyperplane and relaxation vector —— Find two tags that are assigned as heterogeneous ( Labels are multiplied by negative numbers ) Unlabeled samples that are likely to be wrong ( The sum of relaxation variables is greater than 2), Exchange their marks —— Re based on optimization objectives L Solve the separation plane and relaxation vector —— Gradually increase C_{u}, until Cu=Cl, Stop the algorithm .

4、 Some improvements : Because each pair of unlabeled samples needs to be adjusted , The amount of computation generated is very large . Therefore, there are also some optimization studies for this problem .

原网站

版权声明
本文为[Nidiya is trying]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203010557068121.html