当前位置：网站首页>Machine learning from entry to re entry: re understanding of SVM

Machine learning from entry to re entry: re understanding of SVM

2022-06-12 07:19:00 【Nidiya is trying】

Let's start with some nonsense . A long time ago, I wanted to open a blog similar to the technology blog , However, I feel that I have no technology worth recording , Add to that the constant deluge of procrastination , I haven't started a blog yet . Recently, I think that blogging may help me sort out my knowledge , Keep track of your learning ; Plus there is an intermittent frenzy AI Followers , So I decided to open a blog , Record your recent knowledge 、AI Technology or some other interested dency . It is likely to be updated from time to time , But I will try to urge myself to update it in time .（ Shout ： The skin tightens ！ Branch up ！） nitrogen , As a rookie , There must be many shortcomings and mistakes , I also hope you will gently point it out to me , I will listen carefully to your suggestions ~~~

Here we go ~~~

The support vector machine in machine learning is a common question in interview , I was also questioned in the interview before ！！！ Several times , Although I have studied , But not deep enough , The foundation is not solid enough , So now Yanyi continues to take this course . Just a few days ago, I combed several places I didn't understand before in combination with what the teacher said in class and what other leaders summarized （ Mainly personal understanding , I won't cover the derivation of the formula here ）, So the first blog of the small vegetable fish was born .

One 、 Development background

1963 year , Put forward the original SVM;

1992 year , introduce Kernel function It's solved Nonlinear classification problem ;

1995 year , Put forward Maximize soft spacing It's solved Noise problem （ Misclassification problem ）;

2001 year , Put forward SVC Way to complete Unsupervised learning ;

2005 year , Put forward and compare a variety of multi classification SVM Method .

Two 、 thought

While ensuring the correct classification , Let the point closest to the hyperplane （ These points are called support vectors ） To the hyperplane The biggest gap . In fact, it means , Maximize the spacing between the hyperplane and the support vector , This can best ensure the accuracy of classification .

3、 ... and 、 Comparison with perceptron model

We all know that the perceptron is a kind of aging machine （ I didn't mean to say it was bad ） The binary linear classification model , It's the basis of neural networks and support vector machines . Its idea is Try to find a hyperplane that can separate data sets . A place to pay attention to , Its loss function is only for misclassification points ！！！ Its optimization goal is to make The distance from the misclassification point to the hyperplane is the smallest , The optimization goal of support vector is to make Maximum distance between support vector and hyperplane .（ This “ Maximum ”、“ Minimum ” To understand the concept of , In fact, it is not difficult to understand ）

Four 、SVM The goal of optimization

Undergraduate studies svm When , I have seen many pairs of svm Different interpretations of the optimization objectives of , Very ignorant , But because the end of the period does not involve , So I ended the course in a muddle , I didn't delve into it after class , Therefore, I continue to be ignorant until I graduate from college . The main reason for ignorance is , Its objective function will involve “ $[1-y_{i}(wx_{i}+b)]_{+}$ ”, And then again “ $\frac{1}{2}\left \| W \right \|^{2}$ ” Form like this . I haven't been ignorant recently , In fact, they are in the process of explanation “ In silence ” The empirical risk and structural risk are introduced .

1、 Empirical risk ： Empirical risk is the average minimization of the loss function of all sample points in the training set . If only empirical risk is considered , There will be over fitting , Like neural networks .

The loss function used in support vector machine is “ $[1-y_{i}(wx_{i}+b)]_{+}$ ” Form like this , among $\left [z \right]_{+}=\left\{\begin{matrix}0,z\leqslant 0 \\ z,z>0 \end{matrix}\right.$ It is called hinge loss function , When the sample point $(x_{i},y_{i})$ Be correctly classified and $y_{i}(wx_{i}+b)>1$ when , The loss is 0.

If you want to know the model f(x) The ability to predict all the samples in the training sample , It is only necessary to calculate the loss of all sample points once and then add it up , namely ： $R=\frac{1}{N}\sum_{i=1}^{N}l(y_{i},f(x_{i}))$ , among $l(y_{i},f(x_{i}))$ Is the hinge loss function , Experience risk . Empirical risk model f(x) The better the fitting degree of the training set . The so-called empirical risk minimization is to minimize this formula .

2、 Structural risk ： Add a regularization term after the empirical risk function （ Penalty item ） Structural risk （ Because we need to find a balance between training error and model complexity , Prevent over fitting , Reduce generalization error , So the regularization term is introduced ）, At this time, the optimization goal becomes ： $L=min\sum_{i}^{N}\left [ 1-y_{i}(wx_{i}+b)) \right ]_{+}+\lambda \left \| W \right \|^{2}$ .

This structural risk and svm Original optimization objective $L=min \frac{1}{2}\left \| W \right \|^2 + C\sum_{i=1}^{N}\xi _{i}, s.t. y_{i}(wx_{i}+b)\geqslant 1-\xi _{i},i=1,2,3...N,\xi _{i}\geqslant 0,i=1,2,3...N$ It is equivalent. .

（ The proof can refer to Li Hang's statistical learning method P131）

【 Add ： After checking the information, we know that , Statistical learning theory shows that , Learning about machines The actual risk is determined by the training sample Empirical risk and Confidence range Two parts . And I said before SVM The idea is to solve the separation hyperplane which can correctly divide the training data set and has the largest geometric interval , The reason is that the classification hyperplane can satisfy the minimum empirical risk , The maximum interval satisfies the minimum confidence range , In this way, less actual risk can be obtained , So as to achieve optimal structural risk , The tradeoff takes into account empirical risk and confidence limits , With good promotion ability . therefore , Support vector machine is a classifier designed by structural risk minimization criterion .】

（ About experience risk 、 Structural risk 、 The relationship between confidence intervals , It can be moved (12 Bar message ) The way of machine learning ( 7、 ... and ) Support vector machine _asd2479745295 The blog of -CSDN Blog ）

5、 ... and 、 Why is it called support vector machine

We know that with the introduction of Lagrange multipliers The expression of is $W=\sum_{i=1}^{m}\alpha _{i}y_{i}x_{i}$ , from KKT The condition is known , $y_{i}f(x_{i})>1$ when , $\alpha _{i}=0$ , So for those samples that are correctly classified （ As shown in the figure below pic1 Sample points in the blue circle , They are on the upper and lower sides of the support vector ）, They are concerned about Weight contribution of $\alpha _{i}$ by 0, That is, non support vector samples do not affect Value , therefore The value of is only related to the support vector . Hence the name of support vector machine .

6、 ... and 、 Kernel function

1、 effect ： Everyone is familiar with the function of kernel function , Is to take the sample " mapping " To high dimensional space ( This mapping is not a real mapping , The second point will be ), Make the samples that are not separable in the low dimensional feature space linearly separable in the high dimensional feature space .

2、 principle ： that , What's the principle of it ？ It can be seen that there are both support vector machine optimization function and prediction function “ $a^{T}b$ ” Form like this , It represents the inner product of two vectors . If you use $\varphi (x)$ Represents a mapping from a low dimensional space to a high dimensional space , be After mapping, it can be expressed as $\varphi$ (x) , So the formula of optimization function and prediction function is $\varphi (x)^{T}\varphi(z)$ , But it would be too complicated to calculate such inner product in high-dimensional space , So the kernel function was born . Define kernel function $k(x,z)=\varphi (x)^{T}\varphi (z)$ , Its meaning is and The inner product in the feature space is equal to their passage in the original sample space k(x,z) The result of function calculation . in other words , Actually Kernel function is a simple method to calculate inner product after mapping to high dimensional space . That's why the kernel function is calculated in a low dimension , But the classification effect is shown in the high dimension .

I always thought that kernel function is a mapping function that maps features to high-dimensional space , Only here do we know that its essence is actually an inner product of vectors in high-dimensional space directly in low-dimensional space Simple calculation method .

That's all for my partial understanding of support vector machines , As for other concepts , For example, function interval 、 Geometric interval 、 The dual problem 、 And why there are so many numbers in these formulas 1······ There are detailed explanations in watermelon books , I won't repeat it . So on the whole , This article has no technical content , The main thing is to have more comprehensible content ~

Suggestions and comments are very welcome ！！！！！

Do not accept malicious criticism , It's just that you're right ！

See next article ~~~

------------------------2022.1.9 to update ------------------------

7、 ... and 、 Semi supervised support vector machine

1、 Semi-supervised learning ： Let the learner automatically use unlabeled samples for self-learning , Improve learning performance . Can be understood as in the supervised SVM In the classification algorithm, unlabeled samples are added to realize semi supervised classification .

2、 Algorithmic thought ： Iterative search for optimization objectives

$L=min \frac{1}{2}\left \| W \right \|_{2}^{2}+C_{l}\sum_{i=1}^{l}\xi _{i}+C_{u}\sum_{i=l+1}^{m}\xi _{i}, s.t. y_{i}(w^{T}+b)\geqslant 1-\xi_{i},i=1,2...,l,{y_{i}}'(w^{T}+b)\geqslant 1-\xi_{i},i=l+1.l+2,...,m,\xi _{i}\geq 0,i=1,2...,m$

among $\xi _{i}(i=1,2,...,l)$ Corresponding to the marked sample , $\xi _{i}(i=l+1,l+2,...,m)$ Approximate solutions corresponding to unlabeled samples ,、 It is a parameter to balance the effect of marked data and unlabeled data on the results .( In order to make the labeled samples account for a larger proportion in solving the hyperplane, because the predicted pseudo markers are largely deviated from the true values , At initialization time . $C_{u}$ The ratio should be set Small value .

3、 Algorithm steps ： Use labeled samples to learn a “ Master ”SVM—— Use this master to predict unlabeled data , The prediction result is used as a pseudo marker —— Plug in L To solve hyperplane and relaxation vector —— Find two tags that are assigned as heterogeneous （ Labels are multiplied by negative numbers ） Unlabeled samples that are likely to be wrong （ The sum of relaxation variables is greater than 2）, Exchange their marks —— Re based on optimization objectives L Solve the separation plane and relaxation vector —— Gradually increase $C_{u}$ , until Cu=Cl , Stop the algorithm .

4、 Some improvements ： Because each pair of unlabeled samples needs to be adjusted , The amount of computation generated is very large . Therefore, there are also some optimization studies for this problem .

原网站

版权声明
本文为[Nidiya is trying]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203010557068121.html

当前位置：网站首页>Machine learning from entry to re entry: re understanding of SVM

Machine learning from entry to re entry: re understanding of SVM

边栏推荐

猜你喜欢

随机推荐