当前位置：网站首页>Support vector machine -svm+ source code

Support vector machine -svm+ source code

2022-06-11 04:48:00 【Panbohhhhh】

Statement ： Personal learning work , Constrained by time and personal ability , For your reference only . Welcome to discuss .

Support vector machine （SVM）, A powerful discriminant model for classification and regression .
First, we will reconsider mapping features to higher dimensional spaces .
Then we will discuss how support vector machine can alleviate the computational and generalization problems encountered in learning from data mapped to high-dimensional space .

1.1 Nuclear and nuclear techniques

The perceptron uses a hyperplane as the decision boundary to distinguish positive instances from negative instances .
Decision boundary formula ：
$f (x) = < w, x > + b$
Prediction formula ：
$h (x) = s i g n (f (x))$
This way, , We call it primitive form
And in support vector machines , We need to turn it into a dual form .
$f(x)=<w,x>+b=\sum a_{i}y_{i}<x_{i},x>+b$
The biggest difference between the primal form and the dual form is , In the original form, the inner product of model parameters and test case eigenvectors is calculated , The dual form calculates the inner product of the eigenvectors of the training and test cases
And it is this feature , Used to deal with linear unclassifiable .
First , We formalize the definition of feature mapping to higher dimensional space
Mapping features to a higher dimensional space , In this space, the feature is linearly related to the response variable . Mapping increases the number of features by creating a quadratic term of the original feature . These synthetic features allow us to use a linear model to represent a nonlinear model .
in general , A mapping, in terms of formulas , as follows ：
$x\rightarrow \phi (x)$
$\phi :R^{d}\rightarrow R^{D}$
The effect is as follows [1]

In the second picture , The green plane can perfectly divide purple and red , Data becomes separable .
Back to the dual form of the decision boundary , You can see that the eigenvector wisdom appears in the dot product . We can map the data to a higher dimensional space by performing a mapping on the eigenvector .
$f(x)=\sum a_{i}y_{i}<x_{i},x>+b$
$f(x)=\sum a_{i}y_{i}<\phi (x_{i}),\phi (x)>+b$
As we said earlier , Mapping allows us to represent more complex models , But it also introduces the computational problem and the generalization problem .
It means , It requires a lot of processing power to map and calculate the dot product of eigenvectors .
and , Although we map the eigenvector to a higher dimensional space , Eigenvectors still only appear in point product calculations . The result of the dot product is a scalar , A scalar is calculated , We no longer need the mapped eigenvectors .
If we can find the scalar which is the same as the dot product of the mapped vector by different methods , That's easy .

Nuclear skills

A function that verifies such , As long as the original eigenvector is given , It can return the same dot product value as its associated mapping eigenvector . Kernel does not directly map eigenvectors to a higher dimensional space , Or calculate the dot product of the mapping vector .
The kernel produces the same value through a series of different operations , These operations can be effectively calculated .
The definition formula of core is as follows ：
$K(x,z)=<\phi (x),\phi (z)>$
Here's how verification works .
Suppose we have two eigenvectors x and z,
$x=(x_{1},x_{2})$
$z=(z_{1},z_{2})$
In practice , We want to be able to map eigenvectors to higher dimensional spaces .
$\phi (x)=(x{_{1}}^{2},x{_{2}}^{2},\sqrt{2}x_{1}x_{2})$
As described earlier , The dot product of the normalized eigenvector is mapped as follows ：
$<\phi (x),\phi (z)>=(x{_{1}}^{2},x{_{2}}^{2},\sqrt{2}x_{1}x_{2}),(z{_{1}}^{2},z{_{2}}^{2},\sqrt{2}z_{1}z_{2})$
The kernel function can produce values equal to the dot product of the mapped feature necklace ：
$K(x,z)=<x,z>^{2}=(x_{1}z_{1}+x_{2}z_{2})^{2}=x{_{1}}^{2}z{_{1}}^{2}+2x{_{1}}z{_{1}}x{_{2}}z{_{2}}+x{_{2}}^{2}z{_{2}}^{2}$
$K(x,z)=<\phi (x),\phi (z)>$
This is what the formula means , I will write an example of personal understanding . First, it is easy to understand , The second is to increase persuasion , This is just for convenience , I don't need the formula editor .
x=(4,9)
z=(3,3)
K(x,z)=4X4 X3x3 +2X4X3X9X3+9X9X3X3=1521
$<\phi (x),\phi (z)>=<(4^{2},9^{2},\sqrt{2}X4X9),(3^{2},3^{2},\sqrt{2}X3X3)>=1521$
Kernel function K(x,z) The dot product of sum mapping vector is generated $<\phi (x),\phi (z)>$ Calculate the same value , But it does not explicitly map feature vectors to high-dimensional space , And requires relatively few mathematical operations .
In this case , In fact, only two-dimensional vectors are used .
However, even data sets with only a small number of features will generate a large dimensional mapping feature space .
scikit-learn Class libraries have some commonly used kernel functions ：
Polynomial kernel function ： $K(x,z)=(\gamma <x-z>+\delta )^{k}$
Square kernel or k be equal to 2 Polynomial kernel of , It is often used in naturallanguageprocessing .
S Tympanic nucleus ： $K(x,z)=tanh(\gamma <x-z>+\delta )^{k}$ , among y and r These two parameters are super parameters that can be fine tuned in cross validation .
For the problems that need to be treated with nonlinear models , Gaussian kernel is the first priority . Gaussian kernel is a radial basis function . The hyperplane of the decision boundary in the mapped eigenvector space is similar to the hyperplane of the decision boundary in the source space .
The feature space produced by Gaussian kernel can have infinite dimensions , This is a feature that other feature spaces cannot have .
The definition of Gaussian kernel function is as follows ：
$K(x,z)=exp(-\gamma |x-z|^{2})$
Use SVM when , It is important to scale features , But when using Gaussian kernels , Feature scaling is particularly important .
It is very difficult to choose a kernel function .
Ideally , A kernel function can measure the similarity between instances in a way that is effective for the task .
Although kernel functions are often associated with SVM Use it together , But it can also be used with any model that can be expressed as a dot product of two eigenvectors
Including but not limited to logistic regression , perceptron , Principal component analysis PCA.（ After that, I will follow up ）

1.2 Maximum interval classification and support vector

Or use actual examples to describe ：

These three lines , That is what we call the three classifiers , Can be divided into sample categories .
Intuitively speaking , The red line is the best
But for technicians , What we want is rigor and argument .

Suppose that the line in the graph is a logistic regression classifier . Marked as A Instances of are far from decision boundaries , It has a high probability of being predicted as a positive class . Marked as B Instances of are still predicted to belong to the forward class , However, the closer the instance is to the decision boundary, the lower the probability of being predicted as a positive class .
Last , Marked as C The instance of has a very low probability of being predicted as a positive class , Even a change in the smile of the training secretary will lead to a change in the prediction class .
How to solve ？
The most credible prediction is the training instance far away from the decision boundary , Therefore, we can use its function interval to evaluate the reliability of the prediction .
The functional intervals of the training set are as follows ：
$funct=min y_{i}f(x_{i})$
$f (x) = < w, x > + b$
example A The interval between the functions of , example C The function interval of is very small . If C Classification error , The function interval will be negative .
The function interval is equal to 1 An instance of is called a support vector .（？ Why? ）
Associated with the function intervals are geometric intervals , Or the maximum width of the support vector . The geometric interval is equal to the normalized function interval , Because the function interval can pass through w Zoom ,
Therefore, it is necessary to standardize the function interval .
When w Is a unit vector , The geometric interval is equal to the functional interval .
The parametric model to maximize the geometric spacing can be obtained by constraining the optimization conditions .
$min\frac{1}{2}<w,w>$
$y_{i}(<w_{i}x_{i}>+b)\geq 1$

（ Problems to be solved ：）
SVM A feature of ： Constrained optimization problem is a convex optimization problem , Its local minimum is also the global minimum （ How to prove ？）
It is a quadratic programming problem to find the parameters that maximize the geometric interval , The problem Sequence minimum optimization algorithm is usually used SMO solve .（ Not studied ）

Use examples to test ： Classification of handwritten Numbers

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_mldata
import matplotlib.cm as cm
from sklearn import datasets

# Incoming data set 
mnist = datasets.fetch_mldata("mnist-original", data_home='/home/tianchi/myspace/datapanbo/mnist')
#mnist = fetch_mldata("MINIST original", data_home='/home/tianchi/myspace/datapanbo/mnist')
counter = 1
for i in range(1,4):
    for j in range(1,6):
        plt.subplot(3,5,counter)
        plt.imshow(mnist.data[(i-1) * 8000+j].reshape((28,28)),cm.Greys_r)
        plt.axis('off')
        counter+=1
plt.show

Insert picture description here

from sklearn.pipeline import   Pipeline

from sklearn.model_selection import  train_test_split
from sklearn.svm import SVC
from  sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
X,y =mnist.data, mnist.target 
X = X/255.0*2-1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11)
pipeline =Pipeline([
        ('clf',SVC(kernel='rbf', gamma=0.01, C=100))
    ])
parameters ={
        'clf__gamma':(0.01, 0.03, 0.1, 0.3, 1),
        'clf__C' : (0.1, 0.3, 1, 3, 10 , 30),
    }
grid_search =GridSearchCV(pipeline,parameters,n_jobs=2, verbose=1, scoring='accuracy')
grid_search.fit(X_train[:100],y_train[:100])

Insert picture description here
Here I want to say , Because it is to test the code to run through , So I chose 100 Data .

print(' The best score : %0.3f' %grid_search.best_score_)
print(' The best set of parameters :')
best_parameters= grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: % r '%(param_name,best_parameters[param_name]))

predictions=grid_search.predict(X_test)
print(classification_report(y_test,predictions))

Insert picture description here
This is to use the computer to run 10000 data , It took about an hour ：63min

I will upload the following codes and data sets to git
GIT Code address