当前位置：网站首页>[statistical learning method] learning notes - support vector machine (Part 2)

[statistical learning method] learning notes - support vector machine (Part 2)

2022-07-07 12:34:00 【Sickle leek】

Statistical learning methods learning notes —— Support vector machine （ Next ）

3. Nonlinear support vector machine and kernel function
4. Sequence minimum optimization algorithm
summary
Content sources

3. Nonlinear support vector machine and kernel function

Nonlinear support vector machines , Its main feature is the use of nuclear technology （kernel trick）.

3.1 Nuclear skills

Nonlinear classification problem 、
Nonlinear classification problem refers to the problem that can be well classified by using nonlinear models .
Generally speaking , For a given training data set $T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\}$ , among , example $x_i$ Belongs to input space , $x_i\in \mathcal{X}=R^n$ , There are two corresponding class tags $y_i\in \mathcal{y}=\{-1, +1\}, i=1,2,...,N$ . If it works $R^n$ One of them Hypersurface Divide the positive and negative examples correctly , Call this problem Nonlinear separable problems .

The classification problem of the left graph cannot be solved by using straight lines （ Linear model ） Separate positive and negative instances correctly , But you can use an elliptic curve （ Nonlinear models ） Separate them correctly .

If you can use a hypersurface to correctly separate positive and negative examples , This problem is called a nonlinear separable problem . Nonlinear problems are often difficult to solve , So I hope to solve this problem by solving the linear classification problem . The method adopted is a nonlinear transformation , Transform nonlinear problems into linear problems . Solve the original problem by solving the linear problem .
Let the original space be $\mathcal{X}⊂R^2,x=(x^{(1)},x^{(2)})^T\in \mathcal{X}$ , The new space is $\mathcal{Z}⊂R^2$ , $z=(z^{(1)},z^{(2)})^T∈\mathcal{Z}$ , Define the mapping from the original space to the new space ： $z=ϕ(x)=((x^{(1)})^2,(x^{(2)})^2)^T$
Then the ellipse of the original space $w_1(x^{(1)})^2+w_2(x^{(2)})^2+b=0$
Transform into a straight line in the new space $w_1z^{(1)}+w_2z^{(2)}+b=0$

Nuclear techniques are applied to SVM, The basic idea is to input space through a nonlinear transformation （ European space $R^n$ Or discrete sets ） Corresponding to a feature space （ Hilbert space $\mathcal{H}$ ）, Make the input space $R^n$ The hypercurved surface in corresponds to the feature space $\mathcal{H}$ Hyperplane model of （ Support vector machine ）. such , The learning task of classification problem is to solve linear in feature space SVM You can do it .

Definition of kernel function

The idea of nuclear technology is , Only kernel functions are defined in learning and prediction $K (x, z)$ , Instead of the definition mapping function shown $ϕ$ . adopt , Direct calculation $K (x, z)$ Relatively easy , And by $ϕ (x)$ and $ϕ (z)$ Calculation $K (x, z)$ Not easy . The feature space $\mathcal{H}$ It's usually high dimensional , Even infinite dimensional .

For a given core $K (x, z)$ , The feature space $\mathcal{H}$ And mapping functions $ϕ$ The choice of is not unique , You can take different feature spaces , Even in the same feature space, different mappings can be taken .

The application of kernel technique in support vector machine
In the dual problem of linear support vector machine , Whether it is an objective function or a decision function （ Separating hyperplanes ） Only the inner product between the input instance and the instance is involved . Inner product in objective function of dual problem $x_i\cdot x_j$ You can use kernel functions $K(x_i, x_j)=ϕ(x_i) \cdot ϕ(x_j)$ Instead of . In this case, the objective function of the dual problem becomes ： $W(\alpha)=\frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \alpha_i\alpha_j y_i y_j K(x_i ,x_j)-\sum_{i=1}^N\alpha_i$
Again , The inner product in the separation decision function can also be replaced by kernel function , The classification decision function becomes $f(x)=sign(\sum_{i=1}^{N_s} \alpha_i^* y_i \phi (x_i) \cdot \phi(x)+b^*)=sign(\sum_{i=1}^{N_s}\alpha_i^* y_i K(x_i, x)+b^*)$

Learn linearity from training samples in the new feature space SVM. Learning is implicit in the feature space , There is no need to explicitly define feature space and mapping function . Such a technique is called the nuclear technique . When the mapping function is nonlinear , Learned that contains kernel function SVM It is a nonlinear classification model .

in application , Often rely on domain knowledge to directly select kernel function , The validity of kernel function needs to be verified by experiments .

3.2 Positive determination

function $K (x, z)$ What conditions can be satisfied to be called a kernel function ？ Generally speaking, the kernel function is a positive definite kernel function （positive definite kernel function）. First, introduce some preliminary knowledge .
hypothesis $K (x, z)$ Is defined in $\mathcal{X} \times \mathcal{X}$ Symmetric functions on , And for any $x_1, x_2, ..., x_m\in \mathcal{X}$ , $K (x, z)$ About $x_1, x_2, ..., x_m$ Of Gram The matrix is positive semidefinite . According to the function $K (x, z)$ , Form a Hilbert space , The step is ： First define the mapping $\phi$ And form a vector space $\mathcal{S}$ ; And then in $\mathcal{S}$ Define inner product on to form inner product space ; The final will be $\mathcal{S}$ Completion constitutes Hilbert space .

Define mapping , Form vector space $\mathcal{S}$
First define the mapping $\phi: x \rightarrow K(\cdot , x)$
According to this mapping , To any $x_i \in \mathcal{X}, \alpha_i \in R, i=1,2,...,m$ , Define linear combinations $f(\cdot) = \sum_{i=1}^m \alpha_i K(\cdot , x_i)$ Consider a set of elements from linear combinations $\mathcal{S}$ . Because of the set $\mathcal{S}$ Addition and multiplication operations are closed , therefore $\mathcal{S}$ Form a vector space .
stay $\mathcal{S}$ Define inner product on , Make it an inner product space
stay $\mathcal{S}$ Define an operation on $*$ ： To any $g\in \mathcal{S}$ , $f(\cdot) = \sum_{i=1}^m \alpha_i K(\cdot, x_i)$ $g(\cdot) = \sum_{j=1}^l \beta_j K(\cdot, z_j)$ Define operations $*$
$g=\sum_{i=1}^m \sum_{j=1}^l \alpha_i \beta_j K(x_i, z_j)$
The inner product space $\mathcal{S}$ Complete into Hilbert space
Now let's take the inner product space $\mathcal{S}$ Completeness , The norm can be obtained from the inner product defined by the above formula $||f||=\sqrt{f\cdot f}$
therefore , $\mathcal{S}$ Is a normed vector space . According to the theory of functional analysis , For incomplete normed vector spaces $\mathcal{S}$ , It must be complete , Get a complete normed vector space $\mathcal{H}$ . An inner product space , When a normed vector space is complete , It's Hilbert space . such , Then we get Hilbert space $\mathcal{H}$ .

This Hilbert space $\mathcal{H}$ It is called reproducing kernel Hilbert space . This is due to nuclear K Regenerative , The meet $K(\cdot, x) \cdot f = f(x)$ And $K(\cdot, x)\cdot K(\cdot, z)=K(x, z)$
It is called regenerative nucleus .

A necessary and sufficient condition for a positive definite kernel
Theorem ： A necessary and sufficient condition for a positive definite kernel ： set up $\mathcal{X}\times \mathcal{X} \rightarrow R$ It's a symmetric function , be $K (x, z)$ The necessary and sufficient condition for a positive definite kernel function is for any $x_i\ in \mathcal{X}, i=1,2,...,m, K(x,z)$ Corresponding Gram matrix ：
$[K(x_i, x_j)]_{m\times n}$
It's a positive semidefinite matrix .

Definition ： Equivalent definition of positive definite kernel ： set up $\mathcal{X} ⊂R^2, K(x, z)$ Is defined in $KaTeX parse error: Undefined control sequence: \tims at position 13: \mathcal{X} \̲t̲i̲m̲s̲ ̲\mathcal{X}$ Symmetric functions on , If for any $x_i \in \mathcal{X}, i=1,2,..., m, K(x,z)$ Corresponding Gram matrix
$[K(x_i, x_j)]_{m\times n}$
It's a positive semidefinite matrix , said $K (x, z)$ It's positive .

3.3 Common kernel functions

（1） Polynomial kernel function （polynomial kernel function）
$K(x,z)=(x\cdot z+1)^p$
The corresponding support vector machine is a p Polynomials classifier . In this case , The classification decision function becomes
$f(x)=sign(\sum_{i=1}^{N_s}α_i^*y_i(x_i\cdot x+1)^p+b^∗)$
（2） Gaussian kernel （Gaussian kernel function）
$K(x,z)=exp(−\frac{||x−z||^2}{2σ^2})$
The corresponding support vector machine is Gaussian radial basis function classifier . In this case , The classification decision function becomes
$f(x)=sign(\sum_{i=1}{N_s}α_i^*y_i exp(−\frac{||x−z||^2}{2σ^2})+b^∗)$
（3） String kernel function
Kernel functions can be defined not only in Euclidean space , It can also be defined on the set of discrete data . such as , String kernel is the kernel function defined on string set . String kernel function in text classification , Information retrieval , Biological information and other aspects have applications .

Judgement kernel function
（1） Given a kernel function $K$ , about $a > 0$ , $a K$ Also known as kernel function .
（2） Given two kernel functions $K'$ and $K''$ , $K' K''$ Also known as kernel function .
（3） For any real valued function in the input space $f (x)$ , $K(x_1,x_2)=f(x_1)f(x_2)$ It's a kernel function .

3.4 Nonlinear support vector machines

Definition ： Nonlinear support vector machines ： From the nonlinear classification training set , Maximization through kernel function and soft interval , Or convex quadratic programming , Learning the classification decision function $f(x)=sign(\sum_{i=1}^N \alpha_i^* y_i K(x, x_i)+b^*)$ It is called nonlinear support vector machine , $K (x, z)$ It's a positive definite kernel function .

Algorithm ： Nonlinear support vector machine learning algorithm
Input ： Training data set $T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\}$ , among , $x_i\in \mathcal{X}=R^n$ , $y_i\in \mathcal{y}=\{-1, +1\}, i=1,2,...,N$ ;
Output ： Classification decision function
（1） Select the appropriate kernel function $K (x, z)$ And the appropriate parameters C, Construct and solve optimization problems
$\min_\alpha \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) - \sum_{i=1}^N \alpha_i$
$\ \sum_{i=1}^N \alpha_i y_i = 0, 0\le \alpha_i \le C, i=1,2,..., N$
Find the optimal solution $\alpha^* = (\alpha_1^*, \alpha_2^*, ..., \alpha_N^*)^T$ .
（2） choice $\alpha^*$ A component of $\alpha_j^*$ Suitable conditions $\alpha_j^* < C$ , Calculation $b^*=y^j-\sum_{i=1}^Ny_i\alpha_i^*y_iK(x_i \cdot x_j)$
（3） Construct decision function ：
$f(x)=sign(\sum_{i=1}^N \alpha_i^* y_i K(x\cdot x_i) + b^*)$
When $K (x, z)$ Is a positive definite kernel function , The above process is to solve the convex quadratic programming problem , Solutions exist .

4. Sequence minimum optimization algorithm

SMO Algorithm ： How to realize support vector machine learning efficiently , Is a heuristic algorithm , The basic idea is ：
If the solutions of the variables satisfy the optimization problem KKT Conditions （Karush-Kuhn-Tucker conditions）, Then the solution of this optimization problem is obtained . because KKT The condition is a necessary and sufficient condition for the optimization problem . otherwise , Choose two variables , Fix other variables , In view of these two variables, a quadratic programming problem is constructed .

4.1 A method for solving quadratic programming with two variables

In order to solve the quadratic programming problem with two variables , First, analyze the constraints , Then we find the minimum under this constraint .

4.2 How to choose variables

SMO The algorithm selects two variables in each subproblem to optimize , At least one of these variables is a violation of KKT Conditions of the .

（1） The choice of the first variable

SMO Say choose the second 1 The process of variables is an outer loop , The outer loop is selected from the training samples KKT The most severe sample point , And take its corresponding variable as the second 1 A variable .

（2） The first 2 Choice of variables

SMO Say choose the second 2 The process of two variables is an inner loop . Suppose we have found the second... In the outer cycle 1 A variable $a_1$ , Now we have to find the second... In the inner loop 2 A variable $a_2$ . The first 2 The selection criteria of variables is to make $a_2$ There is enough change .

（3） Calculate the threshold $b$ And the difference $E_i$
After each optimization of two variables , Recalculate the threshold b. After optimizing two variables at a time , The corresponding... Must also be updated $E_i$ value , And save them in the list .

4.3 SMO Algorithm

SMO Algorithm

summary

The simplest case of support vector machine is linear separable support vector machine , Or hard interval support vector machine . The condition of constructing it is that the training data is linearly separable . The learning strategy is the maximum interval method .
In reality, there are few cases of linear separability when training data sets , Training sets are often approximately linearly separable , In this case, linear support vector machine , Or soft interval support vector machine . Linear support vector machine is the most basic support vector machine .
For nonlinear classification problems in input space , It can be transformed into a linear classification problem in a high-dimensional feature space by nonlinear transformation , Learning support vector machines in high-dimensional feature space . Because in the dual problem of learning linear support vector machines , The objective function and classification decision function only involve the inner product between instances , So there is no need to explicitly specify the nonlinear transformation , Instead, we replace the inner product with a kernel function . The kernel function represents , Through the inner product between two instances after a nonlinear transformation .
SMO The algorithm is a fast learning algorithm of support vector machine , Its characteristic is that the original quadratic programming problem is continuously decomposed into two variable quadratic programming subproblems , The sub problem is solved analytically , Until all variables satisfy KKT Until the conditions are met , In this way, the optimal solution of the original quadratic programming problem is obtained by heuristic method .