当前位置：网站首页>[statistical learning method] learning notes - support vector machine (I)

[statistical learning method] learning notes - support vector machine (I)

2022-07-07 12:34:00 【Sickle leek】

Statistical learning methods learning notes —— Support vector machine （ On ）

1. Linear separable support vector machine and hard interval maximization
2. Linear support vector machine and soft interval maximization
Content sources ：

Support vector machine （support vector machines, SVM） yes A two class classification model . Its basic model is Define the linear classifier with the largest interval in the feature space , The maximum spacing makes it different from the perceptron ; Support vector machines also include Nuclear skills , This makes it essentially Nonlinear classifier .

The learning strategy of support vector machine is Maximize spacing , It can be formalized as a solution Convex quadratic programming （convex quadratic programming） The problem of , It's also equivalent to regularized Minimization of hinge loss function . The learning algorithm of support vector machine is an optimization algorithm for solving convex quadratic programming .
Support vector machine learning methods include building models from simple to complex ：

Linearly separable SVM： When the training data is linearly separable , adopt Maximize hard spacing , Learn a linear classifier , Linear separable SVM, also called Hard spacing SVM.
linear SVM： When the training data is approximately linearly separable , adopt Maximize soft spacing , Also learn a linear classifier , Linear SVM, also called Soft space SVM.
nonlinear SVM： When the training data is linear and indivisible , By using Nuclear skills And Maximize soft spacing , Study nonlinear SVM.

Kernel function It means mapping the input from the input space to the feature space to get the inner product between the feature vectors . Nonlinearity can be learned by using kernel functions SVM, It is equivalent to learning linearity implicitly in high-dimensional feature space SVM. The method is called Nuclear method , The nuclear method is better than SVM More general machine learning methods .

1. Linear separable support vector machine and hard interval maximization

1.1 Linear separable support vector machine

Suppose the input space is Euclidean space or discrete set , The feature space is Euclidean space or Hilbert space . Linear separable support vector machine 、 Linear support vector machine assumes that the elements of the two spaces correspond one to one , The input in the input space is mapped to the feature vector in the feature space .
Nonlinear support vector machine uses a nonlinear mapping from input space to feature space to map input into feature vector . therefore , Input is transformed from input space to feature space , The learning of support vector machine is carried out in feature space .
Suppose a training data set on the feature space is given $T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\}$ , among $x_i\in \mathcal{X}=R^n$ , $y_i\in \mathcal{y}=\{+1, -1\}, i=1,2,..., N$ . $x_i$ For the first time $i$ eigenvectors , It also becomes an example . $y_i$ by $x_i$ Class tags for . When $y_i=+1$ when , call $x_i$ For example ; When $y_i=-1$ when , call $x_i$ The negative example is . $x_i, y_i)$ Is the sample point . Suppose that the training data set is linearly separable .

The goal of learning is to find a separated hyperplane in the feature space , Can divide instances into different classes . The separation hyperplane corresponds to the equation $w\cdot x + b=0$ , It consists of the normal vector $w$ And intercept $b$ decision , You can use $(w, b)$ To express . The separation hyperplane divides the feature space into two parts , Some are positive classes , Some are negative classes . The normal vector points to a positive class on one side , The other side is negative .

In a general way , When the training data set is linearly separable , There are infinite separating hyperplanes that can correctly classify the two types of data . The perceptron uses the strategy of minimizing misclassification , Find the separating hyperplane , But there are infinite solutions . And linear separable support vector machine is to find the largest separation hyperplane , This is the only .
Definition ： Linear separable support vector machine

1.2 Function interval and geometric interval

Generally speaking , The distance between a point and the separation hyperplane can indicate the degree of certainty of classification prediction . In the hyperplane $w\cdot x + b =0$ In certain cases , $|w\cdot x +b|$ To be able to represent a point relatively $x$ The distance from the hyperplane , and $w\cdot x +b$ Symbol with class notation $y$ Whether the symbols of are consistent can indicate whether the classification is correct . So it can be used $y(w\cdot x +b)$ To show the correctness and certainty of classification . This is it. The function between （functional margin） The concept of .
Definition ： The function between
The function interval can express the accuracy and certainty of classification prediction . But if it changes proportionally $w$ and $b$ , The hyperplane hasn't changed , But the function interval becomes the original 2 times .
To ensure uniqueness , You can separate hyperplane normal vectors $w$ Put some restrictions on , Such as standardization , $∣ ∣ w ∣ ∣ = 1$ , This makes the interval definite . At this time, the function interval becomes the geometric interval （geometric margin）.

In a general way , When the sample point $x_i, y_i)$ Hyperplane $(w, b)$ When correctly classified , spot $x_i$ And hyperplane $(w, b)$ The distance is
$\gamma_i=y_i(\frac{w}{||w||} \cdot x_i + \frac{b}{||w||})$
From this fact, the concept of geometric interval is derived .

Definition ： Geometric interval
From the definition of function interval and geometric interval , There is the following relationship between function interval and geometric interval ： $\gamma_i=\frac{\hat{\gamma}_i}{||w||}$ $\gamma = \frac{\hat{\gamma}}{||w||}$ If $∣ ∣ w ∣ ∣ = 1$ , Then the function interval is equal to the geometric interval . If the hyperplane parameter $w$ and $b$ Change in proportion （ The hyperplane does not change ）, The function interval also changes in this proportion , And the geometric spacing remains the same .

1.3 Maximize spacing

The basic idea of support vector machine learning is to solve the separation hyperplane which can correctly divide the training data set and has the largest geometric interval . For linearly separable training sets , There are infinite number of linearly separable hyperplanes （ Equivalent to perceptron ）, But the separated hyperplane with the largest geometric spacing is the only one . The maximum spacing here is also known as Maximize hard spacing （ Corresponding to the maximization of soft interval when the training data set to be discussed is approximately linearly separable ）.

Maximize spacing ： Finding the hyperplane with the largest geometric interval means classifying the training data with sufficient confidence . That is to say, not only the positive and negative instance points are separated , And for hard to distinguish examples （ The closest point to the hyperplane ） There is also enough certainty to separate them .

Maximum separation hyperplane
Next, consider how to find a separation hyperplane with the largest geometric spacing , That is, the maximum separation hyperplane . In particular , This problem can be expressed as the following constrained optimization problem
$\max_{w,b} \gamma$ $\ y_i(\frac{w}{||w||}\cdot x_i+\frac{b}{||w||})\ge \gamma, i=1,2,...,N$
That is, we want to maximize the hyperplane $(w, b)$ On the geometric interval of the training data set $\gamma$ , The constraint represents a hyperplane $(w, b)$ The geometric interval of each training sample point is at least $\gamma$ .
Consider the relationship between geometric interval and functional interval $\gamma = \frac{\hat{\gamma}}{||w||}$ , The question can be rewritten as
$\max_{w,b} \frac{\hat{\gamma}}{||w||}$ $y_i(w\cdot x_i+b)\ge \hat{\gamma}, i=1,2,...,N$
The function between $\gamma$ The value of does not affect the solution of the optimization problem . Assume that $w$ and $b$ Change proportionally to $\lambda w$ and $KaTeX parse error: Undefined control sequence: \lambdab at position 1: \̲l̲a̲m̲b̲d̲a̲b̲$ , At this time, the function interval is called $\lambda \hat{\gamma}$ , This change of function interval has no effect on the inequality constraint of the above optimization problem , It also has no effect on the optimization of the objective function . such , You can take $\hat{\gamma} = 1$ .
Also note the maximization $\frac{1}{||w||}$ And minimize $\frac{1}{2}||w||^2$ It is equivalent. , So we get the following optimization problem
$\min_{w,b} \frac{1}{2}||w||^2$ $y_i(w\cdot x_i+b)−1⩾0, i=1,2,...,N$
This is a convex quadratic programming （convex quadratic programming） problem .

Solving constrained optimization problems , obtain $w *$ and $b *$ , We can get the maximum separation hyperplane $w^* \cdot x + b^* = 0$ And classification decision function $sign(w^* \cdot x + b^*)$ , The linear separable support vector machine model .

Algorithm ： Linear separable support vector machine learning algorithm —— Maximum interval method
Input ： Linearly separable training data set $T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\}$ , among $x_i\in \mathcal{X}=R^n$ , $y_i\in \mathcal{y}=\{+1, -1\}, i=1,2,..., N$ ;
Output ： Maximum separation hyperplane and classification decision function .
（1） Construct and solve constrained optimization problems ： $\min_{w, b} \ \frac{1}{2}||w||^2$ $\ y_i(w\cdot x_i + b)-1 \ge 0, i=1,2, ..., N$
Find the optimal solution $w^*, b^*$
（2） So we get the separation hyperplane ： $w^* \cdot x + b^* = 0$ Classification decision function $(w^* \cdot x + b^*)$

Existence and uniqueness of maximal separation hyperplane
The maximal separation hyperplane of linearly separable training data sets exists and is unique .
Theorem （ Existence and uniqueness of maximal separation hyperplane ）： If the training data set T Linearly separable , Then the maximum separation hyperplane that can completely and correctly classify the sample points in the training data set exists and is unique .
Support vector and interval boundary
In the case of linear separability , The instance of the sample point closest to the separation hyperplane in the sample point of the training data set is called support vector （support vector）. Support vector is the point that makes the following equal sign true , namely $y_i(w\cdot x_i + b)-1=0$
Yes $y_i=+1$ The normal point of , The support vector is in the hyperplane $H_1: w\cdot x + b = 1$ On , Yes $y_i=-1$ Negative example point of , The support vector is in the hyperplane $H_2: w\cdot +b = -1$ On . Pictured 7.3 Shown , stay $H_1$ and $H_2$ The point on is the support vector .

Separation boundary ： $H_1$ and $H_2$ parallel , No instance point falls between them . $H_1$ And $H_2$ Form a long belt between , The separation hyperplane is parallel to them and located in their center . The width of the length , namely $H_1$ and $H_2$ The distance between them is called interval , The interval depends on the normal vector separating the hyperplanes $w$ , be equal to $2 ∣ ∣ w ∣ ∣$ , $H_1$ and $H_2$ be called Separation boundary .
When deciding to separate hyperplanes , Only support vectors work . If we move the support vector, we will change the solution , But if you move other instance points outside the interval boundary , Even get rid of the dots , The solution will not change .
Because support vector plays a decisive role in determining hyperplane , So this method is called Support vector machine . The number of support vectors is usually very small , So SVM consists of very few “ important ” Determination of training samples .

1.4 The dual algorithm of learning

Dual algorithm （dual algorithm）： In order to solve the optimization problem of linear separable support vector machine , Think of it as a primitive optimization problem , Applying Lagrange duality , By solving duality （dual problem） The problem gets the original problem （primal problem） The best solution . Doing so advantage ： One is that dual problems are often easier to solve ; The second is to introduce kernel function naturally , And then extended to the nonlinear classification problem .
First , Construct Lagrange function （Lagrange function）. So , For every inequality constraint, Lagrange multipliers are introduced （lagrange multiplier） $\alpha_i \ge 0, i=1,2,...,N$ , Define the Lagrange function ：
$\alpha)=\frac{1}{2}||w||^2 - \sum_{i=1}^N \alpha_i y_i(w\cdot x_i + b)+\sum_{i=1}^N \alpha_i$ among $\alpha = (\alpha_1, \alpha_2,...,\alpha_N)^T$ For the Lagrange multiplier vector .
According to Lagrangian duality , The dual problem of the primal problem makes the minimax problem ： $\max_{\alpha}\min_{w,b} L(w, b,\alpha)$
therefore , In order to get the solution of the dual problem , We need to ask first. $\alpha)$ Yes $w, b$ The smallest of , Ask for the right again $\alpha$ It's huge .
（1） seek $\min_{w,b} L(w,b,\alpha)$
Let's take the Lagrange function $L(w,b,\alpha)$ Respectively for $w, b$ Take the partial derivative and make it equal to 0.
Lagrange function for partial derivatives
have to ： $\sum_{i=1}^N \alpha_i y_i x_i$ $\sum_{i=1}^N \alpha_i y_i =0$
Bring it into Lagrange function , Immediate ：
Bring the partial derivative function into the Lagrange function
（2） seek $\min_{w,b} L(w,b, \alpha)$ Yes $\alpha$ It's huge , That is, the dual problem .
Find the maximum dual problem
Convert the objective function of the above formula from seeking maximum to seeking minimum , The following equivalent dual optimization problem is obtained ：

Theorem 7.2
adopt KKT The conditional solution of the dual problem can be obtained ：
$w^* = \sum_{i=1}^N \alpha_i^*y_i^*x_i^*$ $b^*=y_j-\sum_{i=1}^N\alpha_i^*y_i(x_i\cdot x_j)$
According to this theorem , Can be written as a hyperplane $\sum_{i=1}^N \alpha_i^* y_i(x\cdot x_i)+b^*=0$
The classification decision function can be written as $f(x)=sign\{\sum_{i=1}^N \alpha_i^* y_i(x\cdot x_i)+b^*\}$
That is to say , The classification decision function depends only on the input x And the inner product of training sample input . The above formula is called the dual form of linear separable support vector machine .
Algorithm ： Linear separable support vector machine learning algorithm
Input ： Linearly separable training data set $T={(x_1,y_1),(x_2,y_2),...,(x_N,y_N)}$ , among , $x_i\in \mathcal{X}=R^n,y_i \in \mathcal{Y}=\{−1,+1\},i=1,2,...,N$ ;
Output ： Maximum separation hyperplane and classification decision function .
（1） Construct and solve constrained optimization problems ：
$\min_α \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \alpha_i\alpha_jy_iy_j(x_i\cdot x_j)−\sum_{i=1}^N \alpha_i$ $s.t.\ \sum_{i=1}^N \alpha_iy_i=0, \ \alpha_i\ge 0,i=1,2,...,N$

Find the optimal solution $\alpha^∗=(\alpha_1^*,\alpha_2^*,...,\alpha_N^*)^T$ ;
（2） Calculation ：
$w^∗=\sum_{i=1}^N \alpha_i^* y_ix_i$
And select $\alpha^∗$ A positive component of $\alpha_j^*>0$ , Calculation ：
$b^∗=y_j−\sum_{i=1}^N\alpha_i^*y_i(x_i\cdot x_j)$
（3） Find the separating hyperplane ：
$w^* \cdot x +b^* = 0$
Classification decision function ：
$f(x)=sign\{w^* \cdot x + b^*\}$
Definition ： Support vector ： Consider the primal optimization problem and the dual optimization problem , The training data set corresponds to $\alpha_i^* >0$ Sample point of $x_i, y_i)$ Example $x_i \in R^n$ be called Support vector .

2. Linear support vector machine and soft interval maximization

2.1 Linear support vector machines

Support vector machine learning method for linear separable problems , It is not applicable to linear nonseparable training set data , Because at this time, the inequality constraints in the above methods cannot be established . How can we extend it to linear indivisible problems ？ This requires modifying the hard interval to maximize , Make it a Maximize soft spacing .

Suppose a training data set on the feature space is given $T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\}$ , among $x_i\in \mathcal{X}=R^n$ , $y_i\in \mathcal{y}=\{+1, -1\}, i=1,2,..., N$ . $x_i$ For the first time $i$ eigenvectors , It also becomes an example . $y_i$ by $x_i$ Class tags for . Usually it's , There are some special points in the training data （outlier）. After removing these specific points , The set composed of most of the remaining sample points is linearly separable .

Linear indivisibility means that some sample points $x_i,y_i)$ Cannot satisfy function interval greater than or equal to 1 Constraints of (7.14). To solve this problem , Sure For each sample point $x_i,y_i)$ Introduce a relaxation variable $ξ_i⩾0$ , Make the function interval plus the slack variable equal to or greater than 1. such , The constraint becomes
$y_i(w\cdot x_i+b)⩾1−ξ_i$
At the same time, for each relaxation variable $ξ_i$ Pay a price , The objective function becomes
$\frac{1}{2}||w||^2+C\sum_{i=1}^Nξ_i$
here $C > 0$ be called Penalty parameters , It's usually determined by the application problem ,C When the value is high, the penalty for misclassification increases ,C When the value is small, the penalty for misclassification is reduced .
The minimization objective function has two meanings ： send $\frac{1}{2}||w||^2$ As small as possible That is, the interval should be as large as possible , At the same time, make the number of misclassification points as small as possible ,C It's the coefficient to reconcile the two .
Learning problems ： The learning problem of linear inseparable support vector machine becomes the following convex quadratic programming problem
$\min_{w,b,ξ} \frac{1}{2}||w||^2 + C\sum_{i=1}^N ξ_i$
$y_i(w\cdot x_i + b)⩾1−ξ_i, i=1,2,...,N$
$ξ_i⩾0, i=1,2,...,N$
Can prove that w The solution is the only one , but b The solution may not be unique , It exists in an interval .

Let the solution of the above problem be $w^∗,b^∗$ , So we can get the separation hyperplane $w^∗\cdot x+b^∗=0$ And classification decision function $f(x)=sign(w^∗\cdot x+b^∗)$ . Such a model is called linear support vector machine with linearly nonseparable training samples , abbreviation linear SVM. Obviously linear SVM Contains linearly separable SVM. because In reality, training data sets are often linear and inseparable , linear SVM It has wider applicability .

Definition ： Linear support vector machines For a given linearly indivisible training dataset , By solving the convex quadratic programming problem , That is, the soft interval maximization problem , The separation hyperplane obtained is $w^∗\cdot x+b^∗=0$ And the corresponding classification decision function $f(x)=sign(w^∗\cdot x+b^∗)$ Called linear support vector machine .

2.2 The dual algorithm of learning

Algorithm ： Linear support vector machine learning algorithm
Input ： Training data set $T=\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\}$ , among $x_i\in \mathcal{X}=R^n$ , $y_i\in \mathcal{y}=\{+1, -1\}, i=1,2,..., N$ ;
Output ： Separating hyperplane and classification decision function .
（1） Select penalty parameters $C > 0$ , Construct and solve convex quadratic programming problem
$\min_\alpha \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j (x_i \cdot x_j) - \sum_{i=1}^N \alpha_i$
$\ \sum_{i=1}^N \alpha_i y_i = 0, 0\le \alpha_i \le C, i=1,2,..., N$
Find the optimal solution $\alpha^* = (\alpha_1^*, \alpha_2^*, ..., \alpha_N^*)^T$ .
（2） Calculation $w^*=\sum_{i=1}^N \alpha_i^* y_i x_i$
choice $\alpha^*$ A component of $\alpha_j^*$ Suitable conditions $\alpha_j^* < C$ , Calculation $b^*=y^j-\sum_{i=1}^Ny_i\alpha_i^*(x_i \cdot x_j)$
（3） Find the separating hyperplane
$w^* \cdot x + b^* = 0$
Classification decision function ： $f(x)=sign(w^* \cdot x + b^*)$

2.3 Support vector

The support vector here is the support vector in the case of soft separation . In the case of indivisibility of online , The solution of the dual problem $\alpha^* = (\alpha_1^*, \alpha_2^*, ..., \alpha_N^*)^T$ Corresponds to $\alpha_i^*>0$ Sample point of $x_i,y_i)$ Example $x_i$ be called Support vector （ Soft interval support vector ）. As shown in the figure below , At this time, the support vector is more complex than the case of linear separable time . Examples are also marked in the figure $x_i$ Distance to interval boundary $\frac{ξ_i}{||w||}$ .

Soft interval support vector $x_i$ Or on the boundary of the interval , Or between the interval boundary and the separation hyperplane , Or on the wrong side of the separation hyperplane .

if $\alpha_i^*<C$ , be $ξ_i=0$ , Support vector $x_i$ Right on the boundary of the gap .
if $\alpha_i^*=C$ , $0<ξ_i<1$ , The classification is correct , $x_i$ Between the separation boundary and the separation hyperplane .
if $\alpha_i^*=C$ , $ξ_i=1$ , The classification is correct , $x_i$ Between the separation boundary and the separation hyperplane .
if $\alpha_i^*=C$ , $ξ_i>1$ , be $x_i$ On the wrong side of the separation hyperplane .

2.4 Hinge loss function

For linear support vector machine learning , The model is a separated hyperplane $w^* \cdot x + b^*=0$ And decision function $f(x)=sign(w^* \cdot x + b^*)$ , Its learning strategy is to maximize the soft interval , The learning algorithm is convex quadratic programming .
Description of loss function ： linear SVM There is another explanation for learning , That is to minimize the following objective function ：
$\sum_{i=1}^N[1−y_i(w\cdot x_i+b)]_++λ||w||^2$
The... Of the objective function 1 This is an empirical risk , function $L(y(w\cdot x+b))=[1−y(w\cdot x+b)]_+$ be called Hinge loss function （hinge loss）, Subscript “+” Represents the following positive function
$[z]_+=\left\{\begin{matrix} z, & z>0\\ 0, & z\le 0 \end{matrix}\right.$
That is to say , When the sample point $x_i,y_i)$ Is correctly classified and the function interval （ Certainty ） $y_i(w\cdot x_i+b)$ Greater than 1 when , The loss is 0, Otherwise the loss is $1−y_i(w\cdot x_i+b)$ .
The... Of the objective function 2 The term is a coefficient of $λ$ Of w Of $L_2$ norm , It's a regularization term .
Theorem ： Linear support vector machine primitive optimization problem ：
$\min_{w,b,ξ} \frac{1}{2}||w||^2 + C\sum_{i=1}^N ξ_i$
$y_i(w\cdot x_i + b)⩾1−ξ_i, i=1,2,...,N$
$ξ_i⩾0, i=1,2,...,N$
Equivalent to optimization problem $\min_{w,b} \sum_{i=1}^N[1−y_i(w\cdot x_i+b)]_++λ||w||^2$ .

The graph of hinge loss function is shown in the figure below , The horizontal axis is the function interval $y(w\cdot x + b)$ , The vertical axis is the loss . Because the function is shaped like a hinge , So it's called hinge loss function .

The picture also shows 0-1 Loss function , It can be considered as the real loss function of binary classification problem , The hinge loss function is 0-1 The upper bound of the loss function . because 0-1 Losses are not continuously differentiable , Direct optimization is difficult , It can be said that linear SVM It's optimization 0-1 The upper bound of loss （ Hinge loss ） The objective function . At this time, the upper bound loss is also called the proxy loss function （surrogate loss function）.
The dotted line in the figure shows the loss of the sensor $[−y_i(w\cdot x_i)+b]_+$ . The loss when the sample is correctly classified is 0, Otherwise $−y_i(w\cdot x_i)+b$ . by comparison , Hinge loss should not only be classified correctly , And be sure , Loss is 0, In other words, the hinge loss function has higher requirements for learning .