当前位置：网站首页>Machine learning theory learning: perceptron

Machine learning theory learning: perceptron

2022-07-02 07:36:00 【wxplol】

List of articles

One 、 Perceptron model
Two 、 Perceptron learning strategies
3、 ... and 、 Perceptron learning algorithm
- 3.1、 The original form of Perceptron Algorithm
- 3.2、 Dual form of Perceptron Algorithm

perceptron (Perceptron) It's a two class linear classification model , The input is the eigenvector of the instance , The output is the category of the instance , take +1 and -1. The perceptron corresponds to the input space ( The feature space ) The instance is divided into positive and negative separated hyperplanes , It's a discriminant model . Perceptron prediction is to use the learned perceptron model to classify new input instances , It is the basis of neural network and support vector machine . The following reference links are highly recommended , It's very detailed .

Refer to the connection ：

machine learning —— perceptron

One 、 Perceptron model

Definition ： Input x, Corresponding to the feature points in the input space , Output y={+1,-1} Represents the category of the instance , The functions input to output are as follows ：
$f (x) = s i g n (w * x + b)$
It's called the perceptron .w、b Is the weight of the perceptron .

The perceptron is a linear classification model , It's a discriminant model . linear equation w*x+b=0 A hyperplane corresponding to the feature space S, This hyperplane divides the feature space into two parts , They are positive and negative .S Also called separating hyperplane , As shown in the figure below ：

Insert picture description here

Two 、 Perceptron learning strategies

Suppose the training data is linearly separable , The learning goal of the perceptron is to find a separation hyperplane that can completely separate the positive and negative samples , That is, to be sure w,b, It is necessary to define a loss function and minimize the loss function .

The perceptron selects the misclassification point to the separation hyperplane S The total distance as a loss function . input space $R^{n}$ Any point in $x_{o}$ The distance to the hyperplane ：
$\frac{1}{||w||}|w*x_{0}+b|$
among $∣ ∣ w ∣ ∣$ by w Of L2 norm .

【 Add 】
1、 The distance between a point and a line
Let the linear equation be Ax+By+C=0 , The last point is $x_{0},y_{0})$ , Then the distance is
$d=\frac{A*x_{0}+B*y_{0}+C}{\sqrt{A^{2}+B^{2}}}$
2、 Sample to hyperplane distance
Suppose the hyperplane is h=w⋅x+b, among w=(w0,w1,w2…wn), x=(x0,x1,x2…xn), Sample points x′ The distance to the hyperplane ：
$d=\frac{w*x^{'}+b}{||d||}$

in addition , For misclassified classified data $x_{i},y_{i})$ Come on , When $w*x_{i}+b>0$ when , $y_{i}=-1$ ; And when $w*x_{i}+b<0$ when , $y_{i}=+1$ . therefore , Misclassification to S The distance to ：
$-\frac{1}{||w||}y_{i}(w*x_{i}+b)$
Don't consider $\frac{1}{||w||}$ , The loss function of the perceptron is defined as ：
$L(w,b)=-\sum y_{i}(w*x_{i}+b)$

3、 ... and 、 Perceptron learning algorithm

3.1、 The original form of Perceptron Algorithm

For a given data set $T={(x_{1},y_{1}),...,(x_{N},y_{N})}$ Find parameters w,b Minimize the loss function ：
$minL(w,b)=-\sum y_{i}(w*x_{i}+b)$
Because we need to keep learning w,b, Therefore, we adopt the random gradient descent method to continuously optimize the objective function , ad locum , In the process of random gradient descent , Each time, only one misclassification point sample is used to reduce its gradient .

First , We solve the gradient , Respectively for w,bw,b Finding partial derivatives :
$∇wL(w,b)=\frac{∂}{∂w}L(w,b)=−∑yixi\\ ∇bL(w,b)=\frac{∂}{∂b}L(w,b)=−∑yi$
then , Randomly select a misclassification point pair w,bw,b updated ： ( Synchronize updates )
$w←w+ηyixi\\ b←b+ηyi$

among ,η For learning rate , Through this iteration, the loss function L(w,b) Constantly decreasing , Until minimized .

Algorithm steps ：

Input ： $T={(x_{1},y_{1}),...,(x_{N},y_{N})}$ , Learning rate η(0<η<=1)

Output ：w,b; Perceptron model $f (x) = s i g n (w * x + b)$

Select the initial value $w_{0},b_{0}$
Select data $x_{i},y_{i})$
If $y_{i}(w*x_{i}+b)<=0$

$w←w+ηyixi\\ b←b+ηyi$

Transferred to （2）, There are no misclassification points until the training set

According to the algorithm, we can see , Perceptron mainly uses misclassification points to learn , Through adjustment w,b Value , Move the separation hyperplane to the misclassification point , Reduce the distance from the separation point to the plane , Until the hyperplane crosses the point so that it is correctly classified .

3.2、 Dual form of Perceptron Algorithm

We know the general form of weight update ：
$w←w+ηyixi\\ b←b+ηyi$
But after many iterations , We need to update many times w、b The weight , This calculation is relatively large . Is there a way to calculate the weight without so many times ? Yes , That is the dual form of Perceptron Algorithm . Specifically, how to lead out can ？ We know what happened n After modification , The parameter changes are as follows ：
$w=\sum_{i} a_{i}y_{i}x_{i} \\ b=\sum_{i} a_{i}y_{i}$
among $a_{i}$ It is a misclassification point $x_{i},y_{i})$ Number of updates required

So we can learn from w,b Become the number of learning misclassification $a_{i}$ , And in the dual form Gram Matrix to store the inner product , It can improve the operation speed , In contrast to the original form , Every time the parameters change , All matrix calculations need to be calculated , This leads to a much larger amount of computation than the dual form , This is the efficiency of the dual form . here Gram The matrix is equivalent to $\sum_{i}y_{i}x_{i}$ .Gram The definition of a matrix is as follows ：
$G=\left[ \begin{matrix} x_{1}x_{1} & ... & x_{1}x_{N}\\ x_{2}x_{1} & ... & x_{2}x_{N}\\ ... & ... & ...\\ x_{N}x_{1} & ... & x_{N}x_{N} \end{matrix} \right]$
Algorithm steps ：

Input ： $T={(x_{1},y_{1}),...,(x_{N},y_{N})}$ , Learning rate η(0<η<=1)

Output ：a,b; Perceptron model $f(x)=sign(\sum_{j}a_{j}y_{i}x_{j}*x_{i}+b)$

Assign initial value to a0,b0 .
Select data points (xi,yi).
Judge whether the data point is the misclassification point of the current model , That is, judge if $y_{i}(\sum_{j}a_{j}y_{j}x_{j}*x_{i}+b)<=0$ Update