当前位置:网站首页>Machine learning theory learning: perceptron
Machine learning theory learning: perceptron
2022-07-02 07:36:00 【wxplol】
List of articles
perceptron (Perceptron) It's a two class linear classification model , The input is the eigenvector of the instance , The output is the category of the instance , take +1 and -1. The perceptron corresponds to the input space ( The feature space ) The instance is divided into positive and negative separated hyperplanes , It's a discriminant model . Perceptron prediction is to use the learned perceptron model to classify new input instances , It is the basis of neural network and support vector machine . The following reference links are highly recommended , It's very detailed .
Refer to the connection :
machine learning —— perceptron
One 、 Perceptron model
Definition : Input x, Corresponding to the feature points in the input space , Output y={+1,-1} Represents the category of the instance , The functions input to output are as follows :
f ( x ) = s i g n ( w ∗ x + b ) f(x)=sign(w*x+b) f(x)=sign(w∗x+b)
It's called the perceptron .w、b Is the weight of the perceptron .
The perceptron is a linear classification model , It's a discriminant model . linear equation w*x+b=0 A hyperplane corresponding to the feature space S, This hyperplane divides the feature space into two parts , They are positive and negative .S Also called separating hyperplane , As shown in the figure below :

Two 、 Perceptron learning strategies
Suppose the training data is linearly separable , The learning goal of the perceptron is to find a separation hyperplane that can completely separate the positive and negative samples , That is, to be sure w,b, It is necessary to define a loss function and minimize the loss function .
The perceptron selects the misclassification point to the separation hyperplane S The total distance as a loss function . input space R n R^{n} Rn Any point in x o x_{o} xo The distance to the hyperplane :
1 ∣ ∣ w ∣ ∣ ∣ w ∗ x 0 + b ∣ \frac{1}{||w||}|w*x_{0}+b| ∣∣w∣∣1∣w∗x0+b∣
among ∣ ∣ w ∣ ∣ ||w|| ∣∣w∣∣ by w Of L2 norm .
【 Add 】
1、 The distance between a point and a line
Let the linear equation be Ax+By+C=0 , The last point is ( x 0 , y 0 ) (x_{0},y_{0}) (x0,y0), Then the distance is
d = A ∗ x 0 + B ∗ y 0 + C A 2 + B 2 d=\frac{A*x_{0}+B*y_{0}+C}{\sqrt{A^{2}+B^{2}}} d=A2+B2A∗x0+B∗y0+C
2、 Sample to hyperplane distanceSuppose the hyperplane is h=w⋅x+b, among w=(w0,w1,w2…wn), x=(x0,x1,x2…xn), Sample points x′ The distance to the hyperplane :
d = w ∗ x ′ + b ∣ ∣ d ∣ ∣ d=\frac{w*x^{'}+b}{||d||} d=∣∣d∣∣w∗x′+b
in addition , For misclassified classified data ( x i , y i ) (x_{i},y_{i}) (xi,yi) Come on , When w ∗ x i + b > 0 w*x_{i}+b>0 w∗xi+b>0 when , y i = − 1 y_{i}=-1 yi=−1; And when w ∗ x i + b < 0 w*x_{i}+b<0 w∗xi+b<0 when , y i = + 1 y_{i}=+1 yi=+1. therefore , Misclassification to S The distance to :
− 1 ∣ ∣ w ∣ ∣ y i ( w ∗ x i + b ) -\frac{1}{||w||}y_{i}(w*x_{i}+b) −∣∣w∣∣1yi(w∗xi+b)
Don't consider 1 ∣ ∣ w ∣ ∣ \frac{1}{||w||} ∣∣w∣∣1, The loss function of the perceptron is defined as :
L ( w , b ) = − ∑ y i ( w ∗ x i + b ) L(w,b)=-\sum y_{i}(w*x_{i}+b) L(w,b)=−∑yi(w∗xi+b)
3、 ... and 、 Perceptron learning algorithm
3.1、 The original form of Perceptron Algorithm
For a given data set T = ( x 1 , y 1 ) , . . . , ( x N , y N ) T={(x_{1},y_{1}),...,(x_{N},y_{N})} T=(x1,y1),...,(xN,yN) Find parameters w,b Minimize the loss function :
m i n L ( w , b ) = − ∑ y i ( w ∗ x i + b ) minL(w,b)=-\sum y_{i}(w*x_{i}+b) minL(w,b)=−∑yi(w∗xi+b)
Because we need to keep learning w,b, Therefore, we adopt the random gradient descent method to continuously optimize the objective function , ad locum , In the process of random gradient descent , Each time, only one misclassification point sample is used to reduce its gradient .
First , We solve the gradient , Respectively for w,bw,b Finding partial derivatives :
∇ w L ( w , b ) = ∂ ∂ w L ( w , b ) = − ∑ y i x i ∇ b L ( w , b ) = ∂ ∂ b L ( w , b ) = − ∑ y i ∇wL(w,b)=\frac{∂}{∂w}L(w,b)=−∑yixi\\ ∇bL(w,b)=\frac{∂}{∂b}L(w,b)=−∑yi ∇wL(w,b)=∂w∂L(w,b)=−∑yixi∇bL(w,b)=∂b∂L(w,b)=−∑yi
then , Randomly select a misclassification point pair w,bw,b updated : ( Synchronize updates )
w ← w + η y i x i b ← b + η y i w←w+ηyixi\\ b←b+ηyi w←w+ηyixib←b+ηyi
among ,η For learning rate , Through this iteration, the loss function L(w,b) Constantly decreasing , Until minimized .
Algorithm steps :
Input : T = ( x 1 , y 1 ) , . . . , ( x N , y N ) T={(x_{1},y_{1}),...,(x_{N},y_{N})} T=(x1,y1),...,(xN,yN), Learning rate η(0<η<=1)
Output :w,b; Perceptron model f ( x ) = s i g n ( w ∗ x + b ) f(x)=sign(w*x+b) f(x)=sign(w∗x+b)
- Select the initial value w 0 , b 0 w_{0},b_{0} w0,b0
- Select data ( x i , y i ) (x_{i},y_{i}) (xi,yi)
- If y i ( w ∗ x i + b ) < = 0 y_{i}(w*x_{i}+b)<=0 yi(w∗xi+b)<=0
w ← w + η y i x i b ← b + η y i w←w+ηyixi\\ b←b+ηyi w←w+ηyixib←b+ηyi
- Transferred to (2), There are no misclassification points until the training set
According to the algorithm, we can see , Perceptron mainly uses misclassification points to learn , Through adjustment w,b Value , Move the separation hyperplane to the misclassification point , Reduce the distance from the separation point to the plane , Until the hyperplane crosses the point so that it is correctly classified .
3.2、 Dual form of Perceptron Algorithm
We know the general form of weight update :
w ← w + η y i x i b ← b + η y i w←w+ηyixi\\ b←b+ηyi w←w+ηyixib←b+ηyi
But after many iterations , We need to update many times w、b The weight , This calculation is relatively large . Is there a way to calculate the weight without so many times ? Yes , That is the dual form of Perceptron Algorithm . Specifically, how to lead out can ? We know what happened n After modification , The parameter changes are as follows :
w = ∑ i a i y i x i b = ∑ i a i y i w=\sum_{i} a_{i}y_{i}x_{i} \\ b=\sum_{i} a_{i}y_{i} w=i∑aiyixib=i∑aiyi
among a i a_{i} ai It is a misclassification point ( x i , y i ) (x_{i},y_{i}) (xi,yi) Number of updates required
So we can learn from w,b Become the number of learning misclassification a i a_{i} ai, And in the dual form Gram Matrix to store the inner product , It can improve the operation speed , In contrast to the original form , Every time the parameters change , All matrix calculations need to be calculated , This leads to a much larger amount of computation than the dual form , This is the efficiency of the dual form . here Gram The matrix is equivalent to ∑ i y i x i \sum_{i}y_{i}x_{i} ∑iyixi.Gram The definition of a matrix is as follows :
G = [ x 1 x 1 . . . x 1 x N x 2 x 1 . . . x 2 x N . . . . . . . . . x N x 1 . . . x N x N ] G=\left[ \begin{matrix} x_{1}x_{1} & ... & x_{1}x_{N}\\ x_{2}x_{1} & ... & x_{2}x_{N}\\ ... & ... & ...\\ x_{N}x_{1} & ... & x_{N}x_{N} \end{matrix} \right] G=⎣⎢⎢⎡x1x1x2x1...xNx1............x1xNx2xN...xNxN⎦⎥⎥⎤
Algorithm steps :
Input : T = ( x 1 , y 1 ) , . . . , ( x N , y N ) T={(x_{1},y_{1}),...,(x_{N},y_{N})} T=(x1,y1),...,(xN,yN), Learning rate η(0<η<=1)
Output :a,b; Perceptron model f ( x ) = s i g n ( ∑ j a j y i x j ∗ x i + b ) f(x)=sign(\sum_{j}a_{j}y_{i}x_{j}*x_{i}+b) f(x)=sign(∑jajyixj∗xi+b)
- Assign initial value to a0,b0 .
- Select data points (xi,yi).
- Judge whether the data point is the misclassification point of the current model , That is, judge if y i ( ∑ j a j y j x j ∗ x i + b ) < = 0 y_{i}(\sum_{j}a_{j}y_{j}x_{j}*x_{i}+b)<=0 yi(∑jajyjxj∗xi+b)<=0 Update
a i = a i + η b = b + η y i a_{i}=a_{i}+η \\ b=b+ηy_{i} ai=ai+ηb=b+ηyi
- go to (2), Until there are no misclassification points in the training set
In order to reduce the amount of calculation , We can calculate the inner product in the formula in advance , obtain Gram matrix .
Refer to the connection :
Introduction to machine learning 《 Statistical learning method 》 Take notes —— perceptron
边栏推荐
- 【MEDICAL】Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization
- SSM laboratory equipment management
- Faster-ILOD、maskrcnn_ Benchmark trains its own VOC data set and problem summary
- 自然辩证辨析题整理
- Interpretation of ernie1.0 and ernie2.0 papers
- PPT的技巧
- [Bert, gpt+kg research] collection of papers on the integration of Pretrain model with knowledge
- A summary of a middle-aged programmer's study of modern Chinese history
- 矩阵的Jordan分解实例
- 第一个快应用(quickapp)demo
猜你喜欢
随机推荐
Oracle 11g uses ords+pljson to implement JSON_ Table effect
[torch] some ideas to solve the problem that the tensor parameters have gradients and the weight is not updated
Spark SQL task performance optimization (basic)
程序的执行
Using compose to realize visible scrollbar
解决万恶的open failed: ENOENT (No such file or directory)/(Operation not permitted)
Drawing mechanism of view (II)
PHP returns the abbreviation of the month according to the numerical month
一个中年程序员学习中国近代史的小结
PointNet原理证明与理解
Play online games with mame32k
Calculate the difference in days, months, and years between two dates in PHP
Ding Dong, here comes the redis om object mapping framework
parser.parse_args 布尔值类型将False解析为True
Determine whether the version number is continuous in PHP
win10+vs2017+denseflow编译
使用Matlab实现:Jacobi、Gauss-Seidel迭代
[introduction to information retrieval] Chapter 6 term weight and vector space model
[medical] participants to medical ontologies: Content Selection for Clinical Abstract Summarization
Yaml file of ingress controller 0.47.0







![[tricks] whiteningbert: an easy unsupervised sentence embedding approach](/img/8e/3460fed55f2a21f8178e7b6bf77d56.png)
