当前位置:网站首页>[CV] Wu Enda machine learning course notes | Chapter 12
[CV] Wu Enda machine learning course notes | Chapter 12
2022-07-02 21:24:00 【Fannnnf】
If there is no special explanation in this series of articles , The text explains the picture above the text
machine learning | Coursera
Wu Enda machine learning series _bilibili
Catalog
12 Support vector machine (SVM)
12-1 Optimization objectives

- The coordinate system on the left of the above figure is y = 1 y=1 y=1 Image of time valence function , Support vector machine draws a pink curve , Name it C o s t 1 ( z ) Cost_1(z) Cost1(z), Subscript refers to y y y The value of is 1 1 1
- alike , The right coordinate system is y = 0 y=0 y=0 Image of time valence function , Support vector machine draws a pink curve , Name it C o s t 0 ( z ) Cost_0(z) Cost0(z), Subscript refers to y y y The value of is 0 0 0
In logical regression , The cost function is :
J ( θ ) = − 1 m [ ∑ i = 1 m y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J(θ)=-\frac{1}{m}\left[\sum_{i=1}^my^{(i)}log(h_θ(x^{(i)}))+(1-y^{(i)})log(1-h_θ(x^{(i)}))\right]+\frac{λ}{2m}\sum_{j=1}^{n}θ_j^2 J(θ)=−m1[i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]+2mλj=1∑nθj2
Support vector machine , First put the minus sign in the above formula into the sum , And then put 1 m \frac{1}{m} m1 Get rid of ( 1 m \frac{1}{m} m1 It's a constant , Although removing it will change the value of the cost function , But the same minimum value can still be obtained θ \theta θ), The resulting cost function is :
J ( θ ) = ∑ i = 1 m [ y ( i ) ( − l o g ( h θ ( x ( i ) ) ) ) + ( 1 − y ( i ) ) ( − l o g ( 1 − h θ ( x ( i ) ) ) ) ] + λ 2 ∑ j = 1 n θ j 2 J(θ)=\sum_{i=1}^m\left[y^{(i)}\left (-log(h_θ(x^{(i)}))\right )+(1-y^{(i)})\left(-log(1-h_θ(x^{(i)}))\right)\right]+\frac{λ}{2}\sum_{j=1}^{n}θ_j^2 J(θ)=i=1∑m[y(i)(−log(hθ(x(i))))+(1−y(i))(−log(1−hθ(x(i))))]+2λj=1∑nθj2
Put... In the above formula ( − l o g ( h θ ( x ( i ) ) ) ) \left (-log(h_θ(x^{(i)}))\right ) (−log(hθ(x(i)))) Replace with C o s t 1 ( θ T x ( i ) ) Cost_1(\theta^Tx^{(i)}) Cost1(θTx(i)), hold ( − l o g ( 1 − h θ ( x ( i ) ) ) ) \left(-log(1-h_θ(x^{(i)}))\right) (−log(1−hθ(x(i)))) Replace with C o s t 0 ( θ T x ( i ) ) Cost_0(\theta^Tx^{(i)}) Cost0(θTx(i)), Get the cost function :
J ( θ ) = ∑ i = 1 m [ y ( i ) C o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) C o s t 0 ( θ T x ( i ) ) ] + λ 2 ∑ j = 1 n θ j 2 J(θ)=\sum_{i=1}^m\left[y^{(i)}Cost_1(\theta^Tx^{(i)})+(1-y^{(i)})Cost_0(\theta^Tx^{(i)})\right]+\frac{λ}{2}\sum_{j=1}^{n}θ_j^2 J(θ)=i=1∑m[y(i)Cost1(θTx(i))+(1−y(i))Cost0(θTx(i))]+2λj=1∑nθj2
In support vector machines , No longer use regularization parameters λ \lambda λ, Use parameters instead C C C, The cost function of the changed support vector machine is :
J ( θ ) = C ∑ i = 1 m [ y ( i ) C o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) C o s t 0 ( θ T x ( i ) ) ] + 1 2 ∑ j = 1 n θ j 2 J(θ)=C\sum_{i=1}^m\left[y^{(i)}Cost_1(\theta^Tx^{(i)})+(1-y^{(i)})Cost_0(\theta^Tx^{(i)})\right]+\frac{1}{2}\sum_{j=1}^{n}θ_j^2 J(θ)=Ci=1∑m[y(i)Cost1(θTx(i))+(1−y(i))Cost0(θTx(i))]+21j=1∑nθj2
- Support vector machines do not predict y = 1 / 0 y=1/0 y=1/0 Probability , If θ T x ( i ) ≥ 0 \theta^Tx^{(i)}\ge0 θTx(i)≥0, Suppose the function output 1, On the contrary, output 0
12-2 Understanding of large spacing
Support vector machine is also called large space classifier 
Change the judgment boundary in support vector machine , Give Way θ T x ( i ) ≥ 1 \theta^Tx^{(i)}\ge1 θTx(i)≥1 When the output 1, θ T x ( i ) ≤ − 1 \theta^Tx^{(i)}\le-1 θTx(i)≤−1 When the output 0, In this way, there is a safe gap between the two results 
Using a general logistic regression algorithm may generate pink and green lines in the above figure to segment two types of samples , And use support vector chance to generate black lines in the figure , The area between the two blue lines in the figure is called the spacing , Support vector machines try to separate the two samples with the greatest spacing , It can be seen that , Support vector machine can have better robustness 
Pictured above , Let's assume that there is no negative sample on the left , Use a big C C C The black line in the above figure can be generated , But if there is a negative sample on the left , because C C C It's big , Support vector machine is used to ensure the maximum distance between two kinds of samples , The pink line in the above figure will be generated , But if C C C It's not that big , Then even if there is a negative sample on the left , Black lines will still be generated
- C C C It's equivalent to the previous 1 λ \frac{1}{\lambda} λ1, Although the two are really different , But the effect is similar
12-3 Mathematical principle of support vector machine
12-4 Kernel function I

The sample shown in the figure , If we assume that the function ≥ 0 \ge0 ≥0, Just predict y = 1 y=1 y=1, Other predictions y = 0 y=0 y=0,
In the hypothetical function in the figure above , set up f 1 = x 1 , f 2 = x 2 , f 3 = x 1 x 2 , f 4 = x 1 2 , . . . f_1=x_1,f_2=x_2,f_3=x_1x_2,f_4=x_1^2,... f1=x1,f2=x2,f3=x1x2,f4=x12,...
Suppose the function becomes h θ ( x ) = θ 0 + θ 1 f 1 + θ 2 f 2 + . . . h_\theta(x)=\theta_0+\theta_1 f_1+\theta_2 f_2+... hθ(x)=θ0+θ1f1+θ2f2+...
The above method has nothing to do with the following method , Independent of kernel function
However , In addition to combining the original features , Is there a better way to construct 𝑓1, 𝑓2, 𝑓3? We can use kernel function to calculate new features .
Take... In the coordinate system 3 Sample marking matrix l ( 1 ) 、 l ( 2 ) 、 l ( 3 ) l^{(1)}、l^{(2)}、l^{(3)} l(1)、l(2)、l(3)
x x x Is a given training example , Suppose there are two eigenvalues , that x = [ x 1 , x 2 ] x=[x_1,x_2] x=[x1,x2]
Make
f 1 = s i m i l a r i t y ( x , l ( 1 ) ) = e x p ( − ∥ x − l ( 1 ) ∥ 2 2 σ 2 ) f_1=similarity\left(x,l^{(1)}\right)=exp\left(-\frac{\Vert x-l^{(1)} \Vert ^2}{2\sigma^2}\right) f1=similarity(x,l(1))=exp(−2σ2∥x−l(1)∥2)
f 2 = s i m i l a r i t y ( x , l ( 2 ) ) = e x p ( − ∥ x − l ( 2 ) ∥ 2 2 σ 2 ) f_2=similarity\left(x,l^{(2)}\right)=exp\left(-\frac{\Vert x-l^{(2)} \Vert^2 }{2\sigma^2}\right) f2=similarity(x,l(2))=exp(−2σ2∥x−l(2)∥2)
f 3 = . . . . . . f_3=...... f3=......
. . . . . . ...... ......
- s i m i l a r i t y ( x , l ( i ) ) similarity\left(x,l^{(i)}\right) similarity(x,l(i)) It is called similarity measure function / Kernel function
- s i m i l a r i t y ( x , l ( i ) ) similarity\left(x,l^{(i)}\right) similarity(x,l(i)) It can also be written as k ( x , l ( i ) ) k\left(x,l^{(i)}\right) k(x,l(i))
- e x p ( x ) exp(x) exp(x) Express e x e^x ex, It is called Gaussian kernel function
- σ 2 \sigma^2 σ2 Is the parameter of Gaussian kernel function
The kernel function can be reduced to :
f 1 = s i m i l a r i t y ( x , l ( 1 ) ) = e x p ( − ∥ x − l ( 1 ) ∥ 2 2 σ 2 ) = e x p ( − ∑ j = 1 n ( x j − l j ( 1 ) ) 2 2 σ 2 ) f_1=similarity\left(x,l^{(1)}\right)=exp\left(-\frac{\Vert x-l^{(1)} \Vert ^2}{2\sigma^2}\right)=exp\left(-\frac{\sum_{j=1}^n(x_j-l_j^{(1)} )^2}{2\sigma^2}\right) f1=similarity(x,l(1))=exp(−2σ2∥x−l(1)∥2)=exp(−2σ2∑j=1n(xj−lj(1))2)
If x x x Very close to the mark l ( 1 ) l^{(1)} l(1), that f 1 ≈ e x p ( − 0 2 2 σ 2 ) ≈ 1 f_1\approx exp(-\frac{0^2}{2\sigma^2})\approx1 f1≈exp(−2σ202)≈1
If x x x Stay away from the mark l ( 1 ) l^{(1)} l(1), that f 1 ≈ e x p ( − ( l a r g e n u m b e r ) 2 2 σ 2 ) ≈ 0 f_1\approx exp(-\frac{(large\ number)^2}{2\sigma^2})\approx0 f1≈exp(−2σ2(large number)2)≈0
Upper figure , If σ 2 \sigma^2 σ2 Bigger , that f i f_i fi The descent speed of will slow down ( The slope decreases )
Pictured above , Take a little x x x, It has been calculated by support vector machine θ 0 、 θ 1 、 . . . \theta_0、\theta_1、... θ0、θ1、... The value of is shown in the figure above , Then we can calculate according to the kernel function f 0 、 f 1 、 . . . f_0、f_1、... f0、f1、... The value of is shown in the figure above , take θ \theta θ and f f f The value of is substituted into the assumed function to get 0.5 ≥ 0 0.5\ge0 0.5≥0, So predict y = 1 y=1 y=1
Pictured above , Take another point x x x, hypothesis : Finally get close l ( 1 ) and l ( 2 ) l^{(1)} and l^{(2)} l(1) and l(2) The point of will be predicted as 1, And away l ( 1 ) and l ( 2 ) l^{(1)} and l^{(2)} l(1) and l(2) The point of will be predicted as 0, Finally, you can fit a red curve as shown in the figure , The prediction in the curve is 1, The prediction outside the curve is 0
12-5 Kernel function II
How to choose landmarks ?
We usually choose the number of landmarks according to the number of training sets , That is, if there is 𝑚 An example , Then we choose
take 𝑚 Landmarks , And make :𝑙(1) = 𝑥(1), 𝑙(2) = 𝑥(2), . . . . . , 𝑙(𝑚) = 𝑥(𝑚). The advantage of doing so is : Now we
The new features are based on the distance between the original features and all other features in the training set , namely :

from
f 1 ( i ) = s i m i l a r i t y ( x ( i ) , l ( 1 ) ) f_1^{(i)}=similarity\left(x^{(i)},l^{(1)}\right) f1(i)=similarity(x(i),l(1))
f 2 ( i ) = s i m i l a r i t y ( x ( i ) , l ( 2 ) ) f_2^{(i)}=similarity\left(x^{(i)},l^{(2)}\right) f2(i)=similarity(x(i),l(2))
. . . ... ...
f m ( i ) = s i m i l a r i t y ( x ( i ) , l ( m ) ) f_m^{(i)}=similarity\left(x^{(i)},l^{(m)}\right) fm(i)=similarity(x(i),l(m))
take f f f Write as eigenvector form to get
f ( i ) = [ f 0 ( i ) = 1 f 1 ( i ) f 2 ( i ) . . . f m ( i ) ] f^{(i)}= \begin{bmatrix} f_0^{(i)}=1\\ f_1^{(i)}\\ f_2^{(i)}\\ ...\\ f_m^{(i)} \end{bmatrix} f(i)=⎣⎢⎢⎢⎢⎢⎡f0(i)=1f1(i)f2(i)...fm(i)⎦⎥⎥⎥⎥⎥⎤
f ( i ) f^{(i)} f(i) It's a m + 1 m+1 m+1 D matrix , Because except for m m m Out of samples , A bias term is also added f 0 ( i ) = 1 f_0^{(i)}=1 f0(i)=1
matrix f ( i ) f^{(i)} f(i) The meaning is ( The first i i i All the features in the samples ) And ( from 1 To m All the features in each sample ) Perform kernel operation , altogether m The operation results are arranged in the matrix , And add the 0 One of the f 0 ( i ) = 1 f_0^{(i)}=1 f0(i)=1
use f ( i ) f^{(i)} f(i) Replacement tape x x x term , The cost function obtained is :
C ∑ i = 1 m [ y ( i ) C o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) C o s t 0 ( θ T f ( i ) ) ] + 1 2 ∑ j = 1 n = m θ j 2 C\sum_{i=1}^m\left[y^{(i)}Cost_1(\theta^Tf^{(i)})+(1-y^{(i)})Cost_0(\theta^Tf^{(i)})\right]+\frac{1}{2}\sum_{j=1}^{n=m}θ_j^2 Ci=1∑m[y(i)Cost1(θTf(i))+(1−y(i))Cost0(θTf(i))]+21j=1∑n=mθj2
- On the regularization term n = m n=m n=m The explanation of : θ ( i ) \theta_{(i)} θ(i) Is corresponding to matrix f ( i ) f^{(i)} f(i) The weight of , Because there are i = 1 , 2 , . . . , m i=1,2,...,m i=1,2,...,m common m individual f ( i ) f^{(i)} f(i) Corresponding m individual θ \theta θ, So let's assume that there are m term , That is to say m individual θ \theta θ, And the weight has been specified before θ \theta θ Number of n To express , So there's n = m n=m n=m Note that this has been ignored θ 0 \theta_0 θ0, Don't regularize it
The regularization term is in concrete implementation , The summation part can be written as ∑ j = 1 n = m θ j 2 = θ T θ \sum_{j=1}^{n=m}θ_j^2=\theta^T\theta ∑j=1n=mθj2=θTθ, And use θ T M θ \theta^TM\theta θTMθ Instead of θ T θ \theta^T\theta θTθ, matrix M M M It's a matrix that varies according to the kernel function we choose , This can improve the computational efficiency , The modified cost function is :
C ∑ i = 1 m [ y ( i ) C o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) C o s t 0 ( θ T f ( i ) ) ] + 1 2 θ T M θ C\sum_{i=1}^m\left[y^{(i)}Cost_1(\theta^Tf^{(i)})+(1-y^{(i)})Cost_0(\theta^Tf^{(i)})\right]+\frac{1}{2}\theta^TM\theta Ci=1∑m[y(i)Cost1(θTf(i))+(1−y(i))Cost0(θTf(i))]+21θTMθ
Theoretically speaking , We can also use kernel functions in logistic regression , But it uses 𝑀 The method to simplify the calculation is not suitable for logistic regression , So computing is going to be very time consuming .
Here it is , We do not introduce a method to minimize the cost function of support vector machines , You can use existing packages ( Such as
liblinear,libsvm etc. ). Before we use these packages to minimize our cost function , We usually need to write a core
function , And if we use Gaussian kernel functions , So it is necessary to scale features before using them .
in addition , Support vector machines can also use no kernel function , Not using kernel function is also called linear kernel function (linear kernel),
When we don't use very complex functions , Or when we have a lot of features in our training set and very few examples , Can pick
Use this support vector machine without kernel function .
Here are two parameters for SVM 𝐶 and 𝜎 Influence :
𝐶 = 1/𝜆
𝐶 large , amount to 𝜆 smaller , May cause over fitting , High variance ;
𝐶 More hours , amount to 𝜆 more , May cause low fit , High deviation ;
𝜎 large , May lead to low variance , High deviation ;
𝜎 More hours , May cause low deviation , High variance .
come from https://www.cnblogs.com/sl0309/p/10499278.html

The above figure shows the influence of two parameters on the result
12-6 Using support vector machines (SVM)

- Use SVM Software library to calculate θ \theta θ Value
- We need to choose C Value
- We need to choose kernel function
- Linear kernel function : Do not select kernel function , Use linear fitting directly , It can be found in the number of features n It's big , And the number of samples m Use in very small cases
- Gaussian kernel : It can be found in the number of samples m It's big , Number of features n Use in very small cases , It can fit the nonlinear boundary
When using nonlinear kernel function , The eigenvalues need to be normalized
The kernel function needs to satisfy the mersel theorem Mercer’s theorem

边栏推荐
- 2021 v+ Quanzhen internet global innovation and Entrepreneurship Challenge, one of the top ten audio and video scene innovation and application pioneers
- Talk about macromolecule coding theory and Lao Wang's fallacy from the perspective of evolution theory
- Analyze comp-206 advanced UNIX utils
- Codeworks global round 19 (CF 1637) a ~ e problem solution
- Interpretation of some papers published by Tencent multimedia laboratory in 2021
- Highly qualified SQL writing: compare lines. Don't ask why. Asking is highly qualified..
- Internal/validators js:124 throw new ERR_ INVALID_ ARG_ Type (name, 'string', value) -- solution
- [kubernetes series] comparison of space and memory usage before and after kubedm reset initialization
- Is it safe to buy funds on securities accounts? Where can I buy funds
- AMD's largest transaction ever, the successful acquisition of Xilinx with us $35billion
猜你喜欢
![[dynamic planning] p1220: interval DP: turn off the street lights](/img/b6/405e29ca88fac40caee669a3b7893f.jpg)
[dynamic planning] p1220: interval DP: turn off the street lights

Driverless learning (4): Bayesian filtering

In depth research and investment feasibility report of global and Chinese isolator industry, 2022-2028

Write the content into the picture with type or echo and view it with WinHex
![[fluent] dart technique (independent main function entry | nullable type determination | default value setting)](/img/cc/3e4ff5cb2237c0f2007c61db1c346d.jpg)
[fluent] dart technique (independent main function entry | nullable type determination | default value setting)

kernel tty_ struct

Review of the latest 2022 research on "deep learning methods for industrial defect detection"
![[error record] the command line creates an error pub get failed (server unavailable) -- attempting retry 1 in 1 second](/img/6e/c82ff02a249b5d275a4589120a197a.jpg)
[error record] the command line creates an error pub get failed (server unavailable) -- attempting retry 1 in 1 second

Friends who firmly believe that human memory is stored in macromolecular substances, please take a look

ROS learning (10): ROS records multiple topic scripts
随机推荐
China's crude oil heater market trend report, technological innovation and market forecast
Research Report on the overall scale, major manufacturers, major regions, products and applications of battery control units in the global market in 2022
Report on investment development and strategic recommendations of China's vibration isolator market, 2022-2027
Research Report on market supply and demand and strategy of China's Plastic Geogrid industry
China's noise meter market trend report, technical dynamic innovation and market forecast
Volvo's first MPV is exposed! Comfortable and safe, equipped with 2.0T plug-in mixing system, it is worth first-class
kernel tty_ struct
When Valentine's Day falls on Monday
In depth research and investment feasibility report on the global and China active vibration isolation market 2022-2028
想请教一下,究竟有哪些劵商推荐?手机开户是安全么?
Makefile: usage of control functions (error, warning, info)
[fluent] dart generic (generic class | generic method | generic with specific type constraints)
Download vagrant box file locally from Atlas and configuring it
[shutter] statefulwidget component (floatingactionbutton component | refreshindicator component)
Golang string segmentation
[hands on deep learning]02 softmax regression
Number of DP schemes
MySQL learning notes (Advanced)
1005 spell it right (20 points) "PTA class a exercise"
Driverless learning (III): Kalman filter