当前位置:网站首页>[CV] Wu Enda machine learning course notes | Chapter 12
[CV] Wu Enda machine learning course notes | Chapter 12
2022-07-02 21:24:00 【Fannnnf】
If there is no special explanation in this series of articles , The text explains the picture above the text
machine learning | Coursera
Wu Enda machine learning series _bilibili
Catalog
12 Support vector machine (SVM)
12-1 Optimization objectives

- The coordinate system on the left of the above figure is y = 1 y=1 y=1 Image of time valence function , Support vector machine draws a pink curve , Name it C o s t 1 ( z ) Cost_1(z) Cost1(z), Subscript refers to y y y The value of is 1 1 1
- alike , The right coordinate system is y = 0 y=0 y=0 Image of time valence function , Support vector machine draws a pink curve , Name it C o s t 0 ( z ) Cost_0(z) Cost0(z), Subscript refers to y y y The value of is 0 0 0
In logical regression , The cost function is :
J ( θ ) = − 1 m [ ∑ i = 1 m y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J(θ)=-\frac{1}{m}\left[\sum_{i=1}^my^{(i)}log(h_θ(x^{(i)}))+(1-y^{(i)})log(1-h_θ(x^{(i)}))\right]+\frac{λ}{2m}\sum_{j=1}^{n}θ_j^2 J(θ)=−m1[i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]+2mλj=1∑nθj2
Support vector machine , First put the minus sign in the above formula into the sum , And then put 1 m \frac{1}{m} m1 Get rid of ( 1 m \frac{1}{m} m1 It's a constant , Although removing it will change the value of the cost function , But the same minimum value can still be obtained θ \theta θ), The resulting cost function is :
J ( θ ) = ∑ i = 1 m [ y ( i ) ( − l o g ( h θ ( x ( i ) ) ) ) + ( 1 − y ( i ) ) ( − l o g ( 1 − h θ ( x ( i ) ) ) ) ] + λ 2 ∑ j = 1 n θ j 2 J(θ)=\sum_{i=1}^m\left[y^{(i)}\left (-log(h_θ(x^{(i)}))\right )+(1-y^{(i)})\left(-log(1-h_θ(x^{(i)}))\right)\right]+\frac{λ}{2}\sum_{j=1}^{n}θ_j^2 J(θ)=i=1∑m[y(i)(−log(hθ(x(i))))+(1−y(i))(−log(1−hθ(x(i))))]+2λj=1∑nθj2
Put... In the above formula ( − l o g ( h θ ( x ( i ) ) ) ) \left (-log(h_θ(x^{(i)}))\right ) (−log(hθ(x(i)))) Replace with C o s t 1 ( θ T x ( i ) ) Cost_1(\theta^Tx^{(i)}) Cost1(θTx(i)), hold ( − l o g ( 1 − h θ ( x ( i ) ) ) ) \left(-log(1-h_θ(x^{(i)}))\right) (−log(1−hθ(x(i)))) Replace with C o s t 0 ( θ T x ( i ) ) Cost_0(\theta^Tx^{(i)}) Cost0(θTx(i)), Get the cost function :
J ( θ ) = ∑ i = 1 m [ y ( i ) C o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) C o s t 0 ( θ T x ( i ) ) ] + λ 2 ∑ j = 1 n θ j 2 J(θ)=\sum_{i=1}^m\left[y^{(i)}Cost_1(\theta^Tx^{(i)})+(1-y^{(i)})Cost_0(\theta^Tx^{(i)})\right]+\frac{λ}{2}\sum_{j=1}^{n}θ_j^2 J(θ)=i=1∑m[y(i)Cost1(θTx(i))+(1−y(i))Cost0(θTx(i))]+2λj=1∑nθj2
In support vector machines , No longer use regularization parameters λ \lambda λ, Use parameters instead C C C, The cost function of the changed support vector machine is :
J ( θ ) = C ∑ i = 1 m [ y ( i ) C o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) C o s t 0 ( θ T x ( i ) ) ] + 1 2 ∑ j = 1 n θ j 2 J(θ)=C\sum_{i=1}^m\left[y^{(i)}Cost_1(\theta^Tx^{(i)})+(1-y^{(i)})Cost_0(\theta^Tx^{(i)})\right]+\frac{1}{2}\sum_{j=1}^{n}θ_j^2 J(θ)=Ci=1∑m[y(i)Cost1(θTx(i))+(1−y(i))Cost0(θTx(i))]+21j=1∑nθj2
- Support vector machines do not predict y = 1 / 0 y=1/0 y=1/0 Probability , If θ T x ( i ) ≥ 0 \theta^Tx^{(i)}\ge0 θTx(i)≥0, Suppose the function output 1, On the contrary, output 0
12-2 Understanding of large spacing
Support vector machine is also called large space classifier 
Change the judgment boundary in support vector machine , Give Way θ T x ( i ) ≥ 1 \theta^Tx^{(i)}\ge1 θTx(i)≥1 When the output 1, θ T x ( i ) ≤ − 1 \theta^Tx^{(i)}\le-1 θTx(i)≤−1 When the output 0, In this way, there is a safe gap between the two results 
Using a general logistic regression algorithm may generate pink and green lines in the above figure to segment two types of samples , And use support vector chance to generate black lines in the figure , The area between the two blue lines in the figure is called the spacing , Support vector machines try to separate the two samples with the greatest spacing , It can be seen that , Support vector machine can have better robustness 
Pictured above , Let's assume that there is no negative sample on the left , Use a big C C C The black line in the above figure can be generated , But if there is a negative sample on the left , because C C C It's big , Support vector machine is used to ensure the maximum distance between two kinds of samples , The pink line in the above figure will be generated , But if C C C It's not that big , Then even if there is a negative sample on the left , Black lines will still be generated
- C C C It's equivalent to the previous 1 λ \frac{1}{\lambda} λ1, Although the two are really different , But the effect is similar
12-3 Mathematical principle of support vector machine
12-4 Kernel function I

The sample shown in the figure , If we assume that the function ≥ 0 \ge0 ≥0, Just predict y = 1 y=1 y=1, Other predictions y = 0 y=0 y=0,
In the hypothetical function in the figure above , set up f 1 = x 1 , f 2 = x 2 , f 3 = x 1 x 2 , f 4 = x 1 2 , . . . f_1=x_1,f_2=x_2,f_3=x_1x_2,f_4=x_1^2,... f1=x1,f2=x2,f3=x1x2,f4=x12,...
Suppose the function becomes h θ ( x ) = θ 0 + θ 1 f 1 + θ 2 f 2 + . . . h_\theta(x)=\theta_0+\theta_1 f_1+\theta_2 f_2+... hθ(x)=θ0+θ1f1+θ2f2+...
The above method has nothing to do with the following method , Independent of kernel function
However , In addition to combining the original features , Is there a better way to construct 𝑓1, 𝑓2, 𝑓3? We can use kernel function to calculate new features .
Take... In the coordinate system 3 Sample marking matrix l ( 1 ) 、 l ( 2 ) 、 l ( 3 ) l^{(1)}、l^{(2)}、l^{(3)} l(1)、l(2)、l(3)
x x x Is a given training example , Suppose there are two eigenvalues , that x = [ x 1 , x 2 ] x=[x_1,x_2] x=[x1,x2]
Make
f 1 = s i m i l a r i t y ( x , l ( 1 ) ) = e x p ( − ∥ x − l ( 1 ) ∥ 2 2 σ 2 ) f_1=similarity\left(x,l^{(1)}\right)=exp\left(-\frac{\Vert x-l^{(1)} \Vert ^2}{2\sigma^2}\right) f1=similarity(x,l(1))=exp(−2σ2∥x−l(1)∥2)
f 2 = s i m i l a r i t y ( x , l ( 2 ) ) = e x p ( − ∥ x − l ( 2 ) ∥ 2 2 σ 2 ) f_2=similarity\left(x,l^{(2)}\right)=exp\left(-\frac{\Vert x-l^{(2)} \Vert^2 }{2\sigma^2}\right) f2=similarity(x,l(2))=exp(−2σ2∥x−l(2)∥2)
f 3 = . . . . . . f_3=...... f3=......
. . . . . . ...... ......
- s i m i l a r i t y ( x , l ( i ) ) similarity\left(x,l^{(i)}\right) similarity(x,l(i)) It is called similarity measure function / Kernel function
- s i m i l a r i t y ( x , l ( i ) ) similarity\left(x,l^{(i)}\right) similarity(x,l(i)) It can also be written as k ( x , l ( i ) ) k\left(x,l^{(i)}\right) k(x,l(i))
- e x p ( x ) exp(x) exp(x) Express e x e^x ex, It is called Gaussian kernel function
- σ 2 \sigma^2 σ2 Is the parameter of Gaussian kernel function
The kernel function can be reduced to :
f 1 = s i m i l a r i t y ( x , l ( 1 ) ) = e x p ( − ∥ x − l ( 1 ) ∥ 2 2 σ 2 ) = e x p ( − ∑ j = 1 n ( x j − l j ( 1 ) ) 2 2 σ 2 ) f_1=similarity\left(x,l^{(1)}\right)=exp\left(-\frac{\Vert x-l^{(1)} \Vert ^2}{2\sigma^2}\right)=exp\left(-\frac{\sum_{j=1}^n(x_j-l_j^{(1)} )^2}{2\sigma^2}\right) f1=similarity(x,l(1))=exp(−2σ2∥x−l(1)∥2)=exp(−2σ2∑j=1n(xj−lj(1))2)
If x x x Very close to the mark l ( 1 ) l^{(1)} l(1), that f 1 ≈ e x p ( − 0 2 2 σ 2 ) ≈ 1 f_1\approx exp(-\frac{0^2}{2\sigma^2})\approx1 f1≈exp(−2σ202)≈1
If x x x Stay away from the mark l ( 1 ) l^{(1)} l(1), that f 1 ≈ e x p ( − ( l a r g e n u m b e r ) 2 2 σ 2 ) ≈ 0 f_1\approx exp(-\frac{(large\ number)^2}{2\sigma^2})\approx0 f1≈exp(−2σ2(large number)2)≈0
Upper figure , If σ 2 \sigma^2 σ2 Bigger , that f i f_i fi The descent speed of will slow down ( The slope decreases )
Pictured above , Take a little x x x, It has been calculated by support vector machine θ 0 、 θ 1 、 . . . \theta_0、\theta_1、... θ0、θ1、... The value of is shown in the figure above , Then we can calculate according to the kernel function f 0 、 f 1 、 . . . f_0、f_1、... f0、f1、... The value of is shown in the figure above , take θ \theta θ and f f f The value of is substituted into the assumed function to get 0.5 ≥ 0 0.5\ge0 0.5≥0, So predict y = 1 y=1 y=1
Pictured above , Take another point x x x, hypothesis : Finally get close l ( 1 ) and l ( 2 ) l^{(1)} and l^{(2)} l(1) and l(2) The point of will be predicted as 1, And away l ( 1 ) and l ( 2 ) l^{(1)} and l^{(2)} l(1) and l(2) The point of will be predicted as 0, Finally, you can fit a red curve as shown in the figure , The prediction in the curve is 1, The prediction outside the curve is 0
12-5 Kernel function II
How to choose landmarks ?
We usually choose the number of landmarks according to the number of training sets , That is, if there is 𝑚 An example , Then we choose
take 𝑚 Landmarks , And make :𝑙(1) = 𝑥(1), 𝑙(2) = 𝑥(2), . . . . . , 𝑙(𝑚) = 𝑥(𝑚). The advantage of doing so is : Now we
The new features are based on the distance between the original features and all other features in the training set , namely :

from
f 1 ( i ) = s i m i l a r i t y ( x ( i ) , l ( 1 ) ) f_1^{(i)}=similarity\left(x^{(i)},l^{(1)}\right) f1(i)=similarity(x(i),l(1))
f 2 ( i ) = s i m i l a r i t y ( x ( i ) , l ( 2 ) ) f_2^{(i)}=similarity\left(x^{(i)},l^{(2)}\right) f2(i)=similarity(x(i),l(2))
. . . ... ...
f m ( i ) = s i m i l a r i t y ( x ( i ) , l ( m ) ) f_m^{(i)}=similarity\left(x^{(i)},l^{(m)}\right) fm(i)=similarity(x(i),l(m))
take f f f Write as eigenvector form to get
f ( i ) = [ f 0 ( i ) = 1 f 1 ( i ) f 2 ( i ) . . . f m ( i ) ] f^{(i)}= \begin{bmatrix} f_0^{(i)}=1\\ f_1^{(i)}\\ f_2^{(i)}\\ ...\\ f_m^{(i)} \end{bmatrix} f(i)=⎣⎢⎢⎢⎢⎢⎡f0(i)=1f1(i)f2(i)...fm(i)⎦⎥⎥⎥⎥⎥⎤
f ( i ) f^{(i)} f(i) It's a m + 1 m+1 m+1 D matrix , Because except for m m m Out of samples , A bias term is also added f 0 ( i ) = 1 f_0^{(i)}=1 f0(i)=1
matrix f ( i ) f^{(i)} f(i) The meaning is ( The first i i i All the features in the samples ) And ( from 1 To m All the features in each sample ) Perform kernel operation , altogether m The operation results are arranged in the matrix , And add the 0 One of the f 0 ( i ) = 1 f_0^{(i)}=1 f0(i)=1
use f ( i ) f^{(i)} f(i) Replacement tape x x x term , The cost function obtained is :
C ∑ i = 1 m [ y ( i ) C o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) C o s t 0 ( θ T f ( i ) ) ] + 1 2 ∑ j = 1 n = m θ j 2 C\sum_{i=1}^m\left[y^{(i)}Cost_1(\theta^Tf^{(i)})+(1-y^{(i)})Cost_0(\theta^Tf^{(i)})\right]+\frac{1}{2}\sum_{j=1}^{n=m}θ_j^2 Ci=1∑m[y(i)Cost1(θTf(i))+(1−y(i))Cost0(θTf(i))]+21j=1∑n=mθj2
- On the regularization term n = m n=m n=m The explanation of : θ ( i ) \theta_{(i)} θ(i) Is corresponding to matrix f ( i ) f^{(i)} f(i) The weight of , Because there are i = 1 , 2 , . . . , m i=1,2,...,m i=1,2,...,m common m individual f ( i ) f^{(i)} f(i) Corresponding m individual θ \theta θ, So let's assume that there are m term , That is to say m individual θ \theta θ, And the weight has been specified before θ \theta θ Number of n To express , So there's n = m n=m n=m Note that this has been ignored θ 0 \theta_0 θ0, Don't regularize it
The regularization term is in concrete implementation , The summation part can be written as ∑ j = 1 n = m θ j 2 = θ T θ \sum_{j=1}^{n=m}θ_j^2=\theta^T\theta ∑j=1n=mθj2=θTθ, And use θ T M θ \theta^TM\theta θTMθ Instead of θ T θ \theta^T\theta θTθ, matrix M M M It's a matrix that varies according to the kernel function we choose , This can improve the computational efficiency , The modified cost function is :
C ∑ i = 1 m [ y ( i ) C o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) C o s t 0 ( θ T f ( i ) ) ] + 1 2 θ T M θ C\sum_{i=1}^m\left[y^{(i)}Cost_1(\theta^Tf^{(i)})+(1-y^{(i)})Cost_0(\theta^Tf^{(i)})\right]+\frac{1}{2}\theta^TM\theta Ci=1∑m[y(i)Cost1(θTf(i))+(1−y(i))Cost0(θTf(i))]+21θTMθ
Theoretically speaking , We can also use kernel functions in logistic regression , But it uses 𝑀 The method to simplify the calculation is not suitable for logistic regression , So computing is going to be very time consuming .
Here it is , We do not introduce a method to minimize the cost function of support vector machines , You can use existing packages ( Such as
liblinear,libsvm etc. ). Before we use these packages to minimize our cost function , We usually need to write a core
function , And if we use Gaussian kernel functions , So it is necessary to scale features before using them .
in addition , Support vector machines can also use no kernel function , Not using kernel function is also called linear kernel function (linear kernel),
When we don't use very complex functions , Or when we have a lot of features in our training set and very few examples , Can pick
Use this support vector machine without kernel function .
Here are two parameters for SVM 𝐶 and 𝜎 Influence :
𝐶 = 1/𝜆
𝐶 large , amount to 𝜆 smaller , May cause over fitting , High variance ;
𝐶 More hours , amount to 𝜆 more , May cause low fit , High deviation ;
𝜎 large , May lead to low variance , High deviation ;
𝜎 More hours , May cause low deviation , High variance .
come from https://www.cnblogs.com/sl0309/p/10499278.html

The above figure shows the influence of two parameters on the result
12-6 Using support vector machines (SVM)

- Use SVM Software library to calculate θ \theta θ Value
- We need to choose C Value
- We need to choose kernel function
- Linear kernel function : Do not select kernel function , Use linear fitting directly , It can be found in the number of features n It's big , And the number of samples m Use in very small cases
- Gaussian kernel : It can be found in the number of samples m It's big , Number of features n Use in very small cases , It can fit the nonlinear boundary
When using nonlinear kernel function , The eigenvalues need to be normalized
The kernel function needs to satisfy the mersel theorem Mercer’s theorem

边栏推荐
- 6 pyspark Library
- Interpretation of some papers published by Tencent multimedia laboratory in 2021
- Get weekday / day of week for datetime column of dataframe - get weekday / day of week for datetime column of dataframe
- Review of the latest 2022 research on "deep learning methods for industrial defect detection"
- [shutter] statefulwidget component (floatingactionbutton component | refreshindicator component)
- rwctf2022_ QLaaS
- Sword finger offer (I) -- handwriting singleton mode
- 想请教一下,究竟有哪些劵商推荐?手机开户是安全么?
- Research Report on the overall scale, major manufacturers, major regions, products and application segmentation of shock absorber oil in the global market in 2022
- Micro SD Card Industry Research Report - market status analysis and development prospect forecast
猜你喜欢

Highly qualified SQL writing: compare lines. Don't ask why. Asking is highly qualified..

One week dynamics of dragon lizard community | 2.07-2.13

Internal/validators js:124 throw new ERR_ INVALID_ ARG_ Type (name, 'string', value) -- solution

rwctf2022_ QLaaS

Add two numbers of leetcode

Report on investment development and strategic recommendations of China's vibration isolator market, 2022-2027

MySQL learning notes (Advanced)

Go language learning summary (5) -- Summary of go learning notes

7. Build native development environment

Detailed upgrade process of AWS eks
随机推荐
Research and Analysis on the current situation of China's clamping device market and forecast report on its development prospect
Add two numbers of leetcode
JDBC | Chapter 3: SQL precompile and anti injection crud operation
Sweet talk generator, regular greeting email machine... Open source programmers pay too much for this Valentine's day
Common routines of compressed packets in CTF
Adding data to the head or tail of the rar file can still decompress normally
Properties of expectation and variance
A river of spring water flows eastward
想问问,现在开户有优惠吗?在线开户是安全么?
China's Micro SD market trend report, technology dynamic innovation and market forecast
Construction and maintenance of business websites [7]
The metamask method is used to obtain account information
Redis -- three special data types
Construction and maintenance of business websites [4]
Unexpectedly, there are such sand sculpture code comments! I laughed
Don't you want to have a face-to-face communication with cloud native and open source experts? (including benefits
[question brushing diary] classic questions of dynamic planning
Write the content into the picture with type or echo and view it with WinHex
Internet Explorer ignores cookies on some domains (cannot read or set cookies)
Interpretation of some papers published by Tencent multimedia laboratory in 2021