当前位置:网站首页>Machine learning (Part 1)

Machine learning (Part 1)

2022-06-26 08:54:00 Thick Cub with thorns

machine learning (1)

machine learning :

pattern recognition

Computer vision

data mining

speech recognition

Statistical learning

natural language processing

  1. The training sample
  2. feature extraction
  3. Learn functions
  4. forecast
  • With supervision problems : Yes label

  • Unsupervised problems : nothing label

  • Return to : Output specific values

  • classification : Problems classified

Linear regression

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2 hθ(x)=θ0+θ1x1+θ2x2

h θ ( x ) = ∑ i = 0 n θ i x i = θ T x h_\theta(x)=\sum\limits_{i=0}^n\theta_ix_i=\theta^Tx hθ(x)=i=0nθixi=θTx

y ( i ) = θ T x ( i ) + ς ( i ) y^{(i)}=\theta^Tx^{(i)}+\varsigma^{(i)} y(i)=θTx(i)+ς(i)

The error is Independent and with the same distribution It is generally believed that the mean value is 0 The variance of θ 2 \theta^2 θ2 Gaussian distribution of

p ( ς ( i ) ) = 1 2 π σ e x p ( − ς ( i ) 2 2 σ 2 ) p(\varsigma^{(i)})=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{\varsigma^{(i)^2}}{2\sigma^2}) p(ς(i))=2πσ1exp(2σ2ς(i)2)

p ( y ( i ) ∣ x ( i ) ; θ ) = 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}) p(y(i)x(i);θ)=2πσ1exp(2σ2(y(i)θTx(i))2)

Maximum likelihood function :
L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) L(\theta) = \prod\limits_{i=1}^mp(y^{(i)}|x^{(i)};\theta)\\ = \prod\limits_{i=1}^mp(y^{(i)}|x^{(i)};\theta)\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}) L(θ)=i=1mp(y(i)x(i);θ)=i=1mp(y(i)x(i);θ)2πσ1exp(2σ2(y(i)θTx(i))2)
On demand requirements arg ⁡ max ⁡ ( L ( θ ) ) \arg\max(L(\theta)) argmax(L(θ))

l ( θ ) = l o g L ( θ ) l(\theta)=logL(\theta) l(θ)=logL(θ)

l ( θ ) = m log ⁡ 1 2 π σ − 1 σ 2 . 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 l(\theta)=m\log\frac{1}{\sqrt{2\pi}\sigma}-\frac{1}{\sigma^2}.\frac{1}{2}\sum\limits_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2 l(θ)=mlog2πσ1σ21.21i=1m(y(i)θTx(i))2

J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\frac{1}{2}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ)=21i=1m(hθ(x(i))y(i))2

On demand requirements a r g min ⁡ J ( θ ) arg\min J(\theta) argminJ(θ)

J ( θ ) = 1 2 ( X θ − y ) T ( X θ − y ) ∇ θ J ( θ ) = ∇ θ ( 1 2 ( θ T X T − y T ) ( X θ − y ) ) = ∇ θ ( 1 2 ( θ T X T X θ − θ T X T y − y T X θ + y t y ) ) = X T X θ − X T y θ = ( X T X ) − 1 X T y J(\theta)=\frac{1}{2}(X\theta-y)^T(X\theta-y) \\ \nabla_\theta J(\theta)= \nabla_\theta(\frac{1}{2}(\theta^TX^T-y^T)(X\theta-y)) \\ =\nabla_\theta(\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta+y^ty)) \\ =X^TX\theta-X^Ty \\ \theta=(X^TX)^{-1}X^Ty J(θ)=21(Xθy)T(Xθy)θJ(θ)=θ(21(θTXTyT)(Xθy))=θ(21(θTXTXθθTXTyyTXθ+yty))=XTXθXTyθ=(XTX)1XTy

Logical regression

Available for classification ( Two classification ) And return

h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}} hθ(x)=g(θTx)=1+eθTx1

Value range [0,1]

h θ ( x ) ′ = h ( x ) ( 1 − h ( x ) ) h_{\theta}(x)^{'}=h(x)(1-h(x)) hθ(x)=h(x)(1h(x))

Gradient descent is used for optimization

You don't have to find the derivative

Decision trees and random forests

Classification algorithm

The tree structure represents the results of data classification

  • The root node
  • Nonleaf node ( Decision point )
  • Leaf node ( Classification marks )
  • Branch ( Test results )

Training phase

Classification stage

The two events are independent of each other : P ( X , Y ) = P ( X ) ∗ P ( Y ) P(X,Y)=P(X)*P(Y) P(X,Y)=P(X)P(Y) L o g ( X Y ) = L o g ( X ) + L o g ( Y ) Log(XY)=Log(X)+Log(Y) Log(XY)=Log(X)+Log(Y)

Start at the root , Start sorting layer by layer .

It is necessary to use entropy to judge who is the node with low level

H ( x ) H(x) H(x) As the uncertainty of the event , The level of internal chaos

P ( A few rate The more Big ) − > H ( X ) value The more Small P( The greater the chance ) -> H(X) The smaller the value. P( A few rate The more Big )>H(X) value The more Small

P ( A few rate The more Small ) − > H ( X ) value The more Big P( The less likely it is ) ->H(X) The bigger the value is. P( A few rate The more Small )>H(X) value The more Big

entropy = − ∑ i = 1 n p i l n ( P i ) - \sum\limits_{i=1}^np_iln(P_i) i=1npiln(Pi)

G i n i system Count = G i n i ( p ) = ∑ i = 1 K p k ( 1 − p k ) = 1 − ∑ k = 1 K p k 2 Gini coefficient = Gini(p)=\sum\limits_{i=1}^Kp_k(1-p_k)=1-\sum\limits_{k=1}^Kp_k^2 Gini system Count =Gini(p)=i=1Kpk(1pk)=1k=1Kpk2

p The bigger it is , Entropy sum Gini The smaller the coefficient

The basic idea of constructing decision tree

The basic idea of constructing a tree is to increase the depth of the tree , The entropy of nodes decreases rapidly .

The faster entropy decreases, the better , Can make the depth smaller

After each division , The sum of entropy of a set is the smallest and the best , Can result in maximum information gain , Make the information entropy drop the fastest

The version of the decision tree

ID3: Information gain

C4.5: Information gain rate

CART:Gini coefficient

ID3 defects :

The information gain rate is too large : Too many samples , The number of samples per sample is small

Evaluation function : C ( T ) = ∑ t ∈ l e a f N t H ( t ) C(T)=\sum\limits_{t\in leaf}N_tH(t) C(T)=tleafNtH(t) N t N_t Nt Weight value , H ( t ) H(t) H(t) Entropy

The smaller the evaluation function, the better , Similar to the loss function

Able to handle continuous attributes , Firstly, the continuous attribute is discretized , Divide the values of continuous attributes into different intervals

Consideration of missing data : Build decision tree , Loss data can be ignored , When calculating the gain , Only records with attribute values are considered

Decision tree pruning

pre-pruning : The process of building a decision tree , Early termination ( Prevent over fitting )

After pruning : After the decision tree structure is built , Just started cutting

C α ( T ) = C ( T ) + α ∣ T l e a f ∣ C_{\alpha}(T)=C(T)+\alpha|T_{leaf}| Cα(T)=C(T)+αTleaf The more leaf nodes there are , The greater the loss

Random forests

Bootstrapping: There is a return sample

Bagging: There is a return sample n A total of samples are used to build classifiers

A decision tree makes the same decision together

Random : What percentage of the samples are randomly selected for training , Randomly selected features

Bayesian algorithm

Bayes' formula

P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B)=\frac{P(B|A)P(A)}{P(B)} P(AB)=P(B)P(BA)P(A)

Spelling correction

Spam filtering

Model comparison theory

maximum likelihood : Most consistent with observed data ( P ( h ∣ D ) P(h|D) P(hD) The biggest advantage ) The greater the posterior probability, the more advantageous

Okam razor : P ( h ) P(h) P(h) Larger models have greater advantages The greater the prior probability, the better , Higher order polynomials are less common

Naive Bayes : Features are independent of each other , They don't influence each other

Xgboost

Ensemble classifiers

Predictive value : y i ^ = ∑ j w j x i j \hat {y_i} = \sum_jw_jx_{ij} yi^=jwjxij

Objective function : l ( y i , y i ^ ) = ( y i − y i ^ ) 2 l(y_i,\hat{y_i})=(y_i-\hat{y_i})^2 l(yi,yi^)=(yiyi^)2

Optimal solution : F ∗ ( x ) = arg ⁡ min ⁡ E ( x , y ) [ L ( y , F ( x ) ) ] F^*({x})=\arg \min E_{(x,y)}[L(y,F(x))] F(x)=argminE(x,y)[L(y,F(x))]

The basic idea : It means that every tree added will be improved on the original basis

y i ^ ( 0 ) = 0 \hat{y_i}^{(0)} = 0 yi^(0)=0

y i ^ ( 1 ) = f 1 ( x i ) = y i ^ ( 0 ) + f 1 ( x i ) \hat{y_i}^{(1)}=f_1(x_i)=\hat{y_i}^{(0)}+f_1(x_i) yi^(1)=f1(xi)=yi^(0)+f1(xi)

y i ^ ( t ) = ∑ k = 1 t f k ( x i ) = y i ^ ( t − 1 ) + f t ( x i ) \hat{y_i}^{(t)}=\sum\limits_{k=1}^tf_k(x_i)=\hat{y_i}^{(t-1)}+f_t(x_i) yi^(t)=k=1tfk(xi)=yi^(t1)+ft(xi)

It is equivalent to t Wheel model prediction , Keep the front t − 1 t-1 t1 Wheel model prediction , Add a new function

Penalty item : Ω ( f t ) = γ T + 1 2 λ ∑ j = 1 T w j 2 \Omega(f_t)=\gamma T+\frac{1}{2}\lambda\sum\limits_{j=1}^Tw_j^2 Ω(ft)=γT+21λj=1Twj2 For every tree

The first term is the number of leaf nodes , The latter term is the penalty term of regularization , Constitute the total loss function

o b j ( t ) = ∑ i = 1 n l ( y i , y i ^ ( t ) ) + ∑ i = 1 t Ω ( f i ) obj^{(t)}=\sum\limits_{i=1}^nl(y_i,\hat{y_i}^{(t)})+\sum_{i=1}^t \Omega(f_i) obj(t)=i=1nl(yi,yi^(t))+i=1tΩ(fi)

= ∑ i = 1 n l ( y i , y i ^ ( t − 1 ) + f t ( x i ) ) + Ω ( f t ) + c =\sum\limits_{i=1}^nl(y_i,\hat{y_i}^{(t-1)}+f_t(x_i))+\Omega(f_t)+c =i=1nl(yi,yi^(t1)+ft(xi))+Ω(ft)+c

Goals need to be found f t f_t ft To optimize your goals

Optimization with Taylor expansion

o b j ( t ) = ∑ i = 1 n [ l ( y i , y i ^ ( t − 1 ) ) + g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) + c o n s t obj^{(t)}=\sum_{i=1}^n[l(y_i,\hat{y_i}^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+\Omega(f_t)+const obj(t)=i=1n[l(yi,yi^(t1))+gift(xi)+21hift2(xi)]+Ω(ft)+const

g i g_i gi First order conductance , h i h_i hi The two order derivative.

Into traversal of leaf nodes

o b j ( t ) = ∑ i = 1 n [ g i w q ( x i ) + 1 2 h i w q ( x i ) 2 ] + γ T + λ 1 2 ∑ j = 1 T w j 2 obj^{(t)}=\sum_{i=1}^n[g_iw_q(x_i)+\frac{1}{2}h_i w_{q(x_i)}^2]+\gamma T +\lambda \frac{1}{2}\sum_{j=1}^T w_j^2 obj(t)=i=1n[giwq(xi)+21hiwq(xi)2]+γT+λ21j=1Twj2

= ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T =\sum_{j=1}^T[(\sum_{i \in I_j}g_i)w_j+\frac{1}{2}(\sum_{i \in I_j}h_i+ \lambda)w_j^2]+\gamma T =j=1T[(iIjgi)wj+21(iIjhi+λ)wj2]+γT

G j = ∑ i ∈ I j g i G_j=\sum_{i \in I_j}g_i Gj=iIjgi H j = ∑ i ∈ I j h i H_j=\sum_{i \in I_j}h_i Hj=iIjhi

o b j ( t ) = ∑ i = 1 T [ G j w j + 1 2 ( H j + λ w j 2 ) ] + γ T obj^{(t)}=\sum_{i=1}^T[G_jw_j+\frac{1}{2}(H_j+ \lambda w_j^2)]+ \gamma T obj(t)=i=1T[Gjwj+21(Hj+λwj2)]+γT

Partial derivative =0

To calculate the w j = − G j H j + λ w_j=-\frac {G_j}{H_j+ \lambda} wj=Hj+λGj

O b j = − 1 2 ∑ j = 1 T G j 2 H j + λ + γ T Obj=-\frac{1}{2}\sum_{j=1}^T\frac{G_j^2}{H_j+\lambda}+\gamma T Obj=21j=1THj+λGj2+γT

Whether to perform the segmentation between the left node and the right node

Calculate the difference after segmentation

G a i n = 1 2 [ G L 2 H L + λ + G R 2 H R + λ − ( G L + G R ) 2 H L + H R + λ ] − γ Gain=\frac{1}{2}[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}]-\gamma Gain=21[HL+λGL2+HR+λGR2HL+HR+λ(GL+GR)2]γ

Adaboost

Adaptive enhancement

The wrong samples of the previous classifier will be strengthened , After weighting, all samples are used to train the next basic classifier . meanwhile , A new weak classifier is added in each round , Until a predetermined error rate small enough is reached , Or the maximum number of iterations specified in advance

Finally, the new classifier is weighted by multiple classifiers

  1. Initialize the weight distribution of the data , At the beginning, all samples have the same weight
  2. Train weak classifiers , If a sample is not accurately classified , Then increase its weight ; If correctly classified , Lower the weight
  3. The weak classifiers are weighted and combined into strong classifiers

Support vector machine

Classification problem

Suppose there is a hyperplane : w T x + b = 0 w^Tx+b=0 wTx+b=0

There are two points on the hyperplane : x ′   x ′ ′ x^{'} \ x^{''} x x Satisfy w T x ′ = − b w T x ′ ′ = − b w^Tx^{'}=-b \qquad w^Tx^{''}=-b wTx=bwTx=b

The normal vector of the plane w: w T ( x ′ ′ − x ′ ) = 0 w^T(x^{''}-x^{'})=0 wT(xx)=0

x ′ ′ and x ′ x'' and x' x and x It's in vector form

distance(point to line)= w T ∣ ∣ w ∣ ∣ ( x − x T ) \frac{w^T}{||w||}(x-x^T) wwT(xxT)= 1 ∣ ∣ w ∣ ∣ ∣ w T x + b ∣ \frac{1}{||w||}|w^Tx+b| w1wTx+b

SVM In classification , In the right time y = 1 y=1 y=1, Negative examples y = − 1 y=-1 y=1

This will satisfy y i y ( x i ) > 0 y_i y(x_i)>0 yiy(xi)>0

Find a straight line , Make the point closest to the line farther away :

arg ⁡ w , b max ⁡ ( min ⁡ y i ( w T x i + b ) ∣ ∣ w ∣ ∣ ) \arg \limits_{w,b} \max (\min \frac{y_i(w^T x_i+b)}{||w||}) w,bargmax(minwyi(wTxi+b))

By retracting : y i ( w T x i + b ) ≥ 1 y_i(w^Tx_i+b) \ge 1 yi(wTxi+b)1

You need to ask for m a x w , b 1 ∣ ∣ w ∣ ∣ max_{w,b}\frac{1}{||w||} maxw,bw1

To find the minimum min ⁡ w , b 1 2 w 2 \min_{w,b}\frac{1}{2}w^2 minw,b21w2 And y i ( w T x i + b ) ≥ 1 y_i(w^Tx_i+b)\ge 1 yi(wTxi+b)1

Using Lagrange multipliers :

L ( w , b , α ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ i = 1 n α i ( y i ( w T x i + b ) − 1 ) L(w,b,\alpha)=\frac{1}{2}||w||^2-\sum\limits_{i=1}^n{\alpha_i}(y_i(w^Tx_i+b) - 1) L(w,b,α)=21w2i=1nαi(yi(wTxi+b)1)

The dual problem : min ⁡ w , b max ⁡ α L ( w , b , α ) > max ⁡ α min ⁡ w , b L ( w , b , α ) \min\limits_{w,b}\max\limits_{\alpha}L(w,b,\alpha)>\max\limits_{\alpha}\min\limits_{w,b}L(w,b,\alpha) w,bminαmaxL(w,b,α)>αmaxw,bminL(w,b,α)

Respectively for w and b Finding partial derivatives , We get two conditions respectively

∂ L ∂ w = 0 \frac{\partial{L}}{\partial{w}}=0 wL=0 -> w = ∑ i = 1 n α i y i x n w=\sum\limits_{i=1}^n \alpha_iy_ix_n w=i=1nαiyixn

∂ L ∂ b = 0 \frac{\partial{L}}{\partial{b}}=0 bL=0 -> ∑ i = 1 n α i y i = 0 \sum\limits_{i=1}^n \alpha_iy_i=0 i=1nαiyi=0

In the face of α \alpha α Just ask for the derivative

Lagrange multiplier method

min ⁡ f ( x ) \min f(x) minf(x)

s . t . g i ( x ) ≤ 0 i = 1 , … , m s.t. \quad g_i(x) \le 0 \quad i=1,\dots,m s.t.gi(x)0i=1,,m

The support vector determines the point of the split face , Determines the interval separation hyperplane

Soft space

Individual points affect the separation of the entire hyperplane

Introduce relaxation factor , It becomes a soft interval problem

y i ( w x i + b ) ≥ 1 − ε i y_i(wx_i+b)\ge1-\varepsilon_i yi(wxi+b)1εi

Objective function : min ⁡ 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 n ε i \min \frac{1}{2}||w||^2+C\sum\limits_{i=1}^n \varepsilon_i min21w2+Ci=1nεi

When C Approaching infinity : It means that there should be no mistakes in the classification

When C Close to very little : It means that there can be greater tolerance for mistakes

Kernel function

Mapping from low dimensional space to high dimensional space

The benefits of kernel functions : Calculate the inner product of high-dimensional samples in a low dimensional space

It can be simplified as inner product in low dimension to map the result to high dimension

The result is the same as that of inner product in high dimension

Gaussian kernel

K ( X , Y ) = e x p { ∣ ∣ X − Y ∣ ∣ 2 2 σ 2 } K(X,Y)=exp\{\frac{||X-Y||^2}{2\sigma^2}\} K(X,Y)=exp{ 2σ2XY2}

ARIMA

Stability :

  • Stationarity is the fitting curve obtained through the sample time series , In the future, we can still follow the existing form “ inertia ” To continue
  • Stationarity requires that the mean and variance of the sequence do not change significantly

Strict stability and weak stability :

  • Yan pingwen : The distribution of strict stationary representation does not change with time . White noise : No matter how you take , Expectation is 0, The variance is 1
  • Weakly stationary : Expectation and correlation coefficient ( Dependence ) unchanged . At some point in the future t The value of depends on its past information , So we need dependence

The data is relatively stable :

  • The difference method : The time series t And t-1 Time difference

Autoregressive model (AR)

  • Describe the relationship between current value and historical value , Use the historical time data of variables to predict themselves
  • Autoregressive model must satisfy the requirement of stationarity
  • p The formula definition of order autoregressive process : y t = μ + ∑ i = 1 p γ i y t − i + ϵ t y_t=\mu+\sum_{i=1}^p \gamma_iy_{t-i}+\epsilon_t yt=μ+i=1pγiyti+ϵt
  • y t y_t yt It's the current value , μ \mu μ It's a constant term , P P P It's order , γ i \gamma_i γi Is the autocorrelation coefficient , ϵ t \epsilon_t ϵt It's error

Limitations of autoregressive models :

  • Autoregressive model uses its own data to predict
  • Must be stable
  • Must have autocorrelation , If the autocorrelation coefficient ( φ i ) < 0.5 (\varphi_i)<0.5 (φi)<0.5, ... should not be used
  • Autoregression is only applicable to predict the phenomenon related to its own early stage

Moving average model (MA)

  • The moving average model focuses on the accumulation of error terms in the autoregressive model
  • q The formula definition of order autoregressive process : y t = μ + ϵ t + ∑ i = 1 q θ i ϵ t − i y_t=\mu+\epsilon_t+\sum_{i=1}^q \theta_i \epsilon_{t-i} yt=μ+ϵt+i=1qθiϵti
  • The moving average method can effectively eliminate the random fluctuation in prediction

Autoregressive moving average model : ( A R M A ) (ARMA) (ARMA)

  • The combination of autoregressive and moving average
  • Formula definition : y t = μ + ∑ i = 1 p γ i y t − i + ϵ t + ∑ i = 1 q θ i ϵ t − i y_t=\mu+\sum_{i=1}^p\gamma_iy_{t-i}+\epsilon_t+\sum_{i=1}^q\theta_i\epsilon_{t-i} yt=μ+i=1pγiyti+ϵt+i=1qθiϵti

ARIMA: Differential autoregressive moving average model

Transforming non-stationary time series into stationary time series

Then the model is established by regressing the dependent variable only to its lag value and the present value and lag value of the random error term

choice p Values and q value

Autocorrelation function ACF

  • An ordered sequence of random variables is compared to itself Autocorrelation function reflects the correlation between the values of the same sequence in different time series
  • A C F ( k ) = ϱ k = C o v ( y t , y t − k ) V a r ( y t ) ACF(k)=\varrho_k=\frac{Cov(y_t,y_{t-k})}{Var(y_t)} ACF(k)=ϱk=Var(yt)Cov(yt,ytk)

Partial autocorrelation function (PACF)

  • ACF What we get from it is not x ( t ) x(t) x(t) And x ( t − k ) x(t-k) x(tk) A simple correlation between
  • x ( t ) x(t) x(t) At the same time, it will be x ( t − 1 ) … x ( t − k + 1 ) x(t-1) \dots x(t-k+1) x(t1)x(tk+1) Influence
  • PACF Cut out the middle k − 1 k-1 k1 Random variables x ( t − 1 ) , … , x ( t − k + 1 ) x(t-1),\dots,x(t-k+1) x(t1),,x(tk+1) After the interference of , x ( t − k ) x(t-k) x(tk) Yes x ( t ) x(t) x(t) The relevance of the impact
Model ACFPACF
AR(p) Attenuation tends to 0( Geometric or oscillatory )p End after step
MA(q)q End after step Attenuation tends to 0( Geometric or oscillatory )
ARMA(p,q)q The post order attenuation tends to 0( Geometric or oscillatory )p The post order attenuation tends to 0( Geometric or oscillatory )

truncation : Fall within the confidence interval (95% All of the points conform to the rule )

Modeling process

  • Smooth the sequence ( Determination by difference method d)
  • p and q Order determination :ACF And PACF
  • ARIMA(p,d,q)

Model selection AIC And BIC:( The lower the better )

AIC: Red pool information criterion . A I C = 2 k − 2 l n ( L ) AIC=2k-2ln(L) AIC=2k2ln(L)

BIC: Bayesian information rule B I C = k l n ( n ) − 2 l n ( L ) BIC=kln(n)-2ln(L) BIC=kln(n)2ln(L)

k There are several parameters in the model ,n Is the number of samples ,L Is the likelihood function

Model residual test :

ARIMA Whether the residual error of the model is the average value 0 And the variance is a constant normal distribution

neural network

A picture is represented as a three-dimensional array in a computer

K Nearest neighbor algorithm

One thing it's nearest k Which class does one belong to more , It is divided into which class

For points in the unknown category attribute dataset :

  • Calculate the distance between the point in the known category dataset and the current point
  • Sort by distance
  • Select the least distance from the current point k A little bit
  • Make sure before k Probability of occurrence of the category in which a point is located
  • Return to the former k The category with the highest frequency of points is used as the current point prediction classification

No training required , The computational complexity is proportional to the number of documents in the training set , Complexity O ( n ) O(n) O(n)

Distance calculation :

d 1 ( I 1 . I 2 ) = ∑ p ∣ I 1 p − I 2 P ∣ d_1(I_1.I_2)=\sum\limits_{p}|I_1^p-I_2^P| d1(I1.I2)=pI1pI2P

Distance is also called a super parameter

Manhattan distance : L 1 L1 L1 d 1 ( I 1 , I 2 ) = ∑ p ∣ I 1 P − I 2 p ∣ d_1(I_1,I_2)=\sum\limits_p|I_1^P-I_2^p| d1(I1,I2)=pI1PI2p

Euclidean distance : L 2 L2 L2 d 2 ( I 1 , I 2 ) = ∑ p ( I 1 p − I 2 p ) 2 d_2(I_1,I_2)=\sqrt{\sum\limits_p(I_1^p-I_2^p)^2} d2(I1,I2)=p(I1pI2p)2

K Determination of nearest neighbor parameters :

Use cross validation : Take part of the training set as the verification set ( Adjust model parameters )( Alternately take one of them as the verification set )

Neural network loss function

L i = 1 N ∑ i = 1 N ∑ j ≠ y i m a x ( 0 , f ( x i ; W ) j − f ( x i ; W ) y i + δ ) L_i=\frac{1}{N} \sum_{i=1}^N\sum_{j \ne y_i}max(0,f(x_i;W)_j-f(x_i;W)_{y_i}+\delta) Li=N1i=1Nj=yimax(0,f(xi;W)jf(xi;W)yi+δ)

δ \delta δ Tolerable degree

Regularization penalty term

L i = 1 N ∑ i = 1 N ∑ j ≠ y i m a x ( 0 , f ( x i ; W ) j − f ( x i ; W ) y i + δ ) + λ ∑ k ∑ l W k , l 2 L_i=\frac{1}{N} \sum_{i=1}^N\sum_{j \ne y_i}max(0,f(x_i;W)_j-f(x_i;W)_{y_i}+\delta)+\lambda \sum\limits_k \sum\limits_lW_{k,l}^2 Li=N1i=1Nj=yimax(0,f(xi;W)jf(xi;W)yi+δ)+λklWk,l2

effect : Penalty weight parameter

Softmax classifier

s o f t m a x softmax softmax Output : Normalized classification probability

Loss function : Cross entropy loss L i = − log ⁡ e f y i ∑ j e f j L_i=-\log \frac{e^{f_{y_i}}}{\sum_je^{f_j}} Li=logjefjefyi

f j ( z ) = e z j ∑ k e z k f_j(z)=\frac{e^{z_j}}{\sum_ke^{z_k}} fj(z)=kezkezj, Referred to as softmax function

The input value is a vector , The score value of any real number in the vector

Output a vector , Each element value is in 0 To 1 Between , And the sum of all elements is 1

Machine learning optimization

Back propagation

gradient descent

b a t c h s i z e batchsize batchsize: The amount of load on the computer , Multiple sheets in one iteration ( Forward propagation + Back propagation )

In training , The overall trend is convergent

epoch: All the data have been run

Back propagation

Additive gate unit : Equal distribution

MAX Door unit : To the biggest

Multiplication gate unit : swap

characteristic :

  • hierarchy
  • Nonlinear structure

sigmoid Function gradient vanishes seriously , Is no longer used

Now we mainly use RELU m a x ( 0 , x ) max(0,x) max(0,x)

The more neurons , The better the classification

Prevent over fitting : Regularization

Data preprocessing

Weight initialization :b Constant initialization ,w Random

Prevent over fitting :

drop-out: In training , A part of neurons are randomly reserved for forward and backward propagation

原网站

版权声明
本文为[Thick Cub with thorns]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202170554154698.html