当前位置:网站首页>Machine learning (Part 1)
Machine learning (Part 1)
2022-06-26 08:54:00 【Thick Cub with thorns】
machine learning (1)
machine learning :
pattern recognition
Computer vision
data mining
speech recognition
Statistical learning
natural language processing
- The training sample
- feature extraction
- Learn functions
- forecast
With supervision problems : Yes label
Unsupervised problems : nothing label
Return to : Output specific values
classification : Problems classified
Linear regression
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2 hθ(x)=θ0+θ1x1+θ2x2
h θ ( x ) = ∑ i = 0 n θ i x i = θ T x h_\theta(x)=\sum\limits_{i=0}^n\theta_ix_i=\theta^Tx hθ(x)=i=0∑nθixi=θTx
y ( i ) = θ T x ( i ) + ς ( i ) y^{(i)}=\theta^Tx^{(i)}+\varsigma^{(i)} y(i)=θTx(i)+ς(i)
The error is Independent and with the same distribution It is generally believed that the mean value is 0 The variance of θ 2 \theta^2 θ2 Gaussian distribution of
p ( ς ( i ) ) = 1 2 π σ e x p ( − ς ( i ) 2 2 σ 2 ) p(\varsigma^{(i)})=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{\varsigma^{(i)^2}}{2\sigma^2}) p(ς(i))=2πσ1exp(−2σ2ς(i)2)
p ( y ( i ) ∣ x ( i ) ; θ ) = 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}) p(y(i)∣x(i);θ)=2πσ1exp(−2σ2(y(i)−θTx(i))2)
Maximum likelihood function :
L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) L(\theta) = \prod\limits_{i=1}^mp(y^{(i)}|x^{(i)};\theta)\\ = \prod\limits_{i=1}^mp(y^{(i)}|x^{(i)};\theta)\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}) L(θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏mp(y(i)∣x(i);θ)2πσ1exp(−2σ2(y(i)−θTx(i))2)
On demand requirements arg max ( L ( θ ) ) \arg\max(L(\theta)) argmax(L(θ))
l ( θ ) = l o g L ( θ ) l(\theta)=logL(\theta) l(θ)=logL(θ)
l ( θ ) = m log 1 2 π σ − 1 σ 2 . 1 2 ∑ i = 1 m ( y ( i ) − θ T x ( i ) ) 2 l(\theta)=m\log\frac{1}{\sqrt{2\pi}\sigma}-\frac{1}{\sigma^2}.\frac{1}{2}\sum\limits_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2 l(θ)=mlog2πσ1−σ21.21i=1∑m(y(i)−θTx(i))2
J ( θ ) = 1 2 ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta)=\frac{1}{2}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ)=21i=1∑m(hθ(x(i))−y(i))2
On demand requirements a r g min J ( θ ) arg\min J(\theta) argminJ(θ)
J ( θ ) = 1 2 ( X θ − y ) T ( X θ − y ) ∇ θ J ( θ ) = ∇ θ ( 1 2 ( θ T X T − y T ) ( X θ − y ) ) = ∇ θ ( 1 2 ( θ T X T X θ − θ T X T y − y T X θ + y t y ) ) = X T X θ − X T y θ = ( X T X ) − 1 X T y J(\theta)=\frac{1}{2}(X\theta-y)^T(X\theta-y) \\ \nabla_\theta J(\theta)= \nabla_\theta(\frac{1}{2}(\theta^TX^T-y^T)(X\theta-y)) \\ =\nabla_\theta(\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta+y^ty)) \\ =X^TX\theta-X^Ty \\ \theta=(X^TX)^{-1}X^Ty J(θ)=21(Xθ−y)T(Xθ−y)∇θJ(θ)=∇θ(21(θTXT−yT)(Xθ−y))=∇θ(21(θTXTXθ−θTXTy−yTXθ+yty))=XTXθ−XTyθ=(XTX)−1XTy
Logical regression
Available for classification ( Two classification ) And return
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_{\theta}(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}} hθ(x)=g(θTx)=1+e−θTx1
Value range [0,1]
h θ ( x ) ′ = h ( x ) ( 1 − h ( x ) ) h_{\theta}(x)^{'}=h(x)(1-h(x)) hθ(x)′=h(x)(1−h(x))
Gradient descent is used for optimization
You don't have to find the derivative
Decision trees and random forests
Classification algorithm
The tree structure represents the results of data classification
- The root node
- Nonleaf node ( Decision point )
- Leaf node ( Classification marks )
- Branch ( Test results )
Training phase
Classification stage
The two events are independent of each other : P ( X , Y ) = P ( X ) ∗ P ( Y ) P(X,Y)=P(X)*P(Y) P(X,Y)=P(X)∗P(Y) L o g ( X Y ) = L o g ( X ) + L o g ( Y ) Log(XY)=Log(X)+Log(Y) Log(XY)=Log(X)+Log(Y)
Start at the root , Start sorting layer by layer .
It is necessary to use entropy to judge who is the node with low level
H ( x ) H(x) H(x) As the uncertainty of the event , The level of internal chaos
P ( A few rate The more Big ) − > H ( X ) value The more Small P( The greater the chance ) -> H(X) The smaller the value. P( A few rate The more Big )−>H(X) value The more Small
P ( A few rate The more Small ) − > H ( X ) value The more Big P( The less likely it is ) ->H(X) The bigger the value is. P( A few rate The more Small )−>H(X) value The more Big
entropy = − ∑ i = 1 n p i l n ( P i ) - \sum\limits_{i=1}^np_iln(P_i) −i=1∑npiln(Pi)
G i n i system Count = G i n i ( p ) = ∑ i = 1 K p k ( 1 − p k ) = 1 − ∑ k = 1 K p k 2 Gini coefficient = Gini(p)=\sum\limits_{i=1}^Kp_k(1-p_k)=1-\sum\limits_{k=1}^Kp_k^2 Gini system Count =Gini(p)=i=1∑Kpk(1−pk)=1−k=1∑Kpk2
p The bigger it is , Entropy sum Gini The smaller the coefficient
The basic idea of constructing decision tree
The basic idea of constructing a tree is to increase the depth of the tree , The entropy of nodes decreases rapidly .
The faster entropy decreases, the better , Can make the depth smaller
After each division , The sum of entropy of a set is the smallest and the best , Can result in maximum information gain , Make the information entropy drop the fastest
The version of the decision tree
ID3: Information gain
C4.5: Information gain rate
CART:Gini coefficient
ID3 defects :
The information gain rate is too large : Too many samples , The number of samples per sample is small
Evaluation function : C ( T ) = ∑ t ∈ l e a f N t H ( t ) C(T)=\sum\limits_{t\in leaf}N_tH(t) C(T)=t∈leaf∑NtH(t) N t N_t Nt Weight value , H ( t ) H(t) H(t) Entropy
The smaller the evaluation function, the better , Similar to the loss function
Able to handle continuous attributes , Firstly, the continuous attribute is discretized , Divide the values of continuous attributes into different intervals
Consideration of missing data : Build decision tree , Loss data can be ignored , When calculating the gain , Only records with attribute values are considered
Decision tree pruning
pre-pruning : The process of building a decision tree , Early termination ( Prevent over fitting )
After pruning : After the decision tree structure is built , Just started cutting
C α ( T ) = C ( T ) + α ∣ T l e a f ∣ C_{\alpha}(T)=C(T)+\alpha|T_{leaf}| Cα(T)=C(T)+α∣Tleaf∣ The more leaf nodes there are , The greater the loss
Random forests
Bootstrapping: There is a return sample
Bagging: There is a return sample n A total of samples are used to build classifiers
A decision tree makes the same decision together
Random : What percentage of the samples are randomly selected for training , Randomly selected features
Bayesian algorithm
Bayes' formula
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B)=\frac{P(B|A)P(A)}{P(B)} P(A∣B)=P(B)P(B∣A)P(A)
Spelling correction
Spam filtering
Model comparison theory
maximum likelihood : Most consistent with observed data ( P ( h ∣ D ) P(h|D) P(h∣D) The biggest advantage ) The greater the posterior probability, the more advantageous
Okam razor : P ( h ) P(h) P(h) Larger models have greater advantages The greater the prior probability, the better , Higher order polynomials are less common
Naive Bayes : Features are independent of each other , They don't influence each other
Xgboost
Ensemble classifiers
Predictive value : y i ^ = ∑ j w j x i j \hat {y_i} = \sum_jw_jx_{ij} yi^=∑jwjxij
Objective function : l ( y i , y i ^ ) = ( y i − y i ^ ) 2 l(y_i,\hat{y_i})=(y_i-\hat{y_i})^2 l(yi,yi^)=(yi−yi^)2
Optimal solution : F ∗ ( x ) = arg min E ( x , y ) [ L ( y , F ( x ) ) ] F^*({x})=\arg \min E_{(x,y)}[L(y,F(x))] F∗(x)=argminE(x,y)[L(y,F(x))]
The basic idea : It means that every tree added will be improved on the original basis
y i ^ ( 0 ) = 0 \hat{y_i}^{(0)} = 0 yi^(0)=0
y i ^ ( 1 ) = f 1 ( x i ) = y i ^ ( 0 ) + f 1 ( x i ) \hat{y_i}^{(1)}=f_1(x_i)=\hat{y_i}^{(0)}+f_1(x_i) yi^(1)=f1(xi)=yi^(0)+f1(xi)
y i ^ ( t ) = ∑ k = 1 t f k ( x i ) = y i ^ ( t − 1 ) + f t ( x i ) \hat{y_i}^{(t)}=\sum\limits_{k=1}^tf_k(x_i)=\hat{y_i}^{(t-1)}+f_t(x_i) yi^(t)=k=1∑tfk(xi)=yi^(t−1)+ft(xi)
It is equivalent to t Wheel model prediction , Keep the front t − 1 t-1 t−1 Wheel model prediction , Add a new function
Penalty item : Ω ( f t ) = γ T + 1 2 λ ∑ j = 1 T w j 2 \Omega(f_t)=\gamma T+\frac{1}{2}\lambda\sum\limits_{j=1}^Tw_j^2 Ω(ft)=γT+21λj=1∑Twj2 For every tree
The first term is the number of leaf nodes , The latter term is the penalty term of regularization , Constitute the total loss function
o b j ( t ) = ∑ i = 1 n l ( y i , y i ^ ( t ) ) + ∑ i = 1 t Ω ( f i ) obj^{(t)}=\sum\limits_{i=1}^nl(y_i,\hat{y_i}^{(t)})+\sum_{i=1}^t \Omega(f_i) obj(t)=i=1∑nl(yi,yi^(t))+∑i=1tΩ(fi)
= ∑ i = 1 n l ( y i , y i ^ ( t − 1 ) + f t ( x i ) ) + Ω ( f t ) + c =\sum\limits_{i=1}^nl(y_i,\hat{y_i}^{(t-1)}+f_t(x_i))+\Omega(f_t)+c =i=1∑nl(yi,yi^(t−1)+ft(xi))+Ω(ft)+c
Goals need to be found f t f_t ft To optimize your goals
Optimization with Taylor expansion
o b j ( t ) = ∑ i = 1 n [ l ( y i , y i ^ ( t − 1 ) ) + g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) + c o n s t obj^{(t)}=\sum_{i=1}^n[l(y_i,\hat{y_i}^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+\Omega(f_t)+const obj(t)=∑i=1n[l(yi,yi^(t−1))+gift(xi)+21hift2(xi)]+Ω(ft)+const
g i g_i gi First order conductance , h i h_i hi The two order derivative.
Into traversal of leaf nodes
o b j ( t ) = ∑ i = 1 n [ g i w q ( x i ) + 1 2 h i w q ( x i ) 2 ] + γ T + λ 1 2 ∑ j = 1 T w j 2 obj^{(t)}=\sum_{i=1}^n[g_iw_q(x_i)+\frac{1}{2}h_i w_{q(x_i)}^2]+\gamma T +\lambda \frac{1}{2}\sum_{j=1}^T w_j^2 obj(t)=∑i=1n[giwq(xi)+21hiwq(xi)2]+γT+λ21∑j=1Twj2
= ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T =\sum_{j=1}^T[(\sum_{i \in I_j}g_i)w_j+\frac{1}{2}(\sum_{i \in I_j}h_i+ \lambda)w_j^2]+\gamma T =∑j=1T[(∑i∈Ijgi)wj+21(∑i∈Ijhi+λ)wj2]+γT
G j = ∑ i ∈ I j g i G_j=\sum_{i \in I_j}g_i Gj=∑i∈Ijgi H j = ∑ i ∈ I j h i H_j=\sum_{i \in I_j}h_i Hj=∑i∈Ijhi
o b j ( t ) = ∑ i = 1 T [ G j w j + 1 2 ( H j + λ w j 2 ) ] + γ T obj^{(t)}=\sum_{i=1}^T[G_jw_j+\frac{1}{2}(H_j+ \lambda w_j^2)]+ \gamma T obj(t)=∑i=1T[Gjwj+21(Hj+λwj2)]+γT
Partial derivative =0
To calculate the w j = − G j H j + λ w_j=-\frac {G_j}{H_j+ \lambda} wj=−Hj+λGj
O b j = − 1 2 ∑ j = 1 T G j 2 H j + λ + γ T Obj=-\frac{1}{2}\sum_{j=1}^T\frac{G_j^2}{H_j+\lambda}+\gamma T Obj=−21∑j=1THj+λGj2+γT
Whether to perform the segmentation between the left node and the right node
Calculate the difference after segmentation
G a i n = 1 2 [ G L 2 H L + λ + G R 2 H R + λ − ( G L + G R ) 2 H L + H R + λ ] − γ Gain=\frac{1}{2}[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}]-\gamma Gain=21[HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2]−γ
Adaboost
Adaptive enhancement
The wrong samples of the previous classifier will be strengthened , After weighting, all samples are used to train the next basic classifier . meanwhile , A new weak classifier is added in each round , Until a predetermined error rate small enough is reached , Or the maximum number of iterations specified in advance
Finally, the new classifier is weighted by multiple classifiers
- Initialize the weight distribution of the data , At the beginning, all samples have the same weight
- Train weak classifiers , If a sample is not accurately classified , Then increase its weight ; If correctly classified , Lower the weight
- The weak classifiers are weighted and combined into strong classifiers
Support vector machine
Classification problem
Suppose there is a hyperplane : w T x + b = 0 w^Tx+b=0 wTx+b=0
There are two points on the hyperplane : x ′ x ′ ′ x^{'} \ x^{''} x′ x′′ Satisfy w T x ′ = − b w T x ′ ′ = − b w^Tx^{'}=-b \qquad w^Tx^{''}=-b wTx′=−bwTx′′=−b
The normal vector of the plane w: w T ( x ′ ′ − x ′ ) = 0 w^T(x^{''}-x^{'})=0 wT(x′′−x′)=0
x ′ ′ and x ′ x'' and x' x′′ and x′ It's in vector form
distance(point to line)= w T ∣ ∣ w ∣ ∣ ( x − x T ) \frac{w^T}{||w||}(x-x^T) ∣∣w∣∣wT(x−xT)= 1 ∣ ∣ w ∣ ∣ ∣ w T x + b ∣ \frac{1}{||w||}|w^Tx+b| ∣∣w∣∣1∣wTx+b∣
SVM In classification , In the right time y = 1 y=1 y=1, Negative examples y = − 1 y=-1 y=−1
This will satisfy y i y ( x i ) > 0 y_i y(x_i)>0 yiy(xi)>0
Find a straight line , Make the point closest to the line farther away :
arg w , b max ( min y i ( w T x i + b ) ∣ ∣ w ∣ ∣ ) \arg \limits_{w,b} \max (\min \frac{y_i(w^T x_i+b)}{||w||}) w,bargmax(min∣∣w∣∣yi(wTxi+b))
By retracting : y i ( w T x i + b ) ≥ 1 y_i(w^Tx_i+b) \ge 1 yi(wTxi+b)≥1
You need to ask for m a x w , b 1 ∣ ∣ w ∣ ∣ max_{w,b}\frac{1}{||w||} maxw,b∣∣w∣∣1
To find the minimum min w , b 1 2 w 2 \min_{w,b}\frac{1}{2}w^2 minw,b21w2 And y i ( w T x i + b ) ≥ 1 y_i(w^Tx_i+b)\ge 1 yi(wTxi+b)≥1
Using Lagrange multipliers :
L ( w , b , α ) = 1 2 ∣ ∣ w ∣ ∣ 2 − ∑ i = 1 n α i ( y i ( w T x i + b ) − 1 ) L(w,b,\alpha)=\frac{1}{2}||w||^2-\sum\limits_{i=1}^n{\alpha_i}(y_i(w^Tx_i+b) - 1) L(w,b,α)=21∣∣w∣∣2−i=1∑nαi(yi(wTxi+b)−1)
The dual problem : min w , b max α L ( w , b , α ) > max α min w , b L ( w , b , α ) \min\limits_{w,b}\max\limits_{\alpha}L(w,b,\alpha)>\max\limits_{\alpha}\min\limits_{w,b}L(w,b,\alpha) w,bminαmaxL(w,b,α)>αmaxw,bminL(w,b,α)
Respectively for w and b Finding partial derivatives , We get two conditions respectively
∂ L ∂ w = 0 \frac{\partial{L}}{\partial{w}}=0 ∂w∂L=0 -> w = ∑ i = 1 n α i y i x n w=\sum\limits_{i=1}^n \alpha_iy_ix_n w=i=1∑nαiyixn
∂ L ∂ b = 0 \frac{\partial{L}}{\partial{b}}=0 ∂b∂L=0 -> ∑ i = 1 n α i y i = 0 \sum\limits_{i=1}^n \alpha_iy_i=0 i=1∑nαiyi=0
In the face of α \alpha α Just ask for the derivative
Lagrange multiplier method
min f ( x ) \min f(x) minf(x)
s . t . g i ( x ) ≤ 0 i = 1 , … , m s.t. \quad g_i(x) \le 0 \quad i=1,\dots,m s.t.gi(x)≤0i=1,…,m
The support vector determines the point of the split face , Determines the interval separation hyperplane
Soft space
Individual points affect the separation of the entire hyperplane
Introduce relaxation factor , It becomes a soft interval problem
y i ( w x i + b ) ≥ 1 − ε i y_i(wx_i+b)\ge1-\varepsilon_i yi(wxi+b)≥1−εi
Objective function : min 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 n ε i \min \frac{1}{2}||w||^2+C\sum\limits_{i=1}^n \varepsilon_i min21∣∣w∣∣2+Ci=1∑nεi
When C Approaching infinity : It means that there should be no mistakes in the classification
When C Close to very little : It means that there can be greater tolerance for mistakes
Kernel function
Mapping from low dimensional space to high dimensional space
The benefits of kernel functions : Calculate the inner product of high-dimensional samples in a low dimensional space
It can be simplified as inner product in low dimension to map the result to high dimension
The result is the same as that of inner product in high dimension
Gaussian kernel
K ( X , Y ) = e x p { ∣ ∣ X − Y ∣ ∣ 2 2 σ 2 } K(X,Y)=exp\{\frac{||X-Y||^2}{2\sigma^2}\} K(X,Y)=exp{ 2σ2∣∣X−Y∣∣2}
ARIMA
Stability :
- Stationarity is the fitting curve obtained through the sample time series , In the future, we can still follow the existing form “ inertia ” To continue
- Stationarity requires that the mean and variance of the sequence do not change significantly
Strict stability and weak stability :
- Yan pingwen : The distribution of strict stationary representation does not change with time . White noise : No matter how you take , Expectation is 0, The variance is 1
- Weakly stationary : Expectation and correlation coefficient ( Dependence ) unchanged . At some point in the future t The value of depends on its past information , So we need dependence
The data is relatively stable :
- The difference method : The time series t And t-1 Time difference
Autoregressive model (AR)
- Describe the relationship between current value and historical value , Use the historical time data of variables to predict themselves
- Autoregressive model must satisfy the requirement of stationarity
- p The formula definition of order autoregressive process : y t = μ + ∑ i = 1 p γ i y t − i + ϵ t y_t=\mu+\sum_{i=1}^p \gamma_iy_{t-i}+\epsilon_t yt=μ+∑i=1pγiyt−i+ϵt
- y t y_t yt It's the current value , μ \mu μ It's a constant term , P P P It's order , γ i \gamma_i γi Is the autocorrelation coefficient , ϵ t \epsilon_t ϵt It's error
Limitations of autoregressive models :
- Autoregressive model uses its own data to predict
- Must be stable
- Must have autocorrelation , If the autocorrelation coefficient ( φ i ) < 0.5 (\varphi_i)<0.5 (φi)<0.5, ... should not be used
- Autoregression is only applicable to predict the phenomenon related to its own early stage
Moving average model (MA)
- The moving average model focuses on the accumulation of error terms in the autoregressive model
- q The formula definition of order autoregressive process : y t = μ + ϵ t + ∑ i = 1 q θ i ϵ t − i y_t=\mu+\epsilon_t+\sum_{i=1}^q \theta_i \epsilon_{t-i} yt=μ+ϵt+∑i=1qθiϵt−i
- The moving average method can effectively eliminate the random fluctuation in prediction
Autoregressive moving average model : ( A R M A ) (ARMA) (ARMA)
- The combination of autoregressive and moving average
- Formula definition : y t = μ + ∑ i = 1 p γ i y t − i + ϵ t + ∑ i = 1 q θ i ϵ t − i y_t=\mu+\sum_{i=1}^p\gamma_iy_{t-i}+\epsilon_t+\sum_{i=1}^q\theta_i\epsilon_{t-i} yt=μ+∑i=1pγiyt−i+ϵt+∑i=1qθiϵt−i
ARIMA: Differential autoregressive moving average model
Transforming non-stationary time series into stationary time series
Then the model is established by regressing the dependent variable only to its lag value and the present value and lag value of the random error term
choice p Values and q value
Autocorrelation function ACF
- An ordered sequence of random variables is compared to itself Autocorrelation function reflects the correlation between the values of the same sequence in different time series
- A C F ( k ) = ϱ k = C o v ( y t , y t − k ) V a r ( y t ) ACF(k)=\varrho_k=\frac{Cov(y_t,y_{t-k})}{Var(y_t)} ACF(k)=ϱk=Var(yt)Cov(yt,yt−k)
Partial autocorrelation function (PACF)
- ACF What we get from it is not x ( t ) x(t) x(t) And x ( t − k ) x(t-k) x(t−k) A simple correlation between
- x ( t ) x(t) x(t) At the same time, it will be x ( t − 1 ) … x ( t − k + 1 ) x(t-1) \dots x(t-k+1) x(t−1)…x(t−k+1) Influence
- PACF Cut out the middle k − 1 k-1 k−1 Random variables x ( t − 1 ) , … , x ( t − k + 1 ) x(t-1),\dots,x(t-k+1) x(t−1),…,x(t−k+1) After the interference of , x ( t − k ) x(t-k) x(t−k) Yes x ( t ) x(t) x(t) The relevance of the impact
| Model | ACF | PACF |
|---|---|---|
| AR(p) | Attenuation tends to 0( Geometric or oscillatory ) | p End after step |
| MA(q) | q End after step | Attenuation tends to 0( Geometric or oscillatory ) |
| ARMA(p,q) | q The post order attenuation tends to 0( Geometric or oscillatory ) | p The post order attenuation tends to 0( Geometric or oscillatory ) |
truncation : Fall within the confidence interval (95% All of the points conform to the rule )
Modeling process
- Smooth the sequence ( Determination by difference method d)
- p and q Order determination :ACF And PACF
- ARIMA(p,d,q)
Model selection AIC And BIC:( The lower the better )
AIC: Red pool information criterion . A I C = 2 k − 2 l n ( L ) AIC=2k-2ln(L) AIC=2k−2ln(L)
BIC: Bayesian information rule B I C = k l n ( n ) − 2 l n ( L ) BIC=kln(n)-2ln(L) BIC=kln(n)−2ln(L)
k There are several parameters in the model ,n Is the number of samples ,L Is the likelihood function
Model residual test :
ARIMA Whether the residual error of the model is the average value 0 And the variance is a constant normal distribution
neural network
A picture is represented as a three-dimensional array in a computer
K Nearest neighbor algorithm
One thing it's nearest k Which class does one belong to more , It is divided into which class
For points in the unknown category attribute dataset :
- Calculate the distance between the point in the known category dataset and the current point
- Sort by distance
- Select the least distance from the current point k A little bit
- Make sure before k Probability of occurrence of the category in which a point is located
- Return to the former k The category with the highest frequency of points is used as the current point prediction classification
No training required , The computational complexity is proportional to the number of documents in the training set , Complexity O ( n ) O(n) O(n)
Distance calculation :
d 1 ( I 1 . I 2 ) = ∑ p ∣ I 1 p − I 2 P ∣ d_1(I_1.I_2)=\sum\limits_{p}|I_1^p-I_2^P| d1(I1.I2)=p∑∣I1p−I2P∣
Distance is also called a super parameter
Manhattan distance : L 1 L1 L1: d 1 ( I 1 , I 2 ) = ∑ p ∣ I 1 P − I 2 p ∣ d_1(I_1,I_2)=\sum\limits_p|I_1^P-I_2^p| d1(I1,I2)=p∑∣I1P−I2p∣
Euclidean distance : L 2 L2 L2: d 2 ( I 1 , I 2 ) = ∑ p ( I 1 p − I 2 p ) 2 d_2(I_1,I_2)=\sqrt{\sum\limits_p(I_1^p-I_2^p)^2} d2(I1,I2)=p∑(I1p−I2p)2
K Determination of nearest neighbor parameters :
Use cross validation : Take part of the training set as the verification set ( Adjust model parameters )( Alternately take one of them as the verification set )
Neural network loss function
L i = 1 N ∑ i = 1 N ∑ j ≠ y i m a x ( 0 , f ( x i ; W ) j − f ( x i ; W ) y i + δ ) L_i=\frac{1}{N} \sum_{i=1}^N\sum_{j \ne y_i}max(0,f(x_i;W)_j-f(x_i;W)_{y_i}+\delta) Li=N1∑i=1N∑j=yimax(0,f(xi;W)j−f(xi;W)yi+δ)
δ \delta δ Tolerable degree
Regularization penalty term
L i = 1 N ∑ i = 1 N ∑ j ≠ y i m a x ( 0 , f ( x i ; W ) j − f ( x i ; W ) y i + δ ) + λ ∑ k ∑ l W k , l 2 L_i=\frac{1}{N} \sum_{i=1}^N\sum_{j \ne y_i}max(0,f(x_i;W)_j-f(x_i;W)_{y_i}+\delta)+\lambda \sum\limits_k \sum\limits_lW_{k,l}^2 Li=N1∑i=1N∑j=yimax(0,f(xi;W)j−f(xi;W)yi+δ)+λk∑l∑Wk,l2
effect : Penalty weight parameter
Softmax classifier
s o f t m a x softmax softmax Output : Normalized classification probability
Loss function : Cross entropy loss L i = − log e f y i ∑ j e f j L_i=-\log \frac{e^{f_{y_i}}}{\sum_je^{f_j}} Li=−log∑jefjefyi
f j ( z ) = e z j ∑ k e z k f_j(z)=\frac{e^{z_j}}{\sum_ke^{z_k}} fj(z)=∑kezkezj, Referred to as softmax function
The input value is a vector , The score value of any real number in the vector
Output a vector , Each element value is in 0 To 1 Between , And the sum of all elements is 1
Machine learning optimization
Back propagation
gradient descent
b a t c h s i z e batchsize batchsize: The amount of load on the computer , Multiple sheets in one iteration ( Forward propagation + Back propagation )
In training , The overall trend is convergent
epoch: All the data have been run
Back propagation
Additive gate unit : Equal distribution
MAX Door unit : To the biggest
Multiplication gate unit : swap
characteristic :
- hierarchy
- Nonlinear structure
sigmoid Function gradient vanishes seriously , Is no longer used
Now we mainly use RELU m a x ( 0 , x ) max(0,x) max(0,x)
The more neurons , The better the classification
Prevent over fitting : Regularization
Data preprocessing
Weight initialization :b Constant initialization ,w Random
Prevent over fitting :
drop-out: In training , A part of neurons are randomly reserved for forward and backward propagation
边栏推荐
- Opencv learning notes 3
- Zlmediakit push pull flow test
- Batch execute SQL file
- Polka lines code recurrence
- The principle and function of focus
- 如何利用最少的钱,快速打开淘宝流量入口?
- Using transformers of hugging face to realize text classification
- 读书笔记:SQL 查询中的SQL*Plus 替换变量(DEFINE变量)和参数
- Playing card image segmentation
- Detailed process of generating URDF file from SW model
猜你喜欢

XSS cross site scripting attack

爬虫 对 Get/Post 请求时遇到编码问题的解决方案

Euler function: find the number of numbers less than or equal to N and coprime with n

Corn image segmentation count_ nanyangjx

leetcode2022年度刷题分类型总结(十二)并查集

Installation of jupyter

Yolov5进阶之一摄像头实时采集识别

【Unity Mirror】NetworkTeam的使用

Playing card image segmentation

STM32 project design: an e-reader making tutorial based on stm32f4
随机推荐
Golang JSON unsupported value: Nan processing
在同花顺开户证券安全吗,
STM32 project design: temperature, humidity and air quality alarm, sharing source code and PCB
鲸会务为活动现场提供数字化升级方案
1.23 neural network
Install Anaconda + NVIDIA graphics card driver + pytorch under win10_ gpu
Batch execute SQL file
static const与static constexpr的类内数据成员初始化
在 KubeSphere 部署 Wiki 系统 wiki.js 并启用中文全文检索
Relationship extraction --tplinker
Is it safe to open an account in flush,
Using MySQL and Qt5 to develop takeout management system (I): environment configuration
反爬之验证码识别登录 (OCR字符识别)
Deploy wiki system Wiki in kubesphere JS and enable Chinese full-text retrieval
Bezier curve learning
如何利用最少的钱,快速打开淘宝流量入口?
Using transformers of hugging face to realize named entity recognition
Embedded Software Engineer (6-15k) written examination interview experience sharing (fresh graduates)
Opencv learning notes II
Detailed explanation of self attention & transformer