当前位置：网站首页>Machine learning (Part 1)

Machine learning (Part 1)

2022-06-26 08:54:00 【Thick Cub with thorns】

machine learning （1）

machine learning ：

pattern recognition

Computer vision

data mining

speech recognition

Statistical learning

natural language processing

The training sample
feature extraction
Learn functions
forecast

With supervision problems ： Yes label
Unsupervised problems ： nothing label
Return to ： Output specific values
classification ： Problems classified

Linear regression

$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2$

$h_\theta(x)=\sum\limits_{i=0}^n\theta_ix_i=\theta^Tx$

$y^{(i)}=\theta^Tx^{(i)}+\varsigma^{(i)}$

The error is Independent and with the same distribution It is generally believed that the mean value is 0 The variance of $\theta^2$ Gaussian distribution of

$p(\varsigma^{(i)})=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{\varsigma^{(i)^2}}{2\sigma^2})$

$p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2})$

Maximum likelihood function ：
$L(\theta) = \prod\limits_{i=1}^mp(y^{(i)}|x^{(i)};\theta)\\ = \prod\limits_{i=1}^mp(y^{(i)}|x^{(i)};\theta)\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2})$
On demand requirements $\arg\max(L(\theta))$

$l(\theta)=logL(\theta)$

$l(\theta)=m\log\frac{1}{\sqrt{2\pi}\sigma}-\frac{1}{\sigma^2}.\frac{1}{2}\sum\limits_{i=1}^m(y^{(i)}-\theta^Tx^{(i)})^2$

$J(\theta)=\frac{1}{2}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$

On demand requirements $arg\min J(\theta)$

$J(\theta)=\frac{1}{2}(X\theta-y)^T(X\theta-y) \\ \nabla_\theta J(\theta)= \nabla_\theta(\frac{1}{2}(\theta^TX^T-y^T)(X\theta-y)) \\ =\nabla_\theta(\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta+y^ty)) \\ =X^TX\theta-X^Ty \\ \theta=(X^TX)^{-1}X^Ty$

Logical regression

Available for classification （ Two classification ） And return

$h_{\theta}(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}$

Value range [0,1]

$h_{\theta}(x)^{'}=h(x)(1-h(x))$

Gradient descent is used for optimization

You don't have to find the derivative

Decision trees and random forests

Classification algorithm

The tree structure represents the results of data classification

The root node
Nonleaf node （ Decision point ）
Leaf node （ Classification marks ）
Branch （ Test results ）

Training phase

Classification stage

The two events are independent of each other ： $P (X, Y) = P (X) * P (Y)$ $L o g (X Y) = L o g (X) + L o g (Y)$

Start at the root , Start sorting layer by layer .

It is necessary to use entropy to judge who is the node with low level

$H (x)$ As the uncertainty of the event , The level of internal chaos

$P (A few rate The more Big) - > H (X) value The more Small$

$P (A few rate The more Small) - > H (X) value The more Big$

entropy = $\sum\limits_{i=1}^np_iln(P_i)$

$Gini(p)=\sum\limits_{i=1}^Kp_k(1-p_k)=1-\sum\limits_{k=1}^Kp_k^2$

p The bigger it is , Entropy sum Gini The smaller the coefficient

The basic idea of constructing decision tree

The basic idea of constructing a tree is to increase the depth of the tree , The entropy of nodes decreases rapidly .

The faster entropy decreases, the better , Can make the depth smaller

After each division , The sum of entropy of a set is the smallest and the best , Can result in maximum information gain , Make the information entropy drop the fastest

The version of the decision tree

ID3： Information gain

C4.5： Information gain rate

CART：Gini coefficient

ID3 defects ：

The information gain rate is too large ： Too many samples , The number of samples per sample is small

Evaluation function ： $C(T)=\sum\limits_{t\in leaf}N_tH(t)$ $N_t$ Weight value , $H (t)$ Entropy

The smaller the evaluation function, the better , Similar to the loss function

Able to handle continuous attributes , Firstly, the continuous attribute is discretized , Divide the values of continuous attributes into different intervals

Consideration of missing data ： Build decision tree , Loss data can be ignored , When calculating the gain , Only records with attribute values are considered

Decision tree pruning

pre-pruning ： The process of building a decision tree , Early termination （ Prevent over fitting ）

After pruning ： After the decision tree structure is built , Just started cutting

$C_{\alpha}(T)=C(T)+\alpha|T_{leaf}|$ The more leaf nodes there are , The greater the loss

Random forests

Bootstrapping： There is a return sample

Bagging： There is a return sample n A total of samples are used to build classifiers

A decision tree makes the same decision together

Random ： What percentage of the samples are randomly selected for training , Randomly selected features

Bayesian algorithm

Bayes' formula

$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$

Spelling correction

Spam filtering

Model comparison theory

maximum likelihood ： Most consistent with observed data （ $P (h ∣ D)$ The biggest advantage ） The greater the posterior probability, the more advantageous

Okam razor ： $P (h)$ Larger models have greater advantages The greater the prior probability, the better , Higher order polynomials are less common

Naive Bayes ： Features are independent of each other , They don't influence each other

Xgboost

Ensemble classifiers

Predictive value ： $\hat {y_i} = \sum_jw_jx_{ij}$

Objective function ： $l(y_i,\hat{y_i})=(y_i-\hat{y_i})^2$

Optimal solution ： $F^*({x})=\arg \min E_{(x,y)}[L(y,F(x))]$

The basic idea ： It means that every tree added will be improved on the original basis

$\hat{y_i}^{(0)} = 0$

$\hat{y_i}^{(1)}=f_1(x_i)=\hat{y_i}^{(0)}+f_1(x_i)$

$\hat{y_i}^{(t)}=\sum\limits_{k=1}^tf_k(x_i)=\hat{y_i}^{(t-1)}+f_t(x_i)$

It is equivalent to t Wheel model prediction , Keep the front $t - 1$ Wheel model prediction , Add a new function

Penalty item ： $\Omega(f_t)=\gamma T+\frac{1}{2}\lambda\sum\limits_{j=1}^Tw_j^2$ For every tree

The first term is the number of leaf nodes , The latter term is the penalty term of regularization , Constitute the total loss function

$obj^{(t)}=\sum\limits_{i=1}^nl(y_i,\hat{y_i}^{(t)})+\sum_{i=1}^t \Omega(f_i)$

$=\sum\limits_{i=1}^nl(y_i,\hat{y_i}^{(t-1)}+f_t(x_i))+\Omega(f_t)+c$

Goals need to be found $f_t$ To optimize your goals

Optimization with Taylor expansion

$obj^{(t)}=\sum_{i=1}^n[l(y_i,\hat{y_i}^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_if_t^2(x_i)]+\Omega(f_t)+const$

$g_i$ First order conductance , $h_i$ The two order derivative.

Into traversal of leaf nodes

$obj^{(t)}=\sum_{i=1}^n[g_iw_q(x_i)+\frac{1}{2}h_i w_{q(x_i)}^2]+\gamma T +\lambda \frac{1}{2}\sum_{j=1}^T w_j^2$

$=\sum_{j=1}^T[(\sum_{i \in I_j}g_i)w_j+\frac{1}{2}(\sum_{i \in I_j}h_i+ \lambda)w_j^2]+\gamma T$

$G_j=\sum_{i \in I_j}g_i$ $H_j=\sum_{i \in I_j}h_i$

$obj^{(t)}=\sum_{i=1}^T[G_jw_j+\frac{1}{2}(H_j+ \lambda w_j^2)]+ \gamma T$

Partial derivative =0

To calculate the $w_j=-\frac {G_j}{H_j+ \lambda}$

$Obj=-\frac{1}{2}\sum_{j=1}^T\frac{G_j^2}{H_j+\lambda}+\gamma T$

Whether to perform the segmentation between the left node and the right node

Calculate the difference after segmentation

$Gain=\frac{1}{2}[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}]-\gamma$

Adaboost

Adaptive enhancement

The wrong samples of the previous classifier will be strengthened , After weighting, all samples are used to train the next basic classifier . meanwhile , A new weak classifier is added in each round , Until a predetermined error rate small enough is reached , Or the maximum number of iterations specified in advance

Finally, the new classifier is weighted by multiple classifiers

Initialize the weight distribution of the data , At the beginning, all samples have the same weight
Train weak classifiers , If a sample is not accurately classified , Then increase its weight ; If correctly classified , Lower the weight
The weak classifiers are weighted and combined into strong classifiers

Support vector machine

Classification problem

Suppose there is a hyperplane ： $w^Tx+b=0$

There are two points on the hyperplane ： $x^{'} \ x^{''}$ Satisfy $w^Tx^{'}=-b \qquad w^Tx^{''}=-b$

The normal vector of the plane w： $w^T(x^{''}-x^{'})=0$

$x^{''} and x^{'}$ It's in vector form

distance(point to line)= $\frac{w^T}{||w||}(x-x^T)$ = $\frac{1}{||w||}|w^Tx+b|$

SVM In classification , In the right time $y = 1$ , Negative examples $y = - 1$

This will satisfy $y_i y(x_i)>0$

Find a straight line , Make the point closest to the line farther away ：

$\arg \limits_{w,b} \max (\min \frac{y_i(w^T x_i+b)}{||w||})$

By retracting ： $y_i(w^Tx_i+b) \ge 1$

You need to ask for $max_{w,b}\frac{1}{||w||}$

To find the minimum $\min_{w,b}\frac{1}{2}w^2$ And $y_i(w^Tx_i+b)\ge 1$

Using Lagrange multipliers ：

$L(w,b,\alpha)=\frac{1}{2}||w||^2-\sum\limits_{i=1}^n{\alpha_i}(y_i(w^Tx_i+b) - 1)$

The dual problem ： $\min\limits_{w,b}\max\limits_{\alpha}L(w,b,\alpha)>\max\limits_{\alpha}\min\limits_{w,b}L(w,b,\alpha)$

Respectively for w and b Finding partial derivatives , We get two conditions respectively

$\frac{\partial{L}}{\partial{w}}=0$ -> $w=\sum\limits_{i=1}^n \alpha_iy_ix_n$

$\frac{\partial{L}}{\partial{b}}=0$ -> $\sum\limits_{i=1}^n \alpha_iy_i=0$

In the face of $\alpha$ Just ask for the derivative

Lagrange multiplier method

$\min f(x)$

$\quad g_i(x) \le 0 \quad i=1,\dots,m$

The support vector determines the point of the split face , Determines the interval separation hyperplane

Soft space

Individual points affect the separation of the entire hyperplane

Introduce relaxation factor , It becomes a soft interval problem

$y_i(wx_i+b)\ge1-\varepsilon_i$

Objective function ： $\min \frac{1}{2}||w||^2+C\sum\limits_{i=1}^n \varepsilon_i$

When C Approaching infinity ： It means that there should be no mistakes in the classification

When C Close to very little ： It means that there can be greater tolerance for mistakes

Kernel function

Mapping from low dimensional space to high dimensional space

The benefits of kernel functions ： Calculate the inner product of high-dimensional samples in a low dimensional space

It can be simplified as inner product in low dimension to map the result to high dimension

The result is the same as that of inner product in high dimension

Gaussian kernel

$K(X,Y)=exp\{\frac{||X-Y||^2}{2\sigma^2}\}$

ARIMA

Stability ：

Stationarity is the fitting curve obtained through the sample time series , In the future, we can still follow the existing form “ inertia ” To continue
Stationarity requires that the mean and variance of the sequence do not change significantly

Strict stability and weak stability ：

Yan pingwen ： The distribution of strict stationary representation does not change with time . White noise ： No matter how you take , Expectation is 0, The variance is 1
Weakly stationary ： Expectation and correlation coefficient （ Dependence ） unchanged . At some point in the future t The value of depends on its past information , So we need dependence

The data is relatively stable ：

The difference method ： The time series t And t-1 Time difference

Autoregressive model （AR）

Describe the relationship between current value and historical value , Use the historical time data of variables to predict themselves
Autoregressive model must satisfy the requirement of stationarity
p The formula definition of order autoregressive process ： $y_t=\mu+\sum_{i=1}^p \gamma_iy_{t-i}+\epsilon_t$
$y_t$ It's the current value , $\mu$ It's a constant term , $P$ It's order , $\gamma_i$ Is the autocorrelation coefficient , $\epsilon_t$ It's error

Limitations of autoregressive models ：

Autoregressive model uses its own data to predict
Must be stable
Must have autocorrelation , If the autocorrelation coefficient $(\varphi_i)<0.5$ , ... should not be used
Autoregression is only applicable to predict the phenomenon related to its own early stage

Moving average model （MA）

The moving average model focuses on the accumulation of error terms in the autoregressive model
q The formula definition of order autoregressive process ： $y_t=\mu+\epsilon_t+\sum_{i=1}^q \theta_i \epsilon_{t-i}$
The moving average method can effectively eliminate the random fluctuation in prediction

Autoregressive moving average model ： $(A R M A)$

The combination of autoregressive and moving average
Formula definition ： $y_t=\mu+\sum_{i=1}^p\gamma_iy_{t-i}+\epsilon_t+\sum_{i=1}^q\theta_i\epsilon_{t-i}$

ARIMA： Differential autoregressive moving average model

Transforming non-stationary time series into stationary time series

Then the model is established by regressing the dependent variable only to its lag value and the present value and lag value of the random error term

choice p Values and q value

Autocorrelation function ACF

An ordered sequence of random variables is compared to itself Autocorrelation function reflects the correlation between the values of the same sequence in different time series
$ACF(k)=\varrho_k=\frac{Cov(y_t,y_{t-k})}{Var(y_t)}$

Partial autocorrelation function （PACF）

ACF What we get from it is not $x (t)$ And $x (t - k)$ A simple correlation between
$x (t)$ At the same time, it will be $\dots x(t-k+1)$ Influence
PACF Cut out the middle $k - 1$ Random variables $x(t-1),\dots,x(t-k+1)$ After the interference of , $x (t - k)$ Yes $x (t)$ The relevance of the impact

Model	ACF	PACF
AR（p）	Attenuation tends to 0（ Geometric or oscillatory ）	p End after step
MA（q）	q End after step	Attenuation tends to 0（ Geometric or oscillatory ）
ARMA（p,q）	q The post order attenuation tends to 0（ Geometric or oscillatory ）	p The post order attenuation tends to 0（ Geometric or oscillatory ）

truncation ： Fall within the confidence interval （95% All of the points conform to the rule ）

Modeling process

Smooth the sequence （ Determination by difference method d）
p and q Order determination ：ACF And PACF
ARIMA（p,d,q）

Model selection AIC And BIC：（ The lower the better ）

AIC： Red pool information criterion . $A I C = 2 k - 2 l n (L)$

BIC： Bayesian information rule $B I C = k l n (n) - 2 l n (L)$

k There are several parameters in the model ,n Is the number of samples ,L Is the likelihood function

Model residual test ：

ARIMA Whether the residual error of the model is the average value 0 And the variance is a constant normal distribution

neural network

A picture is represented as a three-dimensional array in a computer

K Nearest neighbor algorithm

One thing it's nearest k Which class does one belong to more , It is divided into which class

For points in the unknown category attribute dataset ：

Calculate the distance between the point in the known category dataset and the current point
Sort by distance
Select the least distance from the current point k A little bit
Make sure before k Probability of occurrence of the category in which a point is located
Return to the former k The category with the highest frequency of points is used as the current point prediction classification

No training required , The computational complexity is proportional to the number of documents in the training set , Complexity $O (n)$

Distance calculation ：

$d_1(I_1.I_2)=\sum\limits_{p}|I_1^p-I_2^P|$

Distance is also called a super parameter

Manhattan distance ： $L 1$ ： $d_1(I_1,I_2)=\sum\limits_p|I_1^P-I_2^p|$

Euclidean distance ： $L 2$ ： $d_2(I_1,I_2)=\sqrt{\sum\limits_p(I_1^p-I_2^p)^2}$

K Determination of nearest neighbor parameters ：

Use cross validation ： Take part of the training set as the verification set （ Adjust model parameters ）（ Alternately take one of them as the verification set ）

Neural network loss function

$L_i=\frac{1}{N} \sum_{i=1}^N\sum_{j \ne y_i}max(0,f(x_i;W)_j-f(x_i;W)_{y_i}+\delta)$

$\delta$ Tolerable degree

Regularization penalty term

$L_i=\frac{1}{N} \sum_{i=1}^N\sum_{j \ne y_i}max(0,f(x_i;W)_j-f(x_i;W)_{y_i}+\delta)+\lambda \sum\limits_k \sum\limits_lW_{k,l}^2$

effect ： Penalty weight parameter

Softmax classifier

$s o f t m a x$ Output ： Normalized classification probability

Loss function ： Cross entropy loss $L_i=-\log \frac{e^{f_{y_i}}}{\sum_je^{f_j}}$

$f_j(z)=\frac{e^{z_j}}{\sum_ke^{z_k}}$ , Referred to as softmax function

The input value is a vector , The score value of any real number in the vector

Output a vector , Each element value is in 0 To 1 Between , And the sum of all elements is 1

Machine learning optimization

Back propagation

gradient descent

$b a t c h s i z e$ ： The amount of load on the computer , Multiple sheets in one iteration （ Forward propagation + Back propagation ）

In training , The overall trend is convergent

epoch： All the data have been run

Back propagation

Additive gate unit ： Equal distribution

MAX Door unit ： To the biggest

Multiplication gate unit ： swap

characteristic ：

hierarchy
Nonlinear structure

sigmoid Function gradient vanishes seriously , Is no longer used

Now we mainly use RELU $m a x (0, x)$

The more neurons , The better the classification

Prevent over fitting ： Regularization

Data preprocessing

Weight initialization ：b Constant initialization ,w Random

Prevent over fitting ：

drop-out： In training , A part of neurons are randomly reserved for forward and backward propagation

原网站

版权声明
本文为[Thick Cub with thorns]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202170554154698.html

当前位置：网站首页>Machine learning (Part 1)

Machine learning (Part 1)

machine learning （1）