当前位置：网站首页>Deep Learning (3) Classification Theory Part

Deep Learning (3) Classification Theory Part

2022-08-04 02:32:00 【Ali forever】

分类理论部分

前言
一、反向传播
二、Preliminary knowledge of classification(Classification)
三、生成模型与判别模型
四、高斯分布(Gaussian Distribution)With the maximum likelihood function(Maximum Likelihood)
五、逻辑回归(Logistic Regression)
总结

前言

This article will revolve around how to realize the classification model with adjustable parameters in this area for this

一、反向传播

在深度学习神经网络中,Optimization generally using the gradient descent method to adjust our parameters,But because the network layer of the neural network is too complex,Derivation is very difficult,Therefore, there is a very convenient method in the computer to reduce the computational complexity.

1.chain rule math formula

在这里插入图片描述

2.正向传播的过程

在这里插入图片描述
$L(\theta)=\sum_{n=1}^{N}(y^n-\hat{y}^n)=\sum_{n=1}^{N}C^n(\theta)$
$\frac{\partial L(\theta)}{\partial w}=\sum_{n=1}^{N}\frac{\partial C^n(\theta)}{\partial w}$
Here we pick one ofC对权重w求偏导,根据链式法则We can take the partial derivative split into:
$\frac{\partial C}{\partial w}=\frac{\partial z}{\partial w}*\frac{\partial C}{\partial z}$

在这里插入图片描述
可以知道 $\frac{\partial z}{\partial w_1}$ = $x_1$ , $\frac{\partial z}{\partial w_2}$ = $x_2$ ,这个过程我们叫做正向传播(Forward Pass),So our question focuses on how to solve $\frac{\partial C}{\partial z}$ 中

我们可以看到zThrough an activation function can be called a new resulta,This is we can also through the chain rule to split:
$\frac{\partial C}{\partial z}=\frac{\partial a}{\partial z}*\frac{\partial C}{\partial a}$
可以知道 $\frac{\partial a}{\partial z}$ = $\sigma'(z)$ ,那么如何求解 $\frac{\partial C}{\partial a}$ becomes a very difficult thing,Because according to the chain rule
$\frac{\partial C}{\partial a}=\frac{\partial z'}{\partial a}*\frac{\partial C}{\partial z'}+\frac{\partial z''}{\partial a}*\frac{\partial C}{\partial z''}$
At this time how we will continue to derive formula is complicated,And if the later matures more and more,will grow exponentially,So we don't do forward propagation anymore,接下来我们将进行反向传播(Backpropagation)

3.反向传播的过程

Suppose we spread through positive all the weight and the activation function of the derivative was calculated,then the result will be:
在这里插入图片描述
For the last output layer:
$\frac{\partial C}{\partial z_5}=\frac{\partial y_1}{\partial z_5}*\frac{\partial C}{\partial y_1}$ $\frac{\partial C}{\partial z_6}=\frac{\partial y_2}{\partial z_6}*\frac{\partial C}{\partial y_2}$ Both of these results are easy to obtain,After getting these two results,We can then continue to calculate the parameters that were missing from our previous forward pass,在这里我们举个简单的例子:
$\frac{\partial C}{\partial a(z_4)}=\frac{\partial z_5}{\partial a(z_4)}*\frac{\partial C}{\partial z_5}+\frac{\partial z_6}{\partial a(z_4)}*\frac{\partial C}{\partial z_6}$
After this calculation,You can fill in all the missing items in forward propagation..

二、Preliminary knowledge of classification(Classification)

在前面的内容里,We introduce how to do regression,But not all problems are regression problems,We will introduce another problem below,也就是分类问题

1.Similarities and differences between classification and regression

The similarities of regression and classification：都是监督学习,both make predictions about the input,然后得到预测值
The difference between classification and regression is in the following aspects：

输出不同
The categorical output is a discrete variable,是一种定性输出,Determine the type of object,The regression output is a continuous variable.,是一种定量输出,得到输出值
目的不同
The purpose of classification is mainly to find a decision plane,On the whole plane to classify the data,The purpose of regression is to find the best fit,通过回归算法得到是一个最优拟合线,这个线条可以最好的接近数据集中的各个点.
结果不同
Classification results are only right or wrong,The regression result is based on whether it is closer to the true value as the standard,rather than simply judging right or wrong.
使用场景不同
Classification can be used for confirmation of handwritten signatures,Confirmation of face recognition,Confirmation of illness, etc..while regression can be used for similar house prices,The weather forecast.

2.Regression is used as the classification?

This idea stems from the fact that a predicted value is obtained by regression,By judging which classification value is close to,determine which classification value,这种想法是错误的.
在这里插入图片描述
在这一幅图中,We can see that there are far greater1的值,According to our theory above,这些远大于1points will be judged as1,But in fact they are the wrong point,This leads to a misjudgment.

3.贝叶斯公式

We take a simple example to smoke ball as an introduction:
在这里插入图片描述

先验概率与后验概率:
先验概率:假设抽取Box1的概率是 $\frac{2}{3}$ ,抽取Box2的概率是 $\frac{1}{3}$ ,即P( $B_1$ )= $\frac{2}{3}$ ,P( $B_2$ )= $\frac{1}{3}$ ,These two probabilities are the prior probabilities.
后验概率:when we know from $B_1$ For the ball inside,The probability of drawing blue isP( $Blue|B_1$ )= $\frac{4}{5}$ ,The probability of drawing green isP( $Green|B_1$ )= $\frac{1}{5}$
Conditional probability formula and total probability formula:
条件概率公式: $P (A B) = P (A ∣ B) * P (B) = P (B ∣ A) * P (A)$ 其中P(AB)是A与B的联合概率
全概率公式:如果事件 $A_1$ 、 $A_2$ … … $A_i$ 构成一个完备事件组,即它们两两互不相容,其和为全集,则有:
$P(B)=\sum_{n=1}^{N}P(A_i)P(B|A_i)$
贝叶斯公式
$P(B_1|Blue)=\frac{P(B_1,Blue)}{P(Blue)}=\frac{P(Blue|B_1)*P(B_1)}{P(Blue|B_1)*P(B_1)+P(Blue|B_2)*P(B_2)}$

三、生成模型与判别模型

Supervised learning is generally divided into generative models and discriminative models,So what's the difference between the two?？

1.生成模型(generative model)

源头导向型,关注数据时如何生成的,然后再对一个信号进行分类.（信号输入时,生成模型判断哪个类别最有可能产生这个信号,则这个信号就属于哪个类别.
We use a simple example to illustrate:
在这里插入图片描述
There's a bunch of balls now,Two kinds of color information known as the green and yellow,There are only two colors,这里,The color of the ball isy（目标变量）,The position on the coordinate axis is the featureX.我们想要知道,If a location on the axisxput in a new ball,What color is this club？
We can easily calculatey的先验概率P(y),And the posterior probabilityP(x|y=green),
P(x|y=yellow)Can estimate by training,get a result like the following:
在这里插入图片描述
Then according to the conditional probability formula we can calculate the joint probabilityP(x,y=green),P(x,y=yellow)
If it is a binary classification problem, you can get the final result by judging which probability is greater.

2.判别模型(Discriminative Model)

结果导向型,关注类别之间的差别,并不关心样本的数据时怎么生成的,根据样本之间的“分界线"来简单对给定的样本进行分类.
We also use the above example to illustrate:
在这里插入图片描述
First we put this figure into this,Then in the next step we can calculateP(Y|X),By calculation we can get a line through some algorithm,来进行判别.

3.The difference between the generated model and the discriminant model

特点
A generative model is a statistical representation of the distribution of data,Can reflect the similarity of similar data.

The discriminative model is to find the optimal segmentation surface for different classifications,反映数据的差异

Generative models can be transformed into discriminative models by Bayesian formula,But the discriminative model cannot be converted into a generative model
优缺点
Advantages of generative models:

Can be used when data is incomplete
收敛速度快,Faster convergence to the true model
存在隐变量时,still usable,Carry more information than discriminative models
Less chance of overfitting,The research question more flexible
The advantages of discriminant model:
结果很直观,Can see the difference between one class and other classes at a glance
Apply to more categories
判别模型更加简单,Convenient for learning and training
生成模型的缺点:
Requires a large amount of data as a training set,Not friendly with little data
The distribution function is vulnerable to the entry of some outliers
Features that are very similar,容易产生假阳性
Disadvantages of discriminative models:
黑盒操作,The relationship between the variables is not clear,不可视

常见模型
生成模型:朴素贝叶斯分类器,马尔科夫模型,高斯混合模型,Limit Pullman Machines
判别模型:逻辑回归,决策树,k近邻,线性回归,SVM,boosting
常见用途
生成模型:NLP,医疗诊断
判别模型:Image text classification,时间序列检测

四、高斯分布(Gaussian Distribution)With the maximum likelihood function(Maximum Likelihood)

1.Why use a Gaussian distribution?

Because many events in nature are independent random events,This random variable is close to a Gaussian distribution,So this has been assumed to be an independent random event,If there is a serious correlation, we generally do not use.
其次,In the case of known mean and variance of gaussian distribution entropy is the largest of all distribution,When the data distribution is unknown, the model with the largest entropy is usually selected

2.Two-dimensional Gaussian distribution formula

$f_{\mu,\Sigma}(x)=\frac{1}{(2\pi)^{2D}}\frac{1}{|\Sigma|^{\frac{1}{2}}}exp\{-\frac{1}{2}(x-\mu)^T{\Sigma}^{-1}(x-\mu)\}$
We should know that the most important thing in the Gaussian distribution formula is the mean $\mu$ With the covariance matrix $\Sigma$ ,These two parameters determine the overall probability density distribution,We will use maximum likelihood estimation to calculate them

3.最大似然估计

在这里插入图片描述
Since each point selected here will have a different mean $\mu$ With the covariance matrix $\Sigma$ ,Then we can multiply their probability density functions together,得到:
$L(\mu,\Sigma)=f_{\mu,\Sigma}(x^1)f_{\mu,\Sigma}(x^2)......f_{\mu,\Sigma}(x^n)$
Then our goal is to get
$\mu^*,\Sigma^*=\arg\max_{\mu^*,\Sigma^*} L(\mu,\Sigma)$
最后计算得到:
$\mu^*=\frac{1}{N}\sum_{n=1}^{N}x^n,\Sigma^*=\frac{1}{N}\sum_{n=1}^{N}(x^n-\mu^*)(x^n-\mu^*)^T$
Once all of the optimal solution of parameters calculated,can be substituted into the original Gaussian distribution,Since the posterior probability follows a Gaussian distribution,we can substitute data,calculate the result.

4.Gaussian distribution optimization

Since the covariance matrix has a lot to do with the number of eigenvalues of the input,If there are too many eigenvalues,A larger covariance matrix can have a big impact on the results,So we can consider sharing a covariance matrix,thereby reducing errors.
weighted average below,其中 $\alpha$ 为 $\mu_1$ in all $\mu$ 中的占比.
$\Sigma = \alpha\Sigma_1+(1-\alpha)\Sigma_2$
when taking the same $\Sigma$ 的时候,We will find that his decomposition frontier is a linear model.
在这里插入图片描述
这是因为,Our posterior probability eventually translates to $P(C_1|x)=\sigma(wx+b)$ ,But this is essentially findingw跟b的值,If you use generative models, you will waste time,Next, there will be another way to quickly find this result.

五、逻辑回归(Logistic Regression)

It was introduced when the discriminant model was introduced above.,逻辑回归是一种判别模型,Mainly solve the problem of binary classification

1.函数集(Function Set)

$z=\sum_iw_ix_i+b$
$P_{w,b}(C_1|x)=\sigma(z)=\sigma(\sum_iw_ix_i+b)$

2.The quality of the evaluation function(Goodness of a Function)

在这里插入图片描述
We can see so many characteristics of the corresponding category respectively,Next we want to calculate for aw跟ba probability density function of
$L(w,b)=f_{w,b}(x^1)f_{w,b}(x^2)(1-f_{w,b}(x^3))......f_{w,b}(x^N)$
In the same way, we call the maximum likelihood estimation to find the bestw与b值,Generally, in order to reduce the amount of calculation,We will take the logarithm,This converts the product into the sum of the terms
$w^*,b^*=\arg\max_{w^*,b^*} L(w,b)=\arg\min_{w^*,b^*}-\ln L(w,b)$
Next we will construct交叉熵(cross entropy)
在这里插入图片描述
$H(p,q)=-\sum_xp(x)\ln(q(x))$
在这里p(x)refers to the actual distribution,q(x)refers to the distribution of maximum likelihood estimates,So each item can become like this:
$-\ln f_{w,b}(x^1)=-(\hat{y}^1\ln f_{w,b}(x^1)+(1-\hat{y}^1)\ln (1-f_{w,b}(x^1)))$
The total cross-entropy function becomes:
$-\ln L(w,b)=-\sum_n(\hat{y}^n\ln f_{w,b}(x^n)+(1-\hat{y}^n)\ln (1-f_{w,b}(x^n)))$
Cross-entropy represents how close two distributions are,If the same two distribution,则交叉熵为0,If the two distributions are very different,then the cross entropy is large.

3.找到最佳的函数(Find the best function)

$\frac{-\ln L(w,b)}{\partial w_i}=-\sum_n(\frac{\hat{y}^n\ln f_{w,b}(x^n)}{\partial w_i}+\frac{(1-\hat{y}^n)\ln (1-f_{w,b}(x^n))}{\partial w_i})$
$\frac{\partial \ln f_{w,b}(x^n)}{\partial w_i}=\frac{\partial z}{\partial w_i}\frac{\partial\ln f_{w,b}(x^n)}{\partial z}=x_i*\frac{1}{\sigma(z)}*\sigma(z)*(1-\sigma(z))$
$\frac{\partial \ln(1- f_{w,b}(x^n))}{\partial w_i}=\frac{\partial z}{\partial w_i}\frac{\partial\ln (1-f_{w,b}(x^n))}{\partial z}=x_i*\frac{1}{1-\sigma(z)}*(1-\sigma(z))*(\sigma(z))$

$\begin{split} \frac{\partial -\ln L(w,b)}{\partial w_i}&=-\sum_n\hat{y}^n*x_i^n*(1-\sigma(z))-(1-\hat{y}^n)*x_i^n*\sigma(z)\\ &=-\sum_n(\hat{y}^n-\sigma(z))*x_i^n \end{split}$
Finally get the update function:
$w_i=w_i-\eta\sum_n(\hat{y}^n-\sigma(z))*x_i^n$
Comparing update functions for linear and logistic regression,You will find they look exactly the same,just logistic regression $\hat{y}^n$ 只能取值0或者1,Linear regression can take many different values.

4.The loss of the logistic regression function

在前面我们说到,The loss function selected by logistic regression is cross entropy,instead of the mean square error we mentioned in linear regression,这是为什么呢？
这是因为当 $\sigma(z)$ =0或者1时,整个loss function的结果都为0,That is to say, both cases will be found guilty of convergence,false positive problem,这是不可取的,So we won't use mean square error in logistic regression.

总结

This paper presents the theoretical part of the classification,Hope you can get what you want from it,A mind map is attached below to help memory.
在这里插入图片描述