当前位置：网站首页>AI zhetianchuan DL regression and classification

AI zhetianchuan DL regression and classification

2022-07-26 17:48:00 【Teacher, I forgot my homework】

This paper mainly introduces Logistic Return to and Softmax Return to

One 、 Regression and classified recall

Set of given data points $x^{n}\epsilon R^{n}$ And the corresponding labels $t^{n}\epsilon \Omega$ , For a new data point x, Predict its label ( The goal is to find a mapping $f:R^{m}\rightarrow \Omega$ ):

If $\Omega$ Is a continuous set , Call it Return to (regression)

If $\Omega$ Is a discrete set , Call it classification (classfication)

Polynomial regression

Consider a regression problem , Input x And the output y All scalars . Find a function $f:\mathbb{R}\rightarrow \mathbb{R}$ To fit the data

f(x)=wx+b

$f(x)=w_{1}x+w_{2}x^2+b$

$f(x) = \sum_{i=1}^{n}w_{i}x^{i} + b$

Whether linear regression or nonlinear regression , We usually pass some cost function Such as minimum mean square error (MSE), As Loss function , To make sure f Parameters of .

Linear regression

Is linear

$f(x) = w^{T}x+b$

among $w\epsilon R^{m},\, b\epsilon R,\:b$ ( bias / residual / Error term ) Can integrate $\theta$ And get $f(x) = \theta ^{T}x$

Set the mean square error (MSE) Is the cost function

$E=\frac{1}{2N}\sum_{n=1}^{N}(f(x^{(n)})-t^{(n)})^2 = \frac{1}{2N}\sum_{n=1}^{N}(w^{t}x^{(n)}+b-t^{(n)})^2$

Find the best by minimizing the cost function w and b

Such as the least square method 、 The gradient descent method minimizes the loss function to solve the parameters .

AI Cover the sky ML- Introduction to regression analysis

Binary classification by regression

In feature space , A linear classifier corresponds to a hyperplane

Two typical linear classifiers ：

perceptron
SVM(AI Cover the sky ML-SVM introduction )

Return to - Forecast continuous
classification - forecast $y=\left\{\begin{matrix} 1 & f(x)\geq 0.5\\ 0& f(x)< 0.5 \end{matrix}\right.$

Binary classification using linear regression ：

Assume $t\epsilon \left \{0,1 \right \}$ , Consider the case of one-dimensional features

Assume $t\epsilon \left \{0,1 \right \}$ , Consider the case of high-dimensional features

Binary classification using nonlinear regression

f(x) It can be a nonlinear function , Such as ：logisitic sigoid function

Similarly, we can train nonlinear regression by training linear regression model , It's just the original $f(x) = w^{T}x+b$

Turned into $f(x) = h(w^{T}+b)$

notes : there h Is a function such as logisitic sigoid function

Look at the problem from the perspective of probability

Suppose that the label obeys the mean $f(x) = h(w^{T}+b)$ Of Normal distribution , Then its maximum likelihood estimation is equivalent to minimization :

For the return question (t yes continuity Of ), The assumption of normal distribution is natural .
For the classification problem (t yes discrete Of ), The assumption of normal distribution would be strange .
There are more suitable assumptions for the data distribution of the binary classification problem ----> Bernoulli distribution

Why is Bernoulli distribution more suitable for binary classification problems ？

Two 、Logistic Return to

For a binary task , One 0-1 The unit is enough to represent a label

Try to learn conditional probability ( Have already put b integrate into $\theta$ ,x For input ,t Label )

$P(t=1|x) = \frac{1}{1+exp(-\theta ^{T}x)}\overset{\Delta }{=}h(x)$

Our goal is to find a $\theta$ Value makes probability

When x Belong to the category 1 when , Take a large value, such as 0.99999.

When x Belong to the category 2 when , Take a small value such as 0.00001 ( therefore P(t=0|x) Take a large value )

We are essentially using another continuous function h Come on “ Return to ” A discrete function (x -> t）

Cross entropy error function (CSE)

For Bernoulli distribution , We maximize conditional data likelihood , Getting is equivalent to minimizing ：

obtain New loss function (CSE) $E(\theta) = -\frac{1}{N}\sum_{n=1}^{N}(t^{(n)}ln(h(x^{(n)}))-(1-t^{(n)})ln(1-h(x^{(n)})))$

Let's take out one of them ： $E(\theta )^{(n)} = -t^{(n)}ln(h(x^{(n)}))-(1-t^{(n)})ln(1-h(x^{(n)}))$

so , If t=1, be E = -ln(h)

If t=0, be E = -ln(1-h)

You can see the river .

Training and testing

II. Summary of classification problems

3、 ... and 、SoftMax Return to

We explained the one-dimensional and multi-dimensional classification above , In fact, for multi classification , Just add the number of functions as the dimension .

Pictured above , For example, for a x, The result of the three functions is 1.2、4.1、1.9, Then it can be regressed or classified according to subsequent operations . These three functions may be linear Of , It could be nonlinear Of , Such as logistic Return to .

choice Mean square error (MSE) As a loss function

$E = \frac{1}{2N}\sum_{n=1}^{N}\sum_{k=1}^{K}(f_{k}(x^{(n)})-t_{k}^{(n)})^{2}$

Use the least square method / The gradient descent method is used to calculate the parameters .

Representation of label categories

For the classification problem , That is, through a mapping f The output is a discrete set , We have two ways to represent labels ：

For the first method , There is a distance relationship between categories , So we usually use the second representation . Each dimension has only 0-1 Two results .

We only need to see which kind of point represents the closest point in a certain point of the output to classify .

From the perspective of probability ：

We mentioned above , For binary tasks , Bernoulli distribution is more suitable , So we introduced logistic Return to .

When faced with multi classification tasks (K>2) when , We choose As a whole multinoulli/categorical Distribution

Review and overall planning multinoulli/categorical Distribution

Overall and distributed learning ：

Make $P(t_{k}=1|x)$ Take the form of ：

$P(t_{k}=1|x)=\frac{exp(\theta ^{(k)T}x)}{\sum_{j=1}^{K}exp(\theta ^{(j)T}x)}\overset{\Delta }{=}h(x)$

clearly , $h_{k}(x)\epsilon (0,1)$ also $\sum_{k=1}^{K}h_{k}(x)=1$

Given a test input x, For each k=1,2,...,K, It is estimated that $P(t_{k}=1|x)$

- When x Belong to the first K Class , Take a large value

- When x When it belongs to other classes , Take a small value

because $h_{k}(x)$ It's a ( Successive ) probability , We need to convert it into discrete values that match the classification .

Softmax function

$P(t_{k}=1|x)=\frac{exp(\theta ^{(k)T}x)}{\sum_{j=1}^{K}exp(\theta ^{(j)T}x)}\overset{\Delta }{=}h(x)$

The following functions are called Softmax function ：

$\psi (x_{i})=\frac{exp(z_{i})}{\sum_{j}^{}exp(z_{i})}=\frac{exp(z_{i})}{exp(z_{i})+\sum_{j\neq i}^{}exp(z_{i})}\, \epsilon \, (0,1)$

If $z_{i}> z_{j }$ For all $j\neq i$ All set up , Then for all $j\neq i$ Yes $\psi (z_{i})>\psi (z_{j})$ But its value is less than 1.
If $z_{i}> z_{j }$ For all $j\neq i$ All set up , Then for all $j\neq i$ Yes $\psi (z_{i})\rightarrow 1,\: \: \psi (z_{j})\rightarrow 0$ .

Again , We get the maximum conditional likelihood Cross entropy error function ：

$E(\theta )=-\frac{1}{N}lnP(t^{(1)},...,t^{(N)})=-\frac{1}{N}\sum_{n=1}^{N}\sum_{k=1}^{K}t_{k}^{(n)}ln\frac{exp(\theta ^{(k)T}x^{(n)})}{\sum_{j=1}^{K}exp(\theta ^{(j)T}x^{(n)})}$

notes ：

$\sum_{k=1}^{K}\frac{exp(\theta ^{(k)T}x^{(n)})}{\sum_{j=1}^{K}exp(\theta ^{(j)T}x^{(n)})}$ For each K, There is only one non 0 term ( Because like (0,0,0,1,0,0))

Calculate the gradient

vector - Matrix form

Training and testing

Stochastic gradient descent

Throughout the training set , The computational cost of minimizing the hate function is very large , We usually divide the training set into smaller subsets or minibatches And then in a single minibatches (xi,yi) Optimize the cost function , And take the average .

Introduce bias bias

up to now , We have assumed that $h_{k}(x)=P(t_{k}=1|x)=\frac{exp( u_{k}^{(n)})}{\sum_{j=1}^{K}exp( u_{j}^{))n})}$

among $u_{k}^{(n)}=\theta ^{(k)T}x^{(n)}$

Sometimes the offset term can be introduced into $u_{k}^{(n)}$ in , The parameter becomes {w,b}

$u_{k}^{(n)}=w^{(k)T}x^{(n)}+b^{(k)}$

obtain

Regularization is usually applied only to w On

$J(W,b)=E(W,b)+\lambda \begin{Vmatrix} W \end{Vmatrix}^{2}/2$

Softmax Over parameterization

There are assumptions $P(t_{k}=1|x)=\frac{exp(\theta ^{(k)T}x)}{\sum_{j=1}^{K}exp(\theta ^{(j)T}x)} =\frac{exp((\theta ^{(k)}-\phi )^{T}x)}{\sum_{j=1}^{K}exp((\theta ^{(k)}-\phi )^{T}x)}$

New parameters $\widehat{\theta }^{(k)}\equiv \theta ^{(k)}-\phi$ Will get the same prediction

Minimizing the cross entropy function can have an infinite number of solutions , because ：

$E(\theta )=-\frac{1}{N}\sum_{n=1}^{N}\sum_{k=1}^{K}t_{k}^{(n)}ln\frac{exp(\theta ^{(k)T}x^{(n)})}{\sum_{j=1}^{K}exp(\theta ^{(j)T}x^{(n)})}=E(\theta -\phi )$