Pre learn
Objective function, cost function, loss function: are they the same thing?
The meaning of theta
\( h(x)=W^TX +b =\theta^Tx\)
Linear regression
Hypotheses:
$$ h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2\tag{1} $$
$$ h(x)=\sum_{i=0}^n\theta_ix_i=\theta^Tx\tag{2} $$
$$x_0 =1 $$
Loss:
$$ J(\theta)=\frac 1 2\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\tag{3} $$
type (3) Also known as Ordinary least squares( Least square method ), and mean square error(MSE Mean square error ) Very similar , The difference lies in :
- The least square method as a loss function is not divided by the total number of samples m, Mean square error divided by the total number of samples m
- The method of solving the model based on the minimization of mean square error is called “ Least square method ”.— zhou 《 machine learning 》
as for linear regression loss function Why Ordinary least square, See the link below pdf Inside Probabilistic interpretation.
Goal:
- Ordinary least squares
Solutions:
- Gradient descent
- Normal equation
$$ \theta=(X^TX)^{-1}X^T\vec{y}\tag{4} $$
See https://see.stanford.edu/mate...
Logistic regression
Logistic function:
$$ g(z)=\frac {1} {1+e^{-z}}\tag{5} $$
$$ g'(z)=g(z)(1-g(z))\tag{6} $$
Pictured above Fig.1 Shown ,sigmoid The function range is 0-1, The domain of definition is R, derivative x=0 when , The maximum value is 0.25, But when x There is a risk of gradient disappearance or gradient explosion when it tends to infinity or infinity .
Hypotheses:
$$ h_\theta(x)=g(\theta^Tx)=\frac {1} {1+e^{-\theta^Tx}}\tag{7} $$
Inputting the output of linear regression into a logical function becomes logical regression , The range of values is from R Turn into 0-1, With this, we can do the task of two categories , Greater than 0.5 It's a kind of , Less than 0.5 It's a kind of .
MLE(Maximum Likelihood Estimate)
Next, from the perspective of maximum likelihood estimation to find loss function.
commonly by MLE We want to get argmax p(D\( |\theta) \), That is, finding parameters makes it possible to get such data .
Assume:
$$ P(y=1|x;\theta)=h_\theta(x)\tag{8} $$
$$ P(y=0|x;\theta)=1-h_\theta(x)\tag{9} $$
The formula (8) And the formula (9) You can synthesize formulas (10):
$$ P(y|x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y}\tag{10} $$
The formula (10) Is the likelihood probability of a sample , Also known as point estimation , So how to express the likelihood probability of the sample ? See the following formula (11):
Likelihood:
$$ \begin{equation}\begin{split} L(\theta)&=p(\vec{y}|X;\theta)\\ &=\prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)\\ &=\prod_{i=1}^{m}(h_\theta(x^{(i)}))^{y^{(i)}}(1-h_\theta(x^{(i)}))^{1-y^{(i)}}\tag{11} \end{split}\end{equation} $$
This is actually our objective function , But I want to find \( arg\,max{L(\theta)}\) We need further simplification , The simplified method is to add a log, That is to say \( log(L(\theta)) \), The advantage of this is :
- Simplify the equation , Convert continuous multiplication into continuous addition
- Prevent the risk of numerical underflow in the calculation process
- log It's a monotone increasing function , Will not change the nature of the original function
Log Likelihood
$$ \begin{equation}\begin{split} \ell(\theta)&=log(L(\theta))\\ &=\sum_{i=1}^{m}y^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))\tag{12} \end{split}\end{equation} $$
type (12) Is the maximum likelihood function we are looking for , In fact, adding a minus sign in front of it turns into Loss function, But the maximum likelihood function pursues the maximum value , and Loss function The pursuit is the minimum .
Gradient ascent for one sample:
For the convenience of derivation, I didn't put sum Add in the calculation , That is, only one sample
$$ \begin{equation}\begin{split} \frac{\partial \ell(\theta)}{\partial \theta_j}&=(y\frac{1}{g(\theta^Tx)}-(1-y)\frac{1}{1-g(\theta^Tx)})\frac{\partial g(\theta^Tx)}{\theta_j}\\ &=(y\frac{1}{g(\theta^Tx)}-(1-y)\frac{1}{1-g(\theta^Tx)})g(\theta^Tx)(1-g(\theta^Tx))\frac{\partial \theta^Tx}{\theta_j}\\ &=(y(1-g(\theta^Tx))-(1-y)g(\theta^Tx))x_j\\ &=(y-h_\theta(x))x_j\tag{13} \end{split}\end{equation} $$
so, gradient ascent rule:
$$ \theta_j:=\theta_j+\alpha(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}\tag{14} $$
Loss function
We mentioned above , type (12) Taking a minus sign is our loss function:
$$ \arg\,min{-\sum_{i=1}^{m}y^{(i)}logh_\theta(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))\tag{15}} $$
And gradient descent rule for all samples:
$$ \theta_j:=\theta_j-\alpha\sum_{i=1}^{m}(y^{(i)}-h_\theta(x)^{(i)})x_j^{(i)}\tag{16} $$
As we can see in the figure above, we penalize wrong predictions with an increasingly larger cost.