当前位置:网站首页>Deep learning mathematics foundation
Deep learning mathematics foundation
2022-07-02 18:49:00 【live_ for_ myself】
List of articles
- linear algebra
- probability
- Simple understanding of frequency school and Bayes School
- Probability distribution and probability quality function
- Probability density function
- Edge probability
- Conditional probability
- The chain rule of conditional probability
- Independence and conditional independence
- expect , Variance and covariance
- covariance , The correlation , The relationship of independence
- A priori and a posteriori
- Bayes rule
- Information theory
- Numerical calculation
linear algebra
Scalar 、 vector 、 Matrices and tensors
- Scalar (scalar) Is a separate number
- vector (vector) yes A column of Count .
- matrix (matrix) Is a two-dimensional array
- tensor (tensor) Is an array of more than two dimensions
- radio broadcast (broadcasting)
When a matrix and a vector are added , Generate another matrix : C = A + b C=A+b C=A+b, among C i , j = A i , j + b j C_{i,j}=A_{i,j}+b_j Ci,j=Ai,j+bj, In other words , vector b And matrices A Add up every line of .
Matrix and vector multiply
matrix product (matrix product)
Yes C = A B C=AB C=AB, Demand matrix A The number of columns of must be the same as B The number of rows is equal . If matrix A The shape of is m × n m\times n m×n, matrix B The shape of is n × p n \times p n×p, Then the matrix C The shape of is m × p m\times p m×p.
Write the element definition :
C i , j = ∑ k A i , k B k , j C_{i, j}=\sum_{k}^{}A_{i,k}B_{k,j} Ci,j=∑kAi,kBk,jHadamaji (Hadamard product), Or element corresponding product (element-wise product)
Write it down as A ⊙ B A\odot B A⊙BVector dot product (dot product)
It can be regarded as matrix product x T y x^Ty xTy
Properties of matrix product operation
Distributive law
A(B+C) = AB + ACAssociative law
A(BC) = (AB)CThe point product of two vectors satisfies the commutative law
x T y x^Ty xTy = y T x y^Tx yTxIt can represent a system of linear equations
Ax = b
Write clearly :
A 1 , 1 x 1 + A 1 , 2 x 2 + . . . + A 1 , n x n = b 1 A_{1,1}x_1+A_{1,2}x_2+...+A_{1,n}x_n=b_1 A1,1x1+A1,2x2+...+A1,nxn=b1
A 2 , 1 x 1 + A 2 , 2 x 2 + . . . + A 2 , n x n = b 2 A_{2,1}x_1+A_{2,2}x_2+...+A_{2,n}x_n=b_2 A2,1x1+A2,2x2+...+A2,nxn=b2
Inverse matrix
matrix A The inverse matrix of is written as A − 1 A_{-1} A−1, It meets the following conditions :
A − 1 A = I n A_{-1}A=I_n A−1A=In
Inverse matrix can solve linear equations :
A x = b Ax=b Ax=b
A − 1 A x = A − 1 b A_{-1}Ax=A_{-1}b A−1Ax=A−1b
I n x = A − 1 b I_nx=A_{-1}b Inx=A−1b
x = A − 1 b x=A_{-1}b x=A−1b
It is often used in machine learning ** norm (norm)** To measure the size of the vector . Formally , L p L^p Lp The norm is defined as follows :
∣ ∣ x ∣ ∣ p = ( ∑ i ∣ x i ∣ p ) 1 p \left| \right| x\left| \right|_p=(\sum_{i}^{}\left|x_i \right|^p)^{\frac{1}{p}} ∣∣x∣∣p=(∑i∣xi∣p)p1
Intuitively , vector x The norm of measures from the origin to x Distance of .When p be equal to 2 when , L 2 L^2 L2 The norm is called Euclid Norm (Euclidean norm)
The square L 2 L^2 L2 Norms are also commonly used to measure the size of vectors , You can simply use dot product x T x x^Tx xTx Calculation
The square L 2 L^2 L2 Norm is better in mathematics and calculation L 2 L^2 L2 Norm itself is more convenient . for example , The square L 2 L^2 L2 Norm pair x The derivative of each element in It only depends on the corresponding element , and L 2 L^2 L2 norm The derivative of each element Related to the whole vector .
but The square L 2 L^2 L2 The norm grows very slowly near the origin , In some machine learning applications , It's important to distinguish between elements that happen to be zero and non-zero elements that have very small values . Can be used in these cases L 1 L^1 L1 norm , L 1 L^1 L1 norm as follows :
∣ ∣ x ∣ ∣ 1 = ∑ i ∣ x i ∣ \left| \right| x\left| \right|_1=\sum_{i}^{}\left|x_i \right| ∣∣x∣∣1=∑i∣xi∣ . When machine learning problems When the difference between zero and non-zero elements is very important , You usually use L 1 L^1 L1 norm . whenever x Element from 0 increase ∈, Corresponding L 1 L^1 L1 Norm will also increase ∈.Maximum norm (max norm)
L ∝ L^\propto L∝ The norm represents the absolute value of the element with the largest amplitude in the vector :
∣ ∣ x ∣ ∣ ∝ = m a x i ∣ x i ∣ \left| \right| x\left| \right|_\propto =max_i\left|x_i \right| ∣∣x∣∣∝=maxi∣xi∣Measure the size of the matrix
Use Frobenius norm (Frobenius norm) Measure the size of the matrix , namely :
∣ ∣ A ∣ ∣ F = ∑ i , j A i , j 2 \left| \right| A\left| \right|_F=\sqrt{\sum_{i,j}A_{i,j}^2} ∣∣A∣∣F=∑i,jAi,j2. Similar to vector L 2 L^2 L2 normThe dot product of two vectors can be expressed by norms , As follows :
x T y = ∣ ∣ x ∣ ∣ 2 ∣ ∣ y ∣ ∣ 2 c o s θ x^Ty=\left| \right| x\left| \right|_2\left| \right| y\left| \right|_2 cos\theta xTy=∣∣x∣∣2∣∣y∣∣2cosθ
Special matrix
Diagonal matrix
Only the main diagonal contains non 0 Elements , The other positions are 0. d i a g ( v ) diag(v) diag(v) Indicates that the diagonal element is composed of vector v Given by the element in Diagonal matrix
Compute matrix multiplication d i a g ( v ) x diag(v)\ x diag(v) x, Just put the vector x Every element in x i x_i xi Zoom in v i v_i vi Multiple . In other words , d i a g ( v ) x diag(v)\ x diag(v) x = v ⊙ x v\odot x v⊙x
If and only if the diagonal elements are not 0, Existence of inverse matrix of diagonal square matrix , among d i a g ( v ) − 1 diag(v)^{-1} diag(v)−1 = d i a g ( [ 1 v 1 , . . . . 1 v n ] T ) diag([\frac{1}{v_1}, ....\frac{1}{v_n}]^T) diag([v11,....vn1]T), It's actually the reciprocal of each elementSymmetric matrix
A symmetric matrix is a matrix equal to itself after transpositionOrthogonal matrix
Unit vector (unit vector) Is a unit norm (unit norm) Vector . namely :
∣ ∣ x ∣ ∣ 2 = 1 \left| \right| x\left| \right|_2=1 ∣∣x∣∣2=1
If x T y = 0 x^Ty=0 xTy=0, Then vector x Sum vector y Orthogonal to each other (orthogonal) . If both vectors have non 0 norm , So the angle between these two vectors is 90°.
stay R n R^n Rn That is to say n In the dimension , If there is n Norm is not 0 Vectors are orthogonal to each other and their norms are 1, They are called orthonormal
- Orthogonal matrix refers to a square matrix whose row vector and column vector are respectively standard orthogonal . namely
It means , A − 1 = A T A^{-1}=A^T A−1=AT
Characteristics of decomposition
Characteristics of decomposition (eigendecomposition) The matrix is decomposed into a set of eigenvectors and eigenvalues .
Matrix A Of Eigenvector (eigenvector) Refer to A After multiplication, it is equivalent to a non-zero vector scaling the vector v v v:
A v = λ v Av=\lambda v Av=λv
Where scalar λ \lambda λ It is called the corresponding of this eigenvector The eigenvalueIf v yes A Eigenvector of , So any scaled vector sv It's also A Eigenvector of . Besides , sv and v Have the same eigenvalue , For this reason , Usually only unit eigenvectors are considered
Hypothetical matrix A Yes n Linear independent eigenvectors { v ( 1 ) , v ( 2 ) , . . . . , v ( n ) } \{v^{(1)},v^{(2)},....,v^{(n)} \} { v(1),v(2),....,v(n)}, Corresponding to eigenvalue { λ 1 , λ 2 , . . . . . λ n } \{\lambda_1, \lambda_2, .....\lambda_n\} { λ1,λ2,.....λn}
Connect the eigenvectors into a matrix , Each column is an eigenvector , V = { v ( 1 ) , v ( 2 ) , . . . . , v ( n ) } V=\{v^{(1)},v^{(2)},....,v^{(n)} \} V={ v(1),v(2),....,v(n)}
Similarly, the eigenvalues are connected into a vector . λ = { λ 1 , λ 2 , . . . . . λ n } \lambda=\{\lambda_1, \lambda_2, .....\lambda_n\} λ={ λ1,λ2,.....λn}
therefore A The characteristic decomposition of can be recorded as :
A = V d i a g ( λ ) V − 1 A=Vdiag(\lambda)V^{-1} A=Vdiag(λ)V−1
- Real symmetric matrix ( The set of real Numbers , Matrix equal to itself after transpose ) Can be broken down into Real eigenvector and Real eigenvalue
namely A = Q Λ Q T A=Q\Lambda Q^T A=QΛQT, Q yes A The eigenvector of Orthogonal matrix , Λ \Lambda Λ It's a diagonal matrix .
The eigenvalue Λ i \Lambda_{i} Λi, among i The corresponding eigenvector is a matrix Q Of the i Column . Write it down as Q : , i Q_{:, i} Q:,i.
because Q It's an orthogonal matrix , You can put A As in the direction v ( i ) v^{(i)} v(i) extension λ i \lambda_i λi Double space , As shown in the figure below :
In the figure, the matrix A There are two standard orthogonal eigenvectors , The corresponding eigenvalue is λ 1 \lambda_1 λ1 Of v ( 1 ) v^{(1)} v(1) And the corresponding eigenvalue is λ 2 \lambda_2 λ2 Of v ( 2 ) v^{(2)} v(2). On the left are all unit vectors u ∈ R 2 u∈R^2 u∈R2 Set , Form a unit circle . On the right is all A u Au Au The set of points , You can see A The effect is to v ( i ) v^(i) v(i) The space of the direction is stretched λ \lambda λ times
Although any real symmetric matrix A All have feature decomposition , But feature decomposition may not be unique . If two or more eigenvectors have the same eigenvalue , Then in the generated by these eigenvectors Generate subspace in , Any set of orthogonal vectors are the eigenvectors corresponding to the eigenvalues . therefore , We can equivalently construct from these eigenvectors Q As a substitute . By convention , We Usually in descending order Λ The elements of . Under this Agreement , Feature decomposition is unique , If and only if all eigenvalues are unique .
All eigenvalues are Positive numbers The matrix of is called positive definite (positive definite); All eigenvalues are Nonnegative number The matrix of is called positive semidefinite (positivesemidefinite). similarly , All eigenvalues are negative The matrix of is called Negative definite (negative definite); A matrix whose eigenvalues are all non positive numbers is called semi negative definite (negative semidefinite).
Singular value decomposition
Singular value decomposition (singular valuedecomposition, SVD) Is to decompose the matrix into Singular vectors and singular values .
Singular value decomposition has a wider range of applications . Every real matrix has a singular value decomposition , But not all of them have eigendecomposition . for example , A matrix that is not a square matrix No feature decomposition , At this time, we Can only Using singular value decomposition .
Using characteristic decomposition to analyze matrix A when , Yes :
A = V d i a g ( λ ) V − 1 A=Vdiag(\lambda)V^{-1} A=Vdiag(λ)V−1
Singular value decomposition is similar , But this time it will A Decomposed into the product of three matrices :
hypothesis A It's a m×n Matrix . that U It's a m×m Matrix , D It's a m×n Matrix , V It's a n×n matrix .
Each of these matrices has a special structure after being defined . matrix U and V All defined as orthogonal matrix , and matrix D Defined as diagonal matrix . Be careful , matrix D It doesn't have to be a square .
Simple understanding of frequency school and Bayes School
In short , Frequency school and Bayesian school 「 uncertainty 」 The starting point and foothold of this matter are different . The frequency school starts from 「 natural 」 From an angle , Try to work directly for 「 event 」 Modeling itself , The event A The frequency of occurrence in the independent repeated test tends to the limit p, So the limit is the probability of the event . For example , Want to calculate the probability of tossing a coin face up , We need to keep tossing coins , When the number of tosses tends to infinity, the frequency of facing up is the probability of facing up .
However , The Bayesian school does not attempt to characterize 「 event 」 In itself , And from 「 The observer 」 From an angle . The Bayesian school doesn't try to say 「 The event itself is random 」, perhaps 「 The ontology of the world has some randomness 」, This theory doesn't say anything about 「 The world itself 」 Things that are , But only from 「 The observer's knowledge is incomplete 」 This starting point begins , Construct a set of inference methods for uncertain knowledge in the framework of Bayesian probability theory . Frequency school said 「 Random events 」 In the view of Bayesian School , Not at all 「 The event itself has some objective randomness 」, It is 「 The observer does not know the outcome of the event 」 nothing more , It's just 「 The observer 」 The result of this event has not been included in the knowledge state . But in this case , The observer tries to pass through what has been observed 「 evidence 」 To infer the result of this event , So you can only guess . Bayesian probability theory wants to build a relatively complete framework to describe what can best serve the purpose of rational inference 「 The process of guessing 」. therefore , In the Bayesian framework , The same thing for insiders is 「 Identify events 」, For the uninformed, it is 「 Random events 」, Randomness does not stem from whether the event itself occurs , It just describes the observer's knowledge of the event .
Probability distribution and probability quality function
Intuitively speaking , The shape of the data in the statistical chart , It's called its distribution . The probability distribution of discrete random variables can be used ** Probability mass function (probability mass function, PMF)** To describe .
Sometimes we define a random variable first , And then use ~
Symbols to indicate the distribution it obeys , Such as x~P(x)
Probability quality function can be used for multiple random variables at the same time , This probability distribution of multiple random variables is called Joint probability distribution (joint probability distribution) .P(x=x, y=y)
Express x=x and y= y Probability of simultaneous occurrence , It can also be abbreviated as P(x,y)
Probability density function
When the object of study is continuous random variable , use Probability density function (probability density function, PDF) Instead of describing its probability distribution by probability mass function .
Probability density function p(x)
There is no direct probability given for a particular state , Relative , It shows that the area δx
The probability in the infinitesimal region of is p(x) δx
We can calculate the probability density function Find integral To obtain the true probability quality of the point set . especially ,x Fall in the collection S The probability in can be determined by p(x)
Integrate this set to get . In the univariate case ,x Fall in the Section [a,b]
Is the probability that ∫ a b p ( x ) d x \int_{a}^{b}p(x)dx ∫abp(x)dx.
Give an example of uniform distribution :
The uniform distribution consists of two parameters a and b Definition , They are the minimum and maximum values on the number axis , Commonly abbreviated as U(a,b)
Usually use x~U(a,b)
Express x stay [a,b] It's evenly distributed
Edge probability
Sometimes , We know the joint probability distribution of a set of variables , But want to know The probability distribution of one subset . The probability distribution defined on a subset is called Marginal probability distribution (marginal probability distribution).
For discrete random variables x and y, And know
, It can be calculated according to the law of summationP(x)
For continuous variables , Just use integral instead of summation :
p ( x ) = ∫ p ( x , y ) d y p(x)=\int_{}^{}p(x,y)dy p(x)=∫p(x,y)dy
Conditional probability
The probability of an event occurring when other events are given is called conditional probability . In the given x=x, y=y The probability of occurrence is recorded as P=(y=y|x=x)
. This conditional probability can be calculated by the following formula .
The chain rule of conditional probability
- Joint probability distribution of any multidimensional random variable , Can be broken down into The conditional probabilities of only one variable are multiplied In the form of
This rule is called The chain rule of probability (chain rule) perhaps product rule (product rule)
This formula can be obtained from the definition of conditional probability :
P ( a , b , c ) = P ( a ∣ b , c ) P ( b , c ) P(a,b,c)=P(a|b,c)P(b,c) P(a,b,c)=P(a∣b,c)P(b,c)
P ( b , c ) = P ( b ∣ c ) P ( c ) P(b,c)=P(b|c)P(c) P(b,c)=P(b∣c)P(c)
P ( a , b , c ) = P ( a ∣ b , c ) P ( b ∣ c ) P ( c ) P(a,b,c)=P(a|b,c)P(b|c)P(c) P(a,b,c)=P(a∣b,c)P(b∣c)P(c)
Independence and conditional independence
Two random variables x and y, If their probability distribution can be expressed as the product of two factors , And a factor only contains x, The other factor contains only y, We call these two random variables Mutually independent (independent):
Because the original formula is actually p(x,y) =p(x|y)p(y)
If on x and y The conditional probability distribution of z Each value of can be written as a product of , So these two random variables x and y Given a random variable z When is Conditionally independent (conditionally independent):
A simplified form can be used to express independence and conditional independence :x⊥y
Express x and y Are independent of each other ,x⊥y|z
Express x and y In the given z Conditional independence
expect , Variance and covariance
function f(x) On a distribution P(x) The expectations of the (expectation) Or expectations (expected value) Refer to , When x from P produce ,f Act on x when ,f(x) Average value . For discrete random variables , This can be obtained by summing
Of course, the following is more common :
For continuous random variables , You can get
- When the probability distribution is specified in the context , We can simplify it by writing only the name of the random variable expected to act , for example E x [ f ( x ) ] E_x[f(x)] Ex[f(x)]
- If the expected random variable is also clear , We can not write footmarks at all , It's like E [ f ( x ) ] E[f(x)] E[f(x)]
- By default , We assume that
Mean the values of all random variables in square brackets , When there is no ambiguity , Square brackets can also be omitted
variance When we measure x According to its probability distribution When sampling , A random variable x How different the function value of :
The square root of the variance is called the standard deviation
covariance (covariance) In a sense, the intensity of the linear correlation between the two variables and the scale of these variables are given :
- If the absolute value of covariance is large , It means that the variable value changes greatly , And they both Far from their respective averages
- If the covariance is positive , So both variables tend to get relatively large values at the same time . If the covariance is negative , So one of the variables tends to get a relatively large value at the same time , Another variable tends to have a relatively small value , vice versa .
covariance , The correlation , The relationship of independence
Covariance and correlation are related , But it's actually a different concept . If two variables Are independent of each other , So their The covariance is zero ; If two The covariance of the variable is not zero , So they It must be related .
However , Independence is a completely different property from covariance . Two variables such as The covariance is zero , Between them There must be no linear relationship . independence yes Than Zero covariance Stronger requirements , because Independence also excludes non-linear relationships . The two variables are interdependent , But it is possible to have zero covariance .
for example , Suppose we start with the interval [-1,1] A real number is sampled from the uniform distribution on x, Then for a random variable s sampling .s With 1 2 \frac{1}{2} 21 The probability value of is 1, Otherwise -1. We can make y=sx To generate a random variable y. obviously ,x and y Not independent of each other , because x It's all decided y Scale of . However ,Cov(x,y)=0.
The covariance matrix is a Matrix , Diagonal is variance
A priori and a posteriori
The variable c Come on , Prior probability P(c=i).“ transcendental ” The word indicates that x Previously passed to the model about c Belief . As a contrast ,P(c|x) It's a posterior probability (posterior probability), Because it is observing x Then calculate .
Bayes rule
We often need to know P(y|x) Hourly calculation P(x|y). Fortunately, , If you still know P(x), We can use Bayesian rules (Bayes’rule) To achieve this :
- be aware P(y) Appears in the above formula , It usually uses P ( y ) = ∑ x P ( y ∣ x ) P ( x ) P(y)=\sum_{x}P(y|x)P(x) P(y)=∑xP(y∣x)P(x) To calculate , So we don't need to know in advance P(y) Information about . Bayesian rules can be derived directly from the definition of conditional probability .
Information theory
The basic idea of information theory is that an unlikely event actually happened , Than a very likely event , Can provide more information . The news said :“ The sun rose this morning ”, The amount of information is so small , So that there is no need to send ; But a message said :“ There is a solar eclipse this morning ”, The amount of information is very rich .
We want to quantify information through this basic idea . especially :
- The amount of information about very likely events is relatively small , And in extreme cases , Ensure that events that can occur should have no amount of information .
- Less likely events have a higher amount of information .
- Independent events should have incremental information . for example , The amount of information conveyed by the coin tossed face up twice , It should be twice as much information as a coin tossed face up .
In order to meet the above 3 Nature , Let's define a ** event x=x From information (self-information)** by :
there log The base number of is e, The unit is Knight (nats)
Shannon entropy
Self information only processes a single output . We can use Shannon entropy (Shannon entropy) To quantify the total amount of uncertainty in the whole probability distribution :
Also remember H(P). In other words , A distributed Shannon entropy It refers to the events that follow this distribution Expected total information . Those distributions that are close to certainty ( The output is almost certain ) With low entropy ; Those probability distributions close to uniform distribution have high entropy . When x Is a continuous , Shannon entropy is called differential entropy (differential entropy).
The following figure shows the Shannon entropy of binary random variables .
This figure shows how a distribution closer to certainty has a lower Shannon entropy , And how the distribution closer to uniform distribution has higher Shannon entropy . The horizontal axis is p, Indicates that the binary random variable is equal to 1 Probability . When p near 0 when , The distribution is almost certain , Because random variables are almost always 0. When p near 1 when , The distribution is almost certain , Because random variables are almost always 1. When p=0.5 when , Entropy is the largest , Because it is distributed in two results (0 and 1) It's uniform on the surface
KL The divergence
If for the same random variable x There are two separate probability distributions P(x) and Q(x), have access to KL The divergence (Kullback-Leibler(KL)divergence) To measure the difference between the two distributions :
KL Divergence is asymmetric . Suppose we have a distribution p(x), And I want to use another distribution q(x) To approximate it . We can choose to minimize D K L ( p ∣ ∣ q ) D_{KL}(p||q) DKL(p∣∣q) Or minimize D K L ( q ∣ ∣ p ) D_{KL}(q||p) DKL(q∣∣p).
To illustrate the effect of each option , We make p yes Two A mixture of Gaussian distributions , Make q by Single Gaussian distribution . Choose to use KL Which direction of divergence depends on the problem . Some applications require this approximate distribution q In real distribution p Place high probability everywhere you place high probability , Other applications require this approximate distribution q In real distribution p Where low probability is placed, high probability is rarely placed .KL The choice of divergence direction reflects for each application , Which option is preferred .
The figure below ( Left ) To minimize the D K L ( p ∣ ∣ q ) D_{KL}(p||q) DKL(p∣∣q) The effect of . under these circumstances , We choose one q, Make it in p Where there is a high probability, there is a high probability . When p With multiple peaks ,q Choose to blur these peaks together , In order to put high probability quality on all peaks .
The figure below ( Right ) To minimize the D K L ( q ∣ ∣ p ) D_{KL}(q||p) DKL(q∣∣p) The effect of . under these circumstances , We choose one q, Make it in p Where there is a low probability, there is a low probability . When p When there are multiple peaks and these peaks are widely spaced , As the figure shows , To minimize the KL Divergence selects a single peak , To avoid placing probability quality in p In the low probability region between multiple peaks .
Yes no negative .KL Divergence is 0, If and only if P and Q The same distribution in the case of discrete variables , Or in the case of continuous variables “ Almost everywhere ” same . because KL Divergence is nonnegative and measures the difference between the two distributions , It is often used as a distance between distributions . However , It's not really distance , Because it's not symmetrical : For certain P and Q, D K L ( P ∣ ∣ Q ) D_{KL}(P||Q) DKL(P∣∣Q)= D K L ( Q ∣ ∣ P ) D_{KL}(Q||P) DKL(Q∣∣P). This asymmetry means choice D K L ( P ∣ ∣ Q ) D_{KL}(P||Q) DKL(P∣∣Q) still D K L ( Q ∣ ∣ P ) D_{KL}(Q||P) DKL(Q∣∣P) Great influence
Cross entropy
Cross entropy (cross-entropy), namely H ( P , Q ) = H ( P ) + D K L ( P ∣ ∣ Q ) H(P,Q)=H(P)+D_{KL}(P||Q) H(P,Q)=H(P)+DKL(P∣∣Q), It and KL Divergence is very similar , But the left one is missing :
in the light of Q Minimizing cross entropy is equivalent to minimizing KL The divergence , because Q Do not participate in the omitted item .
Numerical calculation
Gradient based optimization method
We call the function to be minimized or maximized Objective function (objective function) Or criteria (criterion). When we minimize it , Also called Cost function (cost function)、 Loss function (loss function) Or error function (error function)
We usually use a superscript ∗ A function that minimizes or maximizes x value , Ruji x ∗ = a r g m i n f ( x ) x^∗=arg minf(x) x∗=argminf(x).
It is suggested that the new point is
among ∈ For learning rate (learning rate)
The following figure is a schematic diagram of how the gradient descent algorithm uses function derivatives , That is, along the downhill direction of the function ( The derivative is in the opposite direction ) Until minimum
- 夜神模拟器+Fiddler抓包测试App
- 夜神模擬器+Fiddler抓包測試App
- Redis (6) -- object and data structure
- SAP S/4HANA OData Mock Service 介绍
- Yesterday, Alibaba senior wrote a responsibility chain model, and there were countless bugs
- Exness in-depth good article: dynamic series - Case Analysis of gold liquidity (V)
- What are the links of the problem
- 元宇宙链游系统开发(逻辑开发)丨链游系统开发(详细分析)
- MySQL about only_ full_ group_ By limit
- Leetcode 面试题 16.15. 珠玑妙算
Another double non reform exam 408, will it be cold? Software College of Nanchang Aviation University
Steamos 3.3 beta release, steam deck Chinese keyboard finally came
Leetcode 面试题 16.17. 连续数列
300+ documents! This article explains the latest progress of multimodal learning based on transformer
Night God simulator +fiddler packet capture test app
Simulateur nightGod + application de test de capture de paquets Fiddler
ESP32-C3入门教程 问题篇⑩——error: implicit declaration of function ‘esp_blufi_close‘;
Nm01 function overview and API definition of nm module independent of bus protocol
1.5.1版本官方docker镜像运行容器,能设置使用 mysql 8驱动吗?
Qt Official examples: Qt Quick Controls - Gallery
Basic idea of quick sorting (easy to understand + examples) "suggestions collection"
Yesterday, Alibaba senior wrote a responsibility chain model, and there were countless bugs
文字编辑器 希望有错误的句子用红色标红,文字编辑器用了markdown
How to set vscode to delete the whole line shortcut key?
Distance measurement - Jaccard distance
【Oracle 期末复习】表空间、表、约束、索引、视图的增删改
MySQL 关于 only_full_group_by 限制
Chrome officially supports MathML, which is enabled in chromium dev 105 by default