当前位置：网站首页>Automatic derivation of introduction to deep learning (pytoch)

Automatic derivation of introduction to deep learning (pytoch)

2022-07-03 10:33:00 【-Plain heart to warm】

Automatic derivation

Chain rule and automatic derivation

Vector chain rule

Scalar chain rule
$\quad\ {\partial y \over \partial x}={\partial y \over \partial u}{\partial u \over \partial x}$
Expand to vector

Example 1

Insert picture description here

Example 2

![ Insert picture description here ](https://img-blog.csdnimg.cn/4649ebba71e2492e91f9869a35f61140.png

Automatic derivation

Automatic derivation calculates the derivative of a function at a specified value
It is different from
- Sign derivation
  $ln[1]:= D[4x^3+x^2+3, x]$
  $Out[1]= 2x+12x^2$
- Numerical derivation
  ${\partial f(x) \over \partial x }= lim_{h->0}{f(x+h) - f(x) \over h}$

Calculation chart

Decompose the code into operands
The calculation is expressed as an acyclic graph
Show construction
- Tensorflow/Theano/MXNet

from mxnet import sym

a = sym.var()
b = sym.var()
c = 2 * a + b
# bind data into a and b later

First define the formula , Then bring the value into

Implicit construction
- Pytorch/MXNet

from mxnet import autograd, nd

with autograd.record():
	a = nd.ones((2, 1))
	b = nd.ones((2, 1))
	c = 2 * a + b

Two modes of automatic derivation

Insert picture description here

Reverse accumulation

Insert picture description here

Reverse cumulative summary

Construction calculation diagram
Forward direction ： Execution diagram , Store intermediate results
reverse ： Execute the diagram from the opposite direction
- Remove unwanted branches

Complexity

Computational complexity ：O(n),n Is the number of operands
- Usually the cost of forward and direction is similar
Memory complexity ：O(n), Because you need to store all the intermediate results in the forward direction

Because you want to store all the intermediate results , So it's very expensive GPU resources

Compared with positive accumulation ：
- O(n) Computational complexity is used to calculate the gradient of a variable
- O(1) Memory complexity

Automatic derivation implementation

Automatic derivation

Let's say we want to test the function $y = 2x^Tx$ About column vectors x Derivation

import torch

x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

In our calculation y About x Before the gradient of , We need a place to store gradients .

x.requires_grad(True)	#  Equivalent to  `x = torch.arange(4.0, requires_grad=True)`
x.grad	#  The default value is None

Now let's calculate y.

y = 2 * torch.dot(x, x)
y

tensor(28.)

Automatically calculate by calling the back propagation function y About x The gradient of each component

y.backward()
x.grad

tensor([ 0., 4., 8., 12.])

The calculated value should be 4x, You can verify that

x.grad == 4 * x

tensor([True, True, True, True])

Now let's calculate x Another function of

#  By default ,PyTorch It accumulates gradients , We need to clear the previous value 
x.grad.zero_()
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

Deep learning , Our purpose is not to calculate the differential matrix , It is the sum of the partial derivatives calculated separately for each sample in the batch .

#  For non scalars `backword` Need to pass in a `gradient` Parameters , This parameter specifies the differential parameter 
x.grad.zero_()
y = x * x
#  Equivalent to y.backword(torch.ones(len(x))
y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

Why do we do this when we take the derivative sum operation ?
Gradient can only be scalar （ That is, a number ） Output implicitly creates .

Move some calculations out of the recorded calculation diagram

x.grad.zero_()
y = x * x
u = y.detach()	#  Make the parameter constant 
z = u * x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

Later, when some network parameters are fixed , It is useful to

x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

tensor([True, True, True, True])

Even if the calculation diagram of the construction function needs to pass Python control flow （ for example , Conditions 、 Loop or any function call ）, We can still calculate the gradient of the variable .

def f(a):
	b = a * 2
	while b.norm() < 1000:
		b = b * 2
	if b.sum() > 0:
		c = b
	else:
		c = 100 * b
	return c

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

a.grad == d / a

tensor(True)

QA

The difference between explicit construction and implicit construction ？
Show calculation ： Give the formula first and then the value
Implicit calculation ： Give the value first and then the formula
Why does deep learning generally take derivatives from scalars ？
because Loss Most of the time it's scalar .

原网站

版权声明
本文为[-Plain heart to warm]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/184/202207030927196661.html