当前位置：网站首页>[Deep Learning] (Problem Record) - Linear Regression - Small Batch Stochastic Gradient Descent

[Deep Learning] (Problem Record) - Linear Regression - Small Batch Stochastic Gradient Descent

2022-07-30 10:14:00 【aaaafeng】

前言

When I read a book and feel that the foreword does not match the afterword,I often can't help but wonder if there's something wrong with this writing（Sometimes I even scold the author a few words in my heart,怎么这么不小心）.Yet experience and residual sanity still remind me,This is most likely my own problem.

我的编程环境：Win10操作系统,python3.6

初学不久,文章如有问题,欢迎指出.

1. 问题和代码

for the codesgd函数中的param[:] = param - lr * param.grad / batch_sizeI've been very confused about this line.

For example a mini-batch is set up in the code10个样本,So I think it's right参数集params求梯度时,得到每个参数The gradient of should be a vector（可以理解为一个数组）类型的数据.Because for each parameter,通过10The gradient of each sample will be obtained10corresponding value.

So I had doubts,/ batch_sizeThe purpose is to get the gradient平均值,But the dividend to its left is not one标量（Ordinary single value）,So how does this line of code get the average we want？

注：代码参考自《动手学深度学习》
代码：

# 代码目标：训练一个线性回归模型,Use mini-batch stochastic gradient descent
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
from mxnet import autograd, nd
import random

# 制作训练集
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)

features[0], labels[0]

def use_svg_display():
    # 用矢量图显示
    display.set_matplotlib_formats('svg')

def set_figsize(figsize=(3.5, 2.5)):
    use_svg_display()
    # 设置图的尺寸
    plt.rcParams['figure.figsize'] = figsize

set_figsize()
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1);  # 加分号只显示图（Otherwise, a line of text will also be displayed）

# 本函数已保存在d2lzh包中方便以后使用
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)  # 样本的读取顺序是随机的
    for i in range(0, num_examples, batch_size):
        j = nd.array(indices[i: min(i + batch_size, num_examples)])
        yield features.take(j), labels.take(j)  # take函数根据索引返回对应元素
        
batch_size = 10  # 一个“小批量”的大小

# Build the parameters of the model we want to train
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))

w.attach_grad()
b.attach_grad()

def linreg(X, w, b):  # our model function
    return nd.dot(X, w) + b

def squared_loss(y_hat, y):  # 使用的损失函数
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

def sgd(params, lr, batch_size):  # 用于迭代（更新）参数
    for param in params:
        param[:] = param - lr * param.grad / batch_size

lr = 0.03      # 学习率
num_epochs = 3 # number of learning cycles
net = linreg   # take a nickname
loss = squared_loss

for epoch in range(num_epochs):  # 训练模型一共需要num_epochs个迭代周期
    # 在每一个迭代周期中,会使用训练数据集中所有样本一次（假设样本数能够被批量大小整除）.X
    # X和y分别是小批量样本的特征和标签
    for X, y in data_iter(batch_size, features, labels):
        with autograd.record():
            l = loss(net(X, w, b), y)  # l是有关小批量X和y的损失
        l.backward()  # 小批量的损失对模型参数求梯度
        sgd([w, b], lr, batch_size)  # 使用小批量随机梯度下降迭代模型参数
    train_l = loss(net(features, w, b), labels)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))

print('\nweights:')
print(true_w, w)
print('\nbias:')
print(true_b, b)

2. 分析问题

After thinking about it for a while,我突然想到,Why not put the parameterparamThe gradients are printed out and have a look？Then it is not clear what the situation is！请看代码：

def sgd(params, lr, batch_size):  # 用于迭代（更新）参数
    for param in params:
        param[:] = param - lr * param.grad / batch_size
        print('\nparam.grad:')
        print(param.grad)

我仅仅在sgdTwo lines of printing are added at the end of the function,Then let's look at the effect again（This function is called every time a mini-batch of samples is processed,One look is enough,Because I just want to know the data type of the gradient of the parameter）

运行效果：
在这里插入图片描述
前面的param.gradrepresents two 权重(weight) 参数的梯度,后面的是 偏差(bias) 参数的梯度.也就是说,The gradient obtained for each parameter has only one value.

Why a batch10个样本,Only one value is obtained？

这个值是什么？其实在执行l.backward()时,Equivalent to executingl.sum().backward(),That is, there is a gradient value for each sample in a batch,然后把这10add up the gradient values,The gradient value of the parameter is obtained.所以再用/ batch_sizeTaking the average is also a natural thing to do.

Actually I saw this explanation shortly after,But because I was already in the loop,Just when I see here蒙圈++

🧭 总结

种瓜得瓜,种豆得豆.
What shape is the variable,What is the shape of the gradient obtained for this variable.

The reason I subconsciously thought I would get a set of values instead of a single value,It's because I saw an example of finding the gradient of a matrix earlier,What you get is a set of values（一个矩阵）.Then I got confused here,Here each parameter object for which we are finding the gradient is a single value,It's just that there are multiple data samples.

对矩阵（向量）求梯度	The gradient obtained is a matrix（向量）
Find the gradient of a scalar	The obtained gradient is a scalar

I have another feeling,as a previous idiomaticC/C++程序员,pythonThis flexibility of variable data types really blows my mind、非常的不适应,I've been fooled by this many times,呜.

原网站

版权声明
本文为[aaaafeng]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/211/202207300903469600.html