当前位置:网站首页>[Deep Learning] (Problem Record) - Linear Regression - Small Batch Stochastic Gradient Descent
[Deep Learning] (Problem Record) - Linear Regression - Small Batch Stochastic Gradient Descent
2022-07-30 10:14:00 【aaaafeng】
前言
博主主页:阿阿阿阿锋的主页_CSDN
原文链接:原文
When I read a book and feel that the foreword does not match the afterword,I often can't help but wonder if there's something wrong with this writing(Sometimes I even scold the author a few words in my heart,怎么这么不小心).Yet experience and residual sanity still remind me,This is most likely my own problem.
我的编程环境:Win10操作系统,python3.6
初学不久,文章如有问题,欢迎指出.
1. 问题和代码
for the codesgd函数中的param[:] = param - lr * param.grad / batch_sizeI've been very confused about this line.
For example a mini-batch is set up in the code10个样本,So I think it's right参数集params求梯度时,得到每个参数The gradient of should be a vector(可以理解为一个数组)类型的数据.Because for each parameter,通过10The gradient of each sample will be obtained10corresponding value.
So I had doubts,/ batch_sizeThe purpose is to get the gradient平均值,But the dividend to its left is not one标量(Ordinary single value),So how does this line of code get the average we want?
注:代码参考自《动手学深度学习》
代码:
# 代码目标:训练一个线性回归模型,Use mini-batch stochastic gradient descent
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
from mxnet import autograd, nd
import random
# 制作训练集
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
features[0], labels[0]
def use_svg_display():
# 用矢量图显示
display.set_matplotlib_formats('svg')
def set_figsize(figsize=(3.5, 2.5)):
use_svg_display()
# 设置图的尺寸
plt.rcParams['figure.figsize'] = figsize
set_figsize()
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1); # 加分号只显示图(Otherwise, a line of text will also be displayed)
# 本函数已保存在d2lzh包中方便以后使用
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices) # 样本的读取顺序是随机的
for i in range(0, num_examples, batch_size):
j = nd.array(indices[i: min(i + batch_size, num_examples)])
yield features.take(j), labels.take(j) # take函数根据索引返回对应元素
batch_size = 10 # 一个“小批量”的大小
# Build the parameters of the model we want to train
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))
w.attach_grad()
b.attach_grad()
def linreg(X, w, b): # our model function
return nd.dot(X, w) + b
def squared_loss(y_hat, y): # 使用的损失函数
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
def sgd(params, lr, batch_size): # 用于迭代(更新)参数
for param in params:
param[:] = param - lr * param.grad / batch_size
lr = 0.03 # 学习率
num_epochs = 3 # number of learning cycles
net = linreg # take a nickname
loss = squared_loss
for epoch in range(num_epochs): # 训练模型一共需要num_epochs个迭代周期
# 在每一个迭代周期中,会使用训练数据集中所有样本一次(假设样本数能够被批量大小整除).X
# X和y分别是小批量样本的特征和标签
for X, y in data_iter(batch_size, features, labels):
with autograd.record():
l = loss(net(X, w, b), y) # l是有关小批量X和y的损失
l.backward() # 小批量的损失对模型参数求梯度
sgd([w, b], lr, batch_size) # 使用小批量随机梯度下降迭代模型参数
train_l = loss(net(features, w, b), labels)
print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))
print('\nweights:')
print(true_w, w)
print('\nbias:')
print(true_b, b)
2. 分析问题
After thinking about it for a while,我突然想到,Why not put the parameterparamThe gradients are printed out and have a look?Then it is not clear what the situation is!请看代码:
def sgd(params, lr, batch_size): # 用于迭代(更新)参数
for param in params:
param[:] = param - lr * param.grad / batch_size
print('\nparam.grad:')
print(param.grad)
我仅仅在sgdTwo lines of printing are added at the end of the function,Then let's look at the effect again(This function is called every time a mini-batch of samples is processed,One look is enough,Because I just want to know the data type of the gradient of the parameter)
运行效果:
前面的param.gradrepresents two 权重(weight) 参数的梯度,后面的是 偏差(bias) 参数的梯度.也就是说,The gradient obtained for each parameter has only one value.
Why a batch10个样本,Only one value is obtained?
这个值是什么?其实在执行l.backward()时,Equivalent to executingl.sum().backward(),That is, there is a gradient value for each sample in a batch,然后把这10add up the gradient values,The gradient value of the parameter is obtained.所以再用/ batch_sizeTaking the average is also a natural thing to do.
Actually I saw this explanation shortly after,But because I was already in the loop,Just when I see here蒙圈++
🧭 总结
种瓜得瓜,种豆得豆.
What shape is the variable,What is the shape of the gradient obtained for this variable.
The reason I subconsciously thought I would get a set of values instead of a single value,It's because I saw an example of finding the gradient of a matrix earlier,What you get is a set of values(一个矩阵).Then I got confused here,Here each parameter object for which we are finding the gradient is a single value,It's just that there are multiple data samples.
| 对矩阵(向量)求梯度 | The gradient obtained is a matrix(向量) |
|---|---|
| Find the gradient of a scalar | The obtained gradient is a scalar |
I have another feeling,as a previous idiomaticC/C++程序员,pythonThis flexibility of variable data types really blows my mind、非常的不适应,I've been fooled by this many times,呜.
边栏推荐
- Multithreading--the usage of threads and thread pools
- 再有人问你分布式事务,把这篇扔给他
- Always remember: one day you will emerge from the chrysalis
- CVTE校招笔试题+知识点总结
- 时刻铭记:总有一天你将破蛹而出
- GNOME 新功能:安全启动被禁用时警告用户
- BERT预训练模型系列总结
- leetcode 剑指 Offer 52. 两个链表的第一个公共节点
- 唯物辩证法-条件论
- Re18:读论文 GCI Everything Has a Cause: Leveraging Causal Inference in Legal Text Analysis
猜你喜欢
Container Technology - A Simple Understanding of Kubernetes Objects

idea2021+Activiti【最完整笔记一(基础使用)】

死锁的理解

Flask's routing (app.route) detailed

69. Sqrt(x)x 的平方根

leetcode 剑指 Offer 48. 最长不含重复字符的子字符串

Re15: Read the paper LEVEN: A Large-Scale Chinese Legal Event Detection Dataset

唯物辩证法-条件论

HR团队如何提升效率?人力资源RPA给你答案

Paper reading: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
随机推荐
JCL 学习
Redis Desktop Manager 2022.4.2 发布
判断一颗树是否为完全二叉树——视频讲解!!!
分页 paging
Flask's routing (app.route) detailed
The use of qsort function and its analog implementation
Re16:读论文 ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation
软考 系统架构设计师 简明教程 | 系统运行与软件维护
Determine whether a tree is a complete binary tree - video explanation!!!
MySQL |子查询
OC-手动引用计数内存管理
学习笔记10--局部轨迹生成主要方法
Meikle Studio - see the actual combat notes of Hongmeng equipment development five - drive subsystem development
New in GNOME: Warn users when Secure Boot is disabled
The thread pool method opens the thread -- the difference between submit() and execute()
Day113. Shangyitong: WeChat login QR code, login callback interface
Detailed explanation of JVM memory layout, class loading mechanism and garbage collection mechanism
Materialist Dialectics - Conditionalism
北京突然宣布,元宇宙重大消息
Soft test system architects introductory tutorial | system operation and software maintenance