当前位置:网站首页>[Deep Learning] (Problem Record) - Linear Regression - Small Batch Stochastic Gradient Descent
[Deep Learning] (Problem Record) - Linear Regression - Small Batch Stochastic Gradient Descent
2022-07-30 10:14:00 【aaaafeng】
前言
博主主页:阿阿阿阿锋的主页_CSDN
原文链接:原文
When I read a book and feel that the foreword does not match the afterword,I often can't help but wonder if there's something wrong with this writing(Sometimes I even scold the author a few words in my heart,怎么这么不小心).Yet experience and residual sanity still remind me,This is most likely my own problem.
我的编程环境:Win10操作系统,python3.6
初学不久,文章如有问题,欢迎指出.
1. 问题和代码
for the codesgd
函数中的param[:] = param - lr * param.grad / batch_size
I've been very confused about this line.
For example a mini-batch is set up in the code10个样本,So I think it's right参数集params
求梯度时,得到每个参数The gradient of should be a vector(可以理解为一个数组)类型的数据.Because for each parameter,通过10The gradient of each sample will be obtained10corresponding value.
So I had doubts,/ batch_size
The purpose is to get the gradient平均值,But the dividend to its left is not one标量(Ordinary single value),So how does this line of code get the average we want?
注:代码参考自《动手学深度学习》
代码:
# 代码目标:训练一个线性回归模型,Use mini-batch stochastic gradient descent
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
from mxnet import autograd, nd
import random
# 制作训练集
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
features[0], labels[0]
def use_svg_display():
# 用矢量图显示
display.set_matplotlib_formats('svg')
def set_figsize(figsize=(3.5, 2.5)):
use_svg_display()
# 设置图的尺寸
plt.rcParams['figure.figsize'] = figsize
set_figsize()
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1); # 加分号只显示图(Otherwise, a line of text will also be displayed)
# 本函数已保存在d2lzh包中方便以后使用
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices) # 样本的读取顺序是随机的
for i in range(0, num_examples, batch_size):
j = nd.array(indices[i: min(i + batch_size, num_examples)])
yield features.take(j), labels.take(j) # take函数根据索引返回对应元素
batch_size = 10 # 一个“小批量”的大小
# Build the parameters of the model we want to train
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))
w.attach_grad()
b.attach_grad()
def linreg(X, w, b): # our model function
return nd.dot(X, w) + b
def squared_loss(y_hat, y): # 使用的损失函数
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
def sgd(params, lr, batch_size): # 用于迭代(更新)参数
for param in params:
param[:] = param - lr * param.grad / batch_size
lr = 0.03 # 学习率
num_epochs = 3 # number of learning cycles
net = linreg # take a nickname
loss = squared_loss
for epoch in range(num_epochs): # 训练模型一共需要num_epochs个迭代周期
# 在每一个迭代周期中,会使用训练数据集中所有样本一次(假设样本数能够被批量大小整除).X
# X和y分别是小批量样本的特征和标签
for X, y in data_iter(batch_size, features, labels):
with autograd.record():
l = loss(net(X, w, b), y) # l是有关小批量X和y的损失
l.backward() # 小批量的损失对模型参数求梯度
sgd([w, b], lr, batch_size) # 使用小批量随机梯度下降迭代模型参数
train_l = loss(net(features, w, b), labels)
print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))
print('\nweights:')
print(true_w, w)
print('\nbias:')
print(true_b, b)
2. 分析问题
After thinking about it for a while,我突然想到,Why not put the parameterparam
The gradients are printed out and have a look?Then it is not clear what the situation is!请看代码:
def sgd(params, lr, batch_size): # 用于迭代(更新)参数
for param in params:
param[:] = param - lr * param.grad / batch_size
print('\nparam.grad:')
print(param.grad)
我仅仅在sgd
Two lines of printing are added at the end of the function,Then let's look at the effect again(This function is called every time a mini-batch of samples is processed,One look is enough,Because I just want to know the data type of the gradient of the parameter)
运行效果:
前面的param.grad
represents two 权重(weight) 参数的梯度,后面的是 偏差(bias) 参数的梯度.也就是说,The gradient obtained for each parameter has only one value.
Why a batch10个样本,Only one value is obtained?
这个值是什么?其实在执行l.backward()
时,Equivalent to executingl.sum().backward()
,That is, there is a gradient value for each sample in a batch,然后把这10add up the gradient values,The gradient value of the parameter is obtained.所以再用/ batch_size
Taking the average is also a natural thing to do.
Actually I saw this explanation shortly after,But because I was already in the loop,Just when I see here蒙圈++
🧭 总结
种瓜得瓜,种豆得豆.
What shape is the variable,What is the shape of the gradient obtained for this variable.
The reason I subconsciously thought I would get a set of values instead of a single value,It's because I saw an example of finding the gradient of a matrix earlier,What you get is a set of values(一个矩阵).Then I got confused here,Here each parameter object for which we are finding the gradient is a single value,It's just that there are multiple data samples.
对矩阵(向量)求梯度 | The gradient obtained is a matrix(向量) |
---|---|
Find the gradient of a scalar | The obtained gradient is a scalar |
I have another feeling,as a previous idiomaticC/C++程序员,pythonThis flexibility of variable data types really blows my mind、非常的不适应,I've been fooled by this many times,呜.
边栏推荐
- Practical Walkthrough | Calculate Daily Average Date or Time Interval in MySQL
- leetcode 剑指 Offer 25. 合并两个排序的链表
- Domino服务器SSL证书安装指南
- 实战演练 | 在 MySQL 中计算每日平均日期或时间间隔
- Study Notes 10--Main Methods of Local Trajectory Generation
- 快解析结合象过河erp
- Do you really understand the 5 basic data structures of Redis?
- 论文阅读:SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
- leetcode 剑指 Offer 21. 调整数组顺序使奇数位于偶数前面
- what is this method called
猜你喜欢
Multithreading--the usage of threads and thread pools
(Text) Frameless button settings
Re21:读论文 MSJudge Legal Judgment Prediction with Multi-Stage Case Representation Learning in the Real
(文字)无框按钮设置
柱状图 直方图 条形图 的区别
mysql安装教程【安装版】
leetcode 剑指 Offer 47. 礼物的最大价值
Day113. Shangyitong: WeChat login QR code, login callback interface
水电表预付费系统
Multi-threading scheme to ensure that a single thread opens a transaction and takes effect
随机推荐
你真的懂Redis的5种基本数据结构吗?
Re20:读论文 What About the Precedent: An Information-Theoretic Analysis of Common Law
MySQL installation tutorial [installation version]
PyQt5-绘制不同类型的直线
实战演练 | 在 MySQL 中计算每日平均日期或时间间隔
容器技术 -- 简单了解 Kubernetes 的对象
线上靶机prompt.ml
软考 系统架构设计师 简明教程 | 系统运行与软件维护
CVTE校招笔试题+知识点总结
(C language) file operation
Re18:读论文 GCI Everything Has a Cause: Leveraging Causal Inference in Legal Text Analysis
连接mysql报错WARN: Establishing SSL connection without server‘s identity verification is not recommended
Re21: Read the paper MSJudge Legal Judgment Prediction with Multi-Stage Case Representation Learning in the Real
国外资源加速下载器,代码全部开源
再有人问你分布式事务,把这篇扔给他
The thread pool method opens the thread -- the difference between submit() and execute()
Quick Start Tutorial for flyway
In 2022, the top will be accepted cca shut the list
leetcode 剑指 Offer 63. 股票的最大利润
Flask's routing (app.route) detailed