当前位置:网站首页>SGD has many improved forms. Why do most papers still use SGD?
SGD has many improved forms. Why do most papers still use SGD?
2022-06-30 10:41:00 【Xiaobai learns vision】
Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”
Heavy dry goods , First time delivery Reading guide
Random steepest descent method (SGD) Except for being quick , It also has many excellent properties . It can automatically escape from the saddle point , Automatic escape from poor local optima , But he also has some shortcomings . But in SGD There are many forms of improvement , Why do most papers choose to use SGD Well ? This article introduces an excellent answer from Zhihu .
because SGD(with Momentum) It is still often the method with better practice effect .
In theory and practice ,Adam The family of optimizers that use adaptive learning rates are not good at finding flat minima. and flat minima about generalization It's important . therefore Adam trained training loss It could be lower , but test performance Often worse . This is the main reason to avoid using adaptive learning rate in many tasks .
meanwhile , We are right. SGD I have a good understanding of the theory of , And then Adam The adaptive optimizer represented by is a very heuristic、 The theoretical mechanism is also very unclear .
One of the inaccuracies in the problem description is : In the field of computer vision ,SGD Today, it is still the dominant optimizer . But in naturallanguageprocessing ( Especially with Transformer-based models) field ,Adam It is already the most popular optimizer .
So why SGD and Adam Each has his own advantages ?
If you use in computer vision Adam Such an adaptive optimizer , The result is likely to be far from SGD Of baseline A few points away . The main reason is , The adaptive optimizer is easy to find sharp minima, Generalization performance is often better than SGD Significant difference .
If you train Transformer A class of models ,Adam Optimize faster and better . The main reason is ,NLP Mission loss landscape There's a lot of “ steep cliff ”, Adaptive learning rate can better deal with this extreme situation , Avoid gradient explosions . For the same reason , Computer vision is rarely used gradient clipping stay NLP Almost indispensable in the task .( Please refer to this article ICLR2020 "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity".)https://arxiv.org/abs/1905.11881
There are some exceptions . Although the generation of confrontation network (GAN) It's usually a visual task , however Adam It has become the most popular optimizer . The main reason is that GAN Your training is not very stable , its loss landscape Very different from normal visual tasks . Everyone is interested in training GAN It is good enough to be stable ,flat minima Yes GAN The meaning of is not very clear .
For the more extreme loss landscape,Adam May have a comparative advantage . although Adam Not good at finding flat minima, but Adam Energy ratio SGD( There is a theoretical guarantee ) Escape the saddle point faster .
Last , Many people mistakenly think Adam There are two advantages , But it doesn't really exist . To some extent, this also hinders Adam popular .
Misunderstanding one , Use Adam There is no need to adjust the initial learning rate .
Even though Adam The default learning rate 0.001 Widely used , But in Adam Than SGD Those areas that do well , It happens to be a readjustment Adam Of learning rate . Like training GAN We usually use the learning rate 0.0002, instead of 0.001; And training Transformer It will take more than 0.001 Greater initial learning rate , The default setting is the learning rate 0.2+NOAM Scheduler. Adjusting the learning rate has a great impact on the results , It is arguably the most important super parameter of the optimizer .
(Adam In general visual tasks, there is no need to adjust the learning rate . But these tasks are unmatched whether they are adjusted or not SGD.)
Misunderstanding two ,Adam Unwanted learning rate decay.
Too many people have this misunderstanding . I even met some engineers who had worked for several years 、 some PhD There are also deep misunderstandings about this issue . The answer is , Adaptive optimizer and need not LR scheduler It almost doesn't matter , They often need to ( superposition ) Work .
SGD and Adam The proof of convergence is also required learning rate Finally, it will be low enough . However, the learning rate of the adaptive optimizer will not automatically drop to a very low level during training .
Actually, you can use whatever you like CIFAR perhaps ImageNet Run a common model to know : The last stage of training , If you don't take the initiative to learning rate Come down ,loss It will not converge to a smaller value by itself . You need learning rate decay, It is necessary both in theory and in practice .
The good news !
Xiaobai learns visual knowledge about the planet
Open to the outside world

download 1:OpenCV-Contrib Chinese version of extension module
stay 「 Xiaobai studies vision 」 Official account back office reply : Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .
download 2:Python Visual combat project 52 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .
download 3:OpenCV Actual project 20 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .
Communication group
Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition ( It will be subdivided gradually in the future ), Please scan the following micro signal clustering , remarks :” nickname + School / company + Research direction “, for example :” Zhang San + Shanghai Jiaotong University + Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~边栏推荐
- 05_ Node JS file management module FS
- June training (day 30) - topology sorting
- WGet -- 404 not found due to spaces in URL
- Using LVM to resize partitions
- Skill combing [email protected] control a dog's running on OLED
- Who should the newly admitted miners bow to in front of the chip machine and the graphics card machine
- MATLAB image histogram equalization, namely spatial filtering
- About Library (function library), dynamic library and static library
- GeoffreyHinton:我的五十年深度学习生涯与研究心法
- 腾讯云数据库工程师能力认证重磅推出,各界共话人才培养难题
猜你喜欢

The preliminary round of the sixth season of 2022 perfect children's model Hefei competition area was successfully concluded

机器学习面试准备(一)KNN

MySQL log management, backup and recovery of databases (2)

腾讯云数据库工程师能力认证重磅推出,各界共话人才培养难题

6. Redis new data type

马斯克推特粉丝过亿了,但他在线失联已一周

MySQL log management, backup and recovery of databases (1)
[email protected] somatosensory manipulator"/>Skill combing [email protected] somatosensory manipulator

Kernel linked list (general linked list) "list.h" simple version and individual comments

Eth is not connected to the ore pool
随机推荐
What is the real performance of CK5, the king machine of CKB?
Implementation of iterative method for linear equations
Configure Yii: display MySQL extension module verification failed
技能梳理[email protected]體感機械臂
05_Node js 文件管理模块 fs
MySQL log management, backup and recovery of databases (2)
MySQL log management, backup and recovery of databases (1)
Enter the world of helium (hNT) hotspot servers to bring you different benefits
Remember the experience of an internship. It is necessary to go to the pit (I)
吴恩达2022机器学习专项课测评来了!
技能梳理[email protected]基于51系列单片机的智能仪器教具
Dyson design award, changing the world with sustainable design
RobotFramework学习笔记:环境安装以及robotframework-browser插件的安装
R language plot visualization: use plot to visualize the prediction confidence of the multi classification model, the prediction confidence of each data point of the model in the 2D grid, and the conf
【Rust日报】2021-01-23 几个新库发布
The programmer was beaten.
Splendid China: public welfare tourism for the middle-aged and the elderly - entering Foshan nursing home
前嗅ForeSpider教程:抽取数据
TypeScript–es5中的类,继承,静态方法
GD32 RT-Thread OTA/Bootloader驱动函数