当前位置:网站首页>Implementation of multi GPU distributed training with horovod in Amazon sagemaker pipeline mode
Implementation of multi GPU distributed training with horovod in Amazon sagemaker pipeline mode
2020-11-07 20:15:00 【InfoQ】
At present , We can use a variety of techniques to train deep learning models with a small amount of data , It includes transfer learning for image classification tasks 、 Small sample learning and even one-time learning , It can also be based on pre training BERT or GPT2 Models fine tune language models . however , In some application cases, we still need to introduce a lot of training data . for example , If the current image and ImageNet The images in the dataset are completely different , Or is the current language corpus only for specific areas 、 It's not a generic type , So it's very difficult for transfer learning to bring about the ideal model performance . As a deep learning researcher , You may need to try new ideas or approaches from scratch . under these circumstances , We have to use large datasets to train large deep learning models ; Without finding the best way to train , The whole process can take a few days 、 Weeks, even months .
In this paper , We'll learn how to do it together Amazon SageMaker Run many on a single instance of GPU Training , And discuss how to do it in Amazon SageMaker On the implementation of more efficient GPU And multi node distributed training .
Link to the original text :【https://www.infoq.cn/article/0867pYEmzviBfvZxW37k】. Without the permission of the author , Prohibited reproduced .
版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
边栏推荐
- 统计文本中字母的频次(不区分大小写)
- Don't treat exceptions as business logic, which you can't afford
- 【涂鸦物联网足迹】物联网主流通信方式
- Opencv computer vision learning (10) -- image transform (Fourier transform, high pass filter, low pass filter)
- 滴滴的分布式ID生成器(Tinyid),好用的一批
- 把 4个消息队列都拉到一个群里后,他们吵起来了
- Analysis of the original code of [QT] qthread
- 使用LWA和Lync模拟外部测试无边缘单前端环境
- 动态规划——用二进制表示集合的状态压缩DP
- 技术债务是对业务功能缺乏真正的理解 -daverupert.com
猜你喜欢
课堂练习
C# 枚举权限 |和||,&和&&的区别
Didi's distributed ID generator (tinyid), easy to use
统计文本中字母的频次(不区分大小写)
If you want to forget the WiFi network you used to connect to your Mac, try this!
How to learn technology efficiently
11.Service更新
Opencv computer vision learning (10) -- image transform (Fourier transform, high pass filter, low pass filter)
RECH8.0版本学习 days 12 rh134部分
编程界大佬教你:一行Python代码能做出哪些神奇的事情?
随机推荐
栈-括号的匹配
Didi's distributed ID generator (tinyid), easy to use
graph generation model
使用RabbitMQ实现分布式事务
我们为什么需要软件工程——从一个简单的项目进行观察
如何应对事关业务生死的数据泄露和删改?
快速上手Git
Ac86u KX Online
三步一坑五步一雷,高速成长下的技术团队怎么带?
Solution to st link USB communication error in stlink Download
C語言重點——指標篇(一文讓你完全搞懂指標)| 從記憶體理解指標 | 指標完全解析
How did I lose control of the team?
课堂练习
Kubernetes服务类型浅析:从概念到实践
[漫谈] 软件设计的目标和途径
从技术谈到管理,把系统优化的技术用到企业管理
Exception calling 'downloadstring' with '1' arguments: 'operation timed out'
Awk implements SQL like join operation
[random talk] the goal and way of software design
聊聊Go代码覆盖率技术与最佳实践