当前位置:网站首页>Mlperf training v2.0 list released, with the same GPU configuration, the performance of Baidu PaddlePaddle ranks first in the world
Mlperf training v2.0 list released, with the same GPU configuration, the performance of Baidu PaddlePaddle ranks first in the world
2022-07-01 16:22:00 【Paddlepaddle】
This article has been published on the official account of the flying oar , Please check the link :
MLPerf Training v2.0 The list is released , In the same way GPU The performance of Baidu PaddlePaddle under configuration is the first in the world
stay 6 month 30 The latest MLPerf Training v2.0 In the list , Baidu uses the propeller frame (PaddlePaddle) And Baidu AI Cloud Baige computing platform BERT Large Model GPU Training performance results , In the same way GPU Ranked first in all submission results under the configuration , It has surpassed highly customized optimization and has been in the leading position in the list for a long time NGC PyTorch frame , It shows the performance advantages of the propeller frame to the world .

chart 1 MLPerf Training v2.0 BERT Top five training results of model effectiveness
chart 1 It shows MLPerf Training v2.0 BERT Model in 8 card NVIDIA GPU A100(400W Power waste ,80G memory ) The training performance results of the top five , The Baidu PaddlePaddle plan is faster than other submitted results 5%-11% Unequal .
“ World first ” The black technology behind it
Where are the oars BERT Model 8 card GPU Training has created the world's best training performance , This comes from the basic performance of the propeller frame and the leadership of Distributed Technology , And the propeller and NVIDIA GPU Deep collaborative optimization .
For deep learning model training tasks , From data reading to model calculation , From the bottom operator to the upper distributed strategy , From multi device load balancing to whole process scheduling mechanism , Will affect the final training performance . The propeller is based on leading architectural design and long-term practice , Systematic optimization work has been made in high-performance training , Mainly reflected in the following aspects :
Load balancing of data reading and model training
Aiming at the problem of load imbalance that often occurs in distributed training , Train the model and read the data 、 Pretreatment is allocated to different devices , Ensure that heterogeneous computing power makes the best use of everything , Implementation data IO And the balance of calculation .
Calculation acceleration of variable length sequence input model
For variable length sequence input models, most of them adopt padding The problem of redundant computation caused by filling alignment , Provide efficient support for variable length input and corresponding model structure , Give Way GPU Computing resources focus on efficient computing , Especially for Transformer The calculation efficiency of class model is significantly improved .
High performance operator library and fusion Optimization Technology
For the ultimate demand of framework foundation performance optimization , Developed a high-performance operator Library PHI, Fully optimize GPU Kernel Implementation , Improve the parallelism of the internal calculation of the operator , And through operator fusion to reduce the imitation memory overhead , Develop GPU The ultimate performance of .
High speedup hybrid parallel training strategy
For traditional data parallel performance 、 The bottleneck of video memory is limited , It realizes the parallel of fused data 、 Model parallel 、 A hybrid parallel distributed training strategy of grouping parameter slicing parallel strategy , Distributed training performance with Superlinear acceleration can be achieved in some scenarios .
Asynchronous scheduling of the whole process
The synchronization frequency of each link in the model training process is high 、 Low degree of time overlap , Design Asynchronous scheduling mechanism , Most of the synchronization operations are removed while ensuring the convergence of the model , Data processing 、 Training and collective communication are almost asynchronous scheduling , Improve end-to-end performance .
Help big model Technological innovation and industrial landing
Baidu has always attached importance to the technological research and development of large models , And is committed to promoting the industrial landing of large models . Large model training requires deep learning framework to provide strong support in high-performance distributed training .
Propeller distributed training starts from industrial practice , Continuously strengthen the leading edge , Successively released the industry's first general heterogeneous parameter server architecture 、4D Hybrid parallel training strategy 、 Many bright technologies such as end-to-end adaptive distributed training architecture , And fully Polish according to different model structures and sparse and dense characteristics , Supportable including computer vision 、 natural language processing 、 Personalized recommendation 、 Different algorithms in a wide range of fields, including scientific computing, achieve high-performance training on heterogeneous hardware , Effectively help the rapid iteration of large model technology innovation exploration .
The leading distributed technology and high-performance training features of the propeller , It supports the software and hardware scheme based on the propeller in MLPerf Continue to achieve excellent performance on , It supports the release of many industry-leading Wenxin large models , For example, the world's first knowledge enhancement model of 100 billion “ Pengcheng - Baidu · Literary heart ”, Knowledge enhanced power industry NLP Big model “ State Grid - Baidu · Literary heart ”, Knowledge enhanced financial industry NLP Big model “ PUFA - Baidu · Literary heart ”, And domestic hardware clusters AlphaFold2 Ten million level protein structure analysis model .
Conclusion
Where are the oars MLPerf Training v2.0 Got BERT Model training performance is the world's first eye-catching achievement . This is not only due to the long-term efforts of the propeller frame in the field of performance optimization , It cannot be separated from the help of hardware ecology . In recent years , The technical strength of the propeller is deeply recognized by the majority of hardware manufacturers , Cooperation is getting closer , Integrated software and hardware for coordinated development , Ecological co creation is fruitful . Not long ago (5 month 26 Japan ),NVIDIA Launched in cooperation with the propeller NGC-Paddle The official launch . At the same time MLPerf In the list ,Graphcore Also achieved excellent results by using the propeller frame . future , The propeller will continue to create performance advantages , Continuous technological innovation in software and hardware collaborative performance optimization and large-scale distributed training , For the majority of users to provide more convenient 、 Easy to use 、 Deep learning framework with excellent performance .
MLPerf Introduce
MLPerf By AI Benchmark in the field of artificial intelligence initiated by world-renowned academic researchers and industry experts .MLPerf It aims to provide a fair 、 A practical benchmark platform , Show industry-leading AI The best performance of software and hardware system , The test results have been obtained AI General recognition in the field . Almost all mainstream hardware manufacturers and software service providers in the world will refer to MLPerf Build your own benchmark system based on the published results , To test the new AI Accelerating chip and deep learning framework in MLPerf Performance on the model .
Live broadcast announcement
7 month 6 Japan ( Wednesday )20:00, Yu dianhai, the chief architect of flying oars, and Zeng Jinle, the senior R & D Engineer of flying oars, will broadcast live , Uncover secrets for everyone GPU Under configuration , Baidu PaddlePaddle performance 「 World first 」 The key technology behind it .
Official account of flying propeller , The background to reply 【 Study 】 Sign up , There are more gifts waiting for you in the live studio !
Focus on 【 Flying propeller PaddlePaddle】 official account
Get more technical content ~
边栏推荐
- Comment win11 définit - il les permissions de l'utilisateur? Win11 comment définir les permissions de l'utilisateur
- In the era of super video, what kind of technology will become the base?
- When ABAP screen switching, refresh the previous screen
- 韩国AI团队抄袭震动学界!1个导师带51个学生,还是抄袭惯犯
- 投稿开奖丨轻量应用服务器征文活动(5月)奖励公布
- 毕业后5年,我成为了年薪30w+的测试开发工程师
- There is a difference between u-standard contract and currency standard contract. Will u-standard contract explode
- 揭秘慕思“智商税”:狂砸40亿搞营销,发明专利仅7项
- 动作捕捉系统用于苹果采摘机器人
- 分享在大疆DJI(深圳总部)工作的日常和福利
猜你喜欢

The sharp drop in electricity consumption in Guangdong shows that the substitution of high-tech industries for high-energy consumption industries has achieved preliminary results

Apple's self-developed baseband chip failed again, which shows Huawei Hisilicon's technological leadership
![[每日一氵]Latex 的通讯作者怎么搞](/img/0f/d19b27dc42124c89993dee1bada838.png)
[每日一氵]Latex 的通讯作者怎么搞
![[PHP graduation design] design and implementation of textbook management system based on php+mysql+apache (graduation thesis + program source code) -- textbook management system](/img/04/11f24f12c52fb1f69e3d6f513d896b.png)
[PHP graduation design] design and implementation of textbook management system based on php+mysql+apache (graduation thesis + program source code) -- textbook management system

Principes et applications du système de base de données (006) - - compilation et installation de MySQL 5.7 (environnement Linux)

Introduction to RT thread env tool (learning notes)

How long will it take to achieve digital immortality? Metacosmic holographic human avatar 8i

Tutorial on principles and applications of database system (006) -- compiling and installing MySQL 5.7 (Linux Environment)

Idea start command line is too long problem handling

怎么用MySQL语言进行行列装置?
随机推荐
Talking from mlperf: how to lead the next wave of AI accelerator
How to write good code - Defensive Programming Guide
Problèmes rencontrés dans le développement de la GI pour maintenir le rythme cardiaque en vie
Task. Run(), Task. Factory. Analysis of behavior inconsistency between startnew() and new task()
【Hot100】19. 删除链表的倒数第 N 个结点
[observation] where is the consulting going in the digital age? Thoughts and actions of softcom consulting
Im instant messaging develops a message delivery scheme for 10000 people
【Hot100】19. Delete the penultimate node of the linked list
How does win11 set user permissions? Win11 method of setting user permissions
Win11如何設置用戶權限?Win11設置用戶權限的方法
电脑屏幕变色了怎么调回来,电脑屏幕颜色怎么改
Research on multi model architecture of ads computing power chip
ADS算力芯片的多模型架构研究
实现数字永生还有多久?元宇宙全息真人分身#8i
投稿开奖丨轻量应用服务器征文活动(5月)奖励公布
The Department came to a Post-00 test paper king who took out 25K. The veteran said it was really dry, but it had been
Go 语言源码级调试器 Delve
Share the daily work and welfare of DJI (Shenzhen headquarters) in Dajiang
分享在大疆DJI(深圳总部)工作的日常和福利
The picgo shortcut is amazing. This person thinks exactly the same as me