当前位置:网站首页>Distributed loading model and incremental training on shengteng 910
Distributed loading model and incremental training on shengteng 910
2022-06-10 04:36:00 【MSofficial】
Problem description :
【 Function module 】
mindspore.train.serialization.load_distributed_checkpoint
【 Operation steps & Problem phenomenon 】
1、 stay Modelarts 16 Nodes 128 Zhangshengteng 910 Practice from the beginning 130 Billion model , The model code is modified from :https://gitee.com/mindspore/mindspore/tree/r1.3/model_zoo/official/nlp/pangu_alpha
2、 Use the same computing resources , and load_distributed_checkpoint API Load first 1 Step training model , And do incremental training
3、 The first 1 Step can train normally , The first 2 During the step loading process, an error will be reported because the memory is exceeded , Every node 8 Cards will be allocated 2048GB Memory ,load_distributed_checkpoint This memory will be exceeded when loading the model
4、 notes : The same code , If you train first 13 Billion model , And then use load_distributed_checkpoint Load your workout 13 Billion model , Do incremental training , There is no problem with the process , Loss 、 The accuracy is normal
【 Screenshot information 】
The following figure shows the resource usage of a node , from 62 minutes , The model passes through load_distributed_checkpoint Saved before starting loading ckpt, then memUsage from 0 Add to 95% about , The training process is then systematically kill fall .
No error will be reported in the foreground log , There is only one more exception , Because you can't capture the outside kill The signal ,plog There is no way to transfer it to the bucket .
/bin/sh: line 1: 15 Killed /bin/bash run_train.sh 's3://aix/PanGu/' 'PanGu/train.py' '/tmp/log/aix-b-model.log' --'epoch_size'='2' --'mode'='13B' --'obs_version_suffix'='increment' --'pre_trained'='obs://aix/saved_model/13B_increment_lm' --data_url='s3://aix/data_test/' --train_url='s3://aix/PanGu/saved_model/V0089/'
answer :
Hello! , Can you provide ckpt and embedding Information about vocabulary size :
1、 Every ckpt How big is the size
2、embedding What is the size of the vocabulary
We suspect that ckpt It's too big , Multiple cards at the same time load May lead to oom, Here are our suggestions for modification and optimization :
1. If the number of cards before and after training remains the same , No call required load_distributed_checkpoint Interface , Call directly load_checkpoint Interface can . namely , Each card load Their own ckpt, Unwanted load be-all ckpt
2. If the number of training cards changes , For example, from 128 Card to 64 card
a. Optimizer parallelism is not turned on , Then every mp A copy is a complete model , for example mp=8, Every time 8 individual ckpt Is a complete model , Each card only needs to call load_distributed_checkpoint Come in 8 individual ckpt. Or call per card load_checkpoint Load one ckpt.
b. Optimizer parallelism is enabled , So all ckpt Is a complete model , Need to call load_distributed_checkpoint Interface , And right ckpt Keep fit
3. Yes ckpt Keep fit . because embedding The default is parallel data , Suppose it occupies 2GB, that 128 The card will occupy 256GB. A machine 8 Cards go at the same time load, You'll need it instantly 256GB*8 Size of memory .
a. The process of slimming down :
i. Each one ckpt Medium embedding Separate storage , That is, delete all ckpt Medium embedding Variable . Keep a separate copy
ii. If you do not need an optimizer turntable , You can also put all ckpt The optimizer variables in the
边栏推荐
- Fastapi-17-page beautification-2
- 城市/学校/专业,哪个在选大学时最重要? | 每日趣闻
- 什么时候用@ComponentScan?与@MapperScan有什么区别?
- 大事件回顾 | Eolink 5月重要动态速览!
- IO被谁吃了?
- Unit test overview
- Su Tao: application of counter sample technology in the field of Internet Security
- 25. BOM事件
- torch.randn迁移成mindspore.ops.TruncatedNormal使用问题
- MindSpore【数据集功能】无法查看数据集
猜你喜欢

Fastapi-17-page beautification-2

FastApi-15-文件上传-3

Pampy | powerful pattern matching tool

Fastapi-15-file upload-3

PySimpleGUI经典实践之:这个汉字怎么读?

MindSpore【初学入门】教程在线运行时报错

Metersphere | a super easy-to-use open source testing platform

Gevent | use it asynchronously!

These programming languages are old and almost dead. Young people can't touch them

Distributed data object: HyperTerminal 'global variable'
随机推荐
- Oui. Net C # Foundation (7): Interface - Comment les gens interagissent avec les chats
libc、glibc和glib的关系
Celery | 任务队列神器
Golang learning 6: file operation in
Who ate IO?
Gevent | 异步就用它!
Storage engine of MySQL database
tensorflow 中的 cross_entropy
Webcodecs parsing GIF graph
Network principle TCP
pytorch的add_module(name,module)用mindspore怎么表示
mindspore官网教程中的初学入门中,导入数据时有map映射不太懂
TCP (sliding window, flow control)
ThreadLocal basic and advanced use
测试工程师提高质量的OKR该如何写?
24. browser object model BOM
Detailed explanation of tcp/ip protocol mechanism
Fastapi-17-page beautification-2
2022.5.23-----leetcode. six hundred and seventy-five
使用MindSpore多数据集混合采样并加载