当前位置:网站首页>Distributed loading model and incremental training on shengteng 910

Distributed loading model and incremental training on shengteng 910

2022-06-10 04:36:00 MSofficial

Problem description :

【 Function module 】

mindspore.train.serialization.load_distributed_checkpoint

【 Operation steps & Problem phenomenon 】

1、 stay Modelarts 16 Nodes 128 Zhangshengteng 910 Practice from the beginning 130 Billion model , The model code is modified from :https://gitee.com/mindspore/mindspore/tree/r1.3/model_zoo/official/nlp/pangu_alpha

2、 Use the same computing resources , and load_distributed_checkpoint API Load first 1 Step training model , And do incremental training

3、 The first 1 Step can train normally , The first 2 During the step loading process, an error will be reported because the memory is exceeded , Every node 8 Cards will be allocated 2048GB Memory ,load_distributed_checkpoint This memory will be exceeded when loading the model

4、 notes : The same code , If you train first 13 Billion model , And then use load_distributed_checkpoint Load your workout 13 Billion model , Do incremental training , There is no problem with the process , Loss 、 The accuracy is normal

【 Screenshot information 】

The following figure shows the resource usage of a node , from 62 minutes , The model passes through load_distributed_checkpoint Saved before starting loading ckpt, then memUsage from 0 Add to 95% about , The training process is then systematically kill fall .

 

No error will be reported in the foreground log , There is only one more exception , Because you can't capture the outside kill The signal ,plog There is no way to transfer it to the bucket .

/bin/sh: line 1: 15 Killed /bin/bash run_train.sh 's3://aix/PanGu/' 'PanGu/train.py' '/tmp/log/aix-b-model.log' --'epoch_size'='2' --'mode'='13B' --'obs_version_suffix'='increment' --'pre_trained'='obs://aix/saved_model/13B_increment_lm' --data_url='s3://aix/data_test/' --train_url='s3://aix/PanGu/saved_model/V0089/'

answer :

Hello! , Can you provide ckpt and embedding Information about vocabulary size :

1、 Every ckpt How big is the size

2、embedding What is the size of the vocabulary

We suspect that ckpt It's too big , Multiple cards at the same time load May lead to oom, Here are our suggestions for modification and optimization :

1. If the number of cards before and after training remains the same , No call required load_distributed_checkpoint Interface , Call directly load_checkpoint Interface can . namely , Each card load Their own ckpt, Unwanted load be-all ckpt

2. If the number of training cards changes , For example, from 128 Card to 64 card

    a. Optimizer parallelism is not turned on , Then every mp A copy is a complete model , for example mp=8, Every time 8 individual ckpt Is a complete model , Each card only needs to call load_distributed_checkpoint Come in 8 individual ckpt. Or call per card load_checkpoint Load one ckpt.

    b. Optimizer parallelism is enabled , So all ckpt Is a complete model , Need to call load_distributed_checkpoint Interface , And right ckpt Keep fit

3. Yes ckpt Keep fit . because embedding The default is parallel data , Suppose it occupies 2GB, that 128 The card will occupy 256GB. A machine 8 Cards go at the same time load, You'll need it instantly 256GB*8 Size of memory .

    a. The process of slimming down :

        i. Each one ckpt Medium embedding Separate storage , That is, delete all ckpt Medium embedding Variable . Keep a separate copy

        ii. If you do not need an optimizer turntable , You can also put all ckpt The optimizer variables in the

原网站

版权声明
本文为[MSofficial]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206100430405016.html