当前位置：网站首页>Multi card training in pytorch

Multi card training in pytorch

2022-07-29 04:14:00 【ytusdc】

- pytorch What is the process of zhongduoka training ？

- If every card has a model BN Are the parameters the same ？

- pytorch Of DistributedDataParallel Every GPU Are the model parameters exactly the same on ？

Same parameter , But at some moments the gradient is different .

DDP In working mode , The process can be imagined as ：

Parallel computing respective loss
parallel backward
Synchronize gradients between different cards
Back propagation

Because the random initialization between different cards is the same ,DDP It can guarantee the connection between different processes model The parameters are always the same .

When you look at the source code, you should also see the class annotation NOTICE and WARNING, Compliance can ensure the consistency of parameters between processes . Of course, I'm still not sure. I can put evaluation Do it once in each process , The same result should be output .

- When Doka trains batchsize The accuracy is reduced instead of increased , Why is that ？ Have you thought about how to solve it ？

DOCA training large batchsize：

Theoretical advantages ：

The impact of noise in the data may be reduced , It may be easy to approach the best ;

Shortcomings and problems ：

Reduced gradient variance;( Theoretically , For convex optimization problems , Low gradient variance Can get better optimization effect ; But actually Keskar et al Verified the increase batchsize It will lead to poor generalization ability );

For nonconvex optimization problems , The loss function contains multiple local Optimalities , Small batchsize Noisy interference may easily jump out of the local best , And the big ones batchsize It is possible to stop at the local best and not jump out .

resolvent ：

increase learning_rate, But there may be problems , Use a lot of... At the beginning of training learning_rate May cause the model not to converge

Use warming up Reduce large learning_rate The model does not converge

warmup

link ： Deep learning training strategies -- Learning rate warms up Warmup

At the beginning of training, use a lot of learning_rate It may lead to the problem of non convergence of training ,warmup The idea is to use a small learning rate at the beginning of training , With the training, the college learning rate gradually changes , until base learning_rate, Reuse other decay（CosineAnnealingLR） The way to train

原网站

版权声明
本文为[ytusdc]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130552017044.html