当前位置:网站首页>[framework] multi learner

[framework] multi learner

2022-07-05 07:09:00 hanjialeOK

Run container

Run the container on each machine

docker run -it -v /data:/root/data/ --network host --name multi_learner hanjl/cuda:framework

modify host The host address

Enter the container on each machine , Modify the container's ip Address

vim /etc/hosts

Show

127.0.0.1
127.0.1.1

For example 4.7 Server , Please change the second line to the host address :

***.**.4.7

modify ssh port

Enter the container on each machine , hold ssh Port changed to 2233

sed -i 's/\(^Port\)/#\1/' /etc/ssh/sshd_config
echo Port 2233 >> /etc/ssh/sshd_config
service ssh restart

Each other on each machine ssh, Ensure that all machines can be directly connected without password or confirmation

worker id To configure

stay master Modify the configuration file above

vim  ~/.ssh/config

It is amended as follows

Host by08
  HostName ***.***.4.8
  Port 2233

Host by07
  HostName ***.***.4.7
  Port 2233

Download the file

Download the framework code on each machine , Ensure that the paths of the framework are all the same .

function

Just in master On the implementation

horovodrun -np 4 -H by07:2,by08:2 python learner.py --config examples/ppo/walker2d_learner_multi.yaml

Then go to each worker Up operation

python actor.py --config examples/ppo/walker2d_actor.yaml

Clear video memory

at present learner Will not automatically exit leading to horovodrun Always occupy the video memory . Need to be in every worker Manually release the video memory on .

First, check the processes that occupy the video memory

fuser -v /dev/nvidia0

And then execute kill, Notice that there is an obvious system process , Unwanted kill

原网站

版权声明
本文为[hanjialeOK]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202140600198696.html