当前位置:网站首页>AWS uses EC2 to reduce the training cost of DeepRacer: DeepRacer-for-cloud practical operation
AWS uses EC2 to reduce the training cost of DeepRacer: DeepRacer-for-cloud practical operation
2022-08-04 06:19:00 【Refers to the sword】
CSDN话题挑战赛第1期
活动详情地址:https://marketing.csdn.net/p/bb5081d88a77db8d6ef45bb7b6ef3d7f
参赛话题:哪项人工智能技术使你感受到了神奇?
话题描述:在你漫长炼丹之旅,哪一项人工智能技术最让你印象深刻?
创作模板:
前言
Amazon DeepRacer 是一款 1:18 赛车,It provides a way to use reinforcement learning (RL) A platform to address autonomous driving technology.RL is an advanced machine learning (ML) 技术,It takes a different approach to training the model than other machine learning methods.它的强大之处在于,It can learn very complex behaviors without any labeled training data,And short-term decisions can be made while optimizing long-term goals.有了 Amazon DeepRacer,Now available for hands-on experience with autonomous driving RL、Experiment and learn.通过基于云的 3D Racing simulator starts with virtual cars and tracks,and get real experience,Trained models can be deployed to Amazon DeepRacer Compete with friends,or participate in global Amazon DeepRacer 联盟.
DeepRacerThe training cost is approx3.5美元/小时,This article provides a method based on open sourceDeepRacer在Amazon EC2Scenario for training the model on the host,可以将DeepRacerThe training cost is reduced70%以上 (以us-east-1 g4dn.2xlarge机型为例).The scheme in this article supports a variety of models,使用者可以根据自己的需求选择.
一、技术介绍
1.1 Amazon DeepRacer
使用 Amazon DeepRacer,You can create your own machine learning models(“训练”流程),And drive these models to race(“评估”流程).You need to pay for training、Fees for evaluating and storing machine learning models.Fees are based on the time you spend training and evaluating a new model and the storage size of that model.此外,You can also buy a fully automatic 1/18 比例 DeepRacer 赛车,Thus experimenting with your model on a real race track.进入 DeepRacer League 无需购买.1.2 Amazon SageMaker
Amazon SageMaker By integrating specifically for ML 构建的广泛功能集,帮助数据科学家和开发人员快速准备、构建、训练和部署高质量的机器学习 (ML) 模型.利用SageMaker,Can be trained serverless on the cloudDeepRacer模型.1.3 Amazon RoboMaker
RoboMakeris used to simulate and deploy robotics applications running on the cloud,利用GazeboThe simulator is trainingDeepRacerSimulate real reality.Amazon RoboMaker is the most comprehensive cloud solution,Available for large-scale simulations by robot developers、Test and safely deploy robotics applications.RoboMaker Provides a fully managed and scalable simulation infrastructure,Customers can use it to perform multi-robot simulations and perform regression tests in simulations CI/CD 集成.此外,Amazon RoboMaker 提供 IDE、Application deployment capabilities、ROS Expansion tools as well as with various Amazon 和 AWS 服务的无缝集成,Enables customers to innovate and deliver best-in-class robotic solutions. RoboMaker 的托管 ROS 和 Gazebo The software stack frees up a lot of engineering resources,Enables you to start building quickly.1.4 Amazon EC2
Amazon Elastic Compute Cloud(Amazon EC2 云服务器)是一种 Web 云服务,能在云中提供安全且可调整大小的计算能力.该服务旨在让开发人员能够更轻松地进行 Web 规模的云计算.Amazon EC2 云服务器的 Web The cloud service interface is very simple,You can easily gain volume with minimal resistance,Configure the capacity accordingly.使用该服务,You will have complete control over your computing resources,And can run in Amazon's mature and proven computing environment.
创建一个 AWS DeepRacing 训练环境,The environment can be deployed in the cloud,It can also be deployed locally Ubuntu Linux、Windows 或 Mac 上.
(本段引用自:利用Amazon EC2进一步降低DeepRacer训练成本)
二、实现途径
使用 Azure N Series of virtual machines or AWS EC2 加速计算实例,Or get up and running locally on your own desktop or server AWS 或 Azure 中的 DeepRacer 训练环境,Provides a quick and easy way.
(本段引用自:deepracer-for-cloud)
三、效果展示
AWS DeepRacer-for-Cloud的官方博客文档:
直达链接:https://aws.amazon.com/cn/blogs/china/use-amazon-ec2-to-further-reduce-the-cost-of-deepracer-training/
由于博客中已经写了具体的方案,这里将里面的脚本提取出来,并对相应的问题做解决.
这里采用Deep Learning AMI (Ubuntu 18.04) Version 60.1
AWS DeepRacer-for-Cloud安装训练脚本如下
Step-1
进入创建的EC2实例,并执行以下命令,从GitHub拉取代码:
git clone https://github.com/aws-deepracer-community/deepracer-for-cloud.git
Step-2
执行第一阶段的环境预配置代码,这会安装DeepRacer本地训练所需的基础组件,之后重启EC2实例:
这里可能会报错,See the error solution at the end of the article
因为这里考虑到会出错的场景 Three lines of code are executed repeatedly,Just remember to be upfront sudo apt-get update
wget -qO - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo apt-key add -
sudo apt-get update
sudo rm /var/lib/dpkg/lock-frontend
sudo rm /var/lib/dpkg/lock
cd deepracer-for-cloud && ./bin/prepare.sh
sudo rm /var/lib/dpkg/lock-frontend
sudo rm /var/lib/dpkg/lock
cd deepracer-for-cloud && ./bin/prepare.sh
sudo reboot
Step-3
重新连接EC2实例,并执行第二阶段的环境初始化代码:
cd deepracer-for-cloud/ && bin/init.sh -c aws -a gpu
Step-4
启动训练
执行source bin/activate.sh
命令 加载训练DeepRacer所需的脚本;
在deepracer-for-cloud/custom_files/reward_function.py文件中编辑奖励函数
def reward_function(params):
'''
Example of penalize steering, which helps mitigate zig-zag behaviors
'''
# Read input parameters
distance_from_center = params['distance_from_center']
track_width = params['track_width']
steering = abs(params['steering_angle']) # Only need the absolute steering angle
# Calculate 3 marks that are farther and father away from the center line
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
# Give higher reward if the car is closer to center line and vice versa
if distance_from_center <= marker_1:
reward = 1
elif distance_from_center <= marker_2:
reward = 0.5
elif distance_from_center <= marker_3:
reward = 0.1
else:
reward = 1e-3 # likely crashed/ close to off track
# Steering penality threshold, change the number based on your action space setting
ABS_STEERING_THRESHOLD = 15
# Penalize reward if the car is steering too much
if steering > ABS_STEERING_THRESHOLD:
reward *= 0.8
return float(reward)
在deepracer-for-cloud/custom_files/hyperparameters.json文件中编辑训练信息,例如:
{
"batch_size": 64,
"beta_entropy": 0.01,
"discount_factor": 0.995,
"e_greedy_value": 0.05,
"epsilon_steps": 10000,
"exploration_type": "categorical",
"loss_type": "huber",
"lr": 0.0003,
"num_episodes_between_training": 20,
"num_epochs": 10,
"stack_size": 1,
"term_cond_avg_score": 350.0,
"term_cond_max_episodes": 1000,
"sac_alpha": 0.2
}
在deepracer-for-cloud/custom_files/model_metadata.json文件中编辑车辆信息,包括action space、传感器以及神经网络类型等,例如:
{
"action_space": [
{
"steering_angle": -30,
"speed": 0.6
},
{
"steering_angle": -15,
"speed": 0.6
},
{
"steering_angle": 0,
"speed": 0.6
},
{
"steering_angle": 15,
"speed": 0.6
},
{
"steering_angle": 30,
"speed": 0.6
}
],
"sensor": ["FRONT_FACING_CAMERA"],
"neural_network": "DEEP_CONVOLUTIONAL_NETWORK_SHALLOW",
"training_algorithm": "clipped_ppo",
"action_space_type": "discrete",
"version": "3"
}
编辑deepracer-for-cloud/run.env文件,添加如下内容:
DR_LOCAL_S3_BUCKET=<创建的bucket名字>
DR_UPLOAD_S3_BUCKET=<创建的bucket名字>
sed -i '1i\DR_LOCAL_S3_BUCKET=<创建的bucket名字>' run.env
sed -i '1i\DR_UPLOAD_S3_BUCKET=<创建的bucket名字>' run.env
Step-5
更新python版本
sudo apt-get -y install python3.8
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2
Step-6
更新后,执行 dr-update
Make this configuration take effect;
执行 dr-upload-custom-files
, Upload custom files to s3,此时 bucket 中的 custom_files 包括
Step-7
执行dr-start-training
开始训练;
遇到的问题
执行 sudo apt-get update
Encountered the following error while commanding
W: GPG error: http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY F60F4B3D7FA2AF80 E: The repository 'http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release' is not signed. N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W问题:处理方案
由于本次使用 ubuntu 18.04
所以使用以下方案解决
wget -qO - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo apt-key add -
sudo apt-get update
更新python版本
sudo apt-get install python3.8
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2
注:
第一个参数--install表示向update-alternatives注册服务名.
第二个参数是注册最终地址,成功后将会把命令在这个固定的目的地址做真实命令的软链,以后管理就是管理这个软链;
( --install link name path priority)
其中link为系统中功能相同软件的公共链接目录,比如/usr/bin/java(需绝对目录);name为命令链接符名称,如java path为你所要使用新命令、新软件的所在目录 priority为优先级,当命令链接已存在时,需高于当前值,因为当alternative为自动模式时,系统默认启用priority高的链接;# 整数 根据版本号设置的优先级(更改的优先级需要大于当前的)
第三个参数:服务名,以后管理时以它为关联依据.
第四个参数,被管理的命令绝对路径.
第五个参数,优先级,数字越大优先级越高.
问题3
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
解决方法:
$ sudo rm /var/lib/dpkg/lock-frontend
$ sudo rm /var/lib/dpkg/lock
四、总结
Although the official blog has written very clearly,But due to the constant updating of the environment,Causes some details to have unpredictable effects,I hope that you will be as flexible as possible when you are studying,Find the sticking point in the experiment,Document it and share it.Hope this blog can help you.
CSDN话题挑战赛第1期活动详情地址:https://marketing.csdn.net/p/bb5081d88a77db8d6ef45bb7b6ef3d7f
边栏推荐
猜你喜欢
YOLOV5 V6.1 详细训练方法
【CV-Learning】Object Detection & Instance Segmentation
动手学深度学习__数据操作
光条中心提取方法总结(一)
TensorFlow2学习笔记:4、第一个神经网模型,鸢尾花分类
【CV-Learning】卷积神经网络
基于BiGRU和GAN的数据生成方法
PCL1.12 解决memory.h中EIGEN处中断问题
TensorFlow2 study notes: 4. The first neural network model, iris classification
【论文阅读】Further Non-local and Channel Attention Networks for Vehicle Re-identification
随机推荐
动手学深度学习_softmax回归
TensorFlow2学习笔记:6、过拟合和欠拟合,及其缓解方案
审稿意见回复
【CV-Learning】语义分割
Android foundation [Super detailed android storage method analysis (SharedPreferences, SQLite database storage)]
亚马逊云科技 Build On 2022 - AIot 第二季物联网专场实验心得
PyTorch
腾讯、网易纷纷出手,火到出圈的元宇宙到底是个啥?
Use of double pointers
【CV-Learning】Object Detection & Instance Segmentation
TensorFlow2 study notes: 4. The first neural network model, iris classification
TensorFlow2 study notes: 6. Overfitting and underfitting, and their mitigation solutions
Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions
[CV-Learning] Linear Classifier (SVM Basics)
tensorRT教程——tensor RT OP理解(实现自定义层,搭建网络)
【代码学习】
动手学深度学习__数据操作
【CV-Learning】目标检测&实例分割
Thoroughly understand box plot analysis
lstm pipeline 过程理解(输入输出)