About yolo7 and gpu
2022-08-04 04:47:00 【Sister tt】
现在说明一下 This article is absolutely not see stick in this site Just to give himself 没有吹 b的意思 Look for itself Also do not need the stand for what's good about yourself
Now the body posture estimating population is divided intoTop-down和Bottom-up两种,与目标检测不同,Based on the heat map or key point detection algorithm based on detector processing,Are more dependent on computing resources,Reasoning takes slightly longer,This year appeared toYOLOThe key for the baseline detector.Played target detection of children's shoes all knowYOLOOf the novel is industrial lands and various varieties type of detector,Its simple design,Long-term active community ecological,Make it always occupying the topic of higher degree.
YoLo-Pose和KaPao(下称为yolo-like-pose)都基于流行的YOLOTarget detection framework put forward a novel method of no heat map,Similar to a long time ago, Google use regression calculation of key ideas,yolo-like-poseDo not use the detector for the second order processing,Two heat used to splice,Although it is a kind of violence the return key point detection algorithm,But have a certain advantage in processing speed.
去年11月,The university of Waterloo was the first to put forward KaPao:Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-Person Human Pose Estimation,基于YOLOv5进行关键点检测,The article has beenECCV 2022接收,Performance of the proposed algorithm is as follows:
今年4月,yolo-pose也挂在了arvix,在论文中,通过调研发现 HeatMap The way of widespread use ofL1 Loss.然而,L1损失并不一定适合获得最佳的OKS.且由于HeatMap是概率图,因此在基于纯HeatMap的方法中不可能使用OKS作为loss,只有当回归到关键点位置时,OKS才能被用作损失函数.因此,yolo-pose使用oks lossAs the loss of the key points
if self.kpt_label:
#Direct kpt prediction
pkpt_x = ps[:, 6::3] * 2. - 0.5
pkpt_y = ps[:, 7::3] * 2. - 0.5
pkpt_score = ps[:, 8::3]
kpt_mask = (tkpt[i][:, 0::2] != 0)
lkptv += self.BCEcls(pkpt_score, kpt_mask.float())
#l2 distance based loss
#lkpt += (((pkpt-tkpt[i])*kpt_mask)**2).mean() #Try to make this loss based on distance instead of ordinary difference
#oks based loss
d = (pkpt_x-tkpt[i][:,0::2])**2 + (pkpt_y-tkpt[i][:,1::2])**2
s = torch.prod(tbox[i][:,-2:], dim=1, keepdim=True)
kpt_loss_factor = (torch.sum(kpt_mask != 0) + torch.sum(kpt_mask == 0))/torch.sum(kpt_mask != 0)
lkpt += kpt_loss_factor*((1 - torch.exp(-d/(s*(4*sigmas**2)+1e-9)))*kpt_mask).mean()
Performance is as follows:
上个星期,YOLOv7Author also released about human body model of detection of key,该模型基于YOLOv7-w6.
The author provides a.ptFile and reasoning test script,Interested in infant boots can go to see,The focus of this article prefer to foryolov7-pose.pt进行onnxFile extraction and reasoning.
yolov7-pose + onnxruntime
Good first download the official training models of,Using the provided script reasoning:
% weigths = torch.load('weights/yolov7-w6-pose.pt')
% image = cv2.imread('sample/pose.jpeg')
!python pose.py
一、yolov7-w6 VS yolov7-w6-pose
首先看下yolov7-w6Use of detecting head
Said a total of four groups of different scales in detecting head,分别为15×15,30×30,60×60,120×120,Corresponding to the output node as114,115,116,117
再看看yolov7-w6-poseUse of detecting head:
Repeat above place not tired earlier,讲几个点:
nkptSaid the human body17个关键点
如果直接使用export脚本进行onnxThe extraction of must be an error,In the previous section we have seen thatpose.ptModel USES the detecting head isIKeypoint,The script needs to be changed:在export.pyThe position of insert:
# 原代码:
for k, m in model.named_modules():
m._non_persistent_buffers_set = set() # pytorch 1.6.0 compatibility
if isinstance(m, models.common.Conv): # assign export-friendly activations
if isinstance(m.act, nn.Hardswish):
m.act = Hardswish()
elif isinstance(m.act, nn.SiLU):
m.act = SiLU()
model.model[-1].export = not opt.grid # set Detect() layer grid export
# 修改代码:
for k, m in model.named_modules():
m._non_persistent_buffers_set = set() # pytorch 1.6.0 compatibility
if isinstance(m, models.common.Conv): # assign export-friendly activations
if isinstance(m.act, nn.Hardswish):
m.act = Hardswish()
elif isinstance(m.act, nn.SiLU):
m.act = SiLU()
elif isinstance(m, models.yolo.IKeypoint):
m.forward = m.forward_keypoint # assign forward (optional)
# The switch testing head
model.model[-1].export = not opt.grid # set Detect() layer grid export
forward_keypoint在原始的yolov7 repo源码中有,The author has sealed,But estimates is not going to open.Use the following command to extract:whaosoft aiot http://143ai.com
python export.py --weights 'weights/yolov7-w6-pose.pt' --img-size 960 --simplify True
import onnxruntime
import matplotlib.pyplot as plt
import torch
import cv2
from torchvision import transforms
import numpy as np
from utils.datasets import letterbox
from utils.general import non_max_suppression_kpt
from utils.plots import output_to_keypoint, plot_skeleton_kpts
device = torch.device("cpu")
image = cv2.imread('sample/pose.jpeg')
image = letterbox(image, 960, stride=64, auto=True)[0]
image_ = image.copy()
image = transforms.ToTensor()(image)
image = torch.tensor(np.array([image.numpy()]))
sess = onnxruntime.InferenceSession('weights/yolov7-w6-pose.onnx')
out = sess.run(['output'], {'images': image.numpy()})[0]
out = torch.from_numpy(out)
output = non_max_suppression_kpt(out, 0.25, 0.65, nc=1, nkpt=17, kpt_label=True)
output = output_to_keypoint(output)
nimg = image[0].permute(1, 2, 0) * 255
nimg = nimg.cpu().numpy().astype(np.uint8)
nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR)
for idx in range(output.shape[0]):
plot_skeleton_kpts(nimg, output[idx, 7:].T, 3)
# matplotlib inline
plt.figure(figsize=(8, 8))
Effect of reasoning almost intact,But time-consuming will shorten one times or so,In addition, there are a few points:
image = letterbox(image, 960, stride=64, auto=True)[0] 中strideRefers to the large step,yolov7-w6和yolov5sThe more sampling step,导致在8,16,32的基础上多了64The sampling step
output = non_max_suppression_kpt(out, 0.25, 0.65, nc=1, nkpt=17, kpt_label=True) ,nc 和 kpt_label 等信息在netronWhen printing model file you can see
所得到的onnxCompared with the original half precision model for nearly three times as big,The follow-up screening reason
yolov7-w6-poseExtreme eating memory,Infer one960×960的图像,需要2-4G的显存,Training is more difficult to imagine
Following is aboutgpu了 Also found something which the others The way don't blame
Personalized recommendation has become the main form of access to information.以往,People more information through active search and they are interested in,而现在,Based on the algorithm of recommendation technology information distribution platform will automatically identify users interested in,快速筛选信息,Push the user interest information.
一方面,Recommendation system significantly improving the user experience,另一方面,Personalized distribution information more accurate、高效,Help platform can more accurately match the user and information,Greatly improve the efficiency of flow liquid,Based on recommendation technology flow liquid engine even made vast empire of billions of value.
From the short video information recommendation、Search advertising to online shopping,These applications are constructed on the precise recommendation system,The core is credited with deep learning behind the model.
不过,With the accumulation of huge amounts of data and more frequent user data iteration,The underlying system scalability and training speed grave challenges.人们发现,General deep learning framework can meet the needs of industrial grade recommendation system directly,But the depth must be based on a general deep learning framework for custom,Even have to develop specialized system.
In view of the modern recommendation system various pain points,一流科技 OneFlow Team launched a high performance、可扩展、High flexibility of recommendation system components OneEmbedding.How it is used as simple and universal framework of deep learning,Performance is far in excess of the generic framework,甚至超过了 NVIDIA HugeCTR Such as custom development recommended scenario system.
具体而言,在 DCN、DeepFM 两个模型上,无论是 FP32 Or compound precision(automatic mixed-precision, AMP)训练,OneEmbedding 的性能大幅超过 HugeCTR,而在 HugeCTR The depth of the optimization that is a little bit “过拟合” 的 DLRM 模型上,OneEmbedding 性能与 HugeCTR 基本持平.
(以上测试环境均为:CPU Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz * 2;CPU Memory 1920GB;GPU NVIDIA A100-SXM-80GB * 8;SSD Intel SSD D7P5510 Series 3.84TB * 4)
当用户使用 OneFlow Structures, recommend the model,Only a few lines of code using the following for Embedding Word is configured to contain training TB Level word is recommended model:
# self.embedding = nn.Embedding(vocab_size, embedding_vec_size)
self.embedding = flow.one_embedding.MultiTableEmbedding(
基于 OneEmbedding Recommend building common search advertising model case address:https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems
The challenge of mass recommendation system
一般而言,Recommendation system need to use a similar sex、年龄、Behavior of discrete features(sparse feature),在一个 Embedding Vocabulary used features ID 进行查表(lookup),取得对应的 Embedding Column and sent to the downstream use.
Commonly used public data sets Criteo1T Probably contain billions and features ID,如果 embedding_dims 配置为 128,那总共需要 512 GB Space to accommodate Embedding 参数,如果使用 Adam 优化器,Because of the need to save the extra two state variables m 和 v,The required storage capacity is up to 1536 GB.实际应用场景中,The data scale than Criteo 还要高出几个数量级,Capacity of the model is bigger.
The core issue of large-scale recommendation system is,How to support efficient economic mass Embedding The query and update.Weigh scale、成本和效率,Appeared the following three common solution.
One of the most common is the first solution is to Embedding 全部部署在 CPU 上,利用 CPU 内存容量大、The characteristics of low cost expansion parameters size,Strengths is the size of the model can almost infinite.不过,其缺点也很明显,Both computing performance and bandwidth,CPU 都远低于 GPU,导致 Embedding Some become significant bottlenecks,Often need dozens or even hundreds of Taiwan CPU The server can support a recommendation system of industrial.
鉴于 GPU In dense computing advantages,也有人建议用 GPU To train a large Embedding 模型.问题是,GPU Very expensive and limited memory capacity,If you use the memory capacity of 40GB 的 A100 来基于 Criteo 数据训练 128 维嵌入向量,至少需要 13 A video card to put down 512GB 的 Embedding 词表.Each card is 40GB 显存容量,分布式 Embedding Need to use the so-called model parallel technology,理想情况下,In order to solve the larger model only need to increase the GPU 的数量即可.
现实是,GPU 相对于 CPU Very costly,And the model body part of the recommendation system is,Model in parallel in the extension process are solved Embedding 规模的问题,Returns to training speed are limited,Even for more introduction of communication between equipment led to the decrease of the training speed,Therefore generally applies only to small cluster.
为了缓解 GPU The transmission bandwidth between problem,The industry develop higher than Ethernet bandwidth NVSwitch、Infiniband Network interconnection technology such as.一方面,This means that the extra cost,另一方面,Many users of infrastructure does not have the corresponding transformation、The conditions of the upgrade.
那么,Is there have your cake and eat it solutions?
Aiming at the problems above scheme,OneFlow 团队设计了 OneEmbedding,Through hierarchical storage CARDS can also support TB 级模型的训练,Through lateral extension model capacity no ceiling,通过 OneFlow The mechanism of automatic assembly line、Operator to optimize quantified compression and communication technology to achieve maximum performance,Using image PyTorch Under the premise of as simple, OneEmbedding 在 DLRM Model performance was TorchRec 的 3 倍以上,开启 TorchRec No support after mixing precision,OneEmbedding The performance of the more TorchRec 的 7 倍以上.
(TorchRec 性能数据参考 8 卡 A100 测试结果:https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/#preliminary-training-results )
OneEmbedding 的核心优势
分层存储:One card can also support TB 级模型训练
Using the data of local spatial locality and time,Multistage cache can well realize the compromise in the performance and cost.OneEmbedding Also based on this thought implements multi-level cache,Even if the user just a GPU 也可以训练 TB 级别的模型.
Users can see Embedding 部署到 GPU 显存上,也可以把 Embedding 部署到 CPU Memory even SSD 上.This scheme can play CPU 内存或者 SSD A lower cost advantage,以对 Embedding The scale parameter extension,还可以利用 GPU As cache memory device,In order to realize high performance effect.
OneEmbedding Will dynamically recently frequently accessed entry cache to GPU 显存上,At the same time, will visit frequency is low in recent entry out to CPU 内存或者 SSD The underlying storage such as.Under this premise the data follows a power-law distribution,Based on the effective Cache 管理算法,OneEmbedding 可以使 GPU The cache hit ratio is always maintained at a higher level.
值得强调的是,OneEmbedding 只是将 CPU 内存和 SSD 作为存储设备,所有计算都在 GPU 上执行.目前,OneEmbedding Storage solution provides three preset:
使用 GPU All memory storage model parameters
将 CPU 内存作为 Embedding Parameters of the storage device,并搭配使用 GPU 作为高速缓存
将 SSD 作为 Embedding Parameters of the storage device,并搭配使用 GPU 作为高速缓存
# 使用 SSD 作为存储设备,搭配 GPU 作为高速缓存
store_options = flow.one_embedding.make_cached_ssd_store_options(
The user can according to the actual use of hardware situation,With just a few lines of code configured,Can one arrow 3 carve scale、效率、成本的最优化.
为了掩盖 CPU 和 SSD Take the data's delay,OneEmbedding 引入流水线、Technical means such as data prefetch,Made in CPU 内存 和 SSD As the storage backend at the same time,Efficiency can still and use pure GPU Training to maintain at the same level.
Respectively for three kinds of storage solution for testing.其中,测试用例与 MLPerf 的 DLRM 模型一致,Parameters size is about 90GB.在使用 SSD 和 CPU Memory as a storage device,我们配置的 GPU The cache size for each GPU 12GB,相比于 90 GB The number of ZongCan,There can only be part of the parameters in a GPU memory,Other parameters are kept in CPU 内存或者 SSD 上,As the training process of dynamic change into GPU 缓存中来,测试结果如下图.
(测试环境:CPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz * 2;CPU Memory 512GB;GPU NVIDIA A100-PCIE-40GB * 4;SSD Intel SSD D7P5510 Series 7.68TB * 4)
(1)纯 GPU Memory solution best performance,但由于 GPU 显存只有 4x40GB,Theory can only be biggest training 160 GB 模型;
(2)相比纯 GPU memory plan,GPU 缓存 + CPU Storage scheme performance only small losses,But can the size of the ceiling of the parameter extension to CPU 内存容量,Is often hundreds of GB~数 TB;
(3)更进一步,If you can accept more performance loss,GPU 缓存 + SSD Storage scheme can extend to the ceiling of the parameter scale SSD 的容量,Model size can reach tens of TB,甚至更大.
If we want to only one NVIDIA A30-24GB GPU The server on the above DLRM Model of a complete training,24G The memory of apparently unable to direct training 90GB 规模的模型.借助分层存储,使用 CPU Memory as a storage device,GPU As cache memory,Can support than 90GB 还大的模型.
横向扩展:Many card linear acceleration,Break the model of the ceiling
Using hierarchical storage technology,OneEmbedding Improved single card case Embedding Parameters size limit,As long as the memory space is enough big can even training TB The size of the level model.If the model of the capacity to further expand and even more than CPU 内存的容量,Users can also on the basis of the multistage storage with the aid of OneFlow Ability of parallel transverse easily expand to more machine card,By training more model.
在推荐系统中,Main parameters than Embedding much smaller.So we usually will Embedding Part is set to model parallel,Model body set for data parallel.By using multiple machine card,可进一步提升 Embedding 大小.
Specific to the implementation details,每个 Rank 各自负责一部分 Embedding 的存储,特征 ID 进入到各个 Rank,可能存在重复 ID 的情况,First of all to go back(即下图的 ID Shuffle);各个 Rank After holding to heavy ID 去查询 Embedding,Get the corresponding local data,所有 Rank Combined the data Rank 得到完整的 Embedding 数据(即下图的 Embedding Shuffle);最后,各 Rank Complete the entire model in the form of data parallel training process.
下图展示了 OneEmbedding 采取纯 GPU Memory strategy training DLRM 模型时,FP32 和 AMP 配置下,不同 GPU The number of throughput under the model.
(测试环境:CPU Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz * 2;CPU Memory 1920GB;GPU NVIDIA A100-SXM-80GB * 8;SSD Intel SSD D7P5510 Series 3.84TB * 4)
可以看到,随着 GPU The number of equipment to increase,Throughput model can significantly increase,In the case of mixed precision,单张 GPU 能有 600 万的吞吐量,当扩展到 8 张 GPU can be close 4000 万的吞吐量.
流水线机制:Overlapping automatic calculation and data transmission
在 DLRM 模型中,Embedding 中的 Dense Feature 会进入到 Bottom MLP 中,而 Sparse Feature 经过 Embedding The query from corresponding feature.两者进入 Interaction 进行特征交叉,最后进入到 Top MLP.
Embedding Related operations include look-up table(Embedding Lookup)、更新(Embedding Update).由于 OneEmbedding Using a hierarchical storage mechanism,May encounter characteristics ID The situation of the missed cache,此时,The data pull take longer,会影响训练速度.
In order to avoid the lack of,OneEmbedding Add data prefetch(Embedding Prefetch)操作,To ensure that the look-up table and update operations can in GPU 上执行.Due to the iteration between before and after data prefetch no dependencies,In the current iteration calculation at the same time,Can pre-fetch next iteration needs Embedding 数据,Calculation and prefetching overlap.
在 Embedding In the process of data query exchange,与 Embedding Operation has nothing to do Dense Feature 可以进入到 Bottom MLP 进行计算,在时间上进行重叠.Complete overlapping execution sequence as shown in the figure below.
Such a complex data line control in traditional deep learning framework is a very challenging problem.不仅如此,在实际推荐场景中,User's data changing,This requires line mechanism also can deal with dynamic data.
而 OneFlow 的 Actor Mechanism to make all this question is very simple,每个 Actor Through its own internal state machine and message mechanism to realize the distributed collaborative work.通过为每个 Actor Give more storage blocks,不同的 Actor 可以同时工作,Overlap their hours of work,从而实现 Actor 之间的流水线.我们只需要将 Embedding Assigned to separate operation a stream 上,Can let the system spontaneously form lines.
算子优化:逼近 GPU 极限性能
OneFlow Team not only on the depth of the general operator optimization,Also according to the characteristics of the popular recommendation system model,Increase the multiple high performance CUDA 算子实现.
对于 DLRM、DCN Features in the model cross section,OneFlow 分别实现了 FusedDotFeatureInteraction 和 FusedCrossFeatureInteraction 算子.
(FusedCrossFeatureInteraction 算子,图片出自 《Deep & Cross Network for Ad Click Predictions》)
For the model of multiple connection layer section,OneFlow 基于 cublasLt Matrix library implements the FusedMLP 算子.
而对于带 Dropout 操作的全连接层,OneFlow The depth of the custom of ReluDropout 操作,使用 bitmask Stored to produce before mask,在反向传播中,通过设置 cublasLt The parameters of matrix multiplication alpha=dropout_scale In order to realize the reverse operator fusion.
量化压缩:Squeeze the communication efficiency
In the process of model training communication,Recently there are a lot of work to quantify the data compression to save traffic,提高通信效率,OneEmbedding 也支持这个特性.
Parallel training,各个 Rank The need for communication between in exchange for Embedding 数据,First we will float quantitative data into int8 类型,After the exchange of the quantitative restore again.
下图以 DLRM Model as an example shows the select pure GPU Memory storage configuration,Were measured in FP32 Before and after model throughput and mixed precision under the condition of quantitative.
Quantitative model accuracy comparison before and after(AUC):
(测试环境:CPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz * 2;CPU Memory 512GB;GPU NVIDIA A100-PCIE-40GB * 4;SSD Intel SSD D7P5510 Series 7.68TB * 4)
测试结果表明,Under the precondition of not affected the accuracy of the model, Quantitative communication in FP32 Default communication model is compared under the condition of 64% 的提升, In the case of mixing precision an 13% 的提升.
易用性:Recommended building large-scale model like using PyTorch 一样简单
OneEmbedding 作为 OneFlow An internal extension of component,Means that the user can be used in OneEmbedding The advanced features at the same time,还能享受 OneFlow General framework of the flexibility to build your own recommendation model.
class DLRMModule(nn.Module):
def __init__(self, args):
super(DLRMModule, self).__init__()
self.bottom_mlp = FusedMLP(...)
self.embedding = OneEmbedding(...)
self.interaction = FusedDotInteraction(...)
self.top_mlp = FusedMLP(...)
def forward(self, sparse_feature, dense_feature):
dense_fields = self.bottom_mlp(dense_feature)
embedding = self.embedding(sparse_feature)
features = self.interaction(dense_fields, embedding)
return self.top_mlp(features)
最后,值得一提的是,OneEmbedding Through the built-in encoding mechanism to characteristic ID 进行编码,Support dynamic insert new data.Users do not need to plan ahead Embedding 容量,Also don't need to the characteristics of the data set ID 进行特殊处理.This dynamic mechanism naturally supports incremental training scenarios,At the same time also use less burden.
目前 OneFlow 的 models Under the warehouse provided based on OneEmbedding Set up a series of model,如 DLRM, DeepFM, xDeepFM, DCN, PNN, MMoE,Subsequent will add more recommendation model(https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems).
OneEmbedding Is should be large-scale recommendation system model of training demand of component,Flexible hierarchical storage、Highly optimized data line and the characteristics of easy to scale out,Allows users to easily training TB Levels of the recommended model.
目前,OneFlow Framework provides some model example for a key to your experience OneEmbedding.后续,OneFlow Team will launch covered the industry mainstream model recommendation system model base Flow-Recommender ,It not only supports distributed training,Also supports distributed reasoning.Welcome interested friends attention.
OneEmbedding 地址:https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems
OneEmbedding 文档:https://docs.oneflow.org/master/cookies/one_embedding.html
OneEmbedding API 文档:https://oneflow.readthedocs.io/en/master/one_embedding.html
