当前位置：网站首页>Heavy update! Yolov4 latest paper! Interpreting yolov4 framework

Heavy update! Yolov4 latest paper! Interpreting yolov4 framework

2022-06-25 20:37:00 【SophiaCV】

Thesis address and code

https://arxiv.org/abs/2004.10934v1

Code ：https://github.com/AlexeyAB/darknet

This blog post is about YOLOv4 Translation and framework interpretation of the thesis ！ And there are PDF Version available for download ！——YOLOv4 Reading notes （ With mind map ）！YOLOv4: Optimal Speed and Accuracy of Object Detection（ Click the jump ）

Abstract ：

It is said that there are many functions that can improve convolutional neural networks （CNN） The accuracy of the . A combination of these features needs to be tested on a large dataset , The results are proved theoretically . Some functions only run on certain models , And it only works on certain issues , Or just run on small datasets ; And some functions （ For example, batch normalization and residual linking ） For most models , Tasks and datasets . We assume that such general functions include weighted residual connections （WRC）, Cross phase partial connection （CSP）, Standardization across small batches （CmBN）, Self confrontation training （SAT） and Mish Activate . We use the following new features ：WRC,CSP,CmBN,SAT,Mish Activate , Mosaic data enhancement ,CmBN,DropBlock Regularization and CIoU The loss of , And combine some of these functions to achieve the latest results ：43.5％ Of AP（65.7 stay Tesla V100 On ,MS COCO The real-time speed of the dataset is about 65 FPS.

The core of the core ： The author will Weighted-Residual-Connections(WRC), Cross-Stage-Partial-connections(CSP), Cross mini-Batch Normalization(CmBN), Self-adversarial-training(SAT),Mish-activation Mosaic data augmentation, DropBlock, CIoU And so on YOLOv4, Can hang everything YOLOv4. stay MS-COCO Data on ：43.5%@AP（65.7%@AP50） At the same time, it can achieve [email protected]

contribution

The author designs YOLO At the beginning of the project, the goal is to design a Fast and efficient target detector . The main contributions of this paper are as follows ：

A fast and powerful target detector is designed , It makes anyone need only one 1080Ti perhaps 2080Ti You can train such an ultra fast and accurate target detector ;
( Can't translate directly into English )We verify the influence of SOTA bag-of-freebies and bag-of-specials methods of object detection during detector training
The author of SOTA Methods to improve （ contain CBN、PAN,SAM） To make it more suitable for single GPU Training

Method

Based on the existing real-time network, the author puts forward two views ：

about GPU for , In group convolution, a small number of groups（1-8）, such as CSPResNeXt50/CSPDarknet53;
about VPU for , Use group convolution instead of SE modular .

Network structure selection

The network structure is selected for the input resolution 、 Network layers 、 Parameter quantity 、 Find a compromise between the number of output filters . The author's research shows that ：CSPResNeXt50 Better than... In classification CSPDarkNet53, On the contrary, it performs poorly in terms of detection .

After the main structure of the network is determined , The next goal is to select additional modules to enhance the receptive field 、 Better feature aggregation module （ Such as FPN、PAN、ASFF、BiFPN）. The best model for classification may not be suitable for detection , contrary , The detection model needs to have the following characteristics ：

Higher input resolution , To better detect small targets ;
More layers , In order to have a greater receptive field ;
More parameters , Larger models can detect targets of different sizes at the same time .

A word is ： Choose to have a greater receptive field 、 The model with larger parameters acts as backbone. The following figure shows the different backbone Comparison of the above information . You can see from it ：CSPResNeXt50 Contains only 16 Convolution layers , Its receptive field is 425x425, contain 20.6M Parameters ; and CSPDarkNet53 contain 29 Convolution layers ,725x725 Feeling field of ,27.6M Parameters . This shows theoretically and experimentally ：CSPDarkNet53 It is more suitable for Backbone.

stay CSPDarkNet53 On the basis of , The author added SPP modular , Because it can enhance the receptive field of the model 、 Separate more important context information 、 It will not reduce the reasoning speed of the model ; meanwhile , The author also uses PANet Different in backbone Level parameter aggregation method instead of FPN.

The final model is ：CSPDarkNet53+SPP+PANet(path-aggregation neck)+YOLOv3-head = YOLOv4.

Tricks choice

For better training target detection model ,CNN Models typically have the following modules ：

Activations：ReLU、Leaky-ReLU、PReLU、ReLU6、SELU、Swish or Mish
Bounding box regression Loss：MSE、IoU、GIoU、CIoU、DIoU
Data Augmentation：CutOut、MixUp、CutMix
Regularization：DropOut、DropPath、Spatial DropOut、DropBlock
Normalization：BN、SyncBn、FRN、CBN
Skip-connections：Residual connections, weighted residual connections, Cross stage partial connections

The author selects the following from the above modules ： Select the activation function Mish; Regularization options DropBlock; Due to the focus on single GPU, Not considered SyncBN.

Other improvement strategies

In order to make the detector more suitable for single GPU, The author also made several additional designs and improvements ：

A new data augmentation method is introduced ：Mosaic And self confrontation training ;
adopt GA The algorithm selects the optimal hyperparameter ;
The existing methods are improved to be more suitable for efficient training and reasoning ： improvement SAM、 improvement PAN,CmBN.

YOLOv4

To make a long story short ,YOLOv4 Contains the following information ：

Backbone：CSPDarkNet53
Neck：SPP,PAN
Head：YOLOv3
Tricks（backbone）：CutMix、Mosaic、DropBlock、Label Smoothing
Modified（backbone）: Mish、CSP、MiWRC
Tricks（detector）：CIoU、CMBN、DropBlock、Mosaic、SAT、Eliminate grid sensitivity、Multiple Anchor、Cosine Annealing scheduler、Random training shape
Modified（tector）：Mish、SPP、SAM、PAN、DIoU-NMS