2022-07-31 14:26:00
DeepLab系列在2015年的ICLR上被提出,主要是使用DCNNs和概率图模型(条件随机场)To realize the image pixel level classification(语义分割任务).DCNNApplied to the pixel level classification task has two big obstacles:信号下采样和空间“不敏感性”(不变性).由于DCNNs的平移不变性,DCNNsUsed a lot of abstract image task,如imagenetLarge-scale classification,cocoIn the target detection, etc.The first problem involves in each layerDCNNExecuted on the maximum pool and the sampling(‘步长’)Caused by the repeated combination of signal resolution down,This model by using hollow algorithm(”hole” algorithm,也叫”atrous” algorithm)To improve the first question,By using the connection condition random field to improve the segmentation result. 总结DeepLabV1And three advantages: (1)速度快,With hollow convolutionDCNN可以达到8fps,Then handle all connectedCRF只需要0.5s. (2)准确性高:在PASCAL VOCFirst success,高于第二名7.2%个点,在PASCAL VOC-2012测试集上达到71.6%的IOU准确性. (3)简单:There are two modules whole model,分别是DCNN和CRF
atrous algorithm
模型使用的DCNN是基于VGG-16结构的,But in order to extract the characteristics of the pixel level,对VGG16进行了改进,Will last the whole connection layer instead of convolution,The result is the output sampling interval into32像素,But for segmentation task is not enough,The article and consulted others way,The last two biggest behind the pooling of under sampling remove,And change the convolution to empty convolution,The purpose is to without usingpoolingOperating loss information can also increase feelings of wild,如下图所示:
上图为1D空洞卷积,kernel_size=3,kernel size = 3, input stride = 2, output stride = 1. About the empty convolution explanation can refer to空洞卷积详解 二维示意图如下:
In the hollow convolution parametersrate,On behalf of the traditional convolution kernels inserted between adjacentrate-1个空洞数.当rate=1时,相当于传统的卷积核.From the standpoint of convolution kernels,,Equivalent in standard convolution between adjacent points addrate-1个零,Such expansion after the convolution kernels with original convolution,So receptive field will increase the.例如下图,rate=2时,原来的 3 ∗ 3 3*3 3∗3卷积就变成了 5 ∗ 5 5*5 5∗5,Middle is adding zero.From the original point of view,Equivalent in the original standard convolution on everyrate-1进行卷积.
Using convolution feel wild and accelerate operation
在ImageNet上预训练的VGG16Output receptive field is 224 ∗ 224 224*224 224∗224(具有零填充)和 404 ∗ 404 404*404 404∗404,After converting network completely convolution network,The first complete connection layer has the4,096个 7 ∗ 7 7*7 7∗7The size of a large space filter,And become our intensive scores in the figure to calculate the bottleneck.The article will directly 7 ∗ 7 7*7 7∗7下采样到 4 ∗ 4 4*4 4∗4(或者 3 ∗ 3 3*3 3∗3),This output will become 128 ∗ 128 128*128 128∗128(使用零填充)和 308 ∗ 308 308*308 308∗308,Will reduce the amount of calculation2-3倍.
在卷积网络中,Classification accuracy and positioning accuracy between the balance of nature:With multiple layer maximum pool deeper model has proved the most successful in the classification task,But they increase invariance and large reception field makes it is not easy to infer from the score position.At present there are two main types of research solutions:(1)使用CNN多层feature mapIntegration to strengthen boundary estimate;(2)Super pixel integral method is used to.In this paper, using a simple willDCNNAnd conditional random field with the method of,DCNNUsed for pixel classification and sure about pixel boundaries,全连接CRFs用于后处理,Recovery of the precise pixel boundaries.以下图为例,The first column is original andground truth图,第二列第一行是DCNN输入Softmax前的特征图,第二行是输出Softmax后的特征图,第三列、The fourth and fifth row areDCNN的Score map和Belief map经过1、2、10次CRF迭代后的结果.很明显可以看到,经过CRFAfter processing of segmentation is better.如下图所示,DCNN输出的score mapAre usually very smooth and produce uniform classification results.在这种情况下,使用short-range CRF是不好的,Because our goal should be to restore the detailed local structure rather than the more smooth it,The algorithm is mainly used for smoothing,使用lacal-range CRF和 contrast-sensitive potentialsAlthough can potentially improve the positioning problem,但是对应的thin-structures依旧是个问题,Discrete optimization at the same time the price is very big.
 为了克服short-range CRF的缺点,The article use full connectionCRF整合到模型中.如下图所示,The original image afterDCNN之后输出score-map,After double linear interpolation on the sampling16Times to the original image size,And then after all connectionCRF,最终得到final-output.
Conditions with the airport can optimize the boundary of the object,Smooth with the segmentation result of noise,Remove the object forecast in the middle of the hole,To make more accurate segmentation result.公式如下图,xIs a pixel tags,P(xi)是像素i处的标签分配概率, θ i \theta _{i} θiIs a logarithmic probability, θ i j \theta _{ij} θij是一个滤波器,在括号中,It is two kernel weighted sum.The first nuclear depends on the pixel value and pixel position is poor,This is a kind of bilateralfilter.On the edge of the bilateral filter has retained features.Can make similar color and position of pixels with the samelabel预测.The second kernel only depends on the pixel position difference,This is a gaussian filter,Can contribute to the prediction of more smooth.那些σ(高斯核大小)和w(权重),Through cross validation to find.
Multi-scale prediction can improve accuracy of,但是加上CRFAfter the promotion more.
在当时DeepLabModel of ascension can say a lot.
(1)使用ResNet代替VGG16作为backbone,In the feature extraction on the basis of model to do improvement,To promote the accuracy. (2)提出atrous spatial pyramid pooling (ASPP)To replace the multi-scale prediction,Can greatly increase the receptive field
DeepLabA series of faces three big challenges:(1)The characteristics of the lower resolution,(2)The existence of multi-scale objects,以及(3)由于DCNNInvariance and reduce the positioning accuracy of. The first challenge is for image classification by the original design of the continuousDCNNLayer to perform under the maximum pool and sampling(‘步长’)的重复组合引起的,Layers of pooling and largest full connection layer can lead to the loss of spatial information,In this paper, covering most of the largest pooling and using hollow convolution to increase output resolution,保留更多的空间信息. The second challenge is caused by the existence of multiple scale object.A standard way of solving this problem is toDCNNRendering the same imageresize版本,Then the polymerization characteristics orscore map.We show that this approach does increase the performance of our system,But to the input image multipleresize版本的所有DCNNLayer to calculate response characteristics at the expense of.参考spatial pyramid pooling,文章提出了atrous spatial pyramid pooling” (ASPP),Will be the figure characteristics through the emptiness of parallel convolution layer,These empty convolution layer with different expansion ratio,And then to predict the output fusion segmentation result. A third challenge andDeepLabV1l类似,是因为DCNN的平移不变性,也是使用CRFTo do the fine processing.
And the main modelDeepLabV1类似,There are two differences between different,DeepLabV2的backbone变成了resnet,DeepLabv1Output stride is16,Need to double linear interpolation on the sampling16倍得到预测结果,DeepLabv2Output stride is8,Only need samples8倍,结果好了很多.
For the processing of pixels of multi-scale segmentation information,That is just challenge 2,The author tried two options: (1)方案1:Using Shared the same parameters in parallelDCNN分支从多个resizeThe original image to extractDCNN score map.为了产生最终结果,我们将并行DCNNBranch of the bilinear interpolation feature mapping to the original image resolution and integration of them,Through access to different scales in every position of maximum response.多尺度处理显着提高了性能,But at the cost of computing allDCNN层的特征响应,In order to meet the various input size; (2)方案2:This is in this paper, the biggest bright spot,较DeepLabv1The biggest change is put forwardASPP,如下图所示是ASPP在feature mapThe application of diagram. Diagram for the four different wayrate值的ASPP示意图,kernel_size=3,膨胀率分别为6、12、18、24,将feature mapThrough the four wayASPP后,The results in the originalfeature map中的field-of-viewUse different color rectangular box out,In order to center pixel(橙色)进行分类,ASPPBy adopting multiple parallel filter with different rate to using multi-scale feature.Segmentation results will improve a lot.
The author USES some in the process of trainingtrick:One is to expand the data and data to enhance;2 it is super parameter selection are referenceDeepLabV1的超参数,Control the inflation rate to adjustFOV;三是LargeFOVRefers to the inflation rater=12The expansion of the convolution strategy,在VGG-16的fc6Adopting the expansion convolution,并将fc7和fc8改为1×1的全卷积,命名为DeepLab-LargeFOV. The following is compared with the mainstream model: PASCAL VOC 2012 Test
The corresponding improving
(1)网络变得更深,对原来的resnet进行改进,Stack and threeblock4在block4后面,分别为block5,6,7,At the same time adjust the sample rate,Under the original five sampling,Modified only four times the sampling,block4Will not the next sampling after.
(2)Multi-grid Method: 我们在block4-block7中使用不同的rate,最后的rateIs equal to the unitrate和corresponding rate 的乘积:举例来说:Multi-grid=[1,2,4], output_stride=16, 则最后的rate=2 * [1,2,4] = [2,4,8]. (3)ASPP的改进,如下图所示,In the last more than a 1 ∗ 1 1*1 1∗1卷积和全局平均池化,当rate=feature map size时,dilation conv就变成了11 conv,所以这个11conv相当于rate很大的空洞卷积.作者设计了如下图的ASPP结构
Two and three structure merger of the two methods will not bring promotion,相比较来说,ASPPThe longitudinal structure is better.所以deeplab v3一般也是指aspp的结构,Namely three this structure.
output strideRelations with the final result below, output_stride 越大,结果越差,It proves the necessity of keep space size.
The network depth andMulti-gridRelations with the results below,You can see deepened along with the network,The result is getting better and better,However ascending is becoming more and more limited;使用multi-gridEffects are better than(1,1,1),简单的doubling unit rates 不是很有效,Deeper network andmulti-grid会提升效果.
We can see in differentASPPThe effect of the Settings,同时multi-grid和image poolingThe effect of also can see,The increase of can draw too muchdilation也是不好的.
Finally between different model with the below
DeepLabV3在ASPP中引入了BN,Improve the effects of the model,In the final model usingimage pooling来补充dilationThe information lost also is of great help to the performance of the model.
DeepLabV3+(DeepLabV3 plus)
Corresponding improvement
(1)在DilatedFCN基础上引入了EcoderDecoder,网络结构增加了decoder模块,The overall structure intoEncoder-Decoder模块. (2)DeepLabv3所采用的backbone是ResNet网络,在v3+Model was improved to tryXception,Xception网络主要采用depthwise separable convolution.
DeepLabV3+的整体架构如下图所示,它的encoderModule mainly contains empty convolutionDCNNModule and an empty space of the convolution pyramid pooling module(Atrous Spatial Pyramid Pooling, ASPP),The former can use classification network such asResNet和Xception作为backbone,The latter is mainly introduced the multi-scale information,V3+The introduction of innovativeDecoder模块,The high-dimensional feature and low-dimensional features fusion,提升分割精度.
图中a)是v3The longitudinal structure,(b)Is a common coding—解码结构,(c)Is proposed in this paper based ondeeplab v3的encode-decode结构.如下图所示.首先将encoder得到的特征双线性插值得到4x的特征,然后与encoder中对应大小的低级特征concat,如ResNetThe first layer characteristics,由于encoder得到的特征数只有256,而低级特征维度可能会很高,为了防止encoder得到的高级特征被弱化,先采用1×1卷积对低级特征进行降维.两个特征concat后,再采用3×3卷积进一步融合特征,最后再双线性插值得到与原始图片相同大小的分割预测.
modified Xception
由于V3+主要使用的是Xcetion为backbone,So the paper mainly discusses theXception的改进,如下图所示,分为三个部分:entry flow、middle flow、exit flow,与传统的Xception不同的有: (1)更多的层数:middle flow 重复了16次,original Xception是8次. (2)所有max pooling操作都由stride=2The depth of the separable convolution replace. (3)在每次3×3After deep convolution add extraBN和ReLU,类似于MobileNet.This to addReLU真是玄学. Specific implementation can refer to the source codeXception的实现.
在PASCAL VOC 2012 Test的实验结果,The first figure is a comparison of different models,The second is differentbackbone对比,可以看到以modified Xception为backboneThe model performance is the best,In the experiments in the same way I do,同样条件下,resnet101_backbone在780个epoch达到0.9的MIou,Xception65在256个epoch就达到了,And the final effect is better thanresnet好.现在最新的resnestIn my experiments thanXception更好,这也和resnestThe article is consistent,I tried fourbackbone,效果分别是resnest>Xception>resnext>resnet,大家可以试一下.deeplabThe segmentation effect is very good.
在 Cityscapes The result on below:
