当前位置：网站首页>Detailed ResNet: What problem is ResNet solving?

Detailed ResNet: What problem is ResNet solving?

2022-08-04 07:18:00 【hot-blooded chef】

原作者开源代码：https://github.com/KaimingHe/deep-residual-networks

论文：https://arxiv.org/pdf/1512.03385.pdf

1、网络退化问题

在ResNet诞生之前,AlexNet、VGGThese more mainstream networks are simple stacked layers,比较明显的现象是,网络层数越深,The better the recognition effect.但事实上,当When the number of network layers reaches a certain depth,Accuracy will reach saturation,然后迅速下降.

在这里插入图片描述

2、Causes of network degradation

Due to the chain rule in the backpropagation algorithm,If the gradient between layers is（0,1）之间,层层缩小,那么就会出现梯度消失.反之,If the gradient passed layer by layer is greater than1,Then after layer by layer expansion,就会出现梯度爆炸.所以,Simple stacked layers will inevitably degenerate the network.

虽然梯度消失/The explosion is caused by too deep hidden layers of the network,但是在论文中,It has been said that this problem is mainly solved by normalizing initialization and intermediate normalization layers.So the network degradation is not because the gradient disappears/caused by the explosion,What is the cause of the network degradation problem?？Another paper gives the answer：The Shattered Gradients Problem: If resnets are the answer, then what is the question?

大意是神经网络越来越深的时候,反传回来的梯度之间的相关性会越来越差,最后接近白噪声.因为我们知道图像是具备局部相关性的,那其实可以认为梯度也应该具备类似的相关性,这样更新的梯度才有意义,如果梯度接近白噪声,那梯度更新可能根本就是在做随机扰动.

3、残差网络

Based on network degradation problem,The author of the paper proposed the concept of residual network.The mathematical model of a residual block is shown in the following figure.The biggest difference between the residual network and the previous network is that there is one moreidentityThe quickest way to branch.And because of the existence of this branch,make the network backpropagation,The loss can pass the gradient directly to the previous network through this shortcut,thereby slowing down the problem of network degradation.

When analyzing the causes of network degradation in Section II,We learned that there is a correlation between gradients.After we have the gradient correlation in the index,The authors analyze a range of structures and activation functions,发现resnetExcellent at preserving gradient correlation（Correlation between attenuation from $\frac{1}{\sqrt{2^L}}$ 到 $\frac{1}{\sqrt{L}}$ 了.This is actually quite understandable,From the gradient flow,There is a gradient that is passed back untouched,This part of the correlation is very strong.

在这里插入图片描述
除此之外,Residual network does not add new parameters,Just one more addition.而在GPU的加速下,This extra computation is almost negligible.

不过我们可以看到,Because the residual block is finally $F (x) + x$ 的操作,那么意味着 $F (x)$ 与 $x$ 的shape必须一致.But in the actual network,还可以利用1x1The convolution of changing the number of channels,The above is on the leftResNet-34structure used,The bottleneck-like structure in the picture on the right isResNet-50/101/152structure used.

在这里插入图片描述
And doing so on the right effectively reduces the amount of parameters,Comparing the two calculations：

The parameter on the left is：3x3x256x256+3x3x256x256 = 1,179,648
The parameter on the right is：1x1x256x64+3x3x64x64+1x1x64x256 = 69,632

可以看到,We're in a residual block was reduced2个数量级的参数,而在ResNetA series of network construction process,is the stacking of these structures.

4、实验结果

在这里插入图片描述

ResNetAs shown in the above recommended parameter,The authors also replaced the fully connected layers with global average pooling,On the one hand, the number of parameters is reduced,On the other hand, fully connected layers are prone to overfitting and rely heavily on dropout 正则化,The global average pooling itself acts as a regularizer,Itself to prevent the overall structure of the fitting.此外,Global average pooling aggregates spatial information,So it is more robust to spatial transformation of the input.

在这里插入图片描述

The final experimental comparison results are also very obvious.,ResNet-34Effectively slow down the gradient disappearance/爆炸的现象.And for the exploration of deeper networks,ResNetEven stack the number of network layers to1000层,Although not widely used in industry,But in academic theory, it is of great significance.

5、总结

而最后,To answer our questions：“ResNet到底在解决什么问题？”,我们重新来看一下Res Block的结构.
在这里插入图片描述
现在假设 $x = 5, H (x) = 5.1$

If it is a non-residual structure,Then the network map is： $F (5)^{'} = 5.1$
If it is a residual structure,The network is mapped to： $F (5) + 0.1 = 5.1$

这里的 $F^{'}$ 和 $F$ 都表示网络参数映射,引入残差后的映射对输出的变化更敏感.For example, from5.1到5.2,映射 $F^{'}$ 的输出增加了1/51=2%,And for the residual structure from5.1到5.2,映射 $F$ 是从0.1到0.2,增加了100%.明显后者输出变化对权重的调整作用更大,所以效果更好.（转自：resnet（残差网络）的F（x）究竟长什么样子？）Subsequent experiments also proved the hypothesis,Residual network ratioplainThe network is better trained.因此,ResNetTo solve the problem is better training network.