当前位置:网站首页>2018-UperNet ECCV
2018-UperNet ECCV
2022-07-29 10:16:00 【Talk about it】
List of articles
2018-UperNet ECCV
Thesis title : Unified Perceptual Parsing for Scene Understanding
Address of thesis :https://arxiv.org/abs/1807.10221
Paper code :https://github.com/CSAILVision/unifiedparsing
Author of the paper : Kuang depending on science and technology
1. brief introduction
1.1 brief introduction
Human beings often recognize objects by Multi angle and multi-level observation to get the object category , Include The shape of an object 、 texture 、 In what context 、 What does it contain, etc . such as , A window , The material is glass , On the wall , The shape is rectangular , Synthesize this pile of conclusions , We come to the conclusion that : Oh ! This is a window .
stay CV world , There are people who do scene analysis 、 Do material identification 、 Target detection 、 Do semantic segmentation and so on , But few integrate these tasks into one model Research on , That is to say Multi-task Mission .
and Multi-task learning Less data sets , At the same time, it is difficult to make , Because data labels for different tasks are heterogeneous . such as , For scenario analysis ADE20K Data sets , All annotations are pixel level objects , For the data set describing texture information DTD(Describe Texture Dataset), Annotations are all image level . This has become the bottleneck of data set establishment .
1.2 New data set
In order to solve the lack Multi-task The problem with data sets , Author use Broadly and Densely Labeled Dataset (Broden) To unify ADE20K、Pascal-Context、Pascal-Part、OpenSurfaces、 and Describable Textures Dataset (DTD) These data sets . These datasets contain various scenarios 、 object 、 Parts, components and materials of the object . next , The author further deals with the problem of category imbalance , Including deletion occurs less than 50 Category of images 、 The number of deleted pixels is less than 50000 Categories . All in all , The author constructed a very grand Multi-task Data sets , in total 62,262 Zhang image .

2. The Internet
2.1 Overall framework
UPerNet The model design of is generally based on FPN(Feature Pyramid Network) and PPM(Pyramid Pooling Module), Here's the picture .

The author is for each task Different detection heads are designed .
- about Scene parse Mission , Because the annotation of scene category is image level , All do not need to do the sampling operation , Directly in PPM Head After the output of , Connect a convolution 、 Pooling and linear classifiers are sufficient .
- about Object and Object part segmentation Mission , That is, semantic segmentation task ,UPerNet stay FPN Each layer of the feature fusion , Input the fused features into two detection heads with the same structure , Complete the segmentation of objects or parts of objects .
- about Material Mission , That is, material detection task , need FPN Predict the last output , Because for these materials , Context information is also very important , For example, glass cups , So a priori , We think glass cups are usually on the table , According to the context information in the image —— The glass cup is on the table , Compared with the model without context semantic information , A model with more context information can better detect the glass cup .
- about Texture task, Texture detection task , Its detection head is specially designed , and , If additional information of other layers is superimposed and integrated with other detection tasks , It is harmful for texture detection . therefore , ad locum , Direct will FPN The semantic result of the first layer is texture Input of detection head , meanwhile , In the detection head Head Added additional 4 Convolution layers , Each convolution layer has 128 Channels , meanwhile , The gradient of this part does not allow back propagation , To avoid interference with other tasks . There are several reasons for this design , First, texture is the lowest level of semantic information , That is, it can be seen at a glance , There is no need to integrate high-level semantics . Second, when training other tasks , The model gets the result of texture invisibly , After all, the texture of the same kind of objects is often homogeneous , Or every object has its corresponding texture .
2.2 Semantic segmentation header
I do semantic segmentation , So I only looked at the semantic segmentation header
PPM Head,
Pyramid pooling model Pyramid Pooling Module
https://arxiv.org/abs/1612.01105
CVPR 2017 Years of work
FPN
Characteristic pyramid network , He Kaiming and others 17 Annual work

3. Code
import torch
import torch.nn as nn
import torch.nn.functional as F
class BasicBlock(nn.Module):
expansion: int = 4
def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
base_width=64, dilation=1, norm_layer=None):
super(BasicBlock, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
if groups != 1 or base_width != 64:
raise ValueError("BasicBlock only supports groups=1 and base_width=64")
if dilation > 1:
raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
# Both self.conv1 and self.downsample layers downsample the input when stride != 1
self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=3, stride=stride,
padding=dilation, groups=groups, bias=False, dilation=dilation)
self.bn1 = norm_layer(planes)
self.relu = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride,
padding=dilation, groups=groups, bias=False, dilation=dilation)
self.bn2 = norm_layer(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None,
groups=1, base_width=64, dilation=1, norm_layer=None, ):
super(Bottleneck, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
width = int(planes * (base_width / 64.0)) * groups
# Both self.conv2 and self.downsample layers downsample the input when stride != 1
self.conv1 = nn.Conv2d(inplanes, width, kernel_size=1, stride=1, bias=False)
self.bn1 = norm_layer(width)
self.conv2 = nn.Conv2d(width, width, kernel_size=3, stride=stride, bias=False, padding=dilation,
dilation=dilation)
self.bn2 = norm_layer(width)
self.conv3 = nn.Conv2d(width, planes * self.expansion, kernel_size=1, stride=1, bias=False)
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class ResNet(nn.Module):
def __init__(
self, block, layers, num_classes=1000, zero_init_residual=False, groups=1,
width_per_group=64, replace_stride_with_dilation=None, norm_layer=None):
super(ResNet, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
self._norm_layer = norm_layer
self.inplanes = 64
self.dilation = 1
if replace_stride_with_dilation is None:
# each element in the tuple indicates if we should replace
# the 2x2 stride with a dilated convolution instead
replace_stride_with_dilation = [False, False, False]
if len(replace_stride_with_dilation) != 3:
raise ValueError(
"replace_stride_with_dilation should be None "
f"or a 3-element tuple, got {
replace_stride_with_dilation}"
)
self.groups = groups
self.base_width = width_per_group
self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = norm_layer(self.inplanes)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2, dilate=replace_stride_with_dilation[0])
self.layer3 = self._make_layer(block, 256, layers[2], stride=2, dilate=replace_stride_with_dilation[1])
self.layer4 = self._make_layer(block, 512, layers[3], stride=2, dilate=replace_stride_with_dilation[2])
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512 * block.expansion, num_classes)
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
# Zero-initialize the last BN in each residual branch,
# so that the residual branch starts with zeros, and each residual block behaves like an identity.
# This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
if zero_init_residual:
for m in self.modules():
if isinstance(m, Bottleneck):
nn.init.constant_(m.bn3.weight, 0) # type: ignore[arg-type]
elif isinstance(m, BasicBlock):
nn.init.constant_(m.bn2.weight, 0) # type: ignore[arg-type]
def _make_layer(
self,
block,
planes,
blocks,
stride=1,
dilate=False,
):
norm_layer = self._norm_layer
downsample = None
previous_dilation = self.dilation
if dilate:
self.dilation *= stride
stride = stride
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
nn.Conv2d(self.inplanes, planes * block.expansion, kernel_size=1, stride=stride, bias=False),
norm_layer(planes * block.expansion))
layers = []
layers.append(
block(
self.inplanes, planes, stride, downsample, self.groups, self.base_width, previous_dilation, norm_layer
)
)
self.inplanes = planes * block.expansion
for _ in range(1, blocks):
layers.append(
block(
self.inplanes,
planes,
groups=self.groups,
base_width=self.base_width,
dilation=self.dilation,
norm_layer=norm_layer,
)
)
return nn.Sequential(*layers)
def _forward_impl(self, x):
out = []
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
out.append(x)
x = self.layer2(x)
out.append(x)
x = self.layer3(x)
out.append(x)
x = self.layer4(x)
out.append(x)
return out
def forward(self, x):
return self._forward_impl(x)
def _resnet(block, layers, pretrained_path=None, **kwargs, ):
m = ResNet(block, layers, **kwargs)
if pretrained_path is not None:
m.load_state_dict(torch.load(pretrained_path), strict=False)
return m
def resnet50(pretrained_path=None, **kwargs):
return _resnet(Bottleneck, [3, 4, 6, 3], pretrained_path, **kwargs)
def resnet101(pretrained_path=None, **kwargs):
return _resnet(Bottleneck, [3, 4, 23, 3], pretrained_path, **kwargs)
class PPM(nn.ModuleList):
""" Pyramid pooling model Pyramid Pooling Module https://arxiv.org/abs/1612.01105 CVPR 2017 year The job of Use maximum pooling , obtain """
def __init__(self, pool_sizes, in_channels, out_channels):
super(PPM, self).__init__()
self.pool_sizes = pool_sizes
self.in_channels = in_channels
self.out_channels = out_channels
for pool_size in pool_sizes:
self.append(
nn.Sequential(
nn.AdaptiveMaxPool2d(pool_size),
nn.Conv2d(self.in_channels, self.out_channels, kernel_size=1),
)
)
def forward(self, x):
out_puts = []
for ppm in self:
ppm_out = nn.functional.interpolate(ppm(x), size=(x.size(2), x.size(3)), mode='bilinear',
align_corners=True)
out_puts.append(ppm_out)
return out_puts
class PPMHEAD(nn.Module):
def __init__(self, in_channels, out_channels, pool_sizes=[1, 2, 3, 6], num_classes=31):
super(PPMHEAD, self).__init__()
self.pool_sizes = pool_sizes
self.num_classes = num_classes
self.in_channels = in_channels
self.out_channels = out_channels
self.psp_modules = PPM(self.pool_sizes, self.in_channels, self.out_channels)
self.final = nn.Sequential(
nn.Conv2d(self.in_channels + len(self.pool_sizes) * self.out_channels, 4 * self.out_channels,
kernel_size=1),
nn.BatchNorm2d(4 * self.out_channels),
nn.ReLU(),
)
def forward(self, x):
out = self.psp_modules(x)
out.append(x)
out = torch.cat(out, 1)
out = self.final(out)
return out
class FPNHEAD(nn.Module):
def __init__(self, channels=2048):
super(FPNHEAD, self).__init__()
self.PPMHead = PPMHEAD(in_channels=2048, out_channels=512)
self.Conv_fuse1 = nn.Sequential(
nn.Conv2d(channels // 2, channels // 2, 1),
nn.BatchNorm2d(channels // 2),
nn.ReLU()
)
self.Conv_fuse1_ = nn.Sequential(
nn.Conv2d(channels // 2 + channels, channels // 2, 1),
nn.BatchNorm2d(channels // 2),
nn.ReLU()
)
self.Conv_fuse2 = nn.Sequential(
nn.Conv2d(channels // 4, channels // 4, 1),
nn.BatchNorm2d(channels // 4),
nn.ReLU()
)
self.Conv_fuse2_ = nn.Sequential(
nn.Conv2d(channels // 2 + channels // 4, channels // 4, 1),
nn.BatchNorm2d(channels // 4),
nn.ReLU()
)
self.Conv_fuse3 = nn.Sequential(
nn.Conv2d(channels // 8, channels // 8, 1),
nn.BatchNorm2d(channels // 8),
nn.ReLU()
)
self.Conv_fuse3_ = nn.Sequential(
nn.Conv2d(channels // 4 + channels // 8, channels // 8, 1),
nn.BatchNorm2d(channels // 8),
nn.ReLU()
)
self.fuse_all = nn.Sequential(
nn.Conv2d(channels * 2 - channels // 8, channels // 4, 1),
nn.BatchNorm2d(channels // 4),
nn.ReLU()
)
def forward(self, input_fpn):
""" Args: input_fpn: Four characteristic diagrams Returns: """
##############################
# 1/32 Characteristic graph Use PPMHead torch.Size([1, 2048, 7, 7])
x1 = self.PPMHead(input_fpn[-1])
# Sampling on the last feature torch.Size([1, 2048, 14, 14])
# [1, 2048, 7, 7]-->[1, 2048, 14, 14]
x = nn.functional.interpolate(x1,
size=(x1.size(2) * 2, x1.size(3) * 2),
mode='bilinear',
align_corners=True)
# The fusion 1/16 Graph torch.Size([1, 3072, 14, 14]). Just splice on the channel
# torch.Size([1, 1024, 14, 14]) + [1, 2048, 14, 14] =[1, 3072, 14, 14]
x = torch.cat([x, self.Conv_fuse1(input_fpn[-2])], dim=1)
##############################
# [1, 3072, 14, 14] -->[1, 1024, 14, 14] , Reduce the number of channels
x2 = self.Conv_fuse1_(x) # torch.Size([1, 1024, 14, 14])
# torch.Size([1, 1024, 28, 28])
x = nn.functional.interpolate(x2,
size=(x2.size(2) * 2, x2.size(3) * 2),
mode='bilinear',
align_corners=True)
# The fusion 1/8 Graph torch.Size([1, 1536, 28, 28])
# torch.Size([1, 512, 28, 28])+ torch.Size([1, 1024, 28, 28])= torch.Size([1, 1536, 28, 28])
x = torch.cat([x, self.Conv_fuse2(input_fpn[-3])], dim=1)
##############################
# [1, 1536, 28, 28]-> [1, 512, 28, 28] Perform channel reduction .
x3 = self.Conv_fuse2_(x)
# torch.Size([1, 512, 56, 56]) Yes 1/8---> 1/4
# [1, 512, 28, 28]-> [1, 512, 56, 56]
x = nn.functional.interpolate(x3,
size=(x3.size(2) * 2, x3.size(3) * 2),
mode='bilinear',
align_corners=True)
# The fusion 1/4 Graph torch.Size([1, 768, 56, 56])
x = torch.cat([x, self.Conv_fuse3(input_fpn[-4])], dim=1)
##############################
# The result is torch.Size([1, 256, 56, 56])
# [1, 768, 56, 56]-> [1, 256, 56, 56]
x4 = self.Conv_fuse3_(x)
x1 = F.interpolate(x1, x4.size()[-2:], mode='bilinear', align_corners=True)
x2 = F.interpolate(x2, x4.size()[-2:], mode='bilinear', align_corners=True)
x3 = F.interpolate(x3, x4.size()[-2:], mode='bilinear', align_corners=True)
# x1= torch.Size([1, 2048, 56, 56])
# x2= torch.Size([1, 1024, 56, 56])
# x3= torch.Size([1, 512, 56, 56])
# x4= torch.Size([1, 256, 56, 56])
x = self.fuse_all(torch.cat([x1, x2, x3, x4], 1))
return x
class UPerNet(nn.Module):
def __init__(self, num_classes):
super(UPerNet, self).__init__()
self.num_classes = num_classes
self.backbone = resnet50(replace_stride_with_dilation=[1, 2, 4])
self.in_channels = 2048
self.channels = 512
self.decoder = FPNHEAD()
# This split header
self.cls_seg = nn.Sequential(
nn.Conv2d(512, self.num_classes, kernel_size=3, padding=1),
)
def forward(self, x):
# Encoder , It can be any encoder . for instance resnet50,deeplabv3,
# And the latest transformer Encoder ,PVT,
# The data is [1,3,224,224]
x = self.backbone(x)
# return 4 A feature map ,1/4,1/8,1/16,1/32
# torch.Size([1, 256, 56, 56])
# torch.Size([1, 512, 28, 28])
# torch.Size([1, 1024, 14, 14])
# torch.Size([1, 2048, 7, 7])
# The last one to return 1/4 Characteristic graph torch.Size([1, 512, 56, 56])
x = self.decoder(x)
# Direct linear difference back
x = nn.functional.interpolate(x, size=(x.size(2) * 4, x.size(3) * 4), mode='bilinear', align_corners=True)
x = self.cls_seg(x)
return x
if __name__ == '__main__':
x = torch.randn(1, 3, 224, 224)
model = UPerNet(num_classes=19)
y = model(x)
print(y.shape)
Reference material
边栏推荐
- [log frame]
- 造型科幻、标配6安全气囊,风行·游艇11.99万起售
- Functions and arrays
- 英特尔联合Datawhale,发布学习项目!
- Google Earth engine (GEE) -- calculate the location of the center point, the external boundary, the external polygon, fuse and simplify the boundary and return it to the vector set
- Follow teacher Tian to learn practical English Grammar (continuous update)
- 转转push的演化之路
- Uniswap entered the NFT trading market and opensea took the lead
- [HFCTF 2021 Final]easyflask
- 不堆概念、换个角度聊多线程并发编程
猜你喜欢
随机推荐
remap_ Use of table in impdp
Comprehensive and detailed SQL learning guide (MySQL direction)
Print out the "hourglass" and the remaining number according to the given number of characters and characters
[Yugong series] go teaching course 009 in July 2022 - floating point type of data type
跟着田老师学实用英语语法(持续更新)
Network picture to local picture - default value or shortcut key
PDF处理还收费?不可能
云服务大厂高管大变阵:技术派让位销售派
12代酷睿处理器+2.8K OLED华硕好屏,灵耀14 2022影青釉商务轻薄本
Solve problems intelligently
函数和数组
网络图片转换本地图片 - 默认值或快捷键
CS research assurance experience in 2021 (VI): system filling + some thoughts
根据给定字符数和字符,打印输出“沙漏”和剩余数
Attachment of text of chenjie Report
Linear regression of machine learning (least square handwriting +sklearn Implementation)
10 suggestions for 10x improvement of application performance
那句话的作用
Yin Yi: my learning and growth path
Is there any charge for PDF processing? impossible









