当前位置:网站首页>Hands on deep learning (32) -- fully connected convolutional neural network FCN
Hands on deep learning (32) -- fully connected convolutional neural network FCN
2022-07-04 09:36:00 【Stay a little star】
List of articles
Reference resources :
【1】 https://zh-v2.d2l.ai/
【2】 https://zhuanlan.zhihu.com/p/30195134
【3】 https://www.sohu.com/a/270896638_633698
1. What is full convolution neural network (Fully Convolutional Networks)
We usually use CNN+ Fully connected layer , The feature map generated by convolution layer (feature map) Mapping to a fixed length eigenvector . With AlexNet For example , Due to image classification, it is expected to get a numerical description of the whole input image ( probability ), such as AlexNet Of ImageNet The model outputs a 1000 The vector of dimension represents the probability that the input image belongs to each class (softmax normalization ).
for example : The following figure shows , Input picture , Use AlexNet analysis , We get a length of 1*1000 Vector , Then, according to this vector, it is judged that the category of this picture is cat 
Conventional CNN There is a problem :
- Big storage cost
- The sliding window is large , Each window needs storage space to store features and identify categories
- Use a fully connected structure , The last few layers of nearly exponential storage
- Calculation efficiency is low . There's a lot of double counting
- Sliding windows are independent , The use of full connection layer at the end is only for local features .
To address these issues , If the full connection layer in the model is replaced by convolution layer, the problem can be solved to a certain extent , We call this network, which is all composed of convolution layers, a full convolution neural network FCN.
2. FCN It is the foundational work of semantic segmentation
Replace with transpose convolution CNN The last full connection layer , So as to realize the prediction of each pixel , Achieve the purpose of semantic segmentation

3. Use FCN Semantic segmentation
%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
import os
3.1 model building
# Use in ImageNet Pre trained on dataset ResNet-18 Image feature extraction , And record the network instance as pretrained_net
# Be careful ResNet-18 The last few layers of are the global average pooling layer and the full connection layer , stay FCN There is no need for
pretrained_net = torchvision.models.resnet18(pretrained=True)
list(pretrained_net.children())[-3:]
[Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
),
AdaptiveAvgPool2d(output_size=(1, 1)),
Linear(in_features=512, out_features=1000, bias=True)]
# according to pretrained net Create a new network instance , Remove FCN Unnecessary parts
net = nn.Sequential(*list(pretrained_net.children())[:-2])
# The given height and width is (320*480) The input of ,net The forward network reduces the input height and width to the original 1/32, namely (10,15)
X = torch.rand(size=(1,3,320,480))
net(X).shape
torch.Size([1, 512, 10, 15])
# Use 1*1 The convolution layer of converts the number of output channels into Pascal VO2012 The number of categories in the dataset (21 class ).
# Here, the number of output channels is selected 21 The reason is to reduce the following transpose Calculation amount of layers ( Minimize )
num_classes =21
net.add_module('final_conv',nn.Conv2d(512,num_classes,kernel_size=1))
Use the transpose roll to increase the height and width of the feature map 32 times , Restore to the height and width of the input image
because ( 320 − 64 + 16 × 2 + 32 ) / 32 = 10 (320-64+16\times2+32)/32=10 (320−64+16×2+32)/32=10 And ( 480 − 64 + 16 × 2 + 32 ) / 32 = 15 (480-64+16\times2+32)/32=15 (480−64+16×2+32)/32=15, We construct a step with 32 32 32 The transposition of the convolution layer , The height and width of convolution kernel are set to 64 64 64, Fill in with 16 16 16.
We can see that if the stride is s s s, Fill in with s / 2 s/2 s/2( hypothesis s / 2 s/2 s/2 Is an integer ) And the height and width of convolution kernel are 2 s 2s 2s, The transposed convolution kernel enlarges the input height and width respectively s s s times .
net.add_module('transpose_conv',nn.ConvTranspose2d(num_classes,num_classes,kernel_size=64,padding=16,stride=32))
3.2 Initialize transpose convolution
In image processing , Sometimes we need to enlarge the image , namely On the sampling (upsampling).
Bilinear interpolation (bilinear interpolation) It is one of the commonly used up sampling methods , It is also often used to initialize the transpose convolution layer . To explain bilinear interpolation , Suppose given an input image , We want to calculate each pixel on the upsampled output image .
- First , The coordinates of the image will be output ( x , y ) (x,y) (x,y) Coordinates mapped to the input image ( x ′ , y ′ ) (x',y') (x′,y′) On . for example , Map according to the size ratio of input to output . Please note that , Mapped x ′ x′ x′ and y ′ y′ y′ Is the set of real Numbers .
- then , Find the abscissa on the input image ( x ′ , y ′ ) (x',y') (x′,y′) Current 4 Pixel .
- Last , The output image is in coordinates ( x , y ) (x,y) (x,y) The pixels on the input image are based on the 4 Pixels and their relation to ( x ′ , y ′ ) (x',y') (x′,y′) To calculate the relative distance of .
The up sampling of bilinear interpolation can be realized by transposing convolution layer , The kernel consists of the following bilinear_kernel The function structure .
Limited to space , We only give bilinear_kernel Implementation of function , The principle of the algorithm is not discussed
def bilinear_kernel(in_channels, out_channels, kernel_size):
factor = (kernel_size + 1) // 2
if kernel_size % 2 == 1:
center = factor - 1
else:
center = factor - 0.5
og = (torch.arange(kernel_size).reshape(-1, 1),
torch.arange(kernel_size).reshape(1, -1))
filt = (1 - torch.abs(og[0] - center) / factor) * \
(1 - torch.abs(og[1] - center) / factor)
weight = torch.zeros(
(in_channels, out_channels, kernel_size, kernel_size))
weight[range(in_channels), range(out_channels), :, :] = filt
return weight
# Here we use transpose layer to realize bilinear difference , Build a transposed convolution layer with twice the input height and width , And use convolution kernel `bilinear_kernel` Function construction
conv_trans = nn.ConvTranspose2d(3,3,kernel_size=4,padding=1,stride=2,bias=False)
conv_trans.weight.data.copy_(bilinear_kernel(3,3,4))
# Read images X, Record the result of the upper sampling as Y, In order to output the printed picture, you need to adjust the dimension
img = torchvision.transforms.ToTensor()(d2l.Image.open('./course_file/pytorch/img/catdog.jpg'))
X = img.unsqueeze(0)
Y = conv_trans(X)
out_img = Y[0].permute(1, 2, 0).detach()
d2l.set_figsize()
print('input image shape:', img.permute(1, 2, 0).shape)
d2l.plt.imshow(img.permute(1, 2, 0))
input image shape: torch.Size([561, 728, 3])

print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img);

You can see , The transposed convolution layer enlarges the height and width of the image respectively 2 times . Except for different coordinate scales , The image magnified by bilinear interpolation looks no different from the original image . So we are in a full convolution network ,[ Initialization of transpose convolution layer by up sampling of bilinear interpolation . about 1 × 1 1\times 1 1×1 Convolution layer , We use Xavier Initialize parameters .]
W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W);
3.3 Reading data sets
We use it https://blog.csdn.net/jerry_liufeng/article/details/120820270 Introduced method data set .
Specifies that the shape of the randomly cropped output image is 320 × 480 320\times 480 320×480: Both height and width can be 32 32 32 to be divisible by .
#@save
def read_voc_images(voc_dir, is_train=True):
""" Read all VOC Image and label ."""
txt_fname = os.path.join(voc_dir, 'ImageSets', 'Segmentation',
'train.txt' if is_train else 'val.txt')
mode = torchvision.io.image.ImageReadMode.RGB
with open(txt_fname, 'r') as f:
images = f.read().split()
features, labels = [], []
for i, fname in enumerate(images):
features.append(
torchvision.io.read_image(
os.path.join(voc_dir, 'JPEGImages', f'{
fname}.jpg')))
# For semantic segmentation , It is required to classify each pixel , therefore label Save as an uncompressed .png The document is more appropriate
labels.append(
torchvision.io.read_image(
os.path.join(voc_dir, 'SegmentationClass', f'{
fname}.png'),
mode))
return features, labels
#@save
VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0],
[0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
[64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
[64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
[0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
[0, 64, 128]]
#@save
VOC_CLASSES = [
'background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
'person', 'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor']
""" Defining a function will RGB Color column and category index are mapped """
#@save
def voc_colormap2label():
""" Build from RGB To VOC Mapping of category indexes ."""
colormap2label = torch.zeros(256**3, dtype=torch.long)
for i, colormap in enumerate(VOC_COLORMAP):
colormap2label[(colormap[0] * 256 + colormap[1]) * 256 +
colormap[2]] = i
return colormap2label
#@save
def voc_label_indices(colormap, colormap2label):
""" take VOC In the tag RGB Values map to their category index ."""
colormap = colormap.permute(1, 2, 0).numpy().astype('int32')
idx = ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256 +
colormap[:, :, 2])
return colormap2label[idx]
#@save
def voc_rand_crop(feature, label, height, width):
""" Randomly crop features and label images ."""
rect = torchvision.transforms.RandomCrop.get_params(
feature, (height, width))
feature = torchvision.transforms.functional.crop(feature, *rect)
label = torchvision.transforms.functional.crop(label, *rect)
return feature, label
#@save
class VOCSegDataset(torch.utils.data.Dataset):
""" One for loading VOC Custom datasets for datasets ."""
def __init__(self, is_train, crop_size, voc_dir):
self.transform = torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
self.crop_size = crop_size
features, labels = read_voc_images(voc_dir, is_train=is_train)
self.features = [
self.normalize_image(feature)
for feature in self.filter(features)]
self.labels = self.filter(labels)
self.colormap2label = voc_colormap2label()
print('read ' + str(len(self.features)) + ' examples')
def normalize_image(self, img):
return self.transform(img.float())
def filter(self, imgs):
return [
img for img in imgs if (img.shape[1] >= self.crop_size[0] and
img.shape[2] >= self.crop_size[1])]
def __getitem__(self, idx):
feature, label = voc_rand_crop(self.features[idx], self.labels[idx],
*self.crop_size)
return (feature, voc_label_indices(label, self.colormap2label))
def __len__(self):
return len(self.features)
#@save
def load_data_voc(batch_size, crop_size):
""" load VOC Semantic segmentation dataset . """
# voc_dir = d2l.download_extract('voc2012',
# os.path.join('VOCdevkit', 'VOC2012'))
voc_dir = os.path.join("../data/VOCdevkit/VOC2012/")
num_workers = d2l.get_dataloader_workers()
train_iter = torch.utils.data.DataLoader(
VOCSegDataset(True, crop_size, voc_dir), batch_size, shuffle=True,
drop_last=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(
VOCSegDataset(False, crop_size, voc_dir), batch_size, drop_last=True,
num_workers=num_workers)
return train_iter, test_iter
batch_size, crop_size = 24, (320, 480)
train_iter, test_iter = load_data_voc(batch_size, crop_size)
read 1114 examples
read 1078 examples
3.4 Training
def loss(inputs, targets):
return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)
num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
# many GPU Training and evaluation
def train_batch(net, X, y, loss, trainer, devices):
if isinstance(X, list):
# fine-tuning BERT Required in ( Discussed later )
X = [x.to(devices[0]) for x in X]
else:
X = X.to(devices[0])
y = y.to(devices[0])
net.train()
trainer.zero_grad()
pred = net(X)
l = loss(pred, y)
l.sum().backward()
trainer.step()
train_loss_sum = l.sum()
train_acc_sum = d2l.accuracy(pred, y)
return train_loss_sum, train_acc_sum
def train(net, train_iter, test_iter, loss, trainer, num_epochs,
devices=d2l.try_all_gpus()):
timer, num_batches = d2l.Timer(), len(train_iter)
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0, 1],
legend=['train loss', 'train acc', 'test acc'])
net = nn.DataParallel(net, device_ids=devices).to(devices[0]) # many GPU function
for epoch in range(num_epochs):
# 4 Dimensions : Store training losses , Training accuracy , Number of instances , Characteristic number
metric = d2l.Accumulator(4)
for i, (features, labels) in enumerate(train_iter):
timer.start()
l, acc = train_batch(net, features, labels, loss, trainer,
devices)
metric.add(l, acc, labels.shape[0], labels.numel())
timer.stop()
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(
epoch + (i + 1) / num_batches,
(metric[0] / metric[2], metric[1] / metric[3], None))
print(metric[0]/metric[2])
test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {
metric[0] / metric[2]:.3f}, train acc '
f'{
metric[1] / metric[3]:.3f}, test acc {
test_acc:.3f}')
print(f'{
metric[2] * num_epochs / timer.sum():.1f} examples/sec on '
f'{
str(devices)}')
train(net, train_iter, test_iter, loss, trainer, num_epochs, devices)
loss 0.420, train acc 0.869, test acc 0.854
0.8 examples/sec on [device(type='cuda', index=0)]

3.5 forecast
When predicting, the input image needs to be standardized in each channel , And convert it into four-dimensional input format of convolutional neural network
def predict(img):
X = test_iter.dataset.normalize_image(img).unsqueeze(0)
pred = net(X.to(devices[0])).argmax(dim=1)
return pred.reshape(pred.shape[1],pred.shape[2])
in order to [ Categories of visual predictions ] Give each pixel , We map the prediction categories back to their annotation colors in the dataset .
def label2image(pred):
colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])
X = pred.long()
return colormap[X, :]
The images in the test data set vary in size and shape .
Because the model uses a stride of 32 The transposition of the convolution layer , Therefore, when the height or width of the input image cannot be changed 32 Divisible time , The height or width of the output of the transposed convolution layer will deviate from the size of the input image .
To solve this problem , We can intercept multiple blocks in the image with the height and width of 32 An integer multiple of a rectangular region , And do forward calculation for the pixels in these areas respectively . Please note that , The union of these regions needs to completely cover the input image . When a pixel is covered by multiple areas , The average value of the output of the transposed convolution layer in the forward calculation of different regions can be used as softmax Input of operation , So as to predict the category .
For the sake of simplicity , We only read a few large test images , And the shape is intercepted from the upper left corner of the image 320 × 480 320\times480 320×480 The area of is used to predict .
For these test images , We print the areas they intercept one by one , Then print the prediction results , Finally, print the marked category .
voc_dir = os.path.join("../data/VOCdevkit/VOC2012/")
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
crop_rect = (0, 0, 320, 480)
X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)
pred = label2image(predict(X))
imgs += [
X.permute(1, 2, 0),
pred.cpu(),
torchvision.transforms.functional.crop(test_labels[i],
*crop_rect).permute(1, 2, 0)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);

4. summary
- Full convolution network first uses convolution neural network to extract image features , And then through 1*1 The convolution layer of converts the number of channels into the number of categories , Finally, the height and width of the feature image are transformed into the size of the input image by transposing the convolution layer .
- In a fully convolutional network , We can initialize the transposed convolution layer as the upsampling of bilinear interpolation .
- Calculation reference of transposed convolution parameters https://blog.csdn.net/jerry_liufeng/article/details/120816608?spm=1001.2014.3001.5501
- The final result of semantic analysis is not very good , This is different from mine The number of iterations and Network layer construction of
- Because the size of the input data is 360*480, every last batch Yes 36 Pictures , Yes GPU The requirement of video memory is relatively large , Therefore, you may need to adjust the corresponding size during training ( I just adjusted batch The size is 24). But the efficiency of training is still very low
边栏推荐
- 2022-2028 global gasket plate heat exchanger industry research and trend analysis report
- Write a jison parser from scratch (4/10): detailed explanation of the syntax format of the jison parser generator
- 《网络是怎么样连接的》读书笔记 - WEB服务端请求和响应(四)
- You can see the employment prospects of PMP project management
- Ultimate bug finding method - two points
- C语言指针面试题——第二弹
- Fatal error in golang: concurrent map writes
- Explanation of for loop in golang
- Markdown syntax
- UML sequence diagram [easy to understand]
猜你喜欢

2022-2028 global edible probiotic raw material industry research and trend analysis report

ASP. Net to access directory files outside the project website

Mac platform forgets the root password of MySQL

C # use gdi+ to add text with center rotation (arbitrary angle)

165 webmaster online toolbox website source code / hare online tool system v2.2.7 Chinese version

2022-2028 global special starch industry research and trend analysis report

2022-2028 global probiotics industry research and trend analysis report

MATLAB小技巧(25)竞争神经网络与SOM神经网络

2022-2028 global optical transparency industry research and trend analysis report
![[C Advanced] file operation (2)](/img/50/e3f09d7025c14ee6c633732aa73cbf.jpg)
[C Advanced] file operation (2)
随机推荐
Dede plug-in (multi-function integration)
Leetcode (Sword finger offer) - 35 Replication of complex linked list
《网络是怎么样连接的》读书笔记 - WEB服务端请求和响应(四)
UML sequence diagram [easy to understand]
lolcat
ASP. Net to access directory files outside the project website
MySQL foundation 02 - installing MySQL in non docker version
el-table单选并隐藏全选框
At the age of 30, I changed to Hongmeng with a high salary because I did these three things
Fatal error in golang: concurrent map writes
Write a jison parser from scratch (5/10): a brief introduction to the working principle of jison parser syntax
《网络是怎么样连接的》读书笔记 - FTTH
How do microservices aggregate API documents? This wave of show~
pcl::fromROSMsg报警告Failed to find match for field ‘intensity‘.
Multilingual Wikipedia website source code development part II
Tkinter Huarong Road 4x4 tutorial II
Write a jison parser (7/10) from scratch: the iterative development process of the parser generator 'parser generator'
What is permission? What is a role? What are users?
Report on investment analysis and prospect trend prediction of China's MOCVD industry Ⓤ 2022 ~ 2028
Sort out the power node, Mr. Wang he's SSM integration steps