当前位置：网站首页>Machine learning notes - convolutional neural network memo list

Machine learning notes - convolutional neural network memo list

2022-06-11 09:08:00 【Sit and watch the clouds rise】

One 、 summary

Tradition CNN Convolutional neural network , Also known as CNN, Is a specific type of neural network , It usually consists of the following layers ：

Two 、 The main types of layers

1、Convolution layer (CONV)

Convolution layer (CONV) In scan input Use the filter that performs the convolution operation . Its super parameters include filter size And stride . Generated output It is called characteristic graph or activation graph .

remarks ： The convolution step can also be extended to 1D and 3D situation .

2、Pooling (POOL)

Pooling layer （POOL） It is a down sampling operation , It is usually applied after the convolution layer , It has certain space invariance . especially , The largest and average pool is a special type of pool , Take the maximum and average values respectively .

Maximum pooling ： Select the maximum value of the current view for each pool operation

The average pooling ： Each pooled operation averages the value of the current view

3、Fully Connected (FC)

Fully connected layer (FC) Run on flattened input , Each of these inputs is connected to all neurons . If there is ,FC Layers usually appear in CNN The end of the architecture , Can be used to optimize the target , For example, class scores .

3、 ... and 、 Filter super parameters

Convolution layer containing filter , It is important to understand the meaning behind its superparameters .

1、Dimensions of a filter

One size is $F\times F$ The filter of shall be used to contain The input to the channel is a $F \times F \times C$ Volume , Its pair size is $I \times I \times C$ The input of performs convolution And generate a size of $O \times O \times 1$ The output characteristic diagram of （ Also called activation diagram ）.

remarks ： The size is $F\times F$ Of The application of the filter results in an output size of $O \times O \times K$ Characteristic graph .

2、Stride

For convolution or pooling , Stride Represents the number of pixels the window moves after each operation .

3、Zero-padding

Zero padding means that The process of adding zeros to each side of the input boundary . This value can be specified manually , It can also be set automatically by one of the three modes detailed below ：

Four 、 Adjust super parameters

1、 Parameter compatibility in convolution layer

By paying attention to Enter the length of the volume size , The length of the filter , Zero fill , Stride , Then the output size of the feature graph along this dimension Given by the following formula ：

2、 Understand the complexity of the model

To assess the complexity of the model , It is often useful to determine the number of parameters its schema will have . In a given layer of a convolutional neural network , It is done as follows ：

3、 Feel the field

The first The receptive field of the layer is expressed as $R_k \times R_k$ Region , The first Each pixel of an activation map can “ notice ” Of input . By calling F_j layer Filter size and S_i layer And use the Convention S_0 = 1 , You can calculate layers using the following formula Feeling field of ：

In the following example , F_1 = F_2=3 , S_1 = S_2=1 , $R^2 =1+2\cdot 1+2\cdot 1=5$

5、 ... and 、 Common activation functions

（1）Rectified Linear Unit

Rectifier linear unit layer (ReLU) It's an activation function , All elements for volume . It aims to introduce nonlinearity into the network . The following table summarizes its variants ：

（2）Softmax

softmax Step can be regarded as a generalized logic function , It takes the fractional vector $x\in\mathbb{R}^n$ As input , And output an output probability vector p∈ R Through the... At the end of the architecture softmax function . The definition is as follows ：

$\boxed{p=\begin{pmatrix}p_1\\\vdots\\p_n\end{pmatrix}}\quad\textrm{where}\quad\boxed{p_i=\frac{e^{ x_i}}{\displaystyle\sum_{j=1}^ne^{x_j}}}$

6、 ... and 、Object detection

（1） Model type

Yes 3 There are three main types of object recognition algorithms , The nature of their predictions is different . They are described in the following table ：

（2）Detection

In the context of object detection , According to whether we just want to locate the object or detect more complex shapes in the image , Use different methods . The following table summarizes the two main ：

（3）Intersection over Union（IOU）

The intersection of the Union , Also known as IoU, Is a quantitative prediction bounding box B_p With the actual bounding box B_a A function of the degree of correct positioning . It is defined as ：
$\boxed{\textrm{IoU}(B_p,B_a)=\frac{B_p\cap B_a}{B_p\cup B_a}}$

remarks ： We always have IoU∈[0,1]. By convention , If $\textrm{IoU}(B_p,B_a)\geqslant0.5$ , Then the predicted bounding box B_p Considered to be quite good .

（4）Anchor boxes

Anchor box is a technique for predicting overlapping bounding boxes . In practice , Allow the network to predict multiple boxes at the same time , Each box prediction is limited to a given set of geometric properties . for example , The first prediction may be a rectangle of a given shape , The second prediction may be another rectangle with different geometry .

（5）Non-max suppression

The non maximum suppression technique aims to remove the overlapping bounding boxes of the same object by selecting the most representative bounding box . After removing all probability predictions below 0.6 Behind the box of , Repeat the following steps with the remaining boxes ：

         For a given class ,
        • step 1： Select the box with the maximum prediction probability .
        • step 2： Discard any with IoU⩾0.5 And the previous box .

（6）YOLO

You Only Look Once (YOLO) Is an object detection algorithm that performs the following steps ：

step 1： Divide the input image into G×G grid .

The first 2 Step ： For each grid cell , Run a forecast yy Of CNN, Its form is as follows ：

$\boxed{y=\big[\underbrace{p_c,b_x,b_y,b_h,b_w,c_1,c_2,...,c_p}_{\textrm{repeated }k\textrm{ times}},...\big]^T\in\mathbb{R}^{G\times G\times k\times(5+p)}}$

among p_c Is the probability of detecting an object , b_x , b_y , b_h , b_w Is the property of the detected bounding box , c_1 ,..., c_p Is what is detected Class one-hot Express , Is the number of anchor frames .

The first 3 Step ： Run a non maximum suppression algorithm to remove any potential duplicate overlapping bounding boxes .

（7）R-CNN

Regions with convolutional neural networks (R-CNN) Is an object detection algorithm , It first segments the image to find the potential related bounding boxes , Then run the detection algorithm to find the most likely objects in these bounding boxes .

remarks ： Although the original algorithm is computationally expensive and slow , But the newer architecture makes the algorithm run faster , for example Fast R-CNN and Faster R-CNN.

7、 ... and 、 Face verification and recognition

（1） Model type

The following table summarizes the two main model types ：

（2）One Shot Learning

One Shot Learning Is a face verification algorithm , It uses a limited training set to learn the similarity function , This function can quantify the different degrees of two given images . The similarity function applied to two images is usually recorded as $d(\textrm{image 1}, \textrm{image 2})$ .

（3）Siamese Network

Siamese Networks It aims to learn how to encode images , Then quantify the difference between the two images . For a given input image $x^{(i)}$ , Coded output is usually recorded as $f(x^{(i)})$ .

（4）Triplet loss

Triplet loss $\ell$ Is based on images A（ anchor ）、P（ just ） and N（ negative ） The embedding of triples of represents the calculated loss function . Anchor points and positive examples belong to the same category , Negative cases belong to another category . By calling $\alpha\in\mathbb{R}^+$ Margin parameter , This loss is defined as follows ：
$\boxed{\ell(A,P,N)=\max\left(d(A,P)-d(A,N)+\alpha,0\right)}$

8、 ... and 、 Neurostylistic migration

（1）Motivation

The goal of neurostyle transfer is based on the given content C And given style S Generate the image G.

（2）Activation

At a given layer in , Activation is marked as $a^{[l]}$ And it's a dimension $n_H\times n_w\times n_c$

（3）Content cost function

Content cost function $J_{\textrm{content}}(C,G)$ Used to determine the generated image G With the original content image C The difference between . The definition is as follows ：

$\boxed{J_{\textrm{content}}(C,G)=\frac{1}{2}||a^{[l](C)}-a^{[l](G)}|| ^2}$

（4）Style matrix

Given layer Style matrix $G^{[l]}$ It's a Gram matrix , Each of these elements $G_{kk'}^{[l]}$ The channel is quantified k and k' The relevance of . It's about activating $a^{[l]}$ The definition is as follows ：
$\boxed{G_{kk'}^{[l]}=\sum_{i=1}^{n_H^{[l]}}\sum_{j=1}^{n_w^{[l]}}a_ {ijk}^{[l]}a_{ijk'}^{[l]}}$

remarks ： The style image and the style matrix of the generated image are respectively recorded as $G^{[l](S)}$ and $G^{[l](G)}$ .

（5）Style cost function

Style cost function $J_{\textrm{style}}(S,G)$ Used to determine the generated image G With style S The difference between . The definition is as follows ：

$\boxed{J_{\textrm{style}}^{[l]}(S,G)=\frac{1}{(2n_Hn_wn_c)^2}||G^{[l](S)}-G^{[l](G)}||_F^2=\frac{1}{(2n_Hn_wn_c)^2}\sum_{k,k'=1}^{n_c}\Big(G_{kk'}^{[l](S)}-G_{kk'}^{[l](G)}\Big)^2}$

（6）Overall cost function

The overall cost function is defined as a combination of content and style cost functions , By the parameter α,β weighting , As shown below ：

$\boxed{J(G)=\alpha J_{\textrm{content}}(C,G)+\beta J_{\textrm{style}}(S,G)}$

remarks ： Higher α Values make the model more concerned about content , And the higher β Values make the model more concerned about style.

Nine 、 An architecture that uses computational skills

（1）Generative Adversarial Network

Generative antagonistic network , Also known as GAN, It consists of generative model and discriminant model , The generation model aims to generate the most realistic output , This output will be fed into a discrimination model designed to distinguish the generated image from the real image .

remarks ： Use GAN Variant use cases include text to image 、 Music generation and synthesis .

（2）ResNet

ResNet Residual network architecture （ Also known as ResNet） Use residual blocks with a large number of layers , Designed to reduce training errors . The residual block has the following characteristic equation ：

$\boxed{a^{[l+2]}=g(a^{[l]}+z^{[l+2]})}$

（3）Inception Network

The architecture uses the initial modules , To try different convolutions , To improve its performance through feature diversification . especially , It USES $1\times1$ Convolution techniques to limit the computational burden .