当前位置：网站首页>Swin transformer code explanation

Swin transformer code explanation

2022-06-12 16:45:00 【QT-Smile】

Swin Transformer Code explanation

The subsampling is 4 times , therefore patch_size=4
Insert picture description here

Insert picture description here
2.

3.
emded_dim=96 It is in the following picture C, After the first Linear Embedding Number of channels after processing

4.

After the first full connection layer, the number of channels is doubled
Insert picture description here

6.
stay Muhlti-Head Attention Whether to use qkv——bias, The default here is to use

7.

first drop_rate Connect to PatchEmbed Back
Insert picture description here

the second drop_rate stay Multi-Head Attention Used in the process
Insert picture description here
9.
Third drop_rate It means in every Swin Transformer Blocks Used in , It's from 0 Slowly grow to 0.1 Of

Insert picture description here
10.
patch_norm Default grounding PatchEmbed hinder

11.
Default not to use , Use the words , It just saves memory

12.
Corresponds to each stage in Swin Transformer Blocks The number of

13.
Corresponding stage4 Number of output channels of

Insert picture description here
14.
PatchEmbed It is to divide the pictures into non overlapping ones patches

15.
PatchEmbed Corresponding to the high in the structure drawing Patch Partition and Linearing Embedding

16.
Patch Partition It is really realized by a convolution

17.
On the right in the width direction pad And at the bottom of the height direction pad
Insert picture description here
18.
From the dimension 2 Beginning to flatten

19.
Here is the first one mentioned above drop_rate Connect directly to patchEmbed Back

20.
For the used Swin Transformer Blocks Set up a drop_path_rate. from 0 Start all the way to drop_path_rate

21.
Traversal generates each stage, And in the code stage And in the paper stage There's a little difference
Insert picture description here

about stage4, It has no Patch Merging, Only Swin Transformer Blocks

In the stage Layers that need to be stacked Swin Transformer Blocks The number of times
Insert picture description here
23.
drop_rate： Connect directly to patchEmbed Back
attn_drop： The stage What is used
dpr： The stage Different from Swin Traansformer Blocks What is used

24.
Before building 3 individual stage Yes, there is PatchMerging, But the last one stage It's not PatchMerging
ad locum self.num_layers=4
Insert picture description here
25.

Insert picture description here
26.

Insert picture description here

In the following Patch Merging Of the characteristic matrix passed in shape yes x: B , H*W , C In this way
Insert picture description here

Insert picture description here
28.
When x Of h and w No 2 The integral times of , Need padding, Need to be on the right padding A column of 0, Hereunder padding a line 0.

29.

30.

31.
For the classification model, you need to add the following code

32.

Initialize the weight of the whole model
Insert picture description here
33.
Conduct 4 Double down sampling

34.
L：H*W

35.
Lose in a certain proportion

36.
Go through each stage

The following code is to build Swin-T Model
Insert picture description here

Insert picture description here

Swin_T stay imagenet-1k Pre training weights on
Insert picture description here
39.

Swin_B（window7_224）
Insert picture description here

Swin_B（window7_384）

Insert picture description here

Insert picture description here
41.
The original characteristic matrix needs to be moved to the right and down , But the specific distance to the right and down is equal to ： The size of the window divided by 2, Round down again

self.shift_size Is the distance to the right and down

42.
One stage Medium Swin Trasformer Blocks The number of
Insert picture description here

When shift_size=0： Use W-MSA
When shift_size It's not equal to 0： Use SW-MSA,shift_size = self.shift_size
Insert picture description here
44.
Swin Transformer Blocks It could be W-MSA It could be SW-MSA. It is not used to W-MSA and SW-MSA Think of it as a Swin Transformer Blocks

45.
depth Represents the numbers circled in the figure below

46.
The down sampling here uses Patch Merging Realized
Insert picture description here
47.
This is really SW-MSA The use of

48.
Swin Transformer Blocks Will not change the height and width of the characteristic matrix , So every one SW-MSA It's all the same , therefore attn_mask The size of will not change , therefore attn_mask Just create it once
Insert picture description here
49.

50.
This is the end of creating a stage All of Swin Transformer Blocks

51.
This is down sampling Patch Merging, The down sampling height and width are reduced by two times

52.
there +1, Is to prevent incoming H and W Is odd , So you need padding

53.