当前位置:网站首页>Detailed explanation of convirt paper (medical pictures)

Detailed explanation of convirt paper (medical pictures)

2022-06-12 06:36:00 PD, I am your true love fan

ConVIRT Detailed explanation of the paper ( Medical pictures ) – Panden's in-depth study notes

Preface

ConVIRT The full name is (contrastive learning of medical visual representations from paired images and text), It is also a work of comparative learning , But it combines multimodality ; Is in CLIP Previous work ;

  • Learning the visual representation of medical images is the core of medical images , But its progress has been hindered by the small size data set of manual labels ;
  • Existing work usually relies on starting from ImageNet Pre training model , Because the image features are completely different , Not so good ;
  • Or extract rule-based label images from text report data matched with medical treatment , Labels are inaccurate and difficult to generalize ;

The way the medical field labels

  • Ask experts to label with high quality , This leads to a small number ;
  • Use a rule to extract labels from a report , There are many such text image pairs in the medical system ;
    • But extraction rules are often difficult to use , Sometimes it is difficult to extract certain tags ;
    • Because every doctor writes differently , This rule is also difficult to use across hospitals ;

The overall architecture

 Insert picture description here

  • A picture is randomly cropped first , Another data enhancement , Then enter Image Encoder(ResNet50), One last MLP obtain 512 The feature representation of dimension ;
  • A passage paired with this picture , Random sampling of some of them ( a few words , Or incomplete sentences ), Then enter text Encoder(Bert), One last MLP obtain 512 The feature representation of dimension ;
  • Because a batch There is N Picture text pairs , So it can be understood as having N-1 A negative example , There is only one positive example , Then calculate the image and text respectively infoNCE loss;

 Insert picture description here

Data sets

  • public MIMIC-CXR The second part of the database 2 edition , This is a set of breasts X The light image is paired with its text report , And because of its publication, it has become a standard resource for the study of multimodal modeling of medical images . After pretreatment , The data set contains about 217k Images - The text is right , Each pair contains on average 1.7 Pictures and 6.0 A sentence ;
  • Bone image : A set of musculoskeletal images was obtained from the Rhode Island Hospital system - The text is right . After the chest image , Musculoskeletal images constitute the second most common type of radiographic images in typical hospitals . The data set contains 48k Images - The text is right , Each pair contains on average 2.5 Two images and 8.0 A sentence .

experiment

Classification task

  • RSNA Pneumonia Detection: Whether pneumonia is a dichotomous task ;
  • CheXpert image classification: Multi label classification task for lung ;
  • COVIDx image classificatio: Whether novel coronavirus pneumonia , Common pneumonia , Normal three category tasks ;
  • MURA bony abnormality detection: The dichotomous task of determining whether skeletal muscle is normal ;

Two ways : linear probe Only train the sorting head ,fine-tuning

 Insert picture description here

  • From top to bottom is :
    • Random initialization ;
    • ResNet50 stay ImageNet Pre trained ;
    • Caption-LSTM It's a network that looks at pictures and talks ( Its architecture is shown in the following figure );
    • Caption-Transformer Same as the last one , But with Transformer To replace the LSTM( yes COCO image captioning benchmar);
    • Contrastive-Binary It is also a network of comparative learning , Is to input a group of pairs , Judge whether it belongs to a pair , As a pre training task ;
  • because COVIDx There isn't that much data , So I didn't do it 1% Testing at level ;

The following figure shows the visualization of feature space , On the left is ImageNet Pre trained , The picture on the right is ConVIRT
 Insert picture description here

Zero-shot Mission

image CLIP equally , The most multimodal is Zero-shot, There's no need to fine tune , Sort through the tips of the picture ;

Zero-shot Mission

  • Image-Image: Transfer all pictures into Image-Encoder, Similar to all requirements , Gather in a heap ;( Make a picture query To find the similarity with all other pictures , It is the similarity of the feature space of comparative learning )
    • still CheXpert Data sets , But this dataset is not a multi label classification task , The author takes out those pictures with only one label as query( After expert treatment and screening , Each category is reserved for 10 Zhang );
    • With these query To find similarity with others , Get the results , If that picture also has query This label is an example , Otherwise, it's a negative example ;
  • Text-image: Ask the experts to write CheXpert Symptoms for each label in ( Yes 5 Share ), Then enter text Encoder, With all access Image-Encoder The characteristics of the ( Of course, these characteristics have to be MLP layer ); The method of distinguishing positive and negative examples is the same as above ;

 Insert picture description here

Super parameter setting

 Insert picture description here

  • Big batch It's bad for Image-Image and Text-Image, Because it is possible that the negative cases themselves are potential positive cases ;
  • Will the last MLP Remove the activation function , Yes Image-Image and Text-Image Their performance has declined ;
  • The above two questions do not affect the classification ;

Compare with other models of comparative learning , But because SimCLR And MoCo It's all a comparison of the pictures , So we can only compare Image-Image
 Insert picture description here

  • It's all used MIMIC-CXR Data sets ,SimCLR All useless v2 edition , Nature can't compare with , however MoCo v2 Mingming used a bigger and more consistent Dictionary , Why still can't compare with ;
    • I think there are several reasons : The super parameter is not adjusted properly ( Because they moved here directly ), Using text data can improve the learning efficiency of the model , Provide model accuracy ;
原网站

版权声明
本文为[PD, I am your true love fan]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/163/202206120634416086.html