当前位置:网站首页>KDD 2021 | MoCl: comparative learning of molecular graphs using multi-level domain knowledge

KDD 2021 | MoCl: comparative learning of molecular graphs using multi-level domain knowledge

2022-06-10 17:12:00 DrugAI

compile | Tao Wen reviewing | Yang Huidan

This article is introduced by Michigan State University and Agios Pharmaceutical company cooperation published on KDD 2021 Research work on . The author studies graph contrast learning in biomedical field , Put forward a name called MoCL New framework for , It uses local and global level domain knowledge to assist representation learning . Local level domain knowledge guides the enhancement process , Changes can be introduced without changing the semantics of the graph . Global knowledge encodes the similar information between graphs in the whole data set , Help to learn more semantically rich representations . The author of this paper has made an analysis of MoCL An assessment was made , It turns out that MoCL State of the art performance .

1

brief introduction

Figure neural network (GNN) It has been proved to have the most advanced performance on graph related tasks . Recently, it is often used in biomedical field , Solve problems related to drugs . However , Like most deep learning networks , It requires a lot of tagged data for training , In the real world, there are usually a limited number of tags for specific tasks , So about GNN The pre training program of has been actively explored recently . However , Different from image , Comparative learning has its unique challenges . First , The structural information and semantics of graphs are in different fields ( For example, social networks and molecular graphs ) There are significant differences between them , Therefore, it is difficult to design a general enhancement scheme for all scenarios . second , At present, most graph contrast learning frameworks ignore the global structure of the entire data , For example, two graphs with similar structures should be closer in the embedded space . Third , Comparison scheme is not unique , Comparisons can occur at nodes - chart , node - node , chart - Between the pictures .

In addition to these unique graph challenges , There are still some unsolved problems in Contrastive learning . for example , It is difficult to estimate mutual information accurately in high dimensional situations . The relationship between mutual information maximization and contrastive learning is not clear .

therefore , The author's goal is to address these challenges in the biomedical field . The author's hypothesis is that , By injecting domain knowledge into enhancement and comparison schemes , You can learn to express better . The author suggests using local and global domain knowledge to assist the comparative learning of molecular graphs . The author proposes a new enhancement scheme , It is called substructure substitution , The effective substructure in the molecule is replaced by a bioelectronic isosteric body , The bioelectronic isosteric body introduces changes without changing the molecular properties . Replacement rules come from domain resources , The author regards it as local level domain knowledge . Global hierarchical domain knowledge encodes the global similarity between graphs . The author suggests using this information to learn more abundant representations through double contrast goals . This paper is the first attempt to use domain knowledge to assist comparative learning , The author's contributions are as follows :

  • In this paper, a molecular graph enhancement scheme based on local level domain knowledge is proposed , Make the semantics of the graph unchanged in the process of enhancement .
  • Using the similarity information between molecular graphs , By adding a global comparison loss , Encode the global structure of the data into the graph representation .
  • It provides a theoretical basis for learning objectives and measuring the triple loss in learning , It shows the effectiveness of the whole framework .
  • On various molecular datasets MoCL To assess the , And prove that it is superior to the most advanced method .

2

Method

2.1 Comparative learning framework

chart 1 by MoCL The overall framework of . First , Generate two enhanced views from local domain knowledge . then , They are the same as the original view ( Blue ) Type... Together GNN Encoder and projector head . Local hierarchical contrast maximizes mutual information between two enhanced views . The global level comparison maximizes the mutual information between two similar graphs , The similar information comes from the domain knowledge at the global level .

chart 1 MoCL The overall framework of

2.2 Local level domain knowledge

Most of the existing methods may change the semantics of molecular graphs in the process of enhancement ( chart 2a,b,c,d). In general enhancements , Only attribute masking ( chart 2d) Does not violate biological assumptions , Because it doesn't change the molecules , It only masks some atoms and edge properties .

Therefore, the author injects domain knowledge to assist the enhancement process . The author proposes a substructure substitution , The effective substructure in the molecule is replaced by a bioelectronic isosteric body , The bioelectronic isosteric body produces a new molecule with physical or chemical properties similar to the original molecule ( chart 2e). The author collected from the domain resources 218 Bar rule , Each rule consists of a source substructure and a target substructure , And increased. 12 An additional rule is to subtract and add carbon groups from a molecule . therefore MoCL contain 230 Bar rule , Used to generate molecular variants with similar properties .

chart 2 Enhanced comparison

2.3 Global level domain knowledge

Maximizing the mutual information between corresponding views can learn transformation invariant representation , However, it may ignore the global semantics of the data . for example , For some graphs with similar graph structure or semantics , It should be closer to... In the embedded space . For molecular graphs , Such information can be obtained from multiple sources . For general graph structure , Extended connectivity fingerprints (ECFP) Encode the molecular substructure , It is widely used to calculate the structural similarity between molecular graphs . Author use ECFP Calculate the similarity between molecular graphs , Two strategies are proposed to incorporate global semantics into the learning framework , One strategy is to use it as direct supervision , The second strategy is to use a comparison target , In this goal, two similar graphs have higher mutual information .

3

experiment

3.1 Local level domain knowledge

chart 3 Displayed in linear protocol Results of different enhancement combinations for all data sets . Each cell represents... From scratch GNN Compared with the model trained with different enhancement combination methods in linear protocol Performance improvement under . Blue represents a negative value , Red for positive .MoCL-DK The resulting representation plus the prediction accuracy produced by the linear classifier is consistent with GNN effect (bace、bbbp、sider) Quite a , Even better (clintox,mutag). You can see that it contains MoCL-DK The row and column values of are usually higher , therefore MoCL-DK Combining with other enhancement methods almost always yields better results . Attribute masking and MoCL-DK It is generally effective in all scenarios , Combining them usually results in better performance . This verifies the author's previous hypothesis , namely MoCL-DK And attribute masking does not violate biological assumptions , So it is better than other enhancement effects .

chart 3 linear evaluation protocol Enhanced combination under

surface 1 It shows that they are in linear protocol and semi-supervised protocol The results of the experiment in this paper are as follows . Compared with other methods using data enhancement and comparative learning ,MoCL The performance of is the best on most datasets .

surface 1 Average of various methods AUC

The enhancement proposed in this paper MoCL-DK Can be applied many times , Generate more complex views . The author tried a series of different intensities , Strength refers to the number of enhancements ( For example, replace again after replacement , It is enhanced twice ). chart 4 The effects of different intensities are compared . For most data sets , As the number of enhancements increases , Its performance first rises and then declines ,MoCL-DK3 Better results are usually achieved .

chart 4 MoCL-Local The average of different intensities AUC

3.2 Global level domain knowledge

chart 5 It shows the performance improvement of different enhancement methods after adding global domain knowledge . You can see , Global information usually improves the performance of all enhancement methods . Compared with other enhancement schemes , The proposed domain enhancements (MoCL-DK1 and MoCL-DK3) The promotion is much higher .

chart 5 After adding global domain knowledge, the performance of different enhancement methods is improved

3.3 Sensitivity analysis

chart 6 The performance surfaces of the proposed method under different combinations of superparameters are shown . You can see , For global losses , Use a relatively small neighborhood size ( Not too small ) And a larger weight ( Not too big ) You can get the best results .

chart 6 Average under different combinations of super parameters AUC

4

summary

In this work , The author uses multi-level domain knowledge to assist the comparative representation learning of molecular graphs . Local domain knowledge supports new enhancement schemes , Global domain knowledge integrates the global structure of data into the learning process . The author proves that these two kinds of knowledge can improve the quality of learning representation .

Reference material

Thesis link

https://dl.acm.org/doi/10.1145/3447548.3467186

原网站

版权声明
本文为[DrugAI]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206101602368461.html