当前位置：网站首页>Deep embedding and alignment of Google | protein sequences

Deep embedding and alignment of Google | protein sequences

2022-07-03 05:14:00 【Zhiyuan community】

【 title 】Deep embedding and alignment of protein sequences

【 The author team 】Felipe Llinares-L´opez, Quentin Berthet, Mathieu Blondel, Olivier Teboul, Jean-Philippe Vert

【 Time of publication 】2021/07/01

【 machine structure 】 Google

【 Thesis link 】https://doi.org/10.1101/2021.11.15.468653

【 Code link 】https://github.com/google-research/google-research/tree/master/dedal

Protein sequence alignment is a key component of most bioinformatics methods for studying protein structure and function . However , Comparing highly diverse sequences is still a difficult task , Current algorithms often fail to perform accurately , This leads to the imperfect annotation of many proteins . This paper uses the latest progress of deep learning in language modeling and differentiable programming , Put forward DEDAL, A flexible model for comparing protein sequences and detecting homologues .DEDAL It is a model based on pre training , It learns the permutation sequence by observing the original protein sequence and the large data set correctly arranged . Once trained , This article shows that DEDAL Compared with the existing methods, the accuracy of comparison on remote homologues is improved 2-3 times , And it can better distinguish remote homologues from evolutionarily unrelated sequences , It provides solutions for many downstream tasks that rely on sequence alignment in structural and functional genomics .

The picture above shows DEDAL（Deep Embedding and Differentiable ALignment） Overview . It is based on standard SW Algorithm above , It's a smooth SW Variants of the algorithm . It can effectively find the best alignment between two sequences , But it provides a flexible SW Parameterization of the scoring function used by the algorithm , To accommodate each sequence pair and each position in each sequence .DEDAL By running in sequence pairs with known alignments and a large set of original protein sequences SW Local alignment algorithm and specific parameters （gap to open up O,gap Expand E, Replace score S） To align two sequences , These parameters depend on the input sequence , Through each sequence Transformer code Tφ, Then there is the parameterizer Pβ, Convert continuous representation to SW Required parameter matrix .Tφ and Pβ All depend on parameters φ and β, These parameters pre train the language model from the large original sequence corpus during training Tφ, Combine sequence pairs with known alignments , adopt SW Continuous differentiable variants of and specific alignment loss functions for end-to-end gradient optimization , Learning together Tφ and Pβ.

Once trained ,DEDAL A gap and replacement scoring matrix calculated specifically for each new pair of sequences will be generated . Besides , Gaps and replacement scores are background ： For each pair of positions , They depend on the complete sequence to be aligned . Then use a standard SW The algorithm uses these parameters to calculate the best arrangement . This article shows that ,DEDAL It can be effectively trained on modern hardware with accelerators . Once the training is done ,DEDAL With the standard SW comparison , The alignment quality predicted for remote homologues is improved 2-3 times , And it produces an alignment score that can detect remote homologues more accurately .

The figure above shows the data from Pfam-A Examples of sequence alignment of two protein domains of seeds .

a. Respectively from the Pfam-A Seed database （ The second line ）、DEDAL forecast （ The third line ） Harmony PFASUM70 Alternative matrix prediction （ In the fourth row ） The comparison . This article shows Pfam-A Seed and DEDAL All residues in the two sequences aligned , But it didn't show PFASUM Aligned upstream and downstream misaligned residues in the sequence . The residues highlighted in green correspond to conservative residues that are correctly aligned , The residues shown in red correspond to predicted alignment and Pfam-A Differences between seed alignments .

b. come from PFASUM Substitution scores between all residue pairs of the substitution matrix .

c. from DEDAL Predicted SW Parameters .

In terms of Technology , This article explores two ways to create a distinguishable SW Align modules , Need to be in " Learn alignment " Training in the task DEDAL Parameters of , Use smoothing or perturbation techniques ; This paper finds that there is no obvious difference in performance between the two , And in the final DEDAL The disturbance based method is implemented in the model . About training DEDAL The permutation of , This paper finds that , When this article hopes DEDAL When accurate local arrangement can be predicted , Use Pfam Expand the domain instead of Pfam Domains are beneficial . Pre training in masking language modeling tasks DEDAL when , The sequences related to families outside the distribution are separated from " Protein universe " Exclude from , This leads to a slight decrease in the performance of remote homologues , Although the performance gap relative to the baseline is not obvious .

About the strategy of end-to-end joint training converter and parameter , This paper finds that this is indeed significantly better than the more classic two-step strategy , That is, first train the converter encoder in the shielded language modeling task , Then fix the converter at " Learn alignment " Training parameter device in task . This shows that , A general language model , Such as ESM, It's not enough. , At least fine tune , To achieve the best performance of alignment .

The above figure shows the application of learning embedding in downstream tasks . This paper evaluates the benefits of context sensitive embedding by simply training a model , In this model , The replacement cost is limited to depend only on the amino acids to be aligned ; It's not hard to see. , This paper observes that the performance of this model has a great decline , Achieved and " aim " The performance of the best replacement matrix in .

原网站

版权声明
本文为[Zhiyuan community]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/184/202207030510166475.html

当前位置：网站首页>Deep embedding and alignment of Google | protein sequences

Deep embedding and alignment of Google | protein sequences

边栏推荐

猜你喜欢

随机推荐