当前位置:网站首页>Esmfold: a new breakthrough in protein structure prediction after alphafold2
Esmfold: a new breakthrough in protein structure prediction after alphafold2
2022-07-26 23:00:00 【weixin_ four million five hundred and twenty-eight thousand thr】
just ,Meta AI Head of protein team Alexander Rives Announce the latest achievements on twitter ESMFold, High precision end-to-end atom level structure prediction can be carried out from single sequence of protein , Reasoning speed exceeds AlphaFold. This is by far the largest protein language model .
A year ago. ,DeepMind Open source AlphaFold2 Liandeng Nature、Science, Brush and explode creatures and AI educational circles . A year later ,Meta With an order of magnitude faster ESMFold coming . Not only fast , There are enough models 150 One hundred million parameters .

LeCun Tweet praise , This is a Meta-FAIR Great new achievements of protein team .

Work together Zeming Lin According to ,30 The big model with 100 million parameters is 256 individual GPU On training 3 A few weeks , and ESMfold stay 128 individual GPU Used it 10 God . as for 150 Billion parameter version , It's not clear yet .
He also said he , The code will definitely open source later , Stay tuned !
Big and fast !
today , Our main character is ESMFold, A sequence from a protein individual , Direct high accuracy 、 end-to-end 、 A model for predicting atomic hierarchy .

Address of thesis :https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1
150 The benefits brought by 100 million parameters are needless to say —— Through training , Today's large models can predict the three-dimensional structure of proteins with atomic size accuracy .
In terms of accuracy ,ESMFold and AlphaFold2、RoseTTAFold almost .
however ,ESMFold Guess it's faster than AlphaFold2 One order of magnitude fast !
Let's talk about the order of magnitude. It may be difficult to understand the speed comparison between the three , Just look at the picture below .

What's the difference ?
although AlphaFold2 and RoseTTAFold We have made a breakthrough in the prediction of atomic resolution structure , But they also rely on the use of multiple sequence alignments (MSA) And similar protein structure templates to achieve the best performance .
by comparison , By using the internal representation of the language model ,ESMFold Using only one sequence as input, the corresponding structure prediction can be generated , Thus, the speed of structure prediction is greatly accelerated .

The researchers found that ,ESMFold The prediction of low complexity sequences is equivalent to the most advanced models .
and , The accuracy of structure prediction is closely related to the complexity of language model , in other words , When language models can better understand sequences , You can better understand the structure .

at present , There are billions of protein sequences whose structures and functions are unknown , Many of them come from metagenome sequencing .
utilize ESMFold, Researchers need only 6 Hours , It can be folded 100 Random samples of 10000 metagenomic sequences .

A large part of them have high confidence , And unlike any known structure ( There is no record in the database ).
The researchers believe that ,ESMFold It can help understand the protein structure beyond cognition .

Besides , because ESMFold The prediction speed of is one order of magnitude faster than the existing models , So researchers can use ESMFold To help bridge the gap between the fast-growing protein sequence database and the slow-moving protein structure and function database .
150 Billion parameter protein language model
Next, let's talk about it in detail Meta This brand new ESMFold.
ESM-2 It's based on Transformer The language model of , And use the attention mechanism to learn the interaction mode between paired amino acids in the input sequence .
Compared with the previous generation model ESM-1b,Meta On the model structure 、 Training parameters have been improved , And add computing resources and data . meanwhile , Relative position embedded addition , So that the model can be extended to sequences of arbitrary length .
From the results , have 1.5 One hundred million parameters ESM-2 The model has 6.5 One hundred million parameters ESM-1b The model behaves better .
Besides , On the basis of structural prediction ,ESM-2 It also surpasses other protein language models . This performance improvement is consistent with the laws established in the field of large-scale language modeling .

With ESM-2 Increase in scale , It can be observed that the accuracy of language modeling has been greatly improved .

End to end single sequence structure prediction
SMFold and AlphaFold2 A key difference is ,ESMFold Use language model to express , Eliminated pairs of well-defined homologous sequences ( With MSA In the form of ) Need as input .
ESMFold By using a process sequence Transformer Module replacement processing MSA Computing expensive network modules , To simplify the AlphaFold2 Medium Evoformer. This simplification means ESMFold Greatly improve the speed of , Much higher than based on MSA Model of .
The output of the folding trunk is then processed by a structure module , It is responsible for outputting the final atomic level structure and the confidence of prediction .

Researchers ESMFold And AlphaFold2 and RoseTTAFold stay CAMEO(2022 year 4 Month to 2022 year 6 month ) and CASP14(2020 year 5 month ) Compare on the test set .
When only a single sequence is input ,ESMFold It's better than Alphafold 2 Much better .
When using a complete pipe ,AlphaFold2 stay CAMEO and CASP14 We have reached 88.3 and 84.7.ESMFold stay CAMEO Has made progress with RoseTTAfold Quite accurate , Its average TM The score is 82.0.

Conclusion
The researchers found that , Language models aimed at unsupervised learning are trained in a large, evolutionarily diverse protein sequence database , It can predict protein structure with atomic resolution .
Expand the parameters of the language model to 15B, We can systematically study the influence of scale on protein structure learning .
We see , The nonlinear curve of protein structure prediction is a function of model scale , We also observed a strong relationship between the understanding of sequences in language models and structural prediction .
ESM-2 The series of models is the largest protein language model trained so far , Its parameters are only one order of magnitude less than the recently developed largest text model .
and ,ESM-2 It is much better than the previous model , Even in 150M Under parameters ,ESM-2 Is better than ESM-1 Generation language model in 6.5 Capture a more accurate structural diagram under 100 million parameters .
The researchers say ,ESMFold The biggest driver of performance is the language model . Because there is a strong connection between the confusion of language model and the accuracy of structure prediction , They found that when ESM-2 When you can better understand the protein sequence , The prediction results equivalent to the current most advanced models can be obtained .
ESMFold Accurate atomic resolution structure prediction is obtained , Reasoning time is longer than AlphaFold2 It's an order of magnitude faster .
In practice , The advantage of speed is even greater . because ESMFold There is no need to search for sequences related to evolution to build MSA.
Although there are faster ways to reduce search time , But how to reduce it may still be a long time .
The benefits brought by the great shortening of reasoning time are self-evident —— The improvement of speed will make it possible to draw the structure space of large metagenomics sequence database .
In addition to structure based tools to identify remote homology and protection , use ESMFold Make fast and accurate structural prediction , It can also play an important role in the structural and functional analysis of a large number of new sequence sets .
Get millions of prediction structures in a limited time , It is conducive to the discovery of a new understanding of the breadth and diversity of natural proteins , And can discover new protein structure and protein function .
reference :
边栏推荐
- SQL multi table query exercise
- Use ECs and OSS to set up personal network disk
- 【HCIP】OSPF 路由计算
- 华为Atlas900揭秘:集成数千颗昇腾910芯片,算力堪比50万台PC!
- 摩尔定律的新推力,英特尔先进封装技术详解!
- Huawei conspires to acquire Brazilian operators?
- Luo Xu talks with Siemens wanghaibin: advanced manufacturing requires benefits from Digitalization
- 关于 StatefulWidget,你不得不知道的原理和要点!
- 基于C语言设计的增量型安全文件系统 SFS
- [hcip] OSPF route calculation
猜你喜欢
![[postgresql]postgresqlg使用enerate_series() 函数补全统计](/img/62/893986eb97a61f4e9ef32abc8d2a90.png)
[postgresql]postgresqlg使用enerate_series() 函数补全统计

数据库全栈工程师(DevDBOps)低首付、高回报,先就业后付款

Plato Farm有望通过Elephant Swap,进一步向外拓展生态

Docker uses mysql:5.6 and owncloud image to build a personal network disk, install and build a private warehouse harbor

每周招聘|PostgreSQL数据库研发工程师,年薪60+,名企高薪,挑战自我!

The JSON string is converted into a JSON object, the value of a key is obtained, and whether a key exists is judged

SQL multi table query exercise
![[hcip] OSPF route calculation](/img/1c/ee9eee2e723b850c401f7cddda1b27.png)
[hcip] OSPF route calculation

Network and VPC hands-on experiment

你知道磁环电感的常见磁芯类型有哪些吗?
随机推荐
Dao:op token and non transferable NFT are committed to building a new digital democracy
功耗降低、功能升级!启英泰伦发布二代语音AI芯片:模组价格低至14.99元!
Sort out each order when you are in love (it takes two months to sort out in detail)
Kt6368a Bluetooth chip development precautions and problem collection - long term update
Detailed explanation of SQL secondary injection
逆袭黑马:数据库全栈工程师(DevDBOps)培训,把最好的课程送给您!
证监会:同意传音控股科创板IPO注册
芯鼎收购紫光控股!万业企业:全面转型集成电路!
what is qrc in qt
Use ECs and OSS to set up personal network disk
Embedded sig | distributed soft bus
KT6368A蓝牙芯片开发注意事项以及问题集锦--长期更新
Siliwei's counterattack: huiding's under screen optical fingerprint patent involved in the case was declared invalid
科研太忙无法顾家?陈婷:人生不能只有一个支点
菜鸟网络面试【杭州多测师】【杭州多测师_王sir】
Monte Carlo search tree (UCT) based on confidence upper bound to realize four sub chess
2019 biometric forum successfully ended: these ten highlights should not be missed!
Making wechat robot with go (I) sending messages
Incremental secure file system SFS based on C language design
华裔科学家Ashe教授对涉嫌造假的Nature论文的正面回应