当前位置:网站首页>January 23, 2022 [reading notes] - bioinformatics and functional genomics (Chapter 6: multiple sequence alignment)
January 23, 2022 [reading notes] - bioinformatics and functional genomics (Chapter 6: multiple sequence alignment)
2022-06-30 07:39:00 【Muyiqing】
- Learning goals
- Understand the use of ClustalW Multiple sequence alignment (MSA) The three main stages of ;
- Describe several other multiple sequence alignments (MSA) Program , Understand how they work , Compare them with ClustalW Similarities and differences ;
- Understand the importance of benchmarking , And understand about MSA Some basic conclusions of ;
- Understand about genomic regions MSA A few questions about .
- 6.1 introduction
- This chapter discusses MSA General questions about
- Introduce MSA Five ways of doing this ;
- Recognition for MSA The database of , such as Pfam;
- Discuss the genome DNA Multiple sequence alignment .
- Definition of multiple alignment sequences
- Multiple sequence alignment is a set of 3 One or more proteins that can be partially or completely matched ( Or nucleic acid ) Sequence .
- A protein family does not necessarily have one “ correct ” The comparison results (β Globulin and myoglobin , Just share 25% The consistency of , But the three-dimensional structure is almost the same )
- A multiple sequence alignment is characterized by having columns on the amino acid residue ratio pairs , This comparison can be determined by the characteristics of amino acid residues , such as :
- There are highly conserved amino acid residues , Such as cysteine, which can form disulfide bond .
- There are conservative motif, Such as transmembrane span or immunoglobulin functional domain .
- There are conservative features of protein secondary structure , If it helps to form α screw 、β Residues of folding or transition domains .
- There are areas that show inserted or missing consistent patterns .
- Typical applications and practical strategies of multiple sequence alignment
- When to use multiple sequence alignment ? Why use multiple sequence alignment ?
- 1. If the protein in question is related to a large group of proteins , So this group of protein members can usually provide information about the possible functions of the protein 、 structure 、 Information about evolution
- 2. Most protein families are distant members , Use MSA We can find homology more sensitively than double sequence alignment .
- 3. When viewing database search results ,MSA For displaying conservative residues and motif More intuitive .
- 4. Evaluate mutations (SNP) Harmful algorithms usually depend on DNA Multiple sequence alignment with proteins to assess cross species conservation —— Harmful compilation tends to occur at more conservative sites
- 5. The study of population data can serve many purposes involving evolution 、 The biological problems of structure and function provide an in-depth understanding
- 6. When the complete genome of any species is sequenced , A major part of the research is to define which protein family all gene products belong to .
- 7. Phylogenetic algorithms start with multiple sequence alignment results as raw data , Generate phylogenetic tree .
- 8. Common sequences containing transcriptional Yinzu binding sites and other conserved elements are mainly identified based on conserved noncoding sequences detected by multiple sequence alignment .
- This chapter discusses MSA General questions about
- 6.2 Major multiple sequence alignment methods for species
- Five common methods
- Exact method
- Progressive comparison method
- Iterative method
- Consistency based approach
- Structure based approach
- Exact method
Needleman and Wunsch( describe ) Continuation of dynamic programming algorithm for double sequence alignment- Using the dynamic programming algorithm of double sequence alignment , The comparison matrix is multidimensional , The goal is to maximize the sum of each pair of sequence alignment scores .
- Pros and : The exact method can generate the most comparable , But it is infeasible for too many sequences in time and space . about N A sequence of , The time requirement for calculation is O(2^N * L^N),N Is the number of sequences ,L Is the average length of the sequence . Compared with ,ClustalW The time complexity of is O(N^4+L^2),MUSCLE The time complexity of is O(N^4+NL^2), These algorithms are fast , But the heuristic algorithm can not guarantee to produce the optimal ratio pairs )
- Progressive comparison method
Fitch and Yasunobu(1975) Put forward , By applying it to 5S Ribose RNA Sequence alignment Hogeweg and Hesper(1984) describe .Da-Fei Feng and Russell Doolittle(1987.1990) Extension- Methods and Strategies
- It is necessary to calculate the pairwise alignment scores of all protein sequences to be compared , Start with the most similar sequence , Then gradually add more sequences to participate in the alignment .
- Pros and : Support rapid alignment of hundreds of sequences . The main limitation is that the final alignment results depend on the order in which the sequences are added .
- Common progressive comparison tools
- ClustalW
- Web tools
- Web tools
- It is carried out in three stages
- First step : A series of double sequence alignments
- First step , The dynamic programming algorithm is used to generate the double sequence alignment between all proteins to be compared , such as , Five sequences produce 10 Double sequence alignment scores
- The second step : Build the boot tree
- Use distance ( Or similarity score ) Matrix calculates a boot tree
- There are two main ways to build a boot tree ( Chapter seven introduces )
- Arithmetic mean unweighted paired group method (UPGMA)
- Adjacency method
- The main characteristics of trees
- topology ( The order of branches )
- Evolutionary distance ( The length of the branch )
- The tree can be used to reflect the correlation degree of multiple sequences involved in multiple alignments
- The third step : Perform a series of steps based on the order in which they appear on the boot tree , Create multiple alignment sequences
- The algorithm guides the selection of two closest sequences from the tree for double sequence alignment . These two sequences appear in the leaf nodes of the tree , That is, the position of the existing sequence . The next sequence is added to the double sequence alignment or used to make another double sequence alignment . Compare gradually , Until you reach the root of the tree , All sequences are matched .
- First step : A series of double sequence alignments
- ClustalW
- Methods and Strategies
- Iterative method
- The iterative method uses the strategy of progressive comparison to calculate a suboptimal solution , Then, dynamic programming or other methods are used to modify the comparison results until the solution converges . An initial tree is divided and the spectra on both sides are re compared . So these methods construct an initial comparison , Then modify it and try to improve it , Use some objective functions to maximize the score .
- The progressive comparison method has limitations , Once an error occurs in the comparison process, it cannot be corrected , Iterative method can overcome this limitation .
- MAFFT Multiple sequence alignment package , Including progressive comparison method :
- similar ClustalW Single round incremental method , A fast Fourier transform is used in the thinning step ;
- Two wheel method , First, multiple sequence alignments are generated , Then, the intensity of refinement is calculated by comparing the results , Form a secondary progressive comparison ;
- PartTree Progressive comparison : Use matching 6 Tuple to calculate pairwise distance , This method is called k-mer Count .
- MUSCLE The operation is divided into three stages
- Use progressive multiple sequence alignment to produce a rough alignment result
- Improved the tree and built a new progressive comparison
- By systematically defecating the dog tree to obtain subsets , The boot tree is iteratively refined ; Delete an edge of the tree ( Or branch ) To create a binary tree .
- The iterative method uses the strategy of progressive comparison to calculate a suboptimal solution , Then, dynamic programming or other methods are used to modify the comparison results until the solution converges . An initial tree is divided and the spectra on both sides are re compared . So these methods construct an initial comparison , Then modify it and try to improve it , Use some objective functions to maximize the score .
- Based on consistency
- The main idea : For the sequence x,y and z, If the residue x Comparison z,z Comparison y, that x It should be compared with y.
- The consistency based method refers to the information content of multiple sequences when scoring the double sequence alignment . This method is unique in that it integrates evidence from multiple sequence alignment to double sequence alignment .
- ProbCons The algorithm consists of five steps
- The algorithm calculates the posterior probability matrix of each pair of sequences
- Calculate the accuracy expectation of each double sequence alignment
- utilize “ Probability consistency transformation ” Re estimate the quality score of each double sequence alignment
- The hierarchical clustering method is used to construct a guide tree with expected accuracy
- In the order given by the boot tree , Sequence alignment incrementally
- Based on structure
- Using the three-dimensional structure information of one or more proteins to be compared may improve the accuracy of multiple sequence alignment . Algorithms that allow users to integrate structural information include PRALINE and T-COFFEE Of Expresso modular
- Five common methods
- 6.3 Research with standard data sets : Method , Discovery and challenge
- Use standard sets for various algorithms , Software research can obtain “ gold standard ” The right answer , The answer consists of a highly reliable true positive relationship , Then compare software programs to objectively judge which is the most accurate .
- Factors that evaluate the quality of standard data sets :
- Relevance : The benchmark data set should include the tasks that users actually encounter when using the software
- Solvability : The task should not be too simple or too difficult
- Scalability : Some tasks are small-scale , Some tasks sequence a large number of proteins
- Availability : The benchmark database should be public
- independence : Methods used to build benchmark databases should not be used for sequence alignment
- Scalable : The benchmark data set should be expanded over time to adapt to new problems
- Recognized benchmark data set for multiple sequence alignment :BAliBASE、HOMSTRAD、OXBench、PREFAB、SABmark as well as IRMBASE. The common method is to obtain comparison results based on proteins with known three-dimensional structure , The three-dimensional structure is through X Obtained by X-ray diffraction crystallization technique .
- MSA The performance of the algorithm in a benchmark data set can be evaluated by some objective scoring functions , The common method is to measure the sum of pairs .
- Welcome to join the student letter exchange group , When the QR code expires, you can add VX:bbplayer2021
边栏推荐
- DXP software uses shortcut keys
- Graphic explanation pads update PCB design basic operation
- Given a fixed point and a straight line, find the normal equation of the straight line passing through the point
- Parameter calculation of deep learning convolution neural network
- Stepper motor
- Installation software operation manual (continuous update)
- Inversion Lemma
- Final review -php learning notes 11-php-pdo database abstraction layer
- RT thread kernel application development message queue experiment
- Virtual machine VMware: due to vcruntime140 not found_ 1.dll, unable to continue code execution
猜你喜欢
Next initializesecuritycontext failed: unknown error (0x80092012) - the revocation function cannot check whether the certificate is revoked.
期末复习-PHP学习笔记1
2021-10-29 [microbiology] a complete set of 16s/its analysis process based on qiime2 tool (Part I)
Personal blog one article multi post tutorial - basic usage of openwriter management tool
03 - programming framework: Division of application layer, middle layer and driver layer in bare metal programming
Mailbox application routine of running wild fire RT thread
Graphic explanation pads update PCB design basic operation
Dynamic memory management
期末复习-PHP学习笔记4-PHP自定义函数
2021 private equity fund market report (62 pages)
随机推荐
Proteus catalog component names and Chinese English cross reference
November 22, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5, section 4, hidden Markov model)
Digital tube EEPROM key to save value
期末複習-PHP學習筆記6-字符串處理
Xiashuo think tank: 125 planet updates reported today (packed with 101 meta universe collections)
24C02
期末复习-PHP学习笔记5-PHP数组
Wangbohua: development situation and challenges of photovoltaic industry
Calculate Euler angle according to rotation matrix R yaw, pitch, roll source code
Analysis of cross clock transmission in tinyriscv
为什么大学毕业了还不知道干什么?
Record the problem that the system file cannot be modified as an administrator during the development process
Dynamic memory management
Analysys analysis: online audio content consumption market analysis 2022
Examen final - notes d'apprentissage PHP 3 - Déclaration de contrôle du processus PHP
C51 minimum system board infrared remote control LED light on and off
C language implementation of chain stack (without leading node)
TC397 QSPI(CPU)
Implementation of double linked list in C language
2021-10-29 [microbiology] a complete set of 16s/its analysis process based on qiime2 tool (Part I)