当前位置：网站首页>January 23, 2022 [reading notes] - bioinformatics and functional genomics (Chapter 6: multiple sequence alignment)

January 23, 2022 [reading notes] - bioinformatics and functional genomics (Chapter 6: multiple sequence alignment)

2022-06-30 07:39:00 【Muyiqing】

Learning goals
- Understand the use of ClustalW Multiple sequence alignment （MSA） The three main stages of ;
- Describe several other multiple sequence alignments （MSA） Program , Understand how they work , Compare them with ClustalW Similarities and differences ;
- Understand the importance of benchmarking , And understand about MSA Some basic conclusions of ;
- Understand about genomic regions MSA A few questions about .
6.1 introduction
- This chapter discusses MSA General questions about
  - Introduce MSA Five ways of doing this ;
  - Recognition for MSA The database of , such as Pfam;
  - Discuss the genome DNA Multiple sequence alignment .
- Definition of multiple alignment sequences
  - Multiple sequence alignment is a set of 3 One or more proteins that can be partially or completely matched （ Or nucleic acid ） Sequence .
  - A protein family does not necessarily have one “ correct ” The comparison results （β Globulin and myoglobin , Just share 25% The consistency of , But the three-dimensional structure is almost the same ）
  - A multiple sequence alignment is characterized by having columns on the amino acid residue ratio pairs , This comparison can be determined by the characteristics of amino acid residues , such as ：
    - There are highly conserved amino acid residues , Such as cysteine, which can form disulfide bond .
    - There are conservative motif, Such as transmembrane span or immunoglobulin functional domain .
    - There are conservative features of protein secondary structure , If it helps to form α screw 、β Residues of folding or transition domains .
    - There are areas that show inserted or missing consistent patterns .
- Typical applications and practical strategies of multiple sequence alignment
  - When to use multiple sequence alignment ？ Why use multiple sequence alignment ？
  - 1. If the protein in question is related to a large group of proteins , So this group of protein members can usually provide information about the possible functions of the protein 、 structure 、 Information about evolution
  - 2. Most protein families are distant members , Use MSA We can find homology more sensitively than double sequence alignment .
  - 3. When viewing database search results ,MSA For displaying conservative residues and motif More intuitive .
  - 4. Evaluate mutations （SNP） Harmful algorithms usually depend on DNA Multiple sequence alignment with proteins to assess cross species conservation —— Harmful compilation tends to occur at more conservative sites
  - 5. The study of population data can serve many purposes involving evolution 、 The biological problems of structure and function provide an in-depth understanding
  - 6. When the complete genome of any species is sequenced , A major part of the research is to define which protein family all gene products belong to .
  - 7. Phylogenetic algorithms start with multiple sequence alignment results as raw data , Generate phylogenetic tree .
  - 8. Common sequences containing transcriptional Yinzu binding sites and other conserved elements are mainly identified based on conserved noncoding sequences detected by multiple sequence alignment .
6.2 Major multiple sequence alignment methods for species
- Five common methods
  - Exact method
  - Progressive comparison method
  - Iterative method
  - Consistency based approach
  - Structure based approach
- Exact method
  Needleman and Wunsch（ describe ） Continuation of dynamic programming algorithm for double sequence alignment
  - Using the dynamic programming algorithm of double sequence alignment , The comparison matrix is multidimensional , The goal is to maximize the sum of each pair of sequence alignment scores .
  - Pros and ： The exact method can generate the most comparable , But it is infeasible for too many sequences in time and space . about N A sequence of , The time requirement for calculation is O（2^N * L^N）,N Is the number of sequences ,L Is the average length of the sequence . Compared with ,ClustalW The time complexity of is O（N^4+L^2）,MUSCLE The time complexity of is O（N^4+NL^2）, These algorithms are fast , But the heuristic algorithm can not guarantee to produce the optimal ratio pairs ）
- Progressive comparison method
  Fitch and Yasunobu（1975） Put forward , By applying it to 5S Ribose RNA Sequence alignment Hogeweg and Hesper（1984） describe .Da-Fei Feng and Russell Doolittle(1987.1990) Extension
  - Methods and Strategies
    - It is necessary to calculate the pairwise alignment scores of all protein sequences to be compared , Start with the most similar sequence , Then gradually add more sequences to participate in the alignment .
  - Pros and ： Support rapid alignment of hundreds of sequences . The main limitation is that the final alignment results depend on the order in which the sequences are added .
  - Common progressive comparison tools
    - ClustalW
      - Web tools
    - It is carried out in three stages
      - First step ： A series of double sequence alignments
        First step , The dynamic programming algorithm is used to generate the double sequence alignment between all proteins to be compared , such as , Five sequences produce 10 Double sequence alignment scores
      - The second step ： Build the boot tree
        Use distance （ Or similarity score ） Matrix calculates a boot tree
        There are two main ways to build a boot tree （ Chapter seven introduces ）
        Arithmetic mean unweighted paired group method （UPGMA）
        Adjacency method
        The main characteristics of trees
        topology （ The order of branches ）
        Evolutionary distance （ The length of the branch ）
        The tree can be used to reflect the correlation degree of multiple sequences involved in multiple alignments
      - The third step ： Perform a series of steps based on the order in which they appear on the boot tree , Create multiple alignment sequences
        The algorithm guides the selection of two closest sequences from the tree for double sequence alignment . These two sequences appear in the leaf nodes of the tree , That is, the position of the existing sequence . The next sequence is added to the double sequence alignment or used to make another double sequence alignment . Compare gradually , Until you reach the root of the tree , All sequences are matched .
- Iterative method
  - The iterative method uses the strategy of progressive comparison to calculate a suboptimal solution , Then, dynamic programming or other methods are used to modify the comparison results until the solution converges . An initial tree is divided and the spectra on both sides are re compared . So these methods construct an initial comparison , Then modify it and try to improve it , Use some objective functions to maximize the score .
  - The progressive comparison method has limitations , Once an error occurs in the comparison process, it cannot be corrected , Iterative method can overcome this limitation .
  - MAFFT Multiple sequence alignment package , Including progressive comparison method ：
    - similar ClustalW Single round incremental method , A fast Fourier transform is used in the thinning step ;
    - Two wheel method , First, multiple sequence alignments are generated , Then, the intensity of refinement is calculated by comparing the results , Form a secondary progressive comparison ;
    - PartTree Progressive comparison ： Use matching 6 Tuple to calculate pairwise distance , This method is called k-mer Count .
  - MUSCLE The operation is divided into three stages
    - Use progressive multiple sequence alignment to produce a rough alignment result
    - Improved the tree and built a new progressive comparison
    - By systematically defecating the dog tree to obtain subsets , The boot tree is iteratively refined ; Delete an edge of the tree （ Or branch ） To create a binary tree .
- Based on consistency
  - The main idea ： For the sequence x,y and z, If the residue x Comparison z,z Comparison y, that x It should be compared with y.
  - The consistency based method refers to the information content of multiple sequences when scoring the double sequence alignment . This method is unique in that it integrates evidence from multiple sequence alignment to double sequence alignment .
  - ProbCons The algorithm consists of five steps
    - The algorithm calculates the posterior probability matrix of each pair of sequences
    - Calculate the accuracy expectation of each double sequence alignment
    - utilize “ Probability consistency transformation ” Re estimate the quality score of each double sequence alignment
    - The hierarchical clustering method is used to construct a guide tree with expected accuracy
    - In the order given by the boot tree , Sequence alignment incrementally
- Based on structure
  - Using the three-dimensional structure information of one or more proteins to be compared may improve the accuracy of multiple sequence alignment . Algorithms that allow users to integrate structural information include PRALINE and T-COFFEE Of Expresso modular
6.3 Research with standard data sets ： Method , Discovery and challenge
- Use standard sets for various algorithms , Software research can obtain “ gold standard ” The right answer , The answer consists of a highly reliable true positive relationship , Then compare software programs to objectively judge which is the most accurate .
- Factors that evaluate the quality of standard data sets ：
  - Relevance ： The benchmark data set should include the tasks that users actually encounter when using the software
  - Solvability ： The task should not be too simple or too difficult
  - Scalability ： Some tasks are small-scale , Some tasks sequence a large number of proteins
  - Availability ： The benchmark database should be public
  - independence ： Methods used to build benchmark databases should not be used for sequence alignment
  - Scalable ： The benchmark data set should be expanded over time to adapt to new problems
- Recognized benchmark data set for multiple sequence alignment ：BAliBASE、HOMSTRAD、OXBench、PREFAB、SABmark as well as IRMBASE. The common method is to obtain comparison results based on proteins with known three-dimensional structure , The three-dimensional structure is through X Obtained by X-ray diffraction crystallization technique .
- MSA The performance of the algorithm in a benchmark data set can be evaluated by some objective scoring functions , The common method is to measure the sum of pairs .
Welcome to join the student letter exchange group , When the QR code expires, you can add VX：bbplayer2021