当前位置:网站首页>January 23, 2022 [reading notes] - bioinformatics and functional genomics (Chapter 6: multiple sequence alignment)

January 23, 2022 [reading notes] - bioinformatics and functional genomics (Chapter 6: multiple sequence alignment)

2022-06-30 07:39:00 Muyiqing

  • Learning goals
    • Understand the use of ClustalW Multiple sequence alignment (MSA) The three main stages of ;
    • Describe several other multiple sequence alignments (MSA) Program , Understand how they work , Compare them with ClustalW Similarities and differences ;
    • Understand the importance of benchmarking , And understand about MSA Some basic conclusions of ;
    • Understand about genomic regions MSA A few questions about .
  • 6.1 introduction
    • This chapter discusses MSA General questions about
      • Introduce MSA Five ways of doing this ;
      • Recognition for MSA The database of , such as Pfam;
      • Discuss the genome DNA Multiple sequence alignment .
    • Definition of multiple alignment sequences
      • Multiple sequence alignment is a set of 3 One or more proteins that can be partially or completely matched ( Or nucleic acid ) Sequence .
      • A protein family does not necessarily have one “ correct ” The comparison results (β Globulin and myoglobin , Just share 25% The consistency of , But the three-dimensional structure is almost the same )
      • A multiple sequence alignment is characterized by having columns on the amino acid residue ratio pairs , This comparison can be determined by the characteristics of amino acid residues , such as :
        • There are highly conserved amino acid residues , Such as cysteine, which can form disulfide bond .
        • There are conservative motif, Such as transmembrane span or immunoglobulin functional domain .
        • There are conservative features of protein secondary structure , If it helps to form α screw 、β Residues of folding or transition domains .
        • There are areas that show inserted or missing consistent patterns .
    • Typical applications and practical strategies of multiple sequence alignment
      • When to use multiple sequence alignment ? Why use multiple sequence alignment ?
      • 1. If the protein in question is related to a large group of proteins , So this group of protein members can usually provide information about the possible functions of the protein 、 structure 、 Information about evolution
      • 2. Most protein families are distant members , Use MSA We can find homology more sensitively than double sequence alignment .
      • 3. When viewing database search results ,MSA For displaying conservative residues and motif More intuitive .
      • 4. Evaluate mutations (SNP) Harmful algorithms usually depend on DNA Multiple sequence alignment with proteins to assess cross species conservation —— Harmful compilation tends to occur at more conservative sites
      • 5. The study of population data can serve many purposes involving evolution 、 The biological problems of structure and function provide an in-depth understanding
      • 6. When the complete genome of any species is sequenced , A major part of the research is to define which protein family all gene products belong to .
      • 7. Phylogenetic algorithms start with multiple sequence alignment results as raw data , Generate phylogenetic tree .
      • 8. Common sequences containing transcriptional Yinzu binding sites and other conserved elements are mainly identified based on conserved noncoding sequences detected by multiple sequence alignment .
  • 6.2 Major multiple sequence alignment methods for species
    • Five common methods
      • Exact method
      • Progressive comparison method
      • Iterative method
      • Consistency based approach
      • Structure based approach
    • Exact method
      Needleman and Wunsch( describe ) Continuation of dynamic programming algorithm for double sequence alignment
      • Using the dynamic programming algorithm of double sequence alignment , The comparison matrix is multidimensional , The goal is to maximize the sum of each pair of sequence alignment scores .
      • Pros and : The exact method can generate the most comparable , But it is infeasible for too many sequences in time and space . about N A sequence of , The time requirement for calculation is O(2^N * L^N),N Is the number of sequences ,L Is the average length of the sequence . Compared with ,ClustalW The time complexity of is O(N^4+L^2),MUSCLE The time complexity of is O(N^4+NL^2), These algorithms are fast , But the heuristic algorithm can not guarantee to produce the optimal ratio pairs )
    • Progressive comparison method
      Fitch and Yasunobu(1975) Put forward , By applying it to 5S Ribose RNA Sequence alignment Hogeweg and Hesper(1984) describe .Da-Fei Feng and Russell Doolittle(1987.1990) Extension
      • Methods and Strategies
        • It is necessary to calculate the pairwise alignment scores of all protein sequences to be compared , Start with the most similar sequence , Then gradually add more sequences to participate in the alignment .
      • Pros and : Support rapid alignment of hundreds of sequences . The main limitation is that the final alignment results depend on the order in which the sequences are added .
      • Common progressive comparison tools
        • ClustalW
          • Web tools

        • It is carried out in three stages
          • First step : A series of double sequence alignments
            • First step , The dynamic programming algorithm is used to generate the double sequence alignment between all proteins to be compared , such as , Five sequences produce 10 Double sequence alignment scores
          • The second step : Build the boot tree
            • Use distance ( Or similarity score ) Matrix calculates a boot tree
            • There are two main ways to build a boot tree ( Chapter seven introduces )
              • Arithmetic mean unweighted paired group method (UPGMA)
              • Adjacency method
            • The main characteristics of trees
              • topology ( The order of branches )
              • Evolutionary distance ( The length of the branch )
            • The tree can be used to reflect the correlation degree of multiple sequences involved in multiple alignments
          • The third step : Perform a series of steps based on the order in which they appear on the boot tree , Create multiple alignment sequences
            • The algorithm guides the selection of two closest sequences from the tree for double sequence alignment . These two sequences appear in the leaf nodes of the tree , That is, the position of the existing sequence . The next sequence is added to the double sequence alignment or used to make another double sequence alignment . Compare gradually , Until you reach the root of the tree , All sequences are matched .
    • Iterative method
      • The iterative method uses the strategy of progressive comparison to calculate a suboptimal solution , Then, dynamic programming or other methods are used to modify the comparison results until the solution converges . An initial tree is divided and the spectra on both sides are re compared . So these methods construct an initial comparison , Then modify it and try to improve it , Use some objective functions to maximize the score .
      • The progressive comparison method has limitations , Once an error occurs in the comparison process, it cannot be corrected , Iterative method can overcome this limitation .
      • MAFFT Multiple sequence alignment package , Including progressive comparison method :
        • similar ClustalW Single round incremental method , A fast Fourier transform is used in the thinning step ;
        • Two wheel method , First, multiple sequence alignments are generated , Then, the intensity of refinement is calculated by comparing the results , Form a secondary progressive comparison ;
        • PartTree Progressive comparison : Use matching 6 Tuple to calculate pairwise distance , This method is called k-mer Count .
      • MUSCLE The operation is divided into three stages
        • Use progressive multiple sequence alignment to produce a rough alignment result
        • Improved the tree and built a new progressive comparison
        • By systematically defecating the dog tree to obtain subsets , The boot tree is iteratively refined ; Delete an edge of the tree ( Or branch ) To create a binary tree .
    • Based on consistency
      • The main idea : For the sequence x,y and z, If the residue x Comparison z,z Comparison y, that x It should be compared with y.
      • The consistency based method refers to the information content of multiple sequences when scoring the double sequence alignment . This method is unique in that it integrates evidence from multiple sequence alignment to double sequence alignment .
      • ProbCons The algorithm consists of five steps
        • The algorithm calculates the posterior probability matrix of each pair of sequences
        • Calculate the accuracy expectation of each double sequence alignment
        • utilize “ Probability consistency transformation ” Re estimate the quality score of each double sequence alignment
        • The hierarchical clustering method is used to construct a guide tree with expected accuracy
        • In the order given by the boot tree , Sequence alignment incrementally
    • Based on structure
      • Using the three-dimensional structure information of one or more proteins to be compared may improve the accuracy of multiple sequence alignment . Algorithms that allow users to integrate structural information include PRALINE and T-COFFEE Of Expresso modular
  • 6.3 Research with standard data sets : Method , Discovery and challenge
    • Use standard sets for various algorithms , Software research can obtain “ gold standard ” The right answer , The answer consists of a highly reliable true positive relationship , Then compare software programs to objectively judge which is the most accurate .
    • Factors that evaluate the quality of standard data sets :
      • Relevance : The benchmark data set should include the tasks that users actually encounter when using the software
      • Solvability : The task should not be too simple or too difficult
      • Scalability : Some tasks are small-scale , Some tasks sequence a large number of proteins
      • Availability : The benchmark database should be public
      • independence : Methods used to build benchmark databases should not be used for sequence alignment
      • Scalable : The benchmark data set should be expanded over time to adapt to new problems
    • Recognized benchmark data set for multiple sequence alignment :BAliBASE、HOMSTRAD、OXBench、PREFAB、SABmark as well as IRMBASE. The common method is to obtain comparison results based on proteins with known three-dimensional structure , The three-dimensional structure is through X Obtained by X-ray diffraction crystallization technique .
    • MSA The performance of the algorithm in a benchmark data set can be evaluated by some objective scoring functions , The common method is to measure the sum of pairs .
  • Welcome to join the student letter exchange group , When the QR code expires, you can add VX:bbplayer2021

 

原网站

版权声明
本文为[Muyiqing]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202160539308996.html