当前位置:网站首页>December 19, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)

December 19, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)

2022-06-30 07:38:00 Muyiqing

  • 5.5 Use something similar to BLAST A quick search of the genome using the comparison tool DNA
    • demand : With the genome DNA The number of databases is growing , Contrast requires more and more tools
      • Can be found in the genome DNA Exons found in
      • Consider the genome when comparing DNA Contains sequencing errors
      • There are corresponding algorithms to solve the problem that the genomes of related species are deleted in the comparison 、 repeat 、 The problem of inversion or displacement
      • There are corresponding algorithms to solve DNA Small differences between sequences , Such as SNP site
    • Use the standard set to evaluate the effect of genome alignment
      • When using a sequential evolutionary random model (ROSE) The software package ships a simulated sequence set for testing , You can get the global comparison tool LAGAN The highest sensitivity , Local comparison tool ( Such as BLASTZ) The comparison in the compilation section is more accurate
    • PatternHunter: Discontinuous seeds improve sensitivity
      • PatternHunter Add a mismatch site between the matching sites , Improved speed and sensitivity ( Two models are described )
      • Match as 1, Mismatch as 0 For example , The model format is as follows :
        • BLASTN:11111111111
        • PatternHunter:110100110010101111( Another kind 11101001010011011)
        • reason : Very few clips are shared between adjacent seed matches , It makes the matching more independent than using the continuous seed model
      • BLASTZ and MegaBLAST The isologous protein search algorithm also uses this strategy
        • Icon

    • BLASTZ
      • Compare the human and mouse genomes DNA Sequence .
      • function :
        • Pedigree specific sporadic repeats are removed from both sequences
        • use 12 A word length match does not allow empty spaces to extend it , When the score exceeds a certain threshold , The extension will allow vacancies , namely 1110100110010101111
        • For sections adjacent to the successful comparison, the second step is repeated with a lower ( More sensitive ) Word length , such as 7.
          • have access to UCSC visualization BLASTZ Compared genome sequences
            • Icon

    • Enredo and Pecan( A little )
      • Ensemble For multiple sequence alignment , The comparison results are more accurate than other software based on other criteria
    • MegaBLAST And discontinuities MegaBLAST
      • MegaBLAST:NCBI Optimized for fast comparison of long DNA Tools for querying sequences , The default word length is 28, Adjustable to 256, Increase the running speed .
        • Small word length , High sensitivity , Low running speed .
        • The output similarity percentage threshold can be defined
        • Corresponding match and mismatch scores can be defined
      • Discontinuous MegaBLAST Tools for comparing more distant related genome sequences .
        • Icon

    • class BLAST Comparison tool (BLAT)
      • A very fast genome DNA Search tools
      • BLAT Put the entire genome DNA The database is broken down into word indexes , These words contain all the non overlapping words in the genome 11-mers.
      • BLAT The database indexing strategy used is also SSAH2 And subsequent MegaBLAST use
      • Other properties :
        • BLAST The extension is fired when two matches occur ,BLAT Multiple matches are required ;
        • BLAT The main purpose is to find the data similar to the query sequence 95% The above matches
        • BLAT Will search the boundaries of introns and exons , In essence, it establishes a model of gene structure .
          • Query case , Icon

    • LAGAN
      • Double sequence alignment
        • Icon

      • The global double sequence alignment is carried out in three steps
        • 1. In the two sequences, a local alignment is first generated to identify a set of anchors , Allow multiple short imprecise word matches instead of long exact word matches ;
        • 2. Generate a rough global map , A collection containing the largest anchors sorted by score ;
        • 3. Calculate the final global comparison , Limited to the priority areas defined by the rough map .
    • SSAHA2
      • SSAHA2 take DNA The database is converted to a hash table with fixed word length , Double sequence alignment can quickly find a match in the hash table .
  • 5.6 Compare the second generation sequencing reading segment with the reference genome
    • 1977 year :sanger Sequence 2005 year :NGS Sequence
    • Comparison considerations :
      • Match and mismatch
      • Running speed
        • Introduce index : Hash table and suffix tree
    • Hash table based comparison
      • utilize “ Seed Extension ” Strategy
      • 1. Enter two types of data :
        • Reference genome sequence
        • A large number of short sequence fragments
      • 2. Index fragments and create multiple hash tables
      • 3. Then search the hash table to identify matching sections in the database .
    • be based on Burrows-Wheeler Comparison of conversions ( Suffix tree )
      • Using suffix tree and suffix array is a way to improve the comparison speed ,BWA and Bowtie2 Commonly used , Length of segments is considered , Sequencing error rate , The vacancy penalty shall be given and the local and global comparison of the reading segment shall be comprehensively considered .
      • BWT Transform and compress the reference genome ( lossless compression ), That is, a complete original sequence can be restored from the compressed data .
        • 1. Given a length of N String , Generate N*N Matrix
        • 2. Sort by dictionary sort , Generate matrix M, Each line corresponds to the cyclic shift of the string , The first column is F The last column L
        • 3. Compressed string only F and L Information or index of , You can quickly restore the matrix M
  • 5.7 expectation
    • With BLAST Search has become a basic tool for studying proteins and genes , Many special applications have been developed , Including different algorithms and special databases .
    • BLAST You can't search a lot of genomes DNA, Other methods can be done by using longer word lengths 、 The empty seed and the index of database and query sequence achieve this purpose .
    • The short sequence alignment tool is specially designed to align millions of short sequences to the reference genome , Typical applications include finding SNP Locus and SV site .
  • 5.8 common problem
    • For any bioinformatics problem , You must specify the target of querying the database , That is to achieve what purpose
    • consider BLAST False positives in , Remove... From the result , And reset the appropriate expectation threshold
    • Try to use the right tools and databases for a specific target
  • I hope this article can help you , You are also welcome to join the exchange group , Or add VX:bbplayer2021 Share the learning experience of Shengxin .

 

原网站

版权声明
本文为[Muyiqing]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202160539309119.html