当前位置：网站首页>December 19, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)

December 19, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)

2022-06-30 07:38:00 【Muyiqing】

5.5 Use something similar to BLAST A quick search of the genome using the comparison tool DNA
- demand ： With the genome DNA The number of databases is growing , Contrast requires more and more tools
  - Can be found in the genome DNA Exons found in
  - Consider the genome when comparing DNA Contains sequencing errors
  - There are corresponding algorithms to solve the problem that the genomes of related species are deleted in the comparison 、 repeat 、 The problem of inversion or displacement
  - There are corresponding algorithms to solve DNA Small differences between sequences , Such as SNP site
- Use the standard set to evaluate the effect of genome alignment
  - When using a sequential evolutionary random model （ROSE） The software package ships a simulated sequence set for testing , You can get the global comparison tool LAGAN The highest sensitivity , Local comparison tool （ Such as BLASTZ） The comparison in the compilation section is more accurate
- PatternHunter： Discontinuous seeds improve sensitivity
  - PatternHunter Add a mismatch site between the matching sites , Improved speed and sensitivity （ Two models are described ）
  - Match as 1, Mismatch as 0 For example , The model format is as follows ：
    - BLASTN：11111111111
    - PatternHunter：110100110010101111（ Another kind 11101001010011011）
    - reason ： Very few clips are shared between adjacent seed matches , It makes the matching more independent than using the continuous seed model
  - BLASTZ and MegaBLAST The isologous protein search algorithm also uses this strategy
    - Icon
- BLASTZ
  - Compare the human and mouse genomes DNA Sequence .
  - function ：
    - Pedigree specific sporadic repeats are removed from both sequences
    - use 12 A word length match does not allow empty spaces to extend it , When the score exceeds a certain threshold , The extension will allow vacancies , namely 1110100110010101111
    - For sections adjacent to the successful comparison, the second step is repeated with a lower （ More sensitive ） Word length , such as 7.
      - have access to UCSC visualization BLASTZ Compared genome sequences
        Icon
- Enredo and Pecan（ A little ）
  - Ensemble For multiple sequence alignment , The comparison results are more accurate than other software based on other criteria
- MegaBLAST And discontinuities MegaBLAST
  - MegaBLAST：NCBI Optimized for fast comparison of long DNA Tools for querying sequences , The default word length is 28, Adjustable to 256, Increase the running speed .
    - Small word length , High sensitivity , Low running speed .
    - The output similarity percentage threshold can be defined
    - Corresponding match and mismatch scores can be defined
  - Discontinuous MegaBLAST Tools for comparing more distant related genome sequences .
    - Icon
- class BLAST Comparison tool （BLAT）
  - A very fast genome DNA Search tools
  - BLAT Put the entire genome DNA The database is broken down into word indexes , These words contain all the non overlapping words in the genome 11-mers.
  - BLAT The database indexing strategy used is also SSAH2 And subsequent MegaBLAST use
  - Other properties ：
    - BLAST The extension is fired when two matches occur ,BLAT Multiple matches are required ;
    - BLAT The main purpose is to find the data similar to the query sequence 95% The above matches
    - BLAT Will search the boundaries of introns and exons , In essence, it establishes a model of gene structure .
      - Query case , Icon
- LAGAN
  - Double sequence alignment
    - Icon
  - The global double sequence alignment is carried out in three steps
    - 1. In the two sequences, a local alignment is first generated to identify a set of anchors , Allow multiple short imprecise word matches instead of long exact word matches ;
    - 2. Generate a rough global map , A collection containing the largest anchors sorted by score ;
    - 3. Calculate the final global comparison , Limited to the priority areas defined by the rough map .
- SSAHA2
  - SSAHA2 take DNA The database is converted to a hash table with fixed word length , Double sequence alignment can quickly find a match in the hash table .
5.6 Compare the second generation sequencing reading segment with the reference genome
- 1977 year ：sanger Sequence 2005 year ：NGS Sequence
- Comparison considerations ：
  - Match and mismatch
  - Running speed
    - Introduce index ： Hash table and suffix tree
- Hash table based comparison
  - utilize “ Seed Extension ” Strategy
  - 1. Enter two types of data ：
    - Reference genome sequence
    - A large number of short sequence fragments
  - 2. Index fragments and create multiple hash tables
  - 3. Then search the hash table to identify matching sections in the database .
- be based on Burrows-Wheeler Comparison of conversions （ Suffix tree ）
  - Using suffix tree and suffix array is a way to improve the comparison speed ,BWA and Bowtie2 Commonly used , Length of segments is considered , Sequencing error rate , The vacancy penalty shall be given and the local and global comparison of the reading segment shall be comprehensively considered .
  - BWT Transform and compress the reference genome （ lossless compression ）, That is, a complete original sequence can be restored from the compressed data .
    - 1. Given a length of N String , Generate N*N Matrix
    - 2. Sort by dictionary sort , Generate matrix M, Each line corresponds to the cyclic shift of the string , The first column is F The last column L
    - 3. Compressed string only F and L Information or index of , You can quickly restore the matrix M
5.7 expectation
- With BLAST Search has become a basic tool for studying proteins and genes , Many special applications have been developed , Including different algorithms and special databases .
- BLAST You can't search a lot of genomes DNA, Other methods can be done by using longer word lengths 、 The empty seed and the index of database and query sequence achieve this purpose .
- The short sequence alignment tool is specially designed to align millions of short sequences to the reference genome , Typical applications include finding SNP Locus and SV site .
5.8 common problem
- For any bioinformatics problem , You must specify the target of querying the database , That is to achieve what purpose
- consider BLAST False positives in , Remove... From the result , And reset the appropriate expectation threshold
- Try to use the right tools and databases for a specific target
I hope this article can help you , You are also welcome to join the exchange group , Or add VX：bbplayer2021 Share the learning experience of Shengxin .