当前位置:网站首页>December 19, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)
December 19, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)
2022-06-30 07:38:00 【Muyiqing】
- 5.5 Use something similar to BLAST A quick search of the genome using the comparison tool DNA
- demand : With the genome DNA The number of databases is growing , Contrast requires more and more tools
- Can be found in the genome DNA Exons found in
- Consider the genome when comparing DNA Contains sequencing errors
- There are corresponding algorithms to solve the problem that the genomes of related species are deleted in the comparison 、 repeat 、 The problem of inversion or displacement
- There are corresponding algorithms to solve DNA Small differences between sequences , Such as SNP site
- Use the standard set to evaluate the effect of genome alignment
- When using a sequential evolutionary random model (ROSE) The software package ships a simulated sequence set for testing , You can get the global comparison tool LAGAN The highest sensitivity , Local comparison tool ( Such as BLASTZ) The comparison in the compilation section is more accurate
- PatternHunter: Discontinuous seeds improve sensitivity
- PatternHunter Add a mismatch site between the matching sites , Improved speed and sensitivity ( Two models are described )
- Match as 1, Mismatch as 0 For example , The model format is as follows :
- BLASTN:11111111111
- PatternHunter:110100110010101111( Another kind 11101001010011011)
- reason : Very few clips are shared between adjacent seed matches , It makes the matching more independent than using the continuous seed model
- BLASTZ and MegaBLAST The isologous protein search algorithm also uses this strategy
- Icon
- Icon
- BLASTZ
- Compare the human and mouse genomes DNA Sequence .
- function :
- Pedigree specific sporadic repeats are removed from both sequences
- use 12 A word length match does not allow empty spaces to extend it , When the score exceeds a certain threshold , The extension will allow vacancies , namely 1110100110010101111
- For sections adjacent to the successful comparison, the second step is repeated with a lower ( More sensitive ) Word length , such as 7.
- have access to UCSC visualization BLASTZ Compared genome sequences
- Icon
- Icon
- have access to UCSC visualization BLASTZ Compared genome sequences
- Enredo and Pecan( A little )
- Ensemble For multiple sequence alignment , The comparison results are more accurate than other software based on other criteria
- MegaBLAST And discontinuities MegaBLAST
- MegaBLAST:NCBI Optimized for fast comparison of long DNA Tools for querying sequences , The default word length is 28, Adjustable to 256, Increase the running speed .
- Small word length , High sensitivity , Low running speed .
- The output similarity percentage threshold can be defined
- Corresponding match and mismatch scores can be defined
- Discontinuous MegaBLAST Tools for comparing more distant related genome sequences .
- Icon
- Icon
- MegaBLAST:NCBI Optimized for fast comparison of long DNA Tools for querying sequences , The default word length is 28, Adjustable to 256, Increase the running speed .
- class BLAST Comparison tool (BLAT)
- A very fast genome DNA Search tools
- BLAT Put the entire genome DNA The database is broken down into word indexes , These words contain all the non overlapping words in the genome 11-mers.
- BLAT The database indexing strategy used is also SSAH2 And subsequent MegaBLAST use
- Other properties :
- BLAST The extension is fired when two matches occur ,BLAT Multiple matches are required ;
- BLAT The main purpose is to find the data similar to the query sequence 95% The above matches
- BLAT Will search the boundaries of introns and exons , In essence, it establishes a model of gene structure .
- Query case , Icon
- Query case , Icon
- LAGAN
- Double sequence alignment
- Icon
- Icon
- The global double sequence alignment is carried out in three steps
- 1. In the two sequences, a local alignment is first generated to identify a set of anchors , Allow multiple short imprecise word matches instead of long exact word matches ;
- 2. Generate a rough global map , A collection containing the largest anchors sorted by score ;
- 3. Calculate the final global comparison , Limited to the priority areas defined by the rough map .
- Double sequence alignment
- SSAHA2
- SSAHA2 take DNA The database is converted to a hash table with fixed word length , Double sequence alignment can quickly find a match in the hash table .
- demand : With the genome DNA The number of databases is growing , Contrast requires more and more tools
- 5.6 Compare the second generation sequencing reading segment with the reference genome
- 1977 year :sanger Sequence 2005 year :NGS Sequence
- Comparison considerations :
- Match and mismatch
- Running speed
- Introduce index : Hash table and suffix tree
- Hash table based comparison
- utilize “ Seed Extension ” Strategy
- 1. Enter two types of data :
- Reference genome sequence
- A large number of short sequence fragments
- 2. Index fragments and create multiple hash tables
- 3. Then search the hash table to identify matching sections in the database .
- be based on Burrows-Wheeler Comparison of conversions ( Suffix tree )
- Using suffix tree and suffix array is a way to improve the comparison speed ,BWA and Bowtie2 Commonly used , Length of segments is considered , Sequencing error rate , The vacancy penalty shall be given and the local and global comparison of the reading segment shall be comprehensively considered .
- BWT Transform and compress the reference genome ( lossless compression ), That is, a complete original sequence can be restored from the compressed data .
- 1. Given a length of N String , Generate N*N Matrix
- 2. Sort by dictionary sort , Generate matrix M, Each line corresponds to the cyclic shift of the string , The first column is F The last column L
- 3. Compressed string only F and L Information or index of , You can quickly restore the matrix M
- 5.7 expectation
- With BLAST Search has become a basic tool for studying proteins and genes , Many special applications have been developed , Including different algorithms and special databases .
- BLAST You can't search a lot of genomes DNA, Other methods can be done by using longer word lengths 、 The empty seed and the index of database and query sequence achieve this purpose .
- The short sequence alignment tool is specially designed to align millions of short sequences to the reference genome , Typical applications include finding SNP Locus and SV site .
- 5.8 common problem
- For any bioinformatics problem , You must specify the target of querying the database , That is to achieve what purpose
- consider BLAST False positives in , Remove... From the result , And reset the appropriate expectation threshold
- Try to use the right tools and databases for a specific target
- I hope this article can help you , You are also welcome to join the exchange group , Or add VX:bbplayer2021 Share the learning experience of Shengxin .
边栏推荐
- Introduction to ecostruxure (1) IEC61499 new scheme
- Analysys analysis: online audio content consumption market analysis 2022
- Multi whale capital: report on China's education intelligent hardware industry in 2022
- Xiashuo think tank: 50 planet updates reported today (including the global architects Summit Series)
- Examen final - notes d'apprentissage PHP 5 - Tableau PHP
- Use of ecostruxure (2) IEC61499 to establish function blocks
- Similarities and differences of differential signal, common mode signal and single ended signal (2022.2.14)
- The most convenient serial port screen chip scheme designed at the charging pile in China
- 冰冰学习笔记:快速排序
- 線程池——C語言
猜你喜欢
冰冰学习笔记:快速排序
National technology n32g45x series about timer timing cycle calculation
期末复习-PHP学习笔记4-PHP自定义函数
Assembly learning register
Network security and data in 2021: collection of new compliance review articles (215 pages)
C language implementation of chain stack (without leading node)
RT thread kernel application development message queue experiment
2021-10-27 [WGS] pacbio third generation methylation modification process
Arm debug interface (adiv5) analysis (I) introduction and implementation [continuous update]
Cadence innovus physical implementation series (I) Lab 1 preliminary innovus
随机推荐
Basic operation command
Similarities and differences of differential signal, common mode signal and single ended signal (2022.2.14)
Adjacency matrix representation of weighted undirected graph (implemented in C language)
1、 Output debugging information: makefile file debugging information $(warning "tests" $(mkfile\u path)); makefile file path
How to batch modify packaging for DXP schematic diagram
期末复习-PHP学习笔记5-PHP数组
期末复习-PHP学习笔记8-mysql数据库
Minecraft 1.16.5模组开发(五十) 书籍词典 (Guide Book)
02 - bare metal and RTOS development modes: five development modes of bare metal and the introduction of RTOS
Final review -php learning notes 9-php session control
Self study notes -- use of 74h573
Cmake generate map file
Line fitting (least square method)
Pre ++ and post ++ overloads
Digital tube EEPROM key to save value
Directory of software
期末复习-PHP学习笔记2-PHP语言基础
Xiashuo think tank: 50 planet updates reported today (including the global architects Summit Series)
uniapp图片下方加标签标图片
Examen final - notes d'apprentissage PHP 3 - Déclaration de contrôle du processus PHP