当前位置:网站首页>December 19, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)
December 19, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5 advanced database search)
2022-06-30 07:38:00 【Muyiqing】
- 5.5 Use something similar to BLAST A quick search of the genome using the comparison tool DNA
- demand : With the genome DNA The number of databases is growing , Contrast requires more and more tools
- Can be found in the genome DNA Exons found in
- Consider the genome when comparing DNA Contains sequencing errors
- There are corresponding algorithms to solve the problem that the genomes of related species are deleted in the comparison 、 repeat 、 The problem of inversion or displacement
- There are corresponding algorithms to solve DNA Small differences between sequences , Such as SNP site
- Use the standard set to evaluate the effect of genome alignment
- When using a sequential evolutionary random model (ROSE) The software package ships a simulated sequence set for testing , You can get the global comparison tool LAGAN The highest sensitivity , Local comparison tool ( Such as BLASTZ) The comparison in the compilation section is more accurate
- PatternHunter: Discontinuous seeds improve sensitivity
- PatternHunter Add a mismatch site between the matching sites , Improved speed and sensitivity ( Two models are described )
- Match as 1, Mismatch as 0 For example , The model format is as follows :
- BLASTN:11111111111
- PatternHunter:110100110010101111( Another kind 11101001010011011)
- reason : Very few clips are shared between adjacent seed matches , It makes the matching more independent than using the continuous seed model
- BLASTZ and MegaBLAST The isologous protein search algorithm also uses this strategy
- Icon
- Icon
- BLASTZ
- Compare the human and mouse genomes DNA Sequence .
- function :
- Pedigree specific sporadic repeats are removed from both sequences
- use 12 A word length match does not allow empty spaces to extend it , When the score exceeds a certain threshold , The extension will allow vacancies , namely 1110100110010101111
- For sections adjacent to the successful comparison, the second step is repeated with a lower ( More sensitive ) Word length , such as 7.
- have access to UCSC visualization BLASTZ Compared genome sequences
- Icon
- Icon
- have access to UCSC visualization BLASTZ Compared genome sequences
- Enredo and Pecan( A little )
- Ensemble For multiple sequence alignment , The comparison results are more accurate than other software based on other criteria
- MegaBLAST And discontinuities MegaBLAST
- MegaBLAST:NCBI Optimized for fast comparison of long DNA Tools for querying sequences , The default word length is 28, Adjustable to 256, Increase the running speed .
- Small word length , High sensitivity , Low running speed .
- The output similarity percentage threshold can be defined
- Corresponding match and mismatch scores can be defined
- Discontinuous MegaBLAST Tools for comparing more distant related genome sequences .
- Icon
- Icon
- MegaBLAST:NCBI Optimized for fast comparison of long DNA Tools for querying sequences , The default word length is 28, Adjustable to 256, Increase the running speed .
- class BLAST Comparison tool (BLAT)
- A very fast genome DNA Search tools
- BLAT Put the entire genome DNA The database is broken down into word indexes , These words contain all the non overlapping words in the genome 11-mers.
- BLAT The database indexing strategy used is also SSAH2 And subsequent MegaBLAST use
- Other properties :
- BLAST The extension is fired when two matches occur ,BLAT Multiple matches are required ;
- BLAT The main purpose is to find the data similar to the query sequence 95% The above matches
- BLAT Will search the boundaries of introns and exons , In essence, it establishes a model of gene structure .
- Query case , Icon
- Query case , Icon
- LAGAN
- Double sequence alignment
- Icon
- Icon
- The global double sequence alignment is carried out in three steps
- 1. In the two sequences, a local alignment is first generated to identify a set of anchors , Allow multiple short imprecise word matches instead of long exact word matches ;
- 2. Generate a rough global map , A collection containing the largest anchors sorted by score ;
- 3. Calculate the final global comparison , Limited to the priority areas defined by the rough map .
- Double sequence alignment
- SSAHA2
- SSAHA2 take DNA The database is converted to a hash table with fixed word length , Double sequence alignment can quickly find a match in the hash table .
- demand : With the genome DNA The number of databases is growing , Contrast requires more and more tools
- 5.6 Compare the second generation sequencing reading segment with the reference genome
- 1977 year :sanger Sequence 2005 year :NGS Sequence
- Comparison considerations :
- Match and mismatch
- Running speed
- Introduce index : Hash table and suffix tree
- Hash table based comparison
- utilize “ Seed Extension ” Strategy
- 1. Enter two types of data :
- Reference genome sequence
- A large number of short sequence fragments
- 2. Index fragments and create multiple hash tables
- 3. Then search the hash table to identify matching sections in the database .
- be based on Burrows-Wheeler Comparison of conversions ( Suffix tree )
- Using suffix tree and suffix array is a way to improve the comparison speed ,BWA and Bowtie2 Commonly used , Length of segments is considered , Sequencing error rate , The vacancy penalty shall be given and the local and global comparison of the reading segment shall be comprehensively considered .
- BWT Transform and compress the reference genome ( lossless compression ), That is, a complete original sequence can be restored from the compressed data .
- 1. Given a length of N String , Generate N*N Matrix
- 2. Sort by dictionary sort , Generate matrix M, Each line corresponds to the cyclic shift of the string , The first column is F The last column L
- 3. Compressed string only F and L Information or index of , You can quickly restore the matrix M
- 5.7 expectation
- With BLAST Search has become a basic tool for studying proteins and genes , Many special applications have been developed , Including different algorithms and special databases .
- BLAST You can't search a lot of genomes DNA, Other methods can be done by using longer word lengths 、 The empty seed and the index of database and query sequence achieve this purpose .
- The short sequence alignment tool is specially designed to align millions of short sequences to the reference genome , Typical applications include finding SNP Locus and SV site .
- 5.8 common problem
- For any bioinformatics problem , You must specify the target of querying the database , That is to achieve what purpose
- consider BLAST False positives in , Remove... From the result , And reset the appropriate expectation threshold
- Try to use the right tools and databases for a specific target
- I hope this article can help you , You are also welcome to join the exchange group , Or add VX:bbplayer2021 Share the learning experience of Shengxin .
边栏推荐
- 冰冰学习笔记:快速排序
- Processes, jobs, and services
- STM32 key control LED
- Examen final - notes d'apprentissage PHP 3 - Déclaration de contrôle du processus PHP
- 期末复习-PHP学习笔记8-mysql数据库
- DXP software uses shortcut keys
- Implementation of double linked list in C language
- Intersection of two lines
- 6月底了,可以开始做准备了,不然这么赚钱的行业就没你的份了
- Personal blog one article multi post tutorial - basic usage of openwriter management tool
猜你喜欢
Analysis of cross clock transmission in tinyriscv
Commands and permissions for directories and files
Implementation of double linked list in C language
期末复习-PHP学习笔记7-PHP与web页面交互
Introduction notes to pytorch deep learning (XII) neural network - nonlinear activation
Research Report on search business value in the era of big search in 2022
Matter protocol
Efga design open source framework fabulous series (I) establishment of development environment
Arm debug interface (adiv5) analysis (I) introduction and implementation [continuous update]
03 - programming framework: Division of application layer, middle layer and driver layer in bare metal programming
随机推荐
Binary tree related operations (based on recursion, implemented in C language)
Self study notes -- use of 74h573
期末复习-PHP学习笔记3-PHP流程控制语句
Final review -php learning notes 7-php and web page interaction
Deloitte: investment management industry outlook in 2022
Socket socket programming -- UDP
Final review -php learning notes 8-mysql database
Minecraft 1.16.5模组开发(五十) 书籍词典 (Guide Book)
Investment and financing analysis report of Supply Chain & logistics industry in 2021
How to batch modify packaging for DXP schematic diagram
Spring Festival inventory of Internet giants in 2022
C51 minimum system board infrared remote control LED light on and off
Variable storage unit and pointer
2021-10-29 [microbiology] qiime2 sample pretreatment form automation script
4diac getting started example
Line fitting (least square method)
Armv8 (coretex-a53) debugging based on openocd and ft2232h
right four steps of SEIF SLAM
STM32 register on LED
Basic knowledge of compiling learning records