当前位置:网站首页>November 16, 2021 [reading notes] - macro genome analysis process
November 16, 2021 [reading notes] - macro genome analysis process
2022-06-30 07:37:00 【Muyiqing】
This note is the notes of MEG gene macro genome course
Catalog
1. quality control
Quality inspection
FastQC
Low quality sequence filtration and removal joint
Trimmomatic
Host contamination sequence filtering
FastQ Screen;Bowtie2
2. Reads-based analysis
Definition : Use sequenced clean reads Directly compare and analyze with the sequence in the database , The detection abundance of known species or functional gene sequences is obtained by comparison
2.1 Species classification and identification
The genome database does not necessarily identify species in sequence alignment , There may be conservative sequences
Marker-based Detection limitations :16S The database resolution is not high , The classification of species is vague
2.2 Reveal species diversity
Based on genome 、 The analysis method of protein database may overestimate the diversity of species , Produce more species with very low abundance , And the reliability is not high ;
be based on 16S The method of sequence alignment (EMIRGE、phyloFlash) The resulting diversity and 16S The sequencing results are close to , The abundance ratio is different , In principle, the results of metagenomics are closer to the real distribution
2.3 Functional classification and identification
2.4 Selection of database and comparison algorithm
The size of the database directly determines read-based Accuracy of analysis results
Large databases are slow , Therefore, it is generally compared with smaller general databases (KEGG/EGGnog/COG), Or professional database (ARFs-OAP/NcycDB/CAZy)
The comparison database is usually a protein database
Algorithm to choose :
Blastx
Diamond
HMMER
notes :Reads-based It usually requires multi-threaded parallel computing
2.5 The principle and method of sequence homogenization
2.6 Data sheet in-depth analysis
Alpha
Beta
Species composition
Comparison of differences
Correlation analysis
Network analysis
Phylogenetic tree
Metabolic pathways 、 Enrichment pathway
2.7 Analysis examples
There is biological duplication ; Select functional genes ( Genes related to nitrogen cycle )
3. Contig Splicing
3.1 Stitching algorithm :de Bruijn Graph Algorithm
3.2 Contig Splicing algorithm and strategy
Commonly used tools
metaSPAdes( High memory consumption )
SPAdes
MEGAHIT( Good balance )
CLC
IDBA-UD
Contig Splicing is usually performed independently of each sample , But biological duplicate samples clean reads If conditions permit, they can be combined and spliced together .
3.3 Contigs-based analysis
contigs It can be compared with the known gene database ;
Can be done ORF And protein prediction ;
For specific genes and segments
Can pass contigs Look for the upstream and downstream regulatory sites or regional characteristics
4. Binning Separate boxes
4.1 MAGs: Metagenomes assemble genomes
MAGs technological process
MetaWRAP:2020 New tools in , Higher integration
MAGs Definition : Put some similar contigs The process of being allocated together
notes :
1. Got bin set Assigned to the by bin Multiple entries under contigs form , but contigs Still independent
2.Binning After MAG Not the genome of a species
4.2 Binning Basic basis
Coverage coverage
TNFs Four base frequency
GC content
Taxonomy Species classification information
Contigs The law of distribution
4.3 Binning Quality assessment
principle : According to the universal single copy marker gene set contained in the phylogenetic spectrum of species (SCCs) To provide Binning after MAG Evaluation of pollution degree and integrity of ;
Completeness :MAG Middle gene and corresponding SCGs comparison , Whether the number of genes is complete , The greater the numerical ,Bin The better the quality
Pollution degree : One MAg The extent to which multiple species exist , The smaller the numerical ,Bin The better the quality
4.4 Binning Results Integration
Three options
Co assembly ( Consume a lot of computing resources )
It is divided into boxes (Bins Contamination is as high as co assembly )
dRep
Assembly and de duplication can explain more and higher quality bins
4.5 Binning Further analysis
Single bacterium
Display genome gene distribution
Collinearity analysis with related species
Draw cell metabolism model
Carry out upstream and downstream gene analysis
Flora
Draw a special cell metabolism model
Multiple conserved proteins construct a more accurate evolutionary tree
Analysis of species association through metabolic networks
5. summary
5.1 Analyze and compare
5.2 Analyze limitations and challenges
- The high cost Difficult to maintain consistency
- The genome obtained is incomplete , And there is no clear taxonomic information
- Sequencing results do not represent attractive microbial groups
- The properties of microorganisms will affect the quantitative results , The relative quantitative information of metagenome can not reflect the absolute abundance of actual samples
5.3 Analysis and suggestion
- Use... According to scientific questions reads-based Analyze or Assembly-based analysis
- database 、 Comparing algorithms and stitching tools requires a balance between accuracy and speed
- Keep learning the methods and software of metagenome analysis , Pay attention to the principle of the new method 、 Scope of application and operating efficiency
- Learn to use the data in the database , Integrate and analyze with your own data
Welcome to join the group , Or add an author VX:bbplayer2021, Invite in
边栏推荐
- next InitializeSecurityContext failed: Unknown error (0x80092012) - 吊销功能无法检查证书是否吊销。
- Efga design open source framework fabulous series (I) establishment of development environment
- DS1302 digital tube clock
- Investment and financing analysis report of Supply Chain & logistics industry in 2021
- Global digital industry strategy and policy observation in 2021 (China Academy of ICT)
- Xiashuo think tank: 28 updates of the planet reported today (including the information of flirting with girls and Han Tuo on Valentine's day)
- 02 - bare metal and RTOS development modes: five development modes of bare metal and the introduction of RTOS
- 2022 retail industry strategy: three strategies for consumer goods gold digging (in depth)
- 动态内存管理
- Quick placement of devices by module in Ad
猜你喜欢
Final review -php learning notes 8-mysql database
Binary tree related operations (based on recursion, implemented in C language)
期末复习-PHP学习笔记2-PHP语言基础
Final review -php learning notes 7-php and web page interaction
Commands and permissions for directories and files
Three software installation methods
Shell command, how much do you know?
C language implementation of chain stack (without leading node)
期末复习-PHP学习笔记4-PHP自定义函数
Next initializesecuritycontext failed: unknown error (0x80092012) - the revocation function cannot check whether the certificate is revoked.
随机推荐
Next initializesecuritycontext failed: unknown error (0x80092012) - the revocation function cannot check whether the certificate is revoked.
期末複習-PHP學習筆記3-PHP流程控制語句
Pool de Threads - langage C
Local unloading traffic of 5g application
Final review -php learning notes 6- string processing
STM32 register on LED
Network security - packet capture and IP packet header analysis
期末复习-PHP学习笔记7-PHP与web页面交互
Wangbohua: development situation and challenges of photovoltaic industry
Digital white paper on total cost management in chain operation industry
National technology n32g45x series about timer timing cycle calculation
03 - programming framework: Division of application layer, middle layer and driver layer in bare metal programming
Final review -php learning notes 7-php and web page interaction
Network security and data in 2021: collection of new compliance review articles (215 pages)
Examen final - notes d'apprentissage PHP 3 - Déclaration de contrôle du processus PHP
Introduction notes to pytorch deep learning (11) neural network pooling layer
Directory of software
Implementation of binary search in C language
Basic operation command
Analysys analysis: online audio content consumption market analysis 2022