当前位置:网站首页>November 16, 2021 [reading notes] - macro genome analysis process
November 16, 2021 [reading notes] - macro genome analysis process
2022-06-30 07:37:00 【Muyiqing】
This note is the notes of MEG gene macro genome course
Catalog
1. quality control
Quality inspection
FastQC
Low quality sequence filtration and removal joint
Trimmomatic
Host contamination sequence filtering
FastQ Screen;Bowtie2
2. Reads-based analysis
Definition : Use sequenced clean reads Directly compare and analyze with the sequence in the database , The detection abundance of known species or functional gene sequences is obtained by comparison 
2.1 Species classification and identification
The genome database does not necessarily identify species in sequence alignment , There may be conservative sequences
Marker-based Detection limitations :16S The database resolution is not high , The classification of species is vague 
2.2 Reveal species diversity
Based on genome 、 The analysis method of protein database may overestimate the diversity of species , Produce more species with very low abundance , And the reliability is not high ;
be based on 16S The method of sequence alignment (EMIRGE、phyloFlash) The resulting diversity and 16S The sequencing results are close to , The abundance ratio is different , In principle, the results of metagenomics are closer to the real distribution
2.3 Functional classification and identification

2.4 Selection of database and comparison algorithm
The size of the database directly determines read-based Accuracy of analysis results
Large databases are slow , Therefore, it is generally compared with smaller general databases (KEGG/EGGnog/COG), Or professional database (ARFs-OAP/NcycDB/CAZy)
The comparison database is usually a protein database
Algorithm to choose :
Blastx
Diamond
HMMER
notes :Reads-based It usually requires multi-threaded parallel computing
2.5 The principle and method of sequence homogenization

2.6 Data sheet in-depth analysis
Alpha
Beta
Species composition
Comparison of differences
Correlation analysis
Network analysis
Phylogenetic tree
Metabolic pathways 、 Enrichment pathway

2.7 Analysis examples
There is biological duplication ; Select functional genes ( Genes related to nitrogen cycle )
3. Contig Splicing
3.1 Stitching algorithm :de Bruijn Graph Algorithm

3.2 Contig Splicing algorithm and strategy
Commonly used tools
metaSPAdes( High memory consumption )
SPAdes
MEGAHIT( Good balance )
CLC
IDBA-UD
Contig Splicing is usually performed independently of each sample , But biological duplicate samples clean reads If conditions permit, they can be combined and spliced together .
3.3 Contigs-based analysis
contigs It can be compared with the known gene database ;
Can be done ORF And protein prediction ;
For specific genes and segments
Can pass contigs Look for the upstream and downstream regulatory sites or regional characteristics 
4. Binning Separate boxes
4.1 MAGs: Metagenomes assemble genomes
MAGs technological process 
MetaWRAP:2020 New tools in , Higher integration
MAGs Definition : Put some similar contigs The process of being allocated together
notes :
1. Got bin set Assigned to the by bin Multiple entries under contigs form , but contigs Still independent
2.Binning After MAG Not the genome of a species
4.2 Binning Basic basis
Coverage coverage
TNFs Four base frequency
GC content
Taxonomy Species classification information
Contigs The law of distribution
4.3 Binning Quality assessment
principle : According to the universal single copy marker gene set contained in the phylogenetic spectrum of species (SCCs) To provide Binning after MAG Evaluation of pollution degree and integrity of ;
Completeness :MAG Middle gene and corresponding SCGs comparison , Whether the number of genes is complete , The greater the numerical ,Bin The better the quality
Pollution degree : One MAg The extent to which multiple species exist , The smaller the numerical ,Bin The better the quality
4.4 Binning Results Integration
Three options
Co assembly ( Consume a lot of computing resources )
It is divided into boxes (Bins Contamination is as high as co assembly )
dRep
Assembly and de duplication can explain more and higher quality bins
4.5 Binning Further analysis
Single bacterium
Display genome gene distribution
Collinearity analysis with related species
Draw cell metabolism model
Carry out upstream and downstream gene analysis 
Flora
Draw a special cell metabolism model
Multiple conserved proteins construct a more accurate evolutionary tree
Analysis of species association through metabolic networks 
5. summary
5.1 Analyze and compare

5.2 Analyze limitations and challenges
- The high cost Difficult to maintain consistency
- The genome obtained is incomplete , And there is no clear taxonomic information
- Sequencing results do not represent attractive microbial groups
- The properties of microorganisms will affect the quantitative results , The relative quantitative information of metagenome can not reflect the absolute abundance of actual samples
5.3 Analysis and suggestion
- Use... According to scientific questions reads-based Analyze or Assembly-based analysis
- database 、 Comparing algorithms and stitching tools requires a balance between accuracy and speed
- Keep learning the methods and software of metagenome analysis , Pay attention to the principle of the new method 、 Scope of application and operating efficiency
- Learn to use the data in the database , Integrate and analyze with your own data
Welcome to join the group , Or add an author VX:bbplayer2021, Invite in 
边栏推荐
- Network security - packet capture and IP packet header analysis
- 期末复习-PHP学习笔记5-PHP数组
- STM32 key control LED
- 实验一、综合实验【Process on】
- Common sorting methods
- Program acceleration
- Next initializesecuritycontext failed: unknown error (0x80092012) - the revocation function cannot check whether the certificate is revoked.
- Introduction notes to pytorch deep learning (11) neural network pooling layer
- Investment and financing analysis report of Supply Chain & logistics industry in 2021
- The most convenient serial port screen chip scheme designed at the charging pile in China
猜你喜欢

实验一、综合实验【Process on】

Basic knowledge of system software development

線程池——C語言

Test enumeration types with STM32 platform running RT thread

2021 private equity fund market report (62 pages)

Multi whale capital: report on China's education intelligent hardware industry in 2022

Dynamic memory management

25岁,从天坑行业提桶跑路,在经历千辛万苦转行程序员,属于我的春天终于来了

Implementation of double linked list in C language

期末复习-PHP学习笔记1
随机推荐
2021 private equity fund market report (62 pages)
NMOS model selection
Dynamic memory management
Basic knowledge of system software development
halcon:读取摄像头并二值化
STM32 control LED lamp
Pre ++ and post ++ overloads
Use of ecostruxure (2) IEC61499 to establish function blocks
Virtual machine VMware: due to vcruntime140 not found_ 1.dll, unable to continue code execution
Lt268 the most convenient TFT-LCD serial port screen chip in the whole network
Deloitte: investment management industry outlook in 2022
Graphic explanation pads update PCB design basic operation
nRF52832 GPIO LED
视频播放器(二):视频解码
Line fitting (least square method)
Use of ecostruxure (3) creating composite function blocks
Similarities and differences of differential signal, common mode signal and single ended signal (2022.2.14)
Cadence innovus physical implementation series (I) Lab 1 preliminary innovus
Digital white paper on total cost management in chain operation industry
Adjacency matrix representation of weighted undirected graph (implemented in C language)