当前位置:网站首页>November 16, 2021 [reading notes] - macro genome analysis process
November 16, 2021 [reading notes] - macro genome analysis process
2022-06-30 07:37:00 【Muyiqing】
This note is the notes of MEG gene macro genome course
Catalog
1. quality control
Quality inspection
FastQC
Low quality sequence filtration and removal joint
Trimmomatic
Host contamination sequence filtering
FastQ Screen;Bowtie2
2. Reads-based analysis
Definition : Use sequenced clean reads Directly compare and analyze with the sequence in the database , The detection abundance of known species or functional gene sequences is obtained by comparison 
2.1 Species classification and identification
The genome database does not necessarily identify species in sequence alignment , There may be conservative sequences
Marker-based Detection limitations :16S The database resolution is not high , The classification of species is vague 
2.2 Reveal species diversity
Based on genome 、 The analysis method of protein database may overestimate the diversity of species , Produce more species with very low abundance , And the reliability is not high ;
be based on 16S The method of sequence alignment (EMIRGE、phyloFlash) The resulting diversity and 16S The sequencing results are close to , The abundance ratio is different , In principle, the results of metagenomics are closer to the real distribution
2.3 Functional classification and identification

2.4 Selection of database and comparison algorithm
The size of the database directly determines read-based Accuracy of analysis results
Large databases are slow , Therefore, it is generally compared with smaller general databases (KEGG/EGGnog/COG), Or professional database (ARFs-OAP/NcycDB/CAZy)
The comparison database is usually a protein database
Algorithm to choose :
Blastx
Diamond
HMMER
notes :Reads-based It usually requires multi-threaded parallel computing
2.5 The principle and method of sequence homogenization

2.6 Data sheet in-depth analysis
Alpha
Beta
Species composition
Comparison of differences
Correlation analysis
Network analysis
Phylogenetic tree
Metabolic pathways 、 Enrichment pathway

2.7 Analysis examples
There is biological duplication ; Select functional genes ( Genes related to nitrogen cycle )
3. Contig Splicing
3.1 Stitching algorithm :de Bruijn Graph Algorithm

3.2 Contig Splicing algorithm and strategy
Commonly used tools
metaSPAdes( High memory consumption )
SPAdes
MEGAHIT( Good balance )
CLC
IDBA-UD
Contig Splicing is usually performed independently of each sample , But biological duplicate samples clean reads If conditions permit, they can be combined and spliced together .
3.3 Contigs-based analysis
contigs It can be compared with the known gene database ;
Can be done ORF And protein prediction ;
For specific genes and segments
Can pass contigs Look for the upstream and downstream regulatory sites or regional characteristics 
4. Binning Separate boxes
4.1 MAGs: Metagenomes assemble genomes
MAGs technological process 
MetaWRAP:2020 New tools in , Higher integration
MAGs Definition : Put some similar contigs The process of being allocated together
notes :
1. Got bin set Assigned to the by bin Multiple entries under contigs form , but contigs Still independent
2.Binning After MAG Not the genome of a species
4.2 Binning Basic basis
Coverage coverage
TNFs Four base frequency
GC content
Taxonomy Species classification information
Contigs The law of distribution
4.3 Binning Quality assessment
principle : According to the universal single copy marker gene set contained in the phylogenetic spectrum of species (SCCs) To provide Binning after MAG Evaluation of pollution degree and integrity of ;
Completeness :MAG Middle gene and corresponding SCGs comparison , Whether the number of genes is complete , The greater the numerical ,Bin The better the quality
Pollution degree : One MAg The extent to which multiple species exist , The smaller the numerical ,Bin The better the quality
4.4 Binning Results Integration
Three options
Co assembly ( Consume a lot of computing resources )
It is divided into boxes (Bins Contamination is as high as co assembly )
dRep
Assembly and de duplication can explain more and higher quality bins
4.5 Binning Further analysis
Single bacterium
Display genome gene distribution
Collinearity analysis with related species
Draw cell metabolism model
Carry out upstream and downstream gene analysis 
Flora
Draw a special cell metabolism model
Multiple conserved proteins construct a more accurate evolutionary tree
Analysis of species association through metabolic networks 
5. summary
5.1 Analyze and compare

5.2 Analyze limitations and challenges
- The high cost Difficult to maintain consistency
- The genome obtained is incomplete , And there is no clear taxonomic information
- Sequencing results do not represent attractive microbial groups
- The properties of microorganisms will affect the quantitative results , The relative quantitative information of metagenome can not reflect the absolute abundance of actual samples
5.3 Analysis and suggestion
- Use... According to scientific questions reads-based Analyze or Assembly-based analysis
- database 、 Comparing algorithms and stitching tools requires a balance between accuracy and speed
- Keep learning the methods and software of metagenome analysis , Pay attention to the principle of the new method 、 Scope of application and operating efficiency
- Learn to use the data in the database , Integrate and analyze with your own data
Welcome to join the group , Or add an author VX:bbplayer2021, Invite in 
边栏推荐
- Cubemx completes STM32F103 dual serial port 485 transceiver transmission
- 03 - programming framework: Division of application layer, middle layer and driver layer in bare metal programming
- 4diac getting started example
- 期末複習-PHP學習筆記6-字符串處理
- Firewall firewalld
- Common sorting methods
- The most convenient serial port screen chip scheme designed at the charging pile in China
- Next initializesecuritycontext failed: unknown error (0x80092012) - the revocation function cannot check whether the certificate is revoked.
- Final review -php learning notes 8-mysql database
- Desk lamp control panel - brightness adjustment timer
猜你喜欢

Deloitte: investment management industry outlook in 2022

right four steps of SEIF SLAM

Research Report on search business value in the era of big search in 2022

National technology n32g45x series about timer timing cycle calculation

期末复习-PHP学习笔记11-PHP-PDO数据库抽象层.

Inversion Lemma

Next initializesecuritycontext failed: unknown error (0x80092012) - the revocation function cannot check whether the certificate is revoked.

Adjacency matrix representation of weighted undirected graph (implemented in C language)

Network security - packet capture and IP packet header analysis

Personal blog one article multi post tutorial - basic usage of openwriter management tool
随机推荐
Basic operation command
Network security - detailed explanation of VLAN and tunk methods
RT thread kernel application development message queue experiment
期末复习-PHP学习笔记3-PHP流程控制语句
The simulation interface does not declare an exception and throws an exception
Cross compile opencv3.4 download cross compile tool chain and compile (3)
ADC basic concepts
Basic knowledge of compiling learning records
Processes, jobs, and services
Graphic explanation pads update PCB design basic operation
C language implementation of chain stack (without leading node)
right four steps of SEIF SLAM
C51 minimum system board infrared remote control LED light on and off
Video player (II): video decoding
DXP software uses shortcut keys
期末复习-PHP学习笔记1
Investment and financing analysis report of Supply Chain & logistics industry in 2021
Program acceleration
Network security - layer 3 switching technology and internal network planning
DXP shortcut key