当前位置:网站首页>November 16, 2021 [reading notes] - macro genome analysis process

November 16, 2021 [reading notes] - macro genome analysis process

2022-06-30 07:37:00 Muyiqing

This note is the notes of MEG gene macro genome course

1. quality control

Quality inspection
FastQC
Low quality sequence filtration and removal joint
Trimmomatic
Host contamination sequence filtering
FastQ Screen;Bowtie2
 Insert picture description here

2. Reads-based analysis

Definition : Use sequenced clean reads Directly compare and analyze with the sequence in the database , The detection abundance of known species or functional gene sequences is obtained by comparison
 Please add a picture description

2.1 Species classification and identification

The genome database does not necessarily identify species in sequence alignment , There may be conservative sequences
Marker-based Detection limitations :16S The database resolution is not high , The classification of species is vague
 Insert picture description here

2.2 Reveal species diversity

Based on genome 、 The analysis method of protein database may overestimate the diversity of species , Produce more species with very low abundance , And the reliability is not high ;
be based on 16S The method of sequence alignment (EMIRGE、phyloFlash) The resulting diversity and 16S The sequencing results are close to , The abundance ratio is different , In principle, the results of metagenomics are closer to the real distribution

2.3 Functional classification and identification

 Insert picture description here

2.4 Selection of database and comparison algorithm

The size of the database directly determines read-based Accuracy of analysis results
Large databases are slow , Therefore, it is generally compared with smaller general databases (KEGG/EGGnog/COG), Or professional database (ARFs-OAP/NcycDB/CAZy)

The comparison database is usually a protein database
Algorithm to choose :

Blastx
Diamond
HMMER

notes :Reads-based It usually requires multi-threaded parallel computing

2.5 The principle and method of sequence homogenization

 Insert picture description here

2.6 Data sheet in-depth analysis

Alpha
Beta
Species composition
Comparison of differences
Correlation analysis
Network analysis
Phylogenetic tree
Metabolic pathways 、 Enrichment pathway

 Insert picture description here

2.7 Analysis examples

There is biological duplication ; Select functional genes ( Genes related to nitrogen cycle )

3. Contig Splicing

3.1 Stitching algorithm :de Bruijn Graph Algorithm

 Insert picture description here

3.2 Contig Splicing algorithm and strategy

Commonly used tools

metaSPAdes( High memory consumption )
SPAdes
MEGAHIT( Good balance )
CLC
IDBA-UD

Contig Splicing is usually performed independently of each sample , But biological duplicate samples clean reads If conditions permit, they can be combined and spliced together .

3.3 Contigs-based analysis

contigs It can be compared with the known gene database ;
Can be done ORF And protein prediction ;
For specific genes and segments
Can pass contigs Look for the upstream and downstream regulatory sites or regional characteristics
 Insert picture description here

4. Binning Separate boxes

4.1 MAGs: Metagenomes assemble genomes

MAGs technological process
 Insert picture description here
MetaWRAP:2020 New tools in , Higher integration
MAGs Definition : Put some similar contigs The process of being allocated together
notes :

1. Got bin set Assigned to the by bin Multiple entries under contigs form , but contigs Still independent
2.Binning After MAG Not the genome of a species

4.2 Binning Basic basis

    Coverage coverage 
    TNFs Four base frequency 
    GC content 
    Taxonomy Species classification information 
    Contigs The law of distribution 

4.3 Binning Quality assessment

principle : According to the universal single copy marker gene set contained in the phylogenetic spectrum of species (SCCs) To provide Binning after MAG Evaluation of pollution degree and integrity of ;
Completeness :MAG Middle gene and corresponding SCGs comparison , Whether the number of genes is complete , The greater the numerical ,Bin The better the quality
Pollution degree : One MAg The extent to which multiple species exist , The smaller the numerical ,Bin The better the quality

4.4 Binning Results Integration

Three options

Co assembly ( Consume a lot of computing resources )
It is divided into boxes (Bins Contamination is as high as co assembly )
dRep

Assembly and de duplication can explain more and higher quality bins
 Insert picture description here

4.5 Binning Further analysis

Single bacterium

Display genome gene distribution
Collinearity analysis with related species
Draw cell metabolism model
Carry out upstream and downstream gene analysis
 Insert picture description here

Flora

Draw a special cell metabolism model
Multiple conserved proteins construct a more accurate evolutionary tree
Analysis of species association through metabolic networks
 Insert picture description here

5. summary

5.1 Analyze and compare

 Insert picture description here

5.2 Analyze limitations and challenges

  • The high cost Difficult to maintain consistency
  • The genome obtained is incomplete , And there is no clear taxonomic information
  • Sequencing results do not represent attractive microbial groups
  • The properties of microorganisms will affect the quantitative results , The relative quantitative information of metagenome can not reflect the absolute abundance of actual samples

5.3 Analysis and suggestion

  • Use... According to scientific questions reads-based Analyze or Assembly-based analysis
  • database 、 Comparing algorithms and stitching tools requires a balance between accuracy and speed
  • Keep learning the methods and software of metagenome analysis , Pay attention to the principle of the new method 、 Scope of application and operating efficiency
  • Learn to use the data in the database , Integrate and analyze with your own data

Welcome to join the group , Or add an author VX:bbplayer2021, Invite in
 Insert picture description here

原网站

版权声明
本文为[Muyiqing]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202160539309499.html