当前位置：网站首页>Admixture usage document Cookbook

Admixture usage document Cookbook

2022-06-27 14:57:00 【Analysis of breeding data】

The software is introduced

Genome selection , Sometimes a lot of families are measured , If you want to see the classification of these families , It can be grouped by software . Commonly used software is STRUCTURE, however STREUTURE It runs very slowly ,admixture With its computing speed , Has become the mainstream analysis software . So let's talk about that admixture How to use .

Official website

Admixture

 http://software.genetics.ucla.edu/admixture/download.html

Admixture Instruction documents cookbook_sed

Software installation

Use conda Install the software .

       
        conda install admixture
       
       
        1.

After installation , type admixture, Display the following information , Description installation successful

       
        (base) [[email protected] test]$ admixture 
        
****                   ADMIXTURE Version 1.3.0                  ****
        
****                    Copyright 2008-2015                     ****
        
****           David Alexander, Suyash Shringarpure,            ****
        
****                John  Novembre, Ken Lange                   ****
        
****                                                            ****
        
****                 Please cite our paper!                     ****
        
****   Information at www.genetics.ucla.edu/software/admixture  ****
        

        
Usage: admixture <input file> <K>
        
See --help or manual for more advanced usage.
       
       
        1.
        2.
        3.
        4.
        5.
        6.
        7.
        8.
        9.
        10.
        11.

Catalog

Admixture Instruction documents cookbook_sed_02

1. Fast start

1.1 Download sample data

Be careful , The sample data on the official website can no longer be downloaded , Want to test data , You can pay attention to the official account. ：“ Analysis of breeding data ”, reply “admixture”, Get test data .--------2020-5-23 to update

       
        wget http://software.genetics.ucla.edu/admixture/hapmap3-files.tar.gz
       
       
        1.

Once the download is complete , decompression :

       
        tar zxvf hapmap3-files.tar.gz
       
       
        1.

Look at the extracted file :

       
        (base) [[email protected] admixture]$ ls
        
hapmap3.bed  hapmap3.bim  hapmap3.fam  hapmap3-files.tar.gz  hapmap3.map
       
       
        1.
        2.

Or on the official website , Download sample data : hapmap3-files.tar.gz

Admixture Instruction documents cookbook_sed_03

1.2 admixture Supported format

plink Of bed Documents or ped file
EIGENSTRAT The software .geno Format
Be careful :
If your data format is plink Of bed file , such as a.bed, Then you should include a.bim, a.fam
If your data format is plink Of ped file , such as b.ped, Then you should include b.map

1.3 Select the appropriate number of clusters k value

Here you have to have one k value , If you don't know how many groups your group can be divided into , You can do a test , For instance from 1~7 Separate groups , Then look at their cv What's the value , Use that k value .

1.4 function k=3 Of admixture

Be careful , The name here is hapmap3.bed, instead of hapmap3( Unlike plink That doesn't add a suffix ), And there is no --file Parameters , Direct addition plink Of bed file

       
        admixture hapmap3.bed 3
       
       
        1.

Calculation results :

       
        (base) [[email protected] admixture]$ admixture hapmap3.bed 3
        
****                   ADMIXTURE Version 1.3.0                  ****
        
****                    Copyright 2008-2015                     ****
        
****           David Alexander, Suyash Shringarpure,            ****
        
****                John  Novembre, Ken Lange                   ****
        
****                                                            ****
        
****                 Please cite our paper!                     ****
        
****   Information at www.genetics.ucla.edu/software/admixture  ****
        

        
Random seed: 43
        
Point estimation method: Block relaxation algorithm
        
Convergence acceleration algorithm: QuasiNewton, 3 secant conditions
        
Point estimation will terminate when objective function delta < 0.0001
        
Estimation of standard errors disabled; will compute point estimates only.
        
Size of G: 324x13928
        
Performing five EM steps to prime main algorithm
        
1 (EM)  Elapsed: 0.318  Loglikelihood: -4.38757e+06 (delta): 2.87325e+06
        
2 (EM)  Elapsed: 0.292  Loglikelihood: -4.25681e+06 (delta): 130762
        
3 (EM)  Elapsed: 0.29 Loglikelihood: -4.21622e+06 (delta): 40582.9
        
4 (EM)  Elapsed: 0.29 Loglikelihood: -4.19347e+06 (delta): 22748.2
        
5 (EM)  Elapsed: 0.29 Loglikelihood: -4.17881e+06 (delta): 14663.1
        
Initial loglikelihood: -4.17881e+06
        
Starting main algorithm
        
1 (QN/Block)  Elapsed: 0.741  Loglikelihood: -3.94775e+06 (delta): 231058
        
2 (QN/Block)  Elapsed: 0.74 Loglikelihood: -3.8802e+06  (delta): 67554.6
        
3 (QN/Block)  Elapsed: 0.852  Loglikelihood: -3.83232e+06 (delta): 47883.8
        
4 (QN/Block)  Elapsed: 1.01 Loglikelihood: -3.81118e+06 (delta): 21138.2
        
5 (QN/Block)  Elapsed: 0.903  Loglikelihood: -3.80682e+06 (delta): 4354.36
        
6 (QN/Block)  Elapsed: 0.85 Loglikelihood: -3.80474e+06 (delta): 2085.65
        
7 (QN/Block)  Elapsed: 0.856  Loglikelihood: -3.80362e+06 (delta): 1112.58
        
8 (QN/Block)  Elapsed: 0.908  Loglikelihood: -3.80276e+06 (delta): 865.01
        
9 (QN/Block)  Elapsed: 0.852  Loglikelihood: -3.80209e+06 (delta): 666.662
        
10 (QN/Block)   Elapsed: 1.015  Loglikelihood: -3.80151e+06 (delta): 579.49
        
11 (QN/Block)   Elapsed: 0.908  Loglikelihood: -3.80097e+06 (delta): 548.156
        
12 (QN/Block)   Elapsed: 0.961  Loglikelihood: -3.80049e+06 (delta): 473.565
        
13 (QN/Block)   Elapsed: 0.855  Loglikelihood: -3.80023e+06 (delta): 258.61
        
14 (QN/Block)   Elapsed: 0.959  Loglikelihood: -3.80005e+06 (delta): 179.949
        
15 (QN/Block)   Elapsed: 1.011  Loglikelihood: -3.79991e+06 (delta): 146.707
        
16 (QN/Block)   Elapsed: 0.903  Loglikelihood: -3.79989e+06 (delta): 13.1942
        
17 (QN/Block)   Elapsed: 1.01 Loglikelihood: -3.79989e+06 (delta): 4.60747
        
18 (QN/Block)   Elapsed: 0.85 Loglikelihood: -3.79989e+06 (delta): 1.50012
        
19 (QN/Block)   Elapsed: 0.851  Loglikelihood: -3.79989e+06 (delta): 0.128916
        
20 (QN/Block)   Elapsed: 0.851  Loglikelihood: -3.79989e+06 (delta): 0.00182983
        
21 (QN/Block)   Elapsed: 0.851  Loglikelihood: -3.79989e+06 (delta): 4.33805e-05
        
Summary: 
        
Converged in 21 iterations (21.788 sec)
        
Loglikelihood: -3799887.171935
        
Fst divergences between estimated populations: 
        
  Pop0  Pop1  
        
Pop0  
        
Pop1  0.163 
        
Pop2  0.073 0.156 
        
Writing output files.
       
       
        1.
        2.
        3.
        4.
        5.
        6.
        7.
        8.
        9.
        10.
        11.
        12.
        13.
        14.
        15.
        16.
        17.
        18.
        19.
        20.
        21.
        22.
        23.
        24.
        25.
        26.
        27.
        28.
        29.
        30.
        31.
        32.
        33.
        34.
        35.
        36.
        37.
        38.
        39.
        40.
        41.
        42.
        43.
        44.
        45.
        46.
        47.
        48.
        49.
        50.
        51.
        52.
        53.

Two files will be generated :P,Q

       
        hapmap3.3.P  hapmap3.3.Q
       
       
        1.

1.5 operation admixture when , Add error information

Add a parameter to the command summary :-B, The speed will slow down .

       
        admixture -B hapmap3.bed 3
       
       
        1.

Three files will be generated :P,Q,Se

1.6 If your SNP Large amount of data , Run very slowly

In choosing the best k When the value of , Can be SNP Divided into subsets , such as 50k snp It is divided into 50 A subset of , Each subset 1k SNP, Then select the best according to the subset K value , Then according to the best K It's worth running all the SNP

1.7 Multithreading

If you have multiple threads (processors), You can add parameters -jn, n Is the number of threads , Like you want to use 4 Thread run :

       
        admixture  hapmap3.bed 3 -j 4
       
       
        1.

2. reference information

2.1 How to choose the right one K value

Multiple programs can be run at the same time , Each program is different k value , such as , to want to k It's worth choosing 1,2,3,4,5, Can be written as :

       
        for K in 1 2 3 4 5; do admixture --cv hapmap3.bed $K | tee log${K}.out; done
       
       
        1.

After running like this , Will generate several out file ,

       
        hapmap3.1.P  hapmap3.1.Q  hapmap3.2.P  hapmap3.2.Q  hapmap3.3.P  hapmap3.3.Q  hapmap3.4.P  hapmap3.4.Q  hapmap3.5.P  hapmap3.5.Q log1.out  log2.out  log3.out  log4.out  log5.out
       
       
        1.

Use grep see *out Of documents cv error( The error of cross validation ) value :

       
        grep -h CV  *.out
       
       
        1.

       
        (base) [[email protected] admixture]$ grep -h CV *out
        
CV error (K=1): 0.55248
        
CV error (K=2): 0.48190
        
CV error (K=3): 0.47835
        
CV error (K=4): 0.48236
        
CV error (K=5): 0.49001
       
       
        1.
        2.
        3.
        4.
        5.
        6.

It can be seen that , K=3 when , CV error Minimum

2.2 How to draw Q The chart

Use R Language

       
        ta1 = read.table("hapmap3.3.Q")
        
head(ta1)
        
barplot(t(as.matrix(ta1)),col = rainbow(3),
        
        xlab = "Individual",
        
        ylab = "Ancestry",
        
        border = NA)
       
       
        1.
        2.
        3.
        4.
        5.
        6.

Admixture Instruction documents cookbook_ data _04

2.3 I need to be based on LD Get rid of some SNP Well ?

admixture Don't consider LD Information about , If you want to do this , have access to plink

such as , Here, according to plink Of bed Document carried out LD Screening

       
        plink  --bfile hapmap3 --indep-pairwise 50 10 0.1
       
       
        1.

The filter parameter here means :

50, The sliding window is 50
10, The size of each slide is 10
0.1 Express R Square less than 0.1

And then it turns into bed file :

       
        plink  --bfile hapmap3 --extract plink.prune.in --make-bed --out prunedData
       
       
        1.

The output filtered file is :

       
        prunedData.bed  prunedData.bim  prunedData.fam
       
       
        1.

Use filtered files , Run again admixture:

       
        for K in 1 2 3 4 5 ; do admixture --cv prunedData.bed $K | tee log${K}.out;done
       
       
        1.

       
        (base) [[email protected] ld-test]$ grep -h CV *out
        
CV error (K=1): 0.52305
        
CV error (K=2): 0.48847
        
CV error (K=3): 0.48509
        
CV error (K=4): 0.49404
        
CV error (K=5): 0.49828
       
       
        1.
        2.
        3.
        4.
        5.
        6.

It can be seen that K=3, cv error Minimum , So choose k=3

Make a picture :

       
        ta1 = read.table("prunedData.3.Q")
        
head(ta1)
        
barplot(t(as.matrix(ta1)),col = rainbow(3),
        
        xlab = "Individual",
        
        ylab = "Ancestry",
        
        border = NA)
       
       
        1.
        2.
        3.
        4.
        5.
        6.

Admixture Instruction documents cookbook_ data _05

3. Other

See... For others Official pdf file

If you're interested in data analysis , For software operations , For data organization , Understanding the results , Any questions , Please feel free to contact me. .

Admixture Instruction documents cookbook_ data _06