当前位置：网站首页>Systematic learning + active exploration is the most comfortable way to get started!

Systematic learning + active exploration is the most comfortable way to get started!

2022-06-27 23:45:00 【Shengxin skill tree】

Our introductory course of student letter and online live data mining course have a history of more than three years , We have cultivated wave after wave of excellent students and students . The content shared in this issue is not what we talked about in class , Instead, I gave a super outline exercise that I could do on tiptoe , Inspire students to take the initiative to learn , Instead of just waiting to be fed .

System learning ＋ Take the initiative to explore , It is the most comfortable way to get started ！

Now let's take a look at the excellent student Jia Nan's sharing ：

R Language super outline exercises

( Student letter skill tree excellent student good male classmate )

data mining （GEO,TCGA, unicellular ）2022 year 6 Lunar field , Get a quick look at some bioinformatics Application Charts
Introduction to student letters -2022 year 6 Lunar field , Your first bioinformatics lesson

First , Read data using read.table and read.csv function , And use dim Look at a few lines and columns , Because there must be repetition 【 Topic pit 】, So don't set read.csv Inside rownames=1, Read them all first .

> exp=read.csv("exp.csv")
> dim(exp)
[1] 1000    7
> soft2 <- read.table("soft.txt",header = T,sep = "\t")
> dim(soft2)
[1] 1000    5

exp Expression matrix , Read in a data frame

soft Data frame , Among them genenames and ID Corresponding , And what we need to replace is these two columns of data

I use first %in% To judge exp and soft2 The expression matrices inside correspond to each other , But use identical Function to determine whether it is completely consistent , Return is F, Description sequence is inconsistent , Need to be flexible match Function adjustment order is consistent . among soft3 It is based on exp In the document ID The column order is adjusted , And then use identical Function to determine whether it is completely consistent , return T, complete . In the end, I will directly soft3 Inside genename Assign a value to exp Of x The row name column can be used to complete the replacement .

> table(exp$X %in% soft2$ID)
TRUE 
1000 
> table(soft2$ID %in% exp$X)
TRUE 
1000 
> identical(soft2$ID,exp$X)
[1] FALSE
> soft3=soft2[match(exp$X,soft2$ID),]
> identical(soft3$ID,exp$X)
[1] TRUE
exp$X=soft3$GeneName

Next , First of all, get rid of exp Repeat genes in the expression matrix , Because the existence of duplicate genes makes it impossible for us to use them as line names . Use directly first duplicated Function judgement exp Of x Duplicate genes in the row name column , Duplicate return T, Then we use it directly as an index , Reverse in exp The operation of extracting subsets can remove duplicate genes and assign them to a new expression matrix exp1. And then exp1 Of x The gene name without repetition in this line is directly used as the line name , Function is rownames. Finally, remove the superfluous x This row is assigned to the new expression matrix exp2.exp2 Is the expression matrix we want .

> exp1=exp[!duplicated(exp$X),]
> rownames(exp1)=exp1$X
> exp2=exp1[,(-1)]
> View(exp2)

「 The second solution is when multiple probes correspond to the same gene , Average. 」 The previous operations are the same , The adjustment order will ID Change to the gene name , The focus is on how to deal with duplicate gene names Use aggregate function ,https://www.jianshu.com/p/7912aac76d5f【 This is a aggregate Function description 】 aggregate Functions are commonly used in data processing , It has powerful functions . Data can be grouped and aggregated as required , Then add the aggregated data 、 Average and other operations . To specify, you can use the command ：help("aggregate") Get official documents

> ### The second method , The duplicate gene names are averaged according to the expression amount 
> expr_mean=aggregate(.~X,mean,data=exp)
> rownames(expr_mean)=expr_mean$X
> expr_mean=expr_mean[,(-1)]
> View(expr_mean)

「 The third method ： For the same genes , We pick the whole row with the larger average of the rows 」

> #### The third method , Take the line with the largest expression value 
> # Calculate the row average , Sort in descending order 
> index=order(rowMeans(exp[,-1]),decreasing = T)
> # adjustment EXP The sequence of genes 
> expr_ordered=exp[index,]
> # For genes with duplications , Keep the one that first appeared , The one with the highest average 
> keep=!duplicated(expr_ordered$X)
> # Get the expression matrix after the final processing 
> expr_max=expr_ordered[keep,]
> expr_max
> rownames(expr_max)=expr_max$X
> expr_max=expr_max[,(-1)]
> View(expr_max)

*「 Xiao Jie added 」：tibble::column_to_rownames() You can convert a column directly to a row name , You can optimize the code , But when the students did this problem, they didn't mention , Beginners write code first , Perfect again ~ in addition , The data frame does not allow duplicate row names , In fact, the matrix is allowed , You can try exp Will turning to a matrix make the code simpler ~ *

原网站

版权声明
本文为[Shengxin skill tree]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206272105500413.html

当前位置：网站首页>Systematic learning + active exploration is the most comfortable way to get started!

Systematic learning + active exploration is the most comfortable way to get started!

边栏推荐

猜你喜欢

随机推荐