当前位置:网站首页>Systematic learning + active exploration is the most comfortable way to get started!
Systematic learning + active exploration is the most comfortable way to get started!
2022-06-27 23:45:00 【Shengxin skill tree】
Our introductory course of student letter and online live data mining course have a history of more than three years , We have cultivated wave after wave of excellent students and students . The content shared in this issue is not what we talked about in class , Instead, I gave a super outline exercise that I could do on tiptoe , Inspire students to take the initiative to learn , Instead of just waiting to be fed .
System learning + Take the initiative to explore , It is the most comfortable way to get started !
Now let's take a look at the excellent student Jia Nan's sharing :
R Language super outline exercises
( Student letter skill tree excellent student good male classmate )
- data mining (GEO,TCGA, unicellular )2022 year 6 Lunar field , Get a quick look at some bioinformatics Application Charts
- Introduction to student letters -2022 year 6 Lunar field , Your first bioinformatics lesson
First , Read data using read.table and read.csv function , And use dim Look at a few lines and columns , Because there must be repetition 【 Topic pit 】, So don't set read.csv Inside rownames=1, Read them all first .
> exp=read.csv("exp.csv")
> dim(exp)
[1] 1000 7
> soft2 <- read.table("soft.txt",header = T,sep = "\t")
> dim(soft2)
[1] 1000 5
exp Expression matrix , Read in a data frame
soft Data frame , Among them genenames and ID Corresponding , And what we need to replace is these two columns of data
I use first %in% To judge exp and soft2 The expression matrices inside correspond to each other , But use identical Function to determine whether it is completely consistent , Return is F, Description sequence is inconsistent , Need to be flexible match Function adjustment order is consistent . among soft3 It is based on exp In the document ID The column order is adjusted , And then use identical Function to determine whether it is completely consistent , return T, complete . In the end, I will directly soft3 Inside genename Assign a value to exp Of x The row name column can be used to complete the replacement .
> table(exp$X %in% soft2$ID)
TRUE
1000
> table(soft2$ID %in% exp$X)
TRUE
1000
> identical(soft2$ID,exp$X)
[1] FALSE
> soft3=soft2[match(exp$X,soft2$ID),]
> identical(soft3$ID,exp$X)
[1] TRUE
exp$X=soft3$GeneName
Next , First of all, get rid of exp Repeat genes in the expression matrix , Because the existence of duplicate genes makes it impossible for us to use them as line names . Use directly first duplicated Function judgement exp Of x Duplicate genes in the row name column , Duplicate return T, Then we use it directly as an index , Reverse in exp The operation of extracting subsets can remove duplicate genes and assign them to a new expression matrix exp1. And then exp1 Of x The gene name without repetition in this line is directly used as the line name , Function is rownames. Finally, remove the superfluous x This row is assigned to the new expression matrix exp2.exp2 Is the expression matrix we want .
> exp1=exp[!duplicated(exp$X),]
> rownames(exp1)=exp1$X
> exp2=exp1[,(-1)]
> View(exp2)
「 The second solution is when multiple probes correspond to the same gene , Average. 」 The previous operations are the same , The adjustment order will ID Change to the gene name , The focus is on how to deal with duplicate gene names Use aggregate function ,https://www.jianshu.com/p/7912aac76d5f【 This is a aggregate Function description 】 aggregate Functions are commonly used in data processing , It has powerful functions . Data can be grouped and aggregated as required , Then add the aggregated data 、 Average and other operations . To specify, you can use the command :help("aggregate") Get official documents
> ### The second method , The duplicate gene names are averaged according to the expression amount
> expr_mean=aggregate(.~X,mean,data=exp)
> rownames(expr_mean)=expr_mean$X
> expr_mean=expr_mean[,(-1)]
> View(expr_mean)
「 The third method : For the same genes , We pick the whole row with the larger average of the rows 」
> #### The third method , Take the line with the largest expression value
> # Calculate the row average , Sort in descending order
> index=order(rowMeans(exp[,-1]),decreasing = T)
> # adjustment EXP The sequence of genes
> expr_ordered=exp[index,]
> # For genes with duplications , Keep the one that first appeared , The one with the highest average
> keep=!duplicated(expr_ordered$X)
> # Get the expression matrix after the final processing
> expr_max=expr_ordered[keep,]
> expr_max
> rownames(expr_max)=expr_max$X
> expr_max=expr_max[,(-1)]
> View(expr_max)*「 Xiao Jie added 」:tibble::column_to_rownames() You can convert a column directly to a row name , You can optimize the code , But when the students did this problem, they didn't mention , Beginners write code first , Perfect again ~ in addition , The data frame does not allow duplicate row names , In fact, the matrix is allowed , You can try exp Will turning to a matrix make the code simpler ~ *
边栏推荐
- 抓出那些重复的基因
- 第 2 章 集成 MP
- Is it safe to open a stock account through the account opening QR code of CICC securities manager? Or is it safe to open an account in a securities company?
- [PCL self study: segmentation4] point cloud segmentation based on Min cut
- 零基础自学SQL课程 | CASE函数
- 刚开始看英文文献,想问一下各位,最初应该怎么看进去?
- 居家办公竟比去公司上班还累?
- 撰写外文时怎样引用中文文献?
- Golang - the difference between new and make
- How to solve the problem that the browser developed with CeF3 does not support flash
猜你喜欢

Online JSON to plaintext tool

Zero foundation self-study SQL course | if function

c语言字符指针、字符串初始化问题

第一性原理(最优解理论)

seata

ClickOnce error deploying ClickOnce application - the reference in the manifest does not match the identity of the downloaded assembly

小芯片chiplet技术杂谈

手把手教你移植 tinyriscv 到FPGA上

跨系统数据一致性问题解决方案汇总

【蓝桥杯集训100题】scratch数字计算 蓝桥杯scratch比赛专项预测编程题 集训模拟练习题第16题
随机推荐
Can you do these five steps of single cell data cleaning?
Windows环境下的ELK——Logstash+Mysql(4)
virtualbox扩展动态磁盘大小的坑
一文剖析C语言函数
通过中金证券经理的开户二维码开股票账户安全吗?还是去证券公司开户安全?
[PCL self study: segmentation4] point cloud segmentation based on Min cut
C language - date formatting [easy to understand]
小芯片chiplet技术杂谈
Zero foundation self-study SQL course | case function
Teach you how to transplant tinyriscv to FPGA
如何找到外文文献对应的中文文献?
【PCL自学:Segmentation3】基于PCL的点云分割:区域增长分割
搭建开源美观的数据库监控系统-Lepus
如何设置企业微信群机器人定时发消息?
vmware虚拟机桥接连通
Use of go log package log
Download versions such as typora 1.2.5
Stream + Nacos
【PCL自学:Segmentation4】基于Min-Cut点云分割
[从零开始学习FPGA编程-48]:视野篇 - 智能传感器的发展与应用