当前位置:网站首页>[R tidyverse] use of select verb
[R tidyverse] use of select verb
2022-06-24 19:24:00 【Shengxin Xiaopeng】
R tidyverse

tidy Streaming data is becoming more and more popular , I think it's inconsistent with the pipeline %>% Use , Data processing verb , Has a very important relationship .
In the least amount of time , Solve the most important 、 The most common problem , I call this efficiency ; The remaining difficulties , I call it improvement .
tidyverse select The use of Verbs
The first thing to be clear is
filter Aiming at That's ok ** The operation of , select Is an operation on a column
Front learning filter The operation of , This study select operation
actual combat
Again ,select Filter by column name , And column names do not need quotation marks .
1. Data style
Still used nycflights13 The data in the package
flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
2. Filter data
select Filtering data can use a single column name , Sequence symbols can also be used , You can also use “-”
# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 517 515 2 830 819 11 UA
#> 2 533 529 4 850 830 20 UA
#> 3 542 540 2 923 850 33 AA
#> 4 544 545 -1 1004 1022 -18 B6
#> 5 554 600 -6 812 837 -25 DL
#> 6 554 558 -4 740 728 12 UA
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
3. expand 1( Boolean operation )
“:” Used to select a series of continuous variables .
“!” Take the complement of a set of variables .
“&” and “|” Used to select the intersection or union of two sets of variables .
“c()” For combination selection
Here we use starwas, iris These two datasets demonstrate
starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
#> 4 Darth Vader 202 136
#> # ... with 83 more rows
“!" Operator negates selection :
starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#> hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
#> 2 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
#> # ... with 83 more rows
iris %>% select(!c(Sepal.Length, Petal.Length))
#> # A tibble: 150 x 3
#> Sepal.Width Petal.Width Species
#> <dbl> <dbl> <fct>
#> 1 3.5 0.2 setosa
#> 2 3 0.2 setosa
#> 3 3.2 0.2 setosa
#> 4 3.1 0.2 setosa
#> # ... with 146 more rows
iris %>% select(!ends_with("Width"))
#> # A tibble: 150 x 3
#> Sepal.Length Petal.Length Species
#> <dbl> <dbl> <fct>
#> 1 5.1 1.4 setosa
#> 2 4.9 1.4 setosa
#> 3 4.7 1.3 setosa
#> 4 4.6 1.5 setosa
#> # ... with 146 more rows
“&” and “|” Take the intersection or union of two choices :
iris %>% select(starts_with("Petal") & ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Width
#> <dbl>
#> 1 0.2
#> 2 0.2
#> 3 0.2
#> 4 0.2
#> # ... with 146 more rows
iris %>% select(starts_with("Petal") | ends_with("Width"))
#> # A tibble: 150 x 3
#> Petal.Length Petal.Width Sepal.Width
#> <dbl> <dbl> <dbl>
#> 1 1.4 0.2 3.5
#> 2 1.4 0.2 3
#> 3 1.3 0.2 3.2
#> 4 1.5 0.2 3.1
#> # ... with 146 more rows
Use a combination of
iris %>% select(starts_with("Petal") & !ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.4
#> 3 1.3
#> 4 1.5
#> # ... with 146 more rows
Actually select Use , When used in combination with other functions, it can play a powerful role , This is another note .
4. expand 2
Actually select Use , When used in combination with other functions, it can play a powerful role
Combining functions last_col()
Select the penultimate column , The default is that the last column is 0
Take a look first iris What this dataset looks like
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
All in all 5 Column
iris %>% select(last_col())
#> # A tibble: 150 x 1
#> Species
#> <fct>
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> 4 setosa
#> # ... with 146 more rows
You can see , Without any parameters , It selects the last column .
> iris %>% select(3:last_col(1)) %>% head()
Petal.Length Petal.Width
1 1.4 0.2
2 1.4 0.2
3 1.3 0.2
4 1.5 0.2
5 1.4 0.2
6 1.7 0.4
**select(3:last_col(1))** This parameter is to select the data from the third column to the penultimate column , It can also be extended to select the penultimate column to the penultimate column 4 Column , Method is the same as above. .
Combining functions everything() function
I usually use it in combination everything Function to rearrange a column .
for example , I want to put a data frame in the first 3,6,8, Put the column at the top , This is convenient for me to check , The rest of the order remains the same .
Original flights Take the data , It can be written like this
select(flights, time_hour, air_time, everything())
#> # A tibble: 336,776 x 19
#> time_hour air_time year month day dep_time sched_dep_time
#> <dttm> <dbl> <int> <int> <int> <int> <int>
#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515
#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529
#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540
#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545
#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600
#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558
#> # … with 336,770 more rows, and 12 more variables: dep_delay <dbl>,
#> # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#> # hour <dbl>, minute <dbl>
And starts_with() Function combination
iris %>% select(starts_with("Sepal"))
#> # A tibble: 150 x 2
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3
#> 3 4.7 3.2
#> 4 4.6 3.1
#> # ... with 146 more rows
And ends_with() Function combination
iris %>% select(ends_with("Width"))
#> # A tibble: 150 x 2
#> Sepal.Width Petal.Width
#> <dbl> <dbl>
#> 1 3.5 0.2
#> 2 3 0.2
#> 3 3.2 0.2
#> 4 3.1 0.2
#> # ... with 146 more rows
The point of using these two functions tips: The contents of these two functions must be in string form , That is to add quotation marks , Without quotation marks , Cannot perform . Here I put an example of my own data .
The data is the row name TCGA The coding , in total 15 position , That's what it looks like .
The aim is to select only 14,15 Bit is less than 11 The data of .
# Correct input method
RCC_test <- expr_RCC %>% select(ends_with(c("01","05")))
# Wrong input mode
RCC_test <- expr_RCC %>% select(ends_with(c(01,05)))
Of course , At the very beginning , I use the foundation R
RCC_cancer <- expr_RCC[,str_sub(colnames(expr_RCC),14,15) < 11]
Both seem to be relatively simple .
And contains() Function combination
This is a bit like a wildcard or regular expression
iris %>% select(contains("al"))
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> # ... with 146 more rows
And matches() Function combination
This is where regular expressions are used
iris %>% select(matches("[pt]al"))
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> # ... with 146 more rows
combination where() function
where() It can be a function , That's great , You can give judgment statements
iris %>% select(where(is.factor))
#> # A tibble: 150 x 1
#> Species
#> <fct>
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> 4 setosa
#> # ... with 146 more rows
combination which() function
##### Since you can use where(), which() It's OK, too
This is the method I found myself , Use your own data just now
RCC_test <- expr_RCC %>% select(which(str_sub(colnames(expr_RCC),14,15) < 11))
The results are the same ,nice
边栏推荐
- How to customize cursor position in wechat applet rotation chart
- High dimension low code: component rendering sub component
- Xiaodi class massive data processing business short chain platform
- Would you like to ask whether the same multiple tasks of the PgSQL CDC account will have an impact? I now have only one of the three tasks
- 建立自己的网站(8)
- Make track map
- 【计算讲谈社】第三讲:如何提出关键问题?
- Power efficiency test
- PingCAP 入选 2022 Gartner 云数据库“客户之声”,获评“卓越表现者”最高分
- 我链接mysql 报这个错 是啥意思呀?
猜你喜欢

Experience of MDM master data project implementation for manufacturing projects

特尔携手微软发挥边云协同势能,推动AI规模化部署

通过SCCM SQL生成计算机上一次登录用户账户报告

Unity移动端游戏性能优化简谱之 以引擎模块为划分的CPU耗时调优

Why are life science enterprises on the cloud in succession?

Introduction to alos satellite

一文详解|Go 分布式链路追踪实现原理

Why useevent is not good enough

Do you have all the basic embedded knowledge points that novices often ignore?

Unity mobile game performance optimization spectrum CPU time-consuming optimization divided by engine modules
随机推荐
[computer talk club] Lecture 3: how to raise key issues?
《Go题库·11》channel的应用场景
three. Basic framework created by JS
请教一个问题。adbhi支持保留一个ID最新100条数据库,类似这样的操作吗
What type of datetime in the CDC SQL table should be replaced
Vs2017 add header file path method
一次 MySQL 误操作导致的事故,高可用都不顶不住!
Mqtt protocol usage of LabVIEW
Working for 6 years with a monthly salary of 3W and a history of striving for one PM
Ask a question. Adbhi supports the retention of 100 databases with the latest IDs. Is this an operation like this
mysql binlog 数据源配置文档麻烦分享一下
Freeswitch uses origin to dialplan
php OSS文件读取和写入文件,workerman生成临时文件并输出浏览器下载
R语言 4.1.0软件安装包和安装教程
Volcano成Spark默认batch调度器
Interpreting harmonyos application and service ecology
Generate the last login user account report of the computer through SCCM SQL
thinkphp6中怎么使用jwt认证
Network security review office starts network security review on HowNet
subject may not be empty [subject-empty]