当前位置：网站首页>[R tidyverse] use of select verb

[R tidyverse] use of select verb

2022-06-24 19:24:00 【Shengxin Xiaopeng】

R tidyverse

tidyverse select The use of Verbs

R for Data Science

tidy Streaming data is becoming more and more popular , I think it's inconsistent with the pipeline %>% Use , Data processing verb , Has a very important relationship .

In the least amount of time , Solve the most important 、 The most common problem , I call this efficiency ; The remaining difficulties , I call it improvement .

tidyverse select The use of Verbs

The first thing to be clear is
filter Aiming at That's ok ** The operation of , select Is an operation on a column

Front learning filter The operation of , This study select operation

actual combat

Again ,select Filter by column name , And column names do not need quotation marks .

1. Data style

Still used nycflights13 The data in the package

flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2. Filter data

select Filtering data can use a single column name , Sequence symbols can also be used , You can also use “-”

# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#> <int> <int> <dbl> <int> <int> <dbl> <chr> 
#> 1 517 515 2 830 819 11 UA 
#> 2 533 529 4 850 830 20 UA 
#> 3 542 540 2 923 850 33 AA 
#> 4 544 545 -1 1004 1022 -18 B6 
#> 5 554 600 -6 812 837 -25 DL 
#> 6 554 558 -4 740 728 12 UA 
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>

3. expand 1（ Boolean operation ）

“:” Used to select a series of continuous variables .
“!” Take the complement of a set of variables .
“&” and “|” Used to select the intersection or union of two sets of variables .
“c()” For combination selection

Here we use starwas, iris These two datasets demonstrate

starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
#> 4 Darth Vader 202 136
#> # ... with 83 more rows

“!" Operator negates selection ：

starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#> hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 
#> 1 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
#> 2 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
#> # ... with 83 more rows

iris %>% select(!c(Sepal.Length, Petal.Length))
#> # A tibble: 150 x 3
#> Sepal.Width Petal.Width Species
#> <dbl> <dbl> <fct> 
#> 1 3.5 0.2 setosa 
#> 2 3 0.2 setosa 
#> 3 3.2 0.2 setosa 
#> 4 3.1 0.2 setosa 
#> # ... with 146 more rows


iris %>% select(!ends_with("Width"))
#> # A tibble: 150 x 3
#> Sepal.Length Petal.Length Species
#> <dbl> <dbl> <fct> 
#> 1 5.1 1.4 setosa 
#> 2 4.9 1.4 setosa 
#> 3 4.7 1.3 setosa 
#> 4 4.6 1.5 setosa 
#> # ... with 146 more rows

“&” and “|” Take the intersection or union of two choices ：

iris %>% select(starts_with("Petal") & ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Width
#> <dbl>
#> 1 0.2
#> 2 0.2
#> 3 0.2
#> 4 0.2
#> # ... with 146 more rows

iris %>% select(starts_with("Petal") | ends_with("Width"))
#> # A tibble: 150 x 3
#> Petal.Length Petal.Width Sepal.Width
#> <dbl> <dbl> <dbl>
#> 1 1.4 0.2 3.5
#> 2 1.4 0.2 3 
#> 3 1.3 0.2 3.2
#> 4 1.5 0.2 3.1
#> # ... with 146 more rows

Use a combination of

iris %>% select(starts_with("Petal") & !ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.4
#> 3 1.3
#> 4 1.5
#> # ... with 146 more rows

Actually select Use , When used in combination with other functions, it can play a powerful role , This is another note .

4. expand 2

Actually select Use , When used in combination with other functions, it can play a powerful role

Combining functions last_col()

Select the penultimate column , The default is that the last column is 0

Take a look first iris What this dataset looks like

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

All in all 5 Column

iris %>% select(last_col())
#> # A tibble: 150 x 1
#> Species
#> <fct> 
#> 1 setosa 
#> 2 setosa 
#> 3 setosa 
#> 4 setosa 
#> # ... with 146 more rows

You can see , Without any parameters , It selects the last column .

> iris %>% select(3:last_col(1)) %>% head()
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
4          1.5         0.2
5          1.4         0.2
6          1.7         0.4

**select(3:last_col(1))** This parameter is to select the data from the third column to the penultimate column , It can also be extended to select the penultimate column to the penultimate column 4 Column , Method is the same as above. .

Combining functions everything() function

I usually use it in combination everything Function to rearrange a column .

for example , I want to put a data frame in the first 3,6,8, Put the column at the top , This is convenient for me to check , The rest of the order remains the same .
Original flights Take the data , It can be written like this

select(flights, time_hour, air_time, everything())
#> # A tibble: 336,776 x 19
#> time_hour air_time year month day dep_time sched_dep_time
#> <dttm> <dbl> <int> <int> <int> <int> <int>
#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515
#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529
#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540
#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545
#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600
#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558
#> # … with 336,770 more rows, and 12 more variables: dep_delay <dbl>,
#> # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#> # hour <dbl>, minute <dbl>

And starts_with() Function combination

iris %>% select(starts_with("Sepal"))
#> # A tibble: 150 x 2
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3 
#> 3 4.7 3.2
#> 4 4.6 3.1
#> # ... with 146 more rows

And ends_with() Function combination

iris %>% select(ends_with("Width"))
#> # A tibble: 150 x 2
#> Sepal.Width Petal.Width
#> <dbl> <dbl>
#> 1 3.5 0.2
#> 2 3 0.2
#> 3 3.2 0.2
#> 4 3.1 0.2
#> # ... with 146 more rows

The point of using these two functions tips： The contents of these two functions must be in string form , That is to add quotation marks , Without quotation marks , Cannot perform . Here I put an example of my own data .

The data is the row name TCGA The coding , in total 15 position , That's what it looks like .

The aim is to select only 14,15 Bit is less than 11 The data of .

# Correct input method 
RCC_test <- expr_RCC %>% select(ends_with(c("01","05")))

# Wrong input mode 
RCC_test <- expr_RCC %>% select(ends_with(c(01,05)))

Of course , At the very beginning , I use the foundation R

RCC_cancer <- expr_RCC[,str_sub(colnames(expr_RCC),14,15) < 11]

Both seem to be relatively simple .

And contains() Function combination

This is a bit like a wildcard or regular expression

iris %>% select(contains("al"))
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> # ... with 146 more rows

And matches() Function combination

This is where regular expressions are used

iris %>% select(matches("[pt]al")) 
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> # ... with 146 more rows

combination where() function

where() It can be a function , That's great , You can give judgment statements

iris %>% select(where(is.factor))
#> # A tibble: 150 x 1
#> Species
#> <fct> 
#> 1 setosa 
#> 2 setosa 
#> 3 setosa 
#> 4 setosa 
#> # ... with 146 more rows

combination which() function

##### Since you can use where(), which() It's OK, too
This is the method I found myself , Use your own data just now

RCC_test <- expr_RCC %>% select(which(str_sub(colnames(expr_RCC),14,15) < 11))

The results are the same ,nice

原网站

版权声明
本文为[Shengxin Xiaopeng]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202211331319039.html