当前位置：网站首页>R for Data Science (note) -- data transformation (select basic use)

R for Data Science (note) -- data transformation (select basic use)

2022-06-24 19:23:00 【Shengxin Xiaopeng】

R for Data Science

tidy Stream processing data is fully used in scientific research , I think it's inconsistent with the pipeline %>% Use , Data processing verb , Has a very important relationship .

In the least amount of time , Solve the most important 、 The most common problem , I call this efficiency ; The remaining difficulties , I call it improvement .

select The use of Verbs

The first thing to be clear is
filter Aiming at That's ok The operation of , select Is an operation on a column

Front learning filter The operation of , This study select operation

### actual combat

Again ,select Filter by column name , And column names do not need quotation marks .
###1. Data style
Still used nycflights13 The data in the package

flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

###2. Filter data

select Filtering data can use a single column name , Sequence symbols can also be used , You can also use “-”

# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#> <int> <int> <dbl> <int> <int> <dbl> <chr> 
#> 1 517 515 2 830 819 11 UA 
#> 2 533 529 4 850 830 20 UA 
#> 3 542 540 2 923 850 33 AA 
#> 4 544 545 -1 1004 1022 -18 B6 
#> 5 554 600 -6 812 837 -25 DL 
#> 6 554 558 -4 740 728 12 UA 
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>

###3. expand 1（ Boolean operation ）

“:” Used to select a series of continuous variables .
“!” Take the complement of a set of variables .
“&” and “|” Used to select the intersection or union of two sets of variables .
“c()” For combination selection

Here we use starwas, iris These two datasets demonstrate

starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
#> 4 Darth Vader 202 136
#> # ... with 83 more rows

“!" Operator negates selection ：

starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#> hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list> 
#> 1 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
#> 2 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
#> # ... with 83 more rows

iris %>% select(!c(Sepal.Length, Petal.Length))
#> # A tibble: 150 x 3
#> Sepal.Width Petal.Width Species
#> <dbl> <dbl> <fct> 
#> 1 3.5 0.2 setosa 
#> 2 3 0.2 setosa 
#> 3 3.2 0.2 setosa 
#> 4 3.1 0.2 setosa 
#> # ... with 146 more rows


iris %>% select(!ends_with("Width"))
#> # A tibble: 150 x 3
#> Sepal.Length Petal.Length Species
#> <dbl> <dbl> <fct> 
#> 1 5.1 1.4 setosa 
#> 2 4.9 1.4 setosa 
#> 3 4.7 1.3 setosa 
#> 4 4.6 1.5 setosa 
#> # ... with 146 more rows

“&” and “|” Take the intersection or union of two choices ：

iris %>% select(starts_with("Petal") & ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Width
#> <dbl>
#> 1 0.2
#> 2 0.2
#> 3 0.2
#> 4 0.2
#> # ... with 146 more rows

iris %>% select(starts_with("Petal") | ends_with("Width"))
#> # A tibble: 150 x 3
#> Petal.Length Petal.Width Sepal.Width
#> <dbl> <dbl> <dbl>
#> 1 1.4 0.2 3.5
#> 2 1.4 0.2 3 
#> 3 1.3 0.2 3.2
#> 4 1.5 0.2 3.1
#> # ... with 146 more rows

Use a combination of

iris %>% select(starts_with("Petal") & !ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.4
#> 3 1.3
#> 4 1.5
#> # ... with 146 more rows