当前位置:网站首页>R for Data Science (note) -- data transformation (select basic use)
R for Data Science (note) -- data transformation (select basic use)
2022-06-24 19:23:00 【Shengxin Xiaopeng】

tidy Stream processing data is fully used in scientific research , I think it's inconsistent with the pipeline %>% Use , Data processing verb , Has a very important relationship .
In the least amount of time , Solve the most important 、 The most common problem , I call this efficiency ; The remaining difficulties , I call it improvement .
select The use of Verbs
The first thing to be clear is
filter Aiming at That's ok The operation of , select Is an operation on a column
Front learning filter The operation of , This study select operation
### actual combat
Again ,select Filter by column name , And column names do not need quotation marks .
###1. Data style
Still used nycflights13 The data in the package
flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
###2. Filter data
select Filtering data can use a single column name , Sequence symbols can also be used , You can also use “-”
# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 517 515 2 830 819 11 UA
#> 2 533 529 4 850 830 20 UA
#> 3 542 540 2 923 850 33 AA
#> 4 544 545 -1 1004 1022 -18 B6
#> 5 554 600 -6 812 837 -25 DL
#> 6 554 558 -4 740 728 12 UA
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
###3. expand 1( Boolean operation )
“:” Used to select a series of continuous variables .
“!” Take the complement of a set of variables .
“&” and “|” Used to select the intersection or union of two sets of variables .
“c()” For combination selection
Here we use starwas, iris These two datasets demonstrate
starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
#> 4 Darth Vader 202 136
#> # ... with 83 more rows
“!" Operator negates selection :
starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#> hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
#> 2 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
#> # ... with 83 more rows
iris %>% select(!c(Sepal.Length, Petal.Length))
#> # A tibble: 150 x 3
#> Sepal.Width Petal.Width Species
#> <dbl> <dbl> <fct>
#> 1 3.5 0.2 setosa
#> 2 3 0.2 setosa
#> 3 3.2 0.2 setosa
#> 4 3.1 0.2 setosa
#> # ... with 146 more rows
iris %>% select(!ends_with("Width"))
#> # A tibble: 150 x 3
#> Sepal.Length Petal.Length Species
#> <dbl> <dbl> <fct>
#> 1 5.1 1.4 setosa
#> 2 4.9 1.4 setosa
#> 3 4.7 1.3 setosa
#> 4 4.6 1.5 setosa
#> # ... with 146 more rows
“&” and “|” Take the intersection or union of two choices :
iris %>% select(starts_with("Petal") & ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Width
#> <dbl>
#> 1 0.2
#> 2 0.2
#> 3 0.2
#> 4 0.2
#> # ... with 146 more rows
iris %>% select(starts_with("Petal") | ends_with("Width"))
#> # A tibble: 150 x 3
#> Petal.Length Petal.Width Sepal.Width
#> <dbl> <dbl> <dbl>
#> 1 1.4 0.2 3.5
#> 2 1.4 0.2 3
#> 3 1.3 0.2 3.2
#> 4 1.5 0.2 3.1
#> # ... with 146 more rows
Use a combination of
iris %>% select(starts_with("Petal") & !ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.4
#> 3 1.3
#> 4 1.5
#> # ... with 146 more rows
Actually select Use , When used in combination with other functions, it can play a powerful role , This is another note .
边栏推荐
- php OSS文件读取和写入文件,workerman生成临时文件并输出浏览器下载
- Preliminary study nuxt3
- Volcano成Spark默认batch调度器
- ArrayList源码解析
- starring V6平台开发接出点流程
- Network security review office starts network security review on HowNet
- ###脚本实现raid0自动化部署
- SaltStack State状态文件配置实例
- R language 4.1.0 software installation package and installation tutorial
- 一文详解|Go 分布式链路追踪实现原理
猜你喜欢

Introduction and download tutorial of administrative division vector data

试驾 Citus 11.0 beta(官方博客)

怎么使用R包ggtreeExtra绘制进化树

数字孪生行业案例:智慧港口数字化

Huawei machine learning service speech recognition function enables applications to paint "sound" and color

Introduction and download of nine npp\gpp datasets

starring V6平台开发接出点流程

What other data besides SHP data

60 个神级 VS Code 插件!!

Why useevent is not good enough
随机推荐
Introduction and download of nine npp\gpp datasets
Huawei machine learning service speech recognition function enables applications to paint "sound" and color
想问下 pgsql cdc 账号同一个 多个 task 会有影响吗,我现在3个task 只有一个 有
对国产数据库厂商提几个关于SQL引擎的小需求
Volcano becomes spark default batch scheduler
Necessary fault handling system for enterprise network administrator
How to customize cursor position in wechat applet rotation chart
Multi cloud mode is not a "master key"
优维低代码:构件渲染子构件
flink cdc全量读mysql老是报这个错怎么处理
Volcano becomes spark default batch scheduler
Working for 6 years with a monthly salary of 3W and a history of striving for one PM
subject may not be empty [subject-empty]
Zadig + 洞态 IAST:让安全溶于持续交付
多云模式并非“万能钥匙”
华为机器学习服务语音识别功能,让应用绘“声”绘色
Volcano becomes spark default batch scheduler
目前是不是只cdc 监控mysql 可以拿到新增列的数据 sqlserver不行是吧
PingCAP 入选 2022 Gartner 云数据库“客户之声”,获评“卓越表现者”最高分
西北工业大学遭黑客攻击?双因素认证改变局面!