当前位置：网站首页>tidyverse笔记——tidyr包

tidyverse笔记——tidyr包

2022-07-31 05:39:00 【高冷现充】

tidyverse笔记——tidyr包

未完待续

tidry:Tidy Messy Data

观摩完 Hadley Wickham 大佬的 R for Data Science，我对有关 tidyr 的章节印象最深刻的是它的开场白，第一句话引用了列夫托尔斯泰的名言：幸福的家庭是相似的，不幸的家庭却各有各的不幸。第二句话是作者“拙劣”的模仿：整洁的数据集是相似的，脏的数据集却各有各的脏法！

Happy families are all alike; every unhappy family is unhappy in its own way.
—Leo Tolstoy
Tidy datasets are all alike, but every messy dataset is messy in its own way.
—Hadley Wickham

对于应对过“脏数据”的人来说可谓是深有同感，然后我们还不得不接受这样一个事实：当我们真正做一些自主性的数据分析，正准备处理一些生活中的数据和问题，绝大时候我们遇到的都是“脏数据”。
很多时候，我们其实并没有事先意识到它是脏的或者不知道脏在哪？做着做着就出现问题了。
数院有一位很厉害的老师跟我提及过：实际上数据预处理是非常难做的。我们今日尝试着用 tidyr 以较合理的方法解决其中的一部分问题。

常用函数及其功能

Gather

Gather columns into key-value pairs
这是帮助文档的解释，实际上是把原来一群有相同性质的变量（列名）转换成新的列的值，实现了从 “键” 到 “值” 的转变。直接上例子，这里用的是 relig_income，它统计了不同区域不同薪资水平的人数。

> relig_income %>% print(n = 5)
# A tibble: 18 × 11
  religion   `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic        27        34        60        81        76       137        122
2 Atheist         12        27        37        52        35        70         73
3 Buddhist        27        21        30        34        33        58         62
4 Catholic       418       617       732       670       638      1116        949
5 Don’t kno…      15        14        15        11        10        35         21
# … with 13 more rows, and 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>

可以看到从第二列开始，变量都是关于薪资的，那么为何不直接把薪资作为新的变量，不同的薪资水平作为值呢？比如说第一行，就会被拆为10行，地区都是Agnostic ,薪资变量下的值是不一样的，用 gather() 实现如下：

relig_income %>% gather(2:11, key = "income", value = "count") %>% arrange(religion)

便会得到

# A tibble: 180 × 3
   religion income             count
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# … with 170 more rows

Spread

Spread a key-value pair across multiple columns
这是帮助文档的解释，简单来讲，它是 gather() 的逆操作。它以某一列的不同值创建新的变量，将键值对分布在多个列上。
所以正确使用 spread() 和 gather()的关键在于，思考清楚：何为键，何为值？
还用上面的例子：

relig_income_ <- relig_income %>% gather(2:11, key = "income", value = "count")
relig_income_ %>% spread(key = income, value = count) %>% print(n = 5)