tidyfst包实例分析

我的R语言小伙伴最近分享了自己使用R来做工业级数据清洗的经验,最近我自己在不断测试我的新包tidyfst,因此就拿这个data.table的案例来尝试一下。

测试数据构造

本次测试,将不会加载data.table包,但是其实tidyfst里面无处不是data.table的元素,而且也导出了很多内置的data.table函数,比如as.data.table和data.table。所以这些代码在tidyfst中就可以自如地使用。

library(tidyfst)
#> 
#> Life's short, use R.
diamonds <- ggplot2::diamonds
n = 1e5  #如果想做工业级测试,可以继续增加数量
set.seed(2020)
dtranges <- seq.Date(from = as.Date("2011-01-01"),
                     to = as.Date("2020-01-01"),
                     by = 1)
n1 <- sample(nrow(diamonds), n, replace = TRUE)
dat1 <- as.data.table(diamonds[n1, ])
dat1[, "dt"] <- sample(dtranges, n, replace = TRUE)  # 增加dt列
n2 <- sample(nrow(dat1), nrow(dat1)/1000)
dat1[n2, "price"] <- NA # price列构造千分之一缺失值
dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

print(dat1)
#>         carat       cut color clarity depth table price     x     y     z
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num> <num>
#>      1:  2.23   Premium     J     VS2  61.0    58 14867  8.39  8.36  5.11
#>      2:  0.46     Ideal     F     SI2  61.5    54   758  4.98  5.01  3.07
#>      3:  0.70     Ideal     H    VVS1  60.9    57  3611  5.70  5.78  3.50
#>      4:  1.50   Premium     F     VS2  61.5    58 14719  7.34  7.32  4.51
#>      5:  0.40 Very Good     G      IF  61.0    59  1154  4.75  4.79  2.91
#>     ---                                                                  
#>  99996:  1.14     Ideal     F     VS2  61.4    56  8017  6.73  6.78  4.14
#>  99997:  0.39     Ideal     F     SI1  61.8    55   886  4.70  4.72  2.91
#>  99998:  2.00      Good     F     SI1  63.0    64 17869  7.67  7.76  4.86
#>  99999:  1.00   Premium     H     VS2  59.7    59  5139  6.57  6.52  3.91
#> 100000:  0.30     Ideal     G    VVS2  60.6    57   878  4.35  4.33  2.63
#>                 dt
#>             <Date>
#>      1: 2016-09-29
#>      2: 2011-10-22
#>      3: 2013-03-06
#>      4: 2014-06-11
#>      5: 2015-02-23
#>     ---           
#>  99996: 2013-11-22
#>  99997: 2014-11-10
#>  99998: 2019-02-21
#>  99999: 2017-01-15
#> 100000: 2019-05-19

基础

小技巧

后面的分析,经常要根据日期进行计算。所以,先对日期进行排序,就能够提高运行速度。在tidyfst中,可以使用arrange_dt函数来对数据进行原位的各种操作,其中就包括排序。

那么,现在dat1的数据就按照日期排好序了。

聚合

1.求每种切割类型、每种颜色钻石的平均价格、中位数价格与最高价格

在tidyfst中,我设置了一个sys_time_print函数,可以方便地输出system.time()函数返回的结果。

sys_time_print({
  r1_1 <- dat1 %>% 
    summarise_dt(
      by = .(cut,color),
      mean_price = mean(price, na.rm = TRUE),
      median_price = median(price, na.rm = TRUE),
      max_price = max(price, na.rm = TRUE)
    )
})
#> [1] "Finished in 0.020s elapsed (0.050s cpu)"
r1_1
#>           cut color mean_price median_price max_price
#>         <ord> <ord>      <num>        <num>     <int>
#>  1: Very Good     H   4390.470       3394.0     18803
#>  2:      Good     G   3999.886       3303.0     18788
#>  3:     Ideal     F   3309.061       1751.0     18780
#>  4:   Premium     E   3481.628       1881.0     18426
#>  5:      Fair     D   4237.899       3205.0     16386
#>  6:   Premium     J   6416.235       5110.5     18706
#>  7:   Premium     F   4347.158       2874.0     18791
#>  8:     Ideal     D   2608.030       1550.5     18693
#>  9: Very Good     G   3900.558       2403.0     18818
#> 10: Very Good     E   3301.873       2051.0     18731
#> 11:   Premium     G   4543.181       2780.0     18741
#> 12:     Ideal     G   3751.035       1881.0     18806
#> 13:     Ideal     E   2558.501       1429.0     18729
#> 14:   Premium     H   5215.019       4451.0     18795
#> 15:   Premium     I   5921.498       4545.5     18823
#> 16: Very Good     F   3792.863       2473.0     18777
#> 17:      Good     D   3369.352       2661.0     18468
#> 18:      Fair     G   4186.403       2797.0     18574
#> 19:      Good     E   3293.289       2407.0     18236
#> 20:   Premium     D   3796.609       2348.0     18286
#> 21:     Ideal     H   3924.682       2320.0     18659
#> 22: Very Good     I   5475.587       4013.0     18500
#> 23:     Ideal     I   4434.030       2631.0     18779
#> 24:      Fair     J   5029.502       3422.0     18531
#> 25: Very Good     D   3437.133       2242.0     18526
#> 26:      Good     F   3518.548       2755.0     18686
#> 27:      Good     I   4966.745       3484.0     18707
#> 28:      Good     H   4098.280       3323.0     18640
#> 29:      Fair     H   5454.118       4134.5     18308
#> 30:      Fair     F   3790.703       2961.0     17995
#> 31:     Ideal     J   4972.801       4135.0     18508
#> 32: Very Good     J   5107.638       4126.5     18430
#> 33:      Fair     I   4278.831       3022.0     18242
#> 34:      Good     J   4578.490       3690.0     18325
#> 35:      Fair     E   3834.358       2996.5     15584
#>           cut color mean_price median_price max_price

tidyfst是永远不可能比data.table快的,但是如果你觉得上面的代码更容易掌握、更容易读懂,而在日常工作中多花零点几秒的运行时间没有太大问题(实际上节省了大家的交流时间,甚至就是节省将来自己再次读懂自己代码的时间),tidyfst就值得拥有。

join

长宽表转换

1.长表转宽表

sys_time_print({
  mean1 <- function(x) mean(x, na.rm = TRUE)
  max1 <- function(x) max(x, na.rm = TRUE)
  r3_1 <-dat1 %>% 
    wider_dt(cut,
             value = c("depth", "price"),
             name = "color",
             fun = list(mean1,max1))
})
#> [1] "Finished in 0.030s elapsed (0.030s cpu)"
r3_1
#> Key: <cut>
#>          cut depth_mean1_D depth_mean1_E depth_mean1_F depth_mean1_G
#>        <ord>         <num>         <num>         <num>         <num>
#> 1:      Fair      64.15688      63.09882      63.41129      64.33663
#> 2:      Good      62.32176      62.16065      62.18220      62.60212
#> 3: Very Good      61.70454      61.73751      61.73138      61.83043
#> 4:   Premium      61.18875      61.16850      61.25098      61.27894
#> 5:     Ideal      61.67241      61.68693      61.65616      61.70585
#>    depth_mean1_H depth_mean1_I depth_mean1_J price_mean1_D price_mean1_E
#>            <num>         <num>         <num>         <num>         <num>
#> 1:      64.44778      64.24571      64.75682      4237.899      3834.358
#> 2:      62.50826      62.56427      62.44288      3369.352      3293.289
#> 3:      61.97838      61.90052      61.94553      3437.133      3301.873
#> 4:      61.31882      61.34123      61.42092      3796.609      3481.628
#> 5:      61.72374      61.79243      61.80705      2608.030      2558.501
#>    price_mean1_F price_mean1_G price_mean1_H price_mean1_I price_mean1_J
#>            <num>         <num>         <num>         <num>         <num>
#> 1:      3790.703      4186.403      5454.118      4278.831      5029.502
#> 2:      3518.548      3999.886      4098.280      4966.745      4578.490
#> 3:      3792.863      3900.558      4390.470      5475.587      5107.638
#> 4:      4347.158      4543.181      5215.019      5921.498      6416.235
#> 5:      3309.061      3751.035      3924.682      4434.030      4972.801
#>    depth_max1_D depth_max1_E depth_max1_F depth_max1_G depth_max1_H
#>           <num>        <num>        <num>        <num>        <num>
#> 1:         71.6         79.0         70.8         72.9         71.8
#> 2:         67.0         65.9         66.5         65.8         66.1
#> 3:         64.7         64.4         64.2         64.9         64.7
#> 4:         63.0         63.0         63.0         63.0         63.0
#> 5:         64.5         65.5         65.3         64.1         65.1
#>    depth_max1_I depth_max1_J price_max1_D price_max1_E price_max1_F
#>           <num>        <num>        <int>        <int>        <int>
#> 1:         71.3         73.6        16386        15584        17995
#> 2:         65.9         66.0        18468        18236        18686
#> 3:         64.4         64.1        18526        18731        18777
#> 4:         63.0         63.0        18286        18426        18791
#> 5:         66.7         65.4        18693        18729        18780
#>    price_max1_G price_max1_H price_max1_I price_max1_J
#>           <int>        <int>        <int>        <int>
#> 1:        18574        18308        18242        18531
#> 2:        18788        18640        18707        18325
#> 3:        18818        18803        18500        18430
#> 4:        18741        18795        18823        18706
#> 5:        18806        18659        18779        18508

高阶

向上/下填充空值

对于填充空值来说,可以这样操作:

添加子维度聚合结果为新列

移动函数

系统参数

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                              
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyfst_0.9.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.3         knitr_1.28         magrittr_1.5       tidyselect_1.0.0  
#>  [5] munsell_0.5.0      colorspace_1.4-1   R6_2.4.1           rlang_0.4.5       
#>  [9] dplyr_0.8.5        stringr_1.4.0      tools_3.6.3        parallel_3.6.3    
#> [13] grid_3.6.3         data.table_1.12.8  gtable_0.3.0       xfun_0.12         
#> [17] htmltools_0.4.0    assertthat_0.2.1   yaml_2.2.1         digest_0.6.25     
#> [21] tibble_2.1.3       lifecycle_0.2.0    crayon_1.3.4       purrr_0.3.3       
#> [25] ggplot2_3.3.0.9000 fst_0.9.2          glue_1.3.2         evaluate_0.14     
#> [29] rmarkdown_2.1      stringi_1.4.6      compiler_3.6.3     pillar_1.4.3      
#> [33] scales_1.1.0       pkgconfig_2.0.3