tidyfst包实例分析

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

tidyfst包实例分析

我的R语言小伙伴最近分享了自己使用R来做工业级数据清洗的经验，最近我自己在不断测试我的新包tidyfst，因此就拿这个data.table的案例来尝试一下。

测试数据构造

本次测试，将不会加载data.table包，但是其实tidyfst里面无处不是data.table的元素，而且也导出了很多内置的data.table函数，比如as.data.table和data.table。所以这些代码在tidyfst中就可以自如地使用。

library(tidyfst)
diamonds <- ggplot2::diamonds
n = 1e5  #如果想做工业级测试，可以继续增加数量
set.seed(2020)
dtranges <- seq.Date(from = as.Date("2011-01-01"),
                     to = as.Date("2020-01-01"),
                     by = 1)
n1 <- sample(nrow(diamonds), n, replace = TRUE)
dat1 <- as.data.table(diamonds[n1, ])
dat1[, "dt"] <- sample(dtranges, n, replace = TRUE)  # 增加dt列
n2 <- sample(nrow(dat1), nrow(dat1)/1000)
dat1[n2, "price"] <- NA # price列构造千分之一缺失值
dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

print(dat1)
#>         carat       cut color clarity depth table price     x     y
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:  2.23   Premium     J     VS2  61.0    58 14867  8.39  8.36
#>      2:  0.46     Ideal     F     SI2  61.5    54   758  4.98  5.01
#>      3:  0.70     Ideal     H    VVS1  60.9    57  3611  5.70  5.78
#>      4:  1.50   Premium     F     VS2  61.5    58 14719  7.34  7.32
#>      5:  0.40 Very Good     G      IF  61.0    59  1154  4.75  4.79
#>     ---                                                            
#>  99996:  1.14     Ideal     F     VS2  61.4    56  8017  6.73  6.78
#>  99997:  0.39     Ideal     F     SI1  61.8    55   886  4.70  4.72
#>  99998:  2.00      Good     F     SI1  63.0    64 17869  7.67  7.76
#>  99999:  1.00   Premium     H     VS2  59.7    59  5139  6.57  6.52
#> 100000:  0.30     Ideal     G    VVS2  60.6    57   878  4.35  4.33
#> 2 variable(s) not shown: [z <num>, dt <Date>]

基础

小技巧

后面的分析，经常要根据日期进行计算。所以，先对日期进行排序，就能够提高运行速度。在tidyfst中，可以使用arrange_dt函数来对数据进行原位的各种操作，其中就包括排序。

dat1 = arrange_dt(dat1,dt)
dat1
#>         carat       cut color clarity depth table price     x     y
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                            
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]

那么，现在dat1的数据就按照日期排好序了。

聚合

1.求每种切割类型、每种颜色钻石的平均价格、中位数价格与最高价格

在tidyfst中，我设置了一个sys_time_print函数，可以方便地输出system.time()函数返回的结果。

sys_time_print({
  r1_1 <- dat1 %>% 
    summarise_dt(
      by = .(cut,color),
      mean_price = mean(price, na.rm = TRUE),
      median_price = median(price, na.rm = TRUE),
      max_price = max(price, na.rm = TRUE)
    )
})
#> [1] "# Finished in 0.000s elapsed (0.000s cpu)"
r1_1
#>           cut color mean_price median_price max_price
#>         <ord> <ord>      <num>        <num>     <int>
#>  1: Very Good     H   4390.470       3394.0     18803
#>  2:      Good     G   3999.886       3303.0     18788
#>  3:     Ideal     F   3309.061       1751.0     18780
#>  4:   Premium     E   3481.628       1881.0     18426
#>  5:      Fair     D   4237.899       3205.0     16386
#>  6:   Premium     J   6416.235       5110.5     18706
#>  7:   Premium     F   4347.158       2874.0     18791
#>  8:     Ideal     D   2608.030       1550.5     18693
#>  9: Very Good     G   3900.558       2403.0     18818
#> 10: Very Good     E   3301.873       2051.0     18731
#> 11:   Premium     G   4543.181       2780.0     18741
#> 12:     Ideal     G   3751.035       1881.0     18806
#> 13:     Ideal     E   2558.501       1429.0     18729
#> 14:   Premium     H   5215.019       4451.0     18795
#> 15:   Premium     I   5921.498       4545.5     18823
#> 16: Very Good     F   3792.863       2473.0     18777
#> 17:      Good     D   3369.352       2661.0     18468
#> 18:      Fair     G   4186.403       2797.0     18574
#> 19:      Good     E   3293.289       2407.0     18236
#> 20:   Premium     D   3796.609       2348.0     18286
#> 21:     Ideal     H   3924.682       2320.0     18659
#> 22: Very Good     I   5475.587       4013.0     18500
#> 23:     Ideal     I   4434.030       2631.0     18779
#> 24:      Fair     J   5029.502       3422.0     18531
#> 25: Very Good     D   3437.133       2242.0     18526
#> 26:      Good     F   3518.548       2755.0     18686
#> 27:      Good     I   4966.745       3484.0     18707
#> 28:      Good     H   4098.280       3323.0     18640
#> 29:      Fair     H   5454.118       4134.5     18308
#> 30:      Fair     F   3790.703       2961.0     17995
#> 31:     Ideal     J   4972.801       4135.0     18508
#> 32: Very Good     J   5107.638       4126.5     18430
#> 33:      Fair     I   4278.831       3022.0     18242
#> 34:      Good     J   4578.490       3690.0     18325
#> 35:      Fair     E   3834.358       2996.5     15584
#>           cut color mean_price median_price max_price

tidyfst是永远不可能比data.table快的，但是如果你觉得上面的代码更容易掌握、更容易读懂，而在日常工作中多花零点几秒的运行时间没有太大问题（实际上节省了大家的交流时间，甚至就是节省将来自己再次读懂自己代码的时间），tidyfst就值得拥有。

2.求每天最高出售价格对应的那笔订单

sys_time_print({
  r1_2 <- dat1 %>% 
    arrange_dt(dt,-price) %>% 
    drop_na_dt(price) %>% 
    group_dt(
      by = dt,
      head(1)
    )
})
#> [1] "# Finished in 0.010s elapsed (0.050s cpu)"
r1_2
#>               dt carat       cut color clarity depth table price     x
#>           <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num>
#>    1: 2011-01-01  2.10   Premium     J     VS2  58.3    54 12401  8.49
#>    2: 2011-01-02  2.02   Premium     I     VS2  59.0    59 17893  8.34
#>    3: 2011-01-03  1.61     Ideal     F     VS1  62.0    55 17414  7.55
#>    4: 2011-01-04  2.19   Premium     E     SI2  62.4    61 18232  8.31
#>    5: 2011-01-05  2.05   Premium     H     VS2  60.7    61 16235  8.25
#>   ---                                                                 
#> 3284: 2019-12-28  2.22     Ideal     J     SI2  62.8    57 13703  8.33
#> 3285: 2019-12-29  2.10 Very Good     J     VS2  61.3    59 14229  8.21
#> 3286: 2019-12-30  1.55 Very Good     H     VS2  63.3    56 10546  7.38
#> 3287: 2019-12-31  1.51 Very Good     F    VVS2  63.1    56 17317  7.32
#> 3288: 2020-01-01  2.04   Premium     J     VS2  63.0    58 13734  8.02
#> 2 variable(s) not shown: [y <num>, z <num>]

join

1.dat1与dat2以dt列左连接

实质上，merge函数已经优化得很好。tidyfst设计*_join系列函数的时候，只是为了一种不一样的语法结构来帮助实现不同的连接，因为它确实更加直观一些。但是实质上它还是merge.data.table函数的包装版本。

sys_time_print({
  r2_1 <- dat1 %>% 
    left_join_dt(dat2,by = "dt")
})
#> [1] "# Finished in 0.010s elapsed (0.000s cpu)"
r2_1
#> Key: <dt>
#>                 dt carat       cut color clarity depth table price     x
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92
#>     ---                                                                 
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38
#> 3 variable(s) not shown: [y <num>, z <num>, price1 <int>]

2.多重join

sys_time_print({
  mymerge <- function(x, y) left_join_dt(x, y, by = "dt")
  r2_2 <- Reduce(mymerge, list(dat1, dat2, dat3))
})
#> [1] "# Finished in 0.060s elapsed (0.050s cpu)"
r2_2
#> Key: <dt>
#>                 dt carat       cut color clarity depth table price     x
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92
#>     ---                                                                 
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38
#> 4 variable(s) not shown: [y <num>, z <num>, price1 <int>, price2 <int>]

长宽表转换

1.长表转宽表

sys_time_print({
  mean1 <- function(x) mean(x, na.rm = TRUE)
  max1 <- function(x) max(x, na.rm = TRUE)
  r3_1 <-dat1 %>% 
    wider_dt(cut,
             value = c("depth", "price"),
             name = "color",
             fun = list(mean1,max1))
})
#> [1] "# Finished in 0.020s elapsed (0.000s cpu)"
r3_1
#> Key: <cut>
#>          cut depth_mean1_D depth_mean1_E depth_mean1_F depth_mean1_G
#>        <ord>         <num>         <num>         <num>         <num>
#> 1:      Fair      64.15688      63.09882      63.41129      64.33663
#> 2:      Good      62.32176      62.16065      62.18220      62.60212
#> 3: Very Good      61.70454      61.73751      61.73138      61.83043
#> 4:   Premium      61.18875      61.16850      61.25098      61.27894
#> 5:     Ideal      61.67241      61.68693      61.65616      61.70585
#> 24 variable(s) not shown: [depth_mean1_H <num>, depth_mean1_I <num>, depth_mean1_J <num>, price_mean1_D <num>, price_mean1_E <num>, price_mean1_F <num>, price_mean1_G <num>, price_mean1_H <num>, price_mean1_I <num>, price_mean1_J <num>, ...]

2.宽表转长表

sys_time_print({
  r3_2 <-dat1 %>% 
    select_dt(cut,color,x,y,z) %>% 
    longer_dt(cut,color,
              name = "xyz",
              value = "xyzvalue")
})
#> [1] "# Finished in 0.000s elapsed (0.000s cpu)"

r3_2 
#>               cut color    xyz xyzvalue
#>             <ord> <ord> <fctr>    <num>
#>      1: Very Good     H      x     4.41
#>      2:      Good     G      x     4.26
#>      3:     Ideal     F      x     6.69
#>      4:   Premium     E      x     6.52
#>      5:      Fair     D      x     4.92
#>     ---                                
#> 299996:     Ideal     J      z     2.73
#> 299997:     Ideal     G      z     2.75
#> 299998: Very Good     I      z     2.58
#> 299999:     Ideal     G      z     2.66
#> 300000:     Ideal     F      z     2.74

高阶

向上/下填充空值

对于填充空值来说，可以这样操作：

sys_time_print({
  dat1 %>% fill_na_dt(price) -> dat1
})
#> [1] "# Finished in 0.000s elapsed (0.020s cpu)"
dat1
#>         carat       cut color clarity depth table price     x     y
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                            
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]

添加子维度聚合结果为新列

1.以dat1为例，添加两列，一列为以cut、color聚合求price的均值，另一列是求标准差


sys_time_print({
  mutate_dt(dat1,
           mean_price = mean(price, na.rm = TRUE),
           sd_price = sd(price, na.rm = TRUE),
           by = .(cut, color))
})
#> [1] "# Finished in 0.060s elapsed (0.040s cpu)"

dat1
#>         carat       cut color clarity depth table price     x     y
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num>
#>      1:  0.34 Very Good     H     VS2  62.4  60.0   537  4.41  4.44
#>      2:  0.30      Good     G     VS1  63.6  57.0   776  4.26  4.23
#>      3:  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69  6.66
#>      4:  1.00   Premium     E     SI1  59.1  58.0  4805  6.52  6.44
#>      5:  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92  5.03
#>     ---                                                            
#>  99996:  0.32     Ideal     J     VS1  62.0  54.7   442  4.39  4.42
#>  99997:  0.33     Ideal     G     VS2  62.5  57.0   743  4.41  4.39
#>  99998:  0.28 Very Good     I     VS2  61.9  57.0   379  4.16  4.19
#>  99999:  0.30     Ideal     G     VS2  62.0  56.0   556  4.28  4.30
#> 100000:  0.32     Ideal     F     VS2  62.4  54.0   645  4.38  4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]

2.以dat1为例，以dt分组添加一列序号id


sys_time_print({
  dat1 %>% 
  group_dt(
    by = dt,
    mutate_dt(id = seq(.N))
  ) -> dat1
})
#> [1] "# Finished in 0.860s elapsed (0.830s cpu)"
dat1
#>                 dt carat       cut color clarity depth table price     x
#>             <Date> <num>     <ord> <ord>   <ord> <num> <num> <int> <num>
#>      1: 2011-01-01  0.34 Very Good     H     VS2  62.4  60.0   537  4.41
#>      2: 2011-01-01  0.30      Good     G     VS1  63.6  57.0   776  4.26
#>      3: 2011-01-01  1.12     Ideal     F    VVS2  61.4  57.0  9634  6.69
#>      4: 2011-01-01  1.00   Premium     E     SI1  59.1  58.0  4805  6.52
#>      5: 2011-01-01  0.50      Fair     D    VVS1  65.9  64.0  1792  4.92
#>     ---                                                                 
#>  99996: 2020-01-01  0.32     Ideal     J     VS1  62.0  54.7   442  4.39
#>  99997: 2020-01-01  0.33     Ideal     G     VS2  62.5  57.0   743  4.41
#>  99998: 2020-01-01  0.28 Very Good     I     VS2  61.9  57.0   379  4.16
#>  99999: 2020-01-01  0.30     Ideal     G     VS2  62.0  56.0   556  4.28
#> 100000: 2020-01-01  0.32     Ideal     F     VS2  62.4  54.0   645  4.38
#> 3 variable(s) not shown: [y <num>, z <num>, id <int>]

移动函数


sys_time_print({
  dat1 %>% 
    group_dt(
      by = color,
      mutate_dt(
        MA10_price = frollmean(price, 10),
        MSD10_price = frollapply(price, 10, FUN = sd)
      )
    ) -> dat1
})
#> [1] "# Finished in 0.450s elapsed (0.420s cpu)"

dat1
#>         color         dt carat       cut clarity depth table price     x
#>         <ord>     <Date> <num>     <ord>   <ord> <num> <num> <int> <num>
#>      1:     H 2011-01-01  0.34 Very Good     VS2  62.4    60   537  4.41
#>      2:     H 2011-01-01  1.20   Premium     SI2  62.1    57  3965  6.84
#>      3:     H 2011-01-01  1.51   Premium     SI1  58.0    59 10063  7.57
#>      4:     H 2011-01-02  1.00   Premium     VS2  60.7    60  4930  6.55
#>      5:     H 2011-01-02  1.01     Ideal     SI2  61.1    56  4693  6.44
#>     ---                                                                 
#>  99996:     I 2019-12-30  0.54     Ideal     SI1  62.1    54  1057  5.24
#>  99997:     I 2019-12-31  0.90 Very Good     VS2  59.3    59  3992  6.27
#>  99998:     I 2019-12-31  0.30     Ideal     SI1  61.3    56   506  4.33
#>  99999:     I 2020-01-01  1.22   Premium     VS2  62.3    59  5832  6.79
#> 100000:     I 2020-01-01  0.28 Very Good     VS2  61.9    57   379  4.16
#> 5 variable(s) not shown: [y <num>, z <num>, id <int>, MA10_price <num>, MSD10_price <num>]

系统参数

sessionInfo()
#> R version 4.4.2 (2024-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                               
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] bench_1.1.3       dplyr_1.1.4       data.table_1.16.4 tidyfst_1.8.2    
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6      jsonlite_1.8.9    compiler_4.4.2    tidyselect_1.2.1 
#>  [5] Rcpp_1.0.13-1     stringr_1.5.1     parallel_4.4.2    jquerylib_0.1.4  
#>  [9] scales_1.3.0      yaml_2.3.10       fastmap_1.2.0     ggplot2_3.5.1    
#> [13] R6_2.5.1          generics_0.1.3    knitr_1.49        tibble_3.2.1     
#> [17] munsell_0.5.1     bslib_0.8.0       pillar_1.10.0     rlang_1.1.4      
#> [21] utf8_1.2.4        cachem_1.1.0      stringi_1.8.4     xfun_0.49        
#> [25] sass_0.4.9        cli_3.6.3         withr_3.0.2       magrittr_2.0.3   
#> [29] digest_0.6.37     grid_4.4.2        rstudioapi_0.17.1 fst_0.9.8        
#> [33] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.1    glue_1.8.0       
#> [37] colorspace_2.1-1  profmem_0.6.0     fstcore_0.9.18    rmarkdown_2.29   
#> [41] tools_4.4.2       pkgconfig_2.0.3   htmltools_0.5.8.1

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.