The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
我的R语言小伙伴最近分享了自己使用R来做工业级数据清洗的经验,最近我自己在不断测试我的新包tidyfst,因此就拿这个data.table的案例来尝试一下。
本次测试,将不会加载data.table包,但是其实tidyfst里面无处不是data.table的元素,而且也导出了很多内置的data.table函数,比如as.data.table和data.table。所以这些代码在tidyfst中就可以自如地使用。
library(tidyfst)
diamonds <- ggplot2::diamonds
n = 1e5 #如果想做工业级测试,可以继续增加数量
set.seed(2020)
dtranges <- seq.Date(from = as.Date("2011-01-01"),
to = as.Date("2020-01-01"),
by = 1)
n1 <- sample(nrow(diamonds), n, replace = TRUE)
dat1 <- as.data.table(diamonds[n1, ])
dat1[, "dt"] <- sample(dtranges, n, replace = TRUE) # 增加dt列
n2 <- sample(nrow(dat1), nrow(dat1)/1000)
dat1[n2, "price"] <- NA # price列构造千分之一缺失值
dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))
dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))
print(dat1)
#> carat cut color clarity depth table price x y
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num>
#> 1: 2.23 Premium J VS2 61.0 58 14867 8.39 8.36
#> 2: 0.46 Ideal F SI2 61.5 54 758 4.98 5.01
#> 3: 0.70 Ideal H VVS1 60.9 57 3611 5.70 5.78
#> 4: 1.50 Premium F VS2 61.5 58 14719 7.34 7.32
#> 5: 0.40 Very Good G IF 61.0 59 1154 4.75 4.79
#> ---
#> 99996: 1.14 Ideal F VS2 61.4 56 8017 6.73 6.78
#> 99997: 0.39 Ideal F SI1 61.8 55 886 4.70 4.72
#> 99998: 2.00 Good F SI1 63.0 64 17869 7.67 7.76
#> 99999: 1.00 Premium H VS2 59.7 59 5139 6.57 6.52
#> 100000: 0.30 Ideal G VVS2 60.6 57 878 4.35 4.33
#> 2 variable(s) not shown: [z <num>, dt <Date>]
后面的分析,经常要根据日期进行计算。所以,先对日期进行排序,就能够提高运行速度。在tidyfst中,可以使用arrange_dt
函数来对数据进行原位的各种操作,其中就包括排序。
dat1 = arrange_dt(dat1,dt)
dat1
#> carat cut color clarity depth table price x y
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num>
#> 1: 0.34 Very Good H VS2 62.4 60.0 537 4.41 4.44
#> 2: 0.30 Good G VS1 63.6 57.0 776 4.26 4.23
#> 3: 1.12 Ideal F VVS2 61.4 57.0 9634 6.69 6.66
#> 4: 1.00 Premium E SI1 59.1 58.0 4805 6.52 6.44
#> 5: 0.50 Fair D VVS1 65.9 64.0 1792 4.92 5.03
#> ---
#> 99996: 0.32 Ideal J VS1 62.0 54.7 442 4.39 4.42
#> 99997: 0.33 Ideal G VS2 62.5 57.0 743 4.41 4.39
#> 99998: 0.28 Very Good I VS2 61.9 57.0 379 4.16 4.19
#> 99999: 0.30 Ideal G VS2 62.0 56.0 556 4.28 4.30
#> 100000: 0.32 Ideal F VS2 62.4 54.0 645 4.38 4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]
那么,现在dat1的数据就按照日期排好序了。
在tidyfst中,我设置了一个sys_time_print
函数,可以方便地输出system.time()
函数返回的结果。
sys_time_print({
r1_1 <- dat1 %>%
summarise_dt(
by = .(cut,color),
mean_price = mean(price, na.rm = TRUE),
median_price = median(price, na.rm = TRUE),
max_price = max(price, na.rm = TRUE)
)
})
#> [1] "# Finished in 0.010s elapsed (0.000s cpu)"
r1_1
#> cut color mean_price median_price max_price
#> <ord> <ord> <num> <num> <int>
#> 1: Very Good H 4390.470 3394.0 18803
#> 2: Good G 3999.886 3303.0 18788
#> 3: Ideal F 3309.061 1751.0 18780
#> 4: Premium E 3481.628 1881.0 18426
#> 5: Fair D 4237.899 3205.0 16386
#> 6: Premium J 6416.235 5110.5 18706
#> 7: Premium F 4347.158 2874.0 18791
#> 8: Ideal D 2608.030 1550.5 18693
#> 9: Very Good G 3900.558 2403.0 18818
#> 10: Very Good E 3301.873 2051.0 18731
#> 11: Premium G 4543.181 2780.0 18741
#> 12: Ideal G 3751.035 1881.0 18806
#> 13: Ideal E 2558.501 1429.0 18729
#> 14: Premium H 5215.019 4451.0 18795
#> 15: Premium I 5921.498 4545.5 18823
#> 16: Very Good F 3792.863 2473.0 18777
#> 17: Good D 3369.352 2661.0 18468
#> 18: Fair G 4186.403 2797.0 18574
#> 19: Good E 3293.289 2407.0 18236
#> 20: Premium D 3796.609 2348.0 18286
#> 21: Ideal H 3924.682 2320.0 18659
#> 22: Very Good I 5475.587 4013.0 18500
#> 23: Ideal I 4434.030 2631.0 18779
#> 24: Fair J 5029.502 3422.0 18531
#> 25: Very Good D 3437.133 2242.0 18526
#> 26: Good F 3518.548 2755.0 18686
#> 27: Good I 4966.745 3484.0 18707
#> 28: Good H 4098.280 3323.0 18640
#> 29: Fair H 5454.118 4134.5 18308
#> 30: Fair F 3790.703 2961.0 17995
#> 31: Ideal J 4972.801 4135.0 18508
#> 32: Very Good J 5107.638 4126.5 18430
#> 33: Fair I 4278.831 3022.0 18242
#> 34: Good J 4578.490 3690.0 18325
#> 35: Fair E 3834.358 2996.5 15584
#> cut color mean_price median_price max_price
tidyfst是永远不可能比data.table快的,但是如果你觉得上面的代码更容易掌握、更容易读懂,而在日常工作中多花零点几秒的运行时间没有太大问题(实际上节省了大家的交流时间,甚至就是节省将来自己再次读懂自己代码的时间),tidyfst就值得拥有。
sys_time_print({
r1_2 <- dat1 %>%
arrange_dt(dt,-price) %>%
drop_na_dt(price) %>%
group_dt(
by = dt,
head(1)
)
})
#> [1] "# Finished in 0.010s elapsed (0.000s cpu)"
r1_2
#> dt carat cut color clarity depth table price x
#> <Date> <num> <ord> <ord> <ord> <num> <num> <int> <num>
#> 1: 2011-01-01 2.10 Premium J VS2 58.3 54 12401 8.49
#> 2: 2011-01-02 2.02 Premium I VS2 59.0 59 17893 8.34
#> 3: 2011-01-03 1.61 Ideal F VS1 62.0 55 17414 7.55
#> 4: 2011-01-04 2.19 Premium E SI2 62.4 61 18232 8.31
#> 5: 2011-01-05 2.05 Premium H VS2 60.7 61 16235 8.25
#> ---
#> 3284: 2019-12-28 2.22 Ideal J SI2 62.8 57 13703 8.33
#> 3285: 2019-12-29 2.10 Very Good J VS2 61.3 59 14229 8.21
#> 3286: 2019-12-30 1.55 Very Good H VS2 63.3 56 10546 7.38
#> 3287: 2019-12-31 1.51 Very Good F VVS2 63.1 56 17317 7.32
#> 3288: 2020-01-01 2.04 Premium J VS2 63.0 58 13734 8.02
#> 2 variable(s) not shown: [y <num>, z <num>]
实质上,merge函数已经优化得很好。tidyfst设计*_join
系列函数的时候,只是为了一种不一样的语法结构来帮助实现不同的连接,因为它确实更加直观一些。但是实质上它还是merge.data.table函数的包装版本。
sys_time_print({
r2_1 <- dat1 %>%
left_join_dt(dat2,by = "dt")
})
#> [1] "# Finished in 0.040s elapsed (0.000s cpu)"
r2_1
#> Key: <dt>
#> dt carat cut color clarity depth table price x
#> <Date> <num> <ord> <ord> <ord> <num> <num> <int> <num>
#> 1: 2011-01-01 0.34 Very Good H VS2 62.4 60.0 537 4.41
#> 2: 2011-01-01 0.30 Good G VS1 63.6 57.0 776 4.26
#> 3: 2011-01-01 1.12 Ideal F VVS2 61.4 57.0 9634 6.69
#> 4: 2011-01-01 1.00 Premium E SI1 59.1 58.0 4805 6.52
#> 5: 2011-01-01 0.50 Fair D VVS1 65.9 64.0 1792 4.92
#> ---
#> 99996: 2020-01-01 0.32 Ideal J VS1 62.0 54.7 442 4.39
#> 99997: 2020-01-01 0.33 Ideal G VS2 62.5 57.0 743 4.41
#> 99998: 2020-01-01 0.28 Very Good I VS2 61.9 57.0 379 4.16
#> 99999: 2020-01-01 0.30 Ideal G VS2 62.0 56.0 556 4.28
#> 100000: 2020-01-01 0.32 Ideal F VS2 62.4 54.0 645 4.38
#> 3 variable(s) not shown: [y <num>, z <num>, price1 <int>]
sys_time_print({
mymerge <- function(x, y) left_join_dt(x, y, by = "dt")
r2_2 <- Reduce(mymerge, list(dat1, dat2, dat3))
})
#> [1] "# Finished in 0.020s elapsed (0.000s cpu)"
r2_2
#> Key: <dt>
#> dt carat cut color clarity depth table price x
#> <Date> <num> <ord> <ord> <ord> <num> <num> <int> <num>
#> 1: 2011-01-01 0.34 Very Good H VS2 62.4 60.0 537 4.41
#> 2: 2011-01-01 0.30 Good G VS1 63.6 57.0 776 4.26
#> 3: 2011-01-01 1.12 Ideal F VVS2 61.4 57.0 9634 6.69
#> 4: 2011-01-01 1.00 Premium E SI1 59.1 58.0 4805 6.52
#> 5: 2011-01-01 0.50 Fair D VVS1 65.9 64.0 1792 4.92
#> ---
#> 99996: 2020-01-01 0.32 Ideal J VS1 62.0 54.7 442 4.39
#> 99997: 2020-01-01 0.33 Ideal G VS2 62.5 57.0 743 4.41
#> 99998: 2020-01-01 0.28 Very Good I VS2 61.9 57.0 379 4.16
#> 99999: 2020-01-01 0.30 Ideal G VS2 62.0 56.0 556 4.28
#> 100000: 2020-01-01 0.32 Ideal F VS2 62.4 54.0 645 4.38
#> 4 variable(s) not shown: [y <num>, z <num>, price1 <int>, price2 <int>]
sys_time_print({
mean1 <- function(x) mean(x, na.rm = TRUE)
max1 <- function(x) max(x, na.rm = TRUE)
r3_1 <-dat1 %>%
wider_dt(cut,
value = c("depth", "price"),
name = "color",
fun = list(mean1,max1))
})
#> [1] "# Finished in 0.020s elapsed (0.000s cpu)"
r3_1
#> Key: <cut>
#> cut depth_mean1_D depth_mean1_E depth_mean1_F depth_mean1_G
#> <ord> <num> <num> <num> <num>
#> 1: Fair 64.15688 63.09882 63.41129 64.33663
#> 2: Good 62.32176 62.16065 62.18220 62.60212
#> 3: Very Good 61.70454 61.73751 61.73138 61.83043
#> 4: Premium 61.18875 61.16850 61.25098 61.27894
#> 5: Ideal 61.67241 61.68693 61.65616 61.70585
#> 24 variable(s) not shown: [depth_mean1_H <num>, depth_mean1_I <num>, depth_mean1_J <num>, price_mean1_D <num>, price_mean1_E <num>, price_mean1_F <num>, price_mean1_G <num>, price_mean1_H <num>, price_mean1_I <num>, price_mean1_J <num>, ...]
sys_time_print({
r3_2 <-dat1 %>%
select_dt(cut,color,x,y,z) %>%
longer_dt(cut,color,
name = "xyz",
value = "xyzvalue")
})
#> [1] "# Finished in 0.000s elapsed (0.000s cpu)"
r3_2
#> cut color xyz xyzvalue
#> <ord> <ord> <fctr> <num>
#> 1: Very Good H x 4.41
#> 2: Good G x 4.26
#> 3: Ideal F x 6.69
#> 4: Premium E x 6.52
#> 5: Fair D x 4.92
#> ---
#> 299996: Ideal J z 2.73
#> 299997: Ideal G z 2.75
#> 299998: Very Good I z 2.58
#> 299999: Ideal G z 2.66
#> 300000: Ideal F z 2.74
对于填充空值来说,可以这样操作:
sys_time_print({
dat1 %>% fill_na_dt(price) -> dat1
})
#> [1] "# Finished in 0.020s elapsed (0.000s cpu)"
dat1
#> carat cut color clarity depth table price x y
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num>
#> 1: 0.34 Very Good H VS2 62.4 60.0 537 4.41 4.44
#> 2: 0.30 Good G VS1 63.6 57.0 776 4.26 4.23
#> 3: 1.12 Ideal F VVS2 61.4 57.0 9634 6.69 6.66
#> 4: 1.00 Premium E SI1 59.1 58.0 4805 6.52 6.44
#> 5: 0.50 Fair D VVS1 65.9 64.0 1792 4.92 5.03
#> ---
#> 99996: 0.32 Ideal J VS1 62.0 54.7 442 4.39 4.42
#> 99997: 0.33 Ideal G VS2 62.5 57.0 743 4.41 4.39
#> 99998: 0.28 Very Good I VS2 61.9 57.0 379 4.16 4.19
#> 99999: 0.30 Ideal G VS2 62.0 56.0 556 4.28 4.30
#> 100000: 0.32 Ideal F VS2 62.4 54.0 645 4.38 4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]
sys_time_print({
mutate_dt(dat1,
mean_price = mean(price, na.rm = TRUE),
sd_price = sd(price, na.rm = TRUE),
by = .(cut, color))
})
#> [1] "# Finished in 0.070s elapsed (0.030s cpu)"
dat1
#> carat cut color clarity depth table price x y
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num>
#> 1: 0.34 Very Good H VS2 62.4 60.0 537 4.41 4.44
#> 2: 0.30 Good G VS1 63.6 57.0 776 4.26 4.23
#> 3: 1.12 Ideal F VVS2 61.4 57.0 9634 6.69 6.66
#> 4: 1.00 Premium E SI1 59.1 58.0 4805 6.52 6.44
#> 5: 0.50 Fair D VVS1 65.9 64.0 1792 4.92 5.03
#> ---
#> 99996: 0.32 Ideal J VS1 62.0 54.7 442 4.39 4.42
#> 99997: 0.33 Ideal G VS2 62.5 57.0 743 4.41 4.39
#> 99998: 0.28 Very Good I VS2 61.9 57.0 379 4.16 4.19
#> 99999: 0.30 Ideal G VS2 62.0 56.0 556 4.28 4.30
#> 100000: 0.32 Ideal F VS2 62.4 54.0 645 4.38 4.40
#> 2 variable(s) not shown: [z <num>, dt <Date>]
sys_time_print({
dat1 %>%
group_dt(
by = dt,
mutate_dt(id = seq(.N))
) -> dat1
})
#> [1] "# Finished in 0.760s elapsed (0.160s cpu)"
dat1
#> dt carat cut color clarity depth table price x
#> <Date> <num> <ord> <ord> <ord> <num> <num> <int> <num>
#> 1: 2011-01-01 0.34 Very Good H VS2 62.4 60.0 537 4.41
#> 2: 2011-01-01 0.30 Good G VS1 63.6 57.0 776 4.26
#> 3: 2011-01-01 1.12 Ideal F VVS2 61.4 57.0 9634 6.69
#> 4: 2011-01-01 1.00 Premium E SI1 59.1 58.0 4805 6.52
#> 5: 2011-01-01 0.50 Fair D VVS1 65.9 64.0 1792 4.92
#> ---
#> 99996: 2020-01-01 0.32 Ideal J VS1 62.0 54.7 442 4.39
#> 99997: 2020-01-01 0.33 Ideal G VS2 62.5 57.0 743 4.41
#> 99998: 2020-01-01 0.28 Very Good I VS2 61.9 57.0 379 4.16
#> 99999: 2020-01-01 0.30 Ideal G VS2 62.0 56.0 556 4.28
#> 100000: 2020-01-01 0.32 Ideal F VS2 62.4 54.0 645 4.38
#> 3 variable(s) not shown: [y <num>, z <num>, id <int>]
sys_time_print({
dat1 %>%
group_dt(
by = color,
mutate_dt(
MA10_price = frollmean(price, 10),
MSD10_price = frollapply(price, 10, FUN = sd)
)
) -> dat1
})
#> [1] "# Finished in 0.450s elapsed (0.130s cpu)"
dat1
#> color dt carat cut clarity depth table price x
#> <ord> <Date> <num> <ord> <ord> <num> <num> <int> <num>
#> 1: H 2011-01-01 0.34 Very Good VS2 62.4 60 537 4.41
#> 2: H 2011-01-01 1.20 Premium SI2 62.1 57 3965 6.84
#> 3: H 2011-01-01 1.51 Premium SI1 58.0 59 10063 7.57
#> 4: H 2011-01-02 1.00 Premium VS2 60.7 60 4930 6.55
#> 5: H 2011-01-02 1.01 Ideal SI2 61.1 56 4693 6.44
#> ---
#> 99996: I 2019-12-30 0.54 Ideal SI1 62.1 54 1057 5.24
#> 99997: I 2019-12-31 0.90 Very Good VS2 59.3 59 3992 6.27
#> 99998: I 2019-12-31 0.30 Ideal SI1 61.3 56 506 4.33
#> 99999: I 2020-01-01 1.22 Premium VS2 62.3 59 5832 6.79
#> 100000: I 2020-01-01 0.28 Very Good VS2 61.9 57 379 4.16
#> 5 variable(s) not shown: [y <num>, z <num>, id <int>, MA10_price <num>, MSD10_price <num>]
sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 22631)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=Chinese (Simplified)_China.utf8
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] bench_1.1.3 dplyr_1.1.4 data.table_1.16.0 tidyfst_1.8.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.5 jsonlite_1.8.8 compiler_4.4.1 tidyselect_1.2.1
#> [5] Rcpp_1.0.13 stringr_1.5.1 parallel_4.4.1 jquerylib_0.1.4
#> [9] scales_1.3.0 yaml_2.3.9 fastmap_1.2.0 ggplot2_3.5.1
#> [13] R6_2.5.1 generics_0.1.3 knitr_1.48 tibble_3.2.1
#> [17] munsell_0.5.1 bslib_0.7.0 pillar_1.9.0 rlang_1.1.4
#> [21] utf8_1.2.4 cachem_1.1.0 stringi_1.8.4 xfun_0.46
#> [25] sass_0.4.9 cli_3.6.3 withr_3.0.1 magrittr_2.0.3
#> [29] grid_4.4.1 digest_0.6.36 rstudioapi_0.16.0 fst_0.9.8
#> [33] lifecycle_1.0.4 vctrs_0.6.5 evaluate_0.24.0 glue_1.7.0
#> [37] colorspace_2.1-0 profmem_0.6.0 fansi_1.0.6 fstcore_0.9.18
#> [41] rmarkdown_2.27 tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.