The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The bulkreadr
package in R includes specialized
functions beyond bulk data reading, aimed at enhancing data analysis
efficiency. These functions are designed to operate on individual
vectors, except for inspect_na()
and
fill_missing_values()
, which work on data frames.
pull_out()
is similar to [. It acts on vectors,
matrices, arrays and lists to extract or replace parts. It is pleasant
to use with the magrittr (%>%
) and
base(|>
) operators.
library(bulkreadr)
library(dplyr)
top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu")
top_10_richest_nig %>%
pull_out(c(1, 5, 2))
#> [1] "Aliko Dangote" "Abdulsamad Rabiu" "Mike Adenuga"
convert_to_date()
parses an input vector into POSIXct
date-time object. It is also powerful to convert from excel date number
like 42370
into date value like
2016-01-01
.
## ** heterogeneous dates **
dates <- c(
44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
"13-01- 2022", "2023", "2023-2", 41750.2, 41751.99,
"11 07 2023", "2023-4"
)
# Convert to POSIXct or Date object
convert_to_date(dates)
#> [1] "2022-11-04" "2022-09-22" NA "1992-02-27" "2022-01-19"
#> [6] "2022-01-13" "2023-01-01" "2023-02-01" "2014-04-21" "2014-04-22"
#> [11] "2023-07-11" "2023-04-01"
# It can also convert date time object to date object
convert_to_date(lubridate::now())
#> [1] "2024-05-26"
inspect_na()
summarizes the rate of missingness in each
column of a data frame. For a grouped data frame, the rate of
missingness is summarized separately for each group.
# dataframe summary
inspect_na(airquality)
#> # A tibble: 6 × 3
#> col_name cnt pcnt
#> <chr> <int> <dbl>
#> 1 Ozone 37 24.2
#> 2 Solar.R 7 4.58
#> 3 Wind 0 0
#> 4 Temp 0 0
#> 5 Month 0 0
#> # ℹ 1 more row
Grouped dataframe summary
fill_missing_values()
is an efficient function that
addresses missing values in a data frame. It uses imputation by
function, also known as column-based imputation, to impute the missing
values. It supports various imputation methods for continuous variables,
including minimum
, maximum
, mean
,
median
, harmonic mean
, and
geometric mean
. For categorical variables, missing values
are replaced with the mode
of the column. This approach
ensures accurate and consistent replacements derived from individual
columns, resulting in a complete and reliable dataset for improved
analysis and decision-making.
df <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", NA, "versicolor", "setosa",
NA, "virginica", "setosa"
)
)
df
#> # A tibble: 7 × 5
#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 4.1 1.5 NA setosa
#> 2 5 3.6 1.4 0.2 <NA>
#> 3 5.7 3 4.2 1.2 versicolor
#> 4 NA 3 1.4 0.2 setosa
#> 5 6.2 2.9 NA 1.3 <NA>
#> # ℹ 2 more rows
If you do not specify selected_variables
(i.e., leave it
as NULL
), the function will impute missing values for all
columns in the dataframe.
# Impute using the mean
result_df_mean <- fill_missing_values(df, method = "mean")
result_df_mean
#> # A tibble: 7 × 5
#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 4.1 1.5 0.94 setosa
#> 2 5 3.6 1.4 0.2 setosa
#> 3 5.7 3 4.2 1.2 versicolor
#> 4 5.72 3 1.4 0.2 setosa
#> 5 6.2 2.9 3 1.3 setosa
#> # ℹ 2 more rows
If you specify column names, only those columns will be imputed. For
example, impute for variables Petal_Length
and
Petal_Width
using the geometric mean.
result_df_geomean <- fill_missing_values(df, selected_variables = c
("Petal_Length", "Petal_Width"), method = "geometric")
result_df_geomean
#> # A tibble: 7 × 5
#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 4.1 1.5 0.732 setosa
#> 2 5 3.6 1.4 0.2 <NA>
#> 3 5.7 3 4.2 1.2 versicolor
#> 4 NA 3 1.4 0.2 setosa
#> 5 6.2 2.9 2.22 1.3 <NA>
#> # ℹ 2 more rows
If you specify column positions, only the columns at those positions will be imputed.
# Impute using the maximum method
result_df_max <- fill_missing_values(df, selected_variables = c
(2, 3), method = "max")
result_df_geomean
#> # A tibble: 7 × 5
#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 4.1 1.5 0.732 setosa
#> 2 5 3.6 1.4 0.2 <NA>
#> 3 5.7 3 4.2 1.2 versicolor
#> 4 NA 3 1.4 0.2 setosa
#> 5 6.2 2.9 2.22 1.3 <NA>
#> # ℹ 2 more rows
You can use the fill_missing_values()
in a grouped data
frame by using other grouping and map functions. Here is an example of
how to do this:
sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
"virginica", "virginica", "setosa")
)
sample_iris
#> # A tibble: 7 × 4
#> Sepal_Length Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 1.5 0.3 setosa
#> 2 5 1.4 0.2 setosa
#> 3 5.7 4.2 1.2 versicolor
#> 4 NA 1.4 0.2 setosa
#> 5 6.2 NA 1.3 virginica
#> # ℹ 2 more rows
sample_iris %>%
group_by(Species) %>%
group_split() %>%
map_df(fill_missing_values, method = "median")
#> # A tibble: 7 × 4
#> Sepal_Length Petal_Length Petal_Width Species
#> <dbl> <dbl> <dbl> <chr>
#> 1 5.2 1.5 0.3 setosa
#> 2 5 1.4 0.2 setosa
#> 3 5.2 1.4 0.2 setosa
#> 4 5.5 3.7 0.2 setosa
#> 5 5.7 4.2 1.2 versicolor
#> # ℹ 2 more rows
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.