The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Tools for assessing data quality, performing exploratory analysis, and semi-automatic preprocessing of messy data with change tracking for integral dataset cleaning.
# The stable version of clickR can be installed from CRAN with:
install.packages("clickR")
# Alternatively, the development version can be installed from Github:
::install_github("David-hervas/clickR") remotes
clickR functions are divided in three groups:
Functions for assessing data quality and performing exploratory
analyses: peek()
descriptive()
cluster_var()
mine.plot()
outliers()
bivariate_outliers()
Functions for detecting and correcting errors:
nice_names()
fix_factors()
fix_dates()
fix_numerics()
fix_levels()
fix_NA()
fix_concat()
remove_empty()
Functions for reviewing and, potentially, restoring the changes
applied to the data. track_changes()
restore_changes()
manual_fix()
Each function has its corresponding help page, which can be accessed
by the standard procedure in R: typing ?name_of_function
in
the console or, in R Studio, by clicking on the function name and
pressing F1.
A simplified usage procedure would be:
# Explore the data with some of the exploratory analysis functions:
descriptive(mtcars_messy)
# Data frame with 32 observations and 13 variables.
# Numeric variables (7)
# Min 1st Q. Median 3rd Q. Max Mean SD Kurtosis Skewness Modes NAs Distribution
# cyl 4.0 4.000 6.00 8.0 8.0 6.188 1.786 -1.762 -0.175 2 0 [#############:##############]
# disp 71.1 120.825 196.30 326.0 472.0 230.722 123.939 -1.207 0.382 2 0 |---[####:########]----------|
# hp 52.0 96.500 123.00 180.0 335.0 146.688 68.563 -0.136 0.726 1 0 |----[#:#####]---------------|
# qsec 14.5 16.892 17.71 18.9 22.9 17.849 1.787 0.335 0.369 1 0 |-------[##:###]-------------|
# am 0.0 0.000 0.00 1.0 1.0 0.406 0.499 -1.925 0.364 2 0 :############################]
# nÂș Gears 3.0 3.000 4.00 4.0 5.0 3.688 0.738 -1.070 0.529 3 0 [#############:--------------|
# carb 1.0 2.000 2.00 4.0 8.0 2.812 1.615 1.257 1.051 2 0 |---:#######]----------------|
#
# Categorical variables (6)
# N. Classes Classes Mode Prop. mode Anti-mode Prop. Anti-mode NAs
# Mpg 28 10.4/15.2/21/30.4/19 10.4 0.062 19.2 0.031 0
# drat 22 3.07/3.92/2.76/3.08/ 3.07 0.094 - 0.031 0
# wt 29 3.44/3.57/1.513/1.61 3.44 0.094 1.513 0.031 0
# vs 4 0/1/?/NULL 0 0.5 ? 0.031 0
# date 32 06/25/1974/12/22/73/ 06/25/1974 0.031 06/25/1974 0.031 0
# maker 26 Mrc/Ft/AMC/Cdllc/Cmr Merc 0.188 AMC 0.031 0
# Use some of the fix functions to correct errors:
<- fix_NA(mtcars_messy)
mtcars_messy <- fix_dates(mtcars_mesy)
mtcars_messy <- fix_numerics(mtcars_messy)
mtcars_messy
# Review the performed changes
track_changes(mtcars_messy)
# variable observation original new fun
# drat Duster 360 <NA> fix_NA
# drat Honda Civic - <NA> fix_NA
# vs Cadillac Fleetwood ? <NA> fix_NA
# vs Porsche 914-2 NULL <NA> fix_NA
# Mpg all character numeric fix_numerics
# drat all character numeric fix_numerics
# wt all character numeric fix_numerics
# Mpg Datsun 710 22,8 22.8 fix_numerics
# Mpg Hornet 4 Drive 21.,4 21.4 fix_numerics
# Mpg Duster 360 14.3 mpg 14.3 fix_numerics
# Mpg Merc 280 19.2 19.2 fix_numerics
# Mpg Merc 280C 1.78e01 17.8 fix_numerics
# wt Lincoln Continental 5,424 5.424 fix_numerics
# wt Chrysler Imperial 5,345 5.345 fix_numerics
# Id needed, restore the unwanted changes:
<- restore_changes(track_changes(mtcars_messy, fun == "fix_numerics" & variable == "wt")) mtcars_messy
New versions of clickR provide parallelization for some of the data-cleaning functions via the future package.
library(future)
plan(multisession(workers=2)) #Set number of workers
fix_dates(mtcars_messy)
This functionality still needs some optimization. So be aware that, in some (rare) specific cases, parallelized tasks might take more time than non-parallelized tasks.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.