The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
library(validata)
library(tidyselect)In data analysis tasks we often have data sets with multiple possible ID columns, but it’s not always clear which combination uniquely identifies each row.
sample_data1 has 125 row with 3 ID type columns and 3 value columns.
head(sample_data1)
#> # A tibble: 6 x 6
#> ID_COL1 ID_COL2 ID_COL3 VAL1 VAL2 VAL3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2413 1034 1014 -0.0639 -1.16 -0.302
#> 2 2413 1034 1322 0.363 1.62 0.165
#> 3 2413 1034 2999 -0.00466 1.23 0.819
#> 4 2413 1034 3544 1.83 -2.58 -0.525
#> 5 2413 1034 9901 0.837 -0.442 -0.341
#> 6 2413 1122 1014 -0.894 -1.11 0.768Let’s use confirm_distinct iteratively to find the uniquely identifying columns of sample_data1.
sample_data1 %>%
confirm_distinct(ID_COL1)
#> database has 120 duplicates at ID_COL1sample_data1 %>%
confirm_distinct(ID_COL1, ID_COL2)
#> database has 100 duplicates at ID_COL1, ID_COL2sample_data1 %>%
confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
#> database is distinct at ID_COL1, ID_COL2, ID_COL3Here we can conclude that the combination of 3 ID columns is the primary key for the data.
These steps can be automated with the wrapper function determine distinct.
sample_data1 %>%
determine_distinct(matches("ID"))confirm_mapping tells you the mapping between two columns in a data frame:
confirm_mapping gives the option to view which type of mapping is associated with each individual row.
sample_data1 %>%
confirm_mapping(ID_COL1, ID_COL2, view = F)
#> many - many mapping between ID_COL1 and ID_COL2sample_data1 %>%
determine_mapping(everything())The overlap functions give a venn style description of the values in 2 columns. This is especially useful before performing a join function, and you want to confirm that the dataframes have matching keys.
confirm_overlap is different from the other confirm functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary
confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
#> # A tibble: 1 x 5
#> only_in_iris_Sepal.W… only_in_iris_Petal.… shared_names total_names pct_shared
#> <int> <int> <int> <int> <chr>
#> 1 12 32 11 55 20%confirm_overlap returns a summary data frame invisibly allowing you to access individual elements using the helper functions.
print(iris_overlap)
#> # A tibble: 55 x 4
#> x iris_Sepal.Width iris_Petal.Length both_flags
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3.5 1 1 2
#> 2 3 1 1 2
#> 3 3.2 1 0 1
#> 4 3.1 1 0 1
#> 5 3.6 1 1 2
#> 6 3.9 1 1 2
#> 7 3.4 1 0 1
#> 8 2.9 1 0 1
#> 9 3.7 1 1 2
#> 10 4 1 1 2
#> # … with 45 more rowsFind the elements unique to the first column
iris_overlap %>%
co_find_only_in_1() %>%
head()
#> # A tibble: 6 x 1
#> iris_Sepal.Width
#> <dbl>
#> 1 3.2
#> 2 3.1
#> 3 3.4
#> 4 2.9
#> 5 2.3
#> 6 2.8Find the elements unique to the second column
iris_overlap %>%
co_find_only_in_2() %>%
head()
#> # A tibble: 6 x 1
#> iris_Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.3
#> 3 1.5
#> 4 1.7
#> 5 1.6
#> 6 1.1Find the elements shared by both columns
iris_overlap %>%
co_find_in_both() %>%
head()
#> # A tibble: 6 x 1
#> x
#> <dbl>
#> 1 3.5
#> 2 3
#> 3 3.6
#> 4 3.9
#> 5 3.7
#> 6 4determine_overlap takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.
Note that the overlap functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see Complex Upset Plots
Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.
iris %>%
confirm_strlen(Species) -> species_len
#> Species_chr_len n percent
#> 6 50 33.3%
#> 9 50 33.3%
#> 10 50 33.3%output is a dataframe
head(species_len)
#> # A tibble: 6 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6A helped function for the output of confirm_strlen that filters the database for chosen string lengths.
species_len %>%
choose_strlen(len = 6) %>%
head()
#> # A tibble: 6 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.
iris %>%
diagnose()
#> # A tibble: 5 x 6
#> variables types missing_count missing_percent unique_count unique_rate
#> <chr> <chr> <int> <dbl> <int> <dbl>
#> 1 Sepal.Length numeric 0 0 35 0.233
#> 2 Sepal.Width numeric 0 0 23 0.153
#> 3 Petal.Length numeric 0 0 43 0.287
#> 4 Petal.Width numeric 0 0 22 0.147
#> 5 Species factor 0 0 3 0.02These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.