The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
library(validata)
library(tidyselect)
In data analysis tasks we often have data sets with multiple possible ID columns, but it’s not always clear which combination uniquely identifies each row.
sample_data1 has 125 row with 3 ID type columns and 3 value columns.
head(sample_data1)
#> # A tibble: 6 x 6
#> ID_COL1 ID_COL2 ID_COL3 VAL1 VAL2 VAL3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2413 1034 1014 -0.0639 -1.16 -0.302
#> 2 2413 1034 1322 0.363 1.62 0.165
#> 3 2413 1034 2999 -0.00466 1.23 0.819
#> 4 2413 1034 3544 1.83 -2.58 -0.525
#> 5 2413 1034 9901 0.837 -0.442 -0.341
#> 6 2413 1122 1014 -0.894 -1.11 0.768
Let’s use confirm_distinct
iteratively to find the uniquely identifying columns of sample_data1.
%>%
sample_data1 confirm_distinct(ID_COL1)
#> database has 120 duplicates at ID_COL1
%>%
sample_data1 confirm_distinct(ID_COL1, ID_COL2)
#> database has 100 duplicates at ID_COL1, ID_COL2
%>%
sample_data1 confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
#> database is distinct at ID_COL1, ID_COL2, ID_COL3
Here we can conclude that the combination of 3 ID columns is the primary key for the data.
These steps can be automated with the wrapper function determine distinct
.
%>%
sample_data1 determine_distinct(matches("ID"))
confirm_mapping
tells you the mapping between two columns in a data frame:
confirm_mapping
gives the option to view which type of mapping is associated with each individual row.
%>%
sample_data1 confirm_mapping(ID_COL1, ID_COL2, view = F)
#> many - many mapping between ID_COL1 and ID_COL2
%>%
sample_data1 determine_mapping(everything())
The overlap
functions give a venn style description of the values in 2 columns. This is especially useful before performing a join
function, and you want to confirm that the dataframes have matching keys.
confirm_overlap
is different from the other confirm
functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary
confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
#> # A tibble: 1 x 5
#> only_in_iris_Sepal.W… only_in_iris_Petal.… shared_names total_names pct_shared
#> <int> <int> <int> <int> <chr>
#> 1 12 32 11 55 20%
confirm_overlap
returns a summary data frame invisibly allowing you to access individual elements using the helper functions.
print(iris_overlap)
#> # A tibble: 55 x 4
#> x iris_Sepal.Width iris_Petal.Length both_flags
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3.5 1 1 2
#> 2 3 1 1 2
#> 3 3.2 1 0 1
#> 4 3.1 1 0 1
#> 5 3.6 1 1 2
#> 6 3.9 1 1 2
#> 7 3.4 1 0 1
#> 8 2.9 1 0 1
#> 9 3.7 1 1 2
#> 10 4 1 1 2
#> # … with 45 more rows
Find the elements unique to the first column
%>%
iris_overlap co_find_only_in_1() %>%
head()
#> # A tibble: 6 x 1
#> iris_Sepal.Width
#> <dbl>
#> 1 3.2
#> 2 3.1
#> 3 3.4
#> 4 2.9
#> 5 2.3
#> 6 2.8
Find the elements unique to the second column
%>%
iris_overlap co_find_only_in_2() %>%
head()
#> # A tibble: 6 x 1
#> iris_Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.3
#> 3 1.5
#> 4 1.7
#> 5 1.6
#> 6 1.1
Find the elements shared by both columns
%>%
iris_overlap co_find_in_both() %>%
head()
#> # A tibble: 6 x 1
#> x
#> <dbl>
#> 1 3.5
#> 2 3
#> 3 3.6
#> 4 3.9
#> 5 3.7
#> 6 4
determine_overlap
takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.
Note that the overlap
functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see Complex Upset Plots
Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.
%>%
iris confirm_strlen(Species) -> species_len
#> Species_chr_len n percent
#> 6 50 33.3%
#> 9 50 33.3%
#> 10 50 33.3%
output is a dataframe
head(species_len)
#> # A tibble: 6 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6
A helped function for the output of confirm_strlen
that filters the database for chosen string lengths.
%>%
species_len choose_strlen(len = 6) %>%
head()
#> # A tibble: 6 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6
Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.
%>%
iris diagnose()
#> # A tibble: 5 x 6
#> variables types missing_count missing_percent unique_count unique_rate
#> <chr> <chr> <int> <dbl> <int> <dbl>
#> 1 Sepal.Length numeric 0 0 35 0.233
#> 2 Sepal.Width numeric 0 0 23 0.153
#> 3 Petal.Length numeric 0 0 43 0.287
#> 4 Petal.Width numeric 0 0 22 0.147
#> 5 Species factor 0 0 3 0.02
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.