The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
In data analysis tasks we often have data sets with multiple possible ID columns, but it’s not always clear which combination uniquely identifies each row.
sample_data1 has 125 row with 3 ID type columns and 3 value columns.
head(sample_data1)
#> # A tibble: 6 × 6
#> ID_COL1 ID_COL2 ID_COL3 VAL1 VAL2 VAL3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2413 1034 1014 -0.0639 -1.16 -0.302
#> 2 2413 1034 1322 0.363 1.62 0.165
#> 3 2413 1034 2999 -0.00466 1.23 0.819
#> 4 2413 1034 3544 1.83 -2.58 -0.525
#> 5 2413 1034 9901 0.837 -0.442 -0.341
#> 6 2413 1122 1014 -0.894 -1.11 0.768Let’s use confirm_distinct iteratively to find the
uniquely identifying columns of sample_data1.
sample_data1 %>%
confirm_distinct(ID_COL1, ID_COL2)
#> database has 100 duplicates at ID_COL1, ID_COL2sample_data1 %>%
confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
#> database is distinct at ID_COL1, ID_COL2, ID_COL3Here we can conclude that the combination of 3 ID columns is the primary key for the data.
These steps can be automated with the wrapper function
determine distinct.
sample_data1 %>%
determine_distinct(matches("ID"))
#> Warning: There was 1 warning in `dplyr::summarize()`.
#> ℹ In argument: `dplyr::across(.fns = ~dplyr::n_distinct(.) == rws)`.
#> Caused by warning:
#> ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
#> ℹ Please supply `.cols` instead.confirm_mapping tells you the mapping between two
columns in a data frame:
confirm_mapping gives the option to view which type of
mapping is associated with each individual row.
The overlap functions give a venn style description of
the values in 2 columns. This is especially useful before performing a
join function, and you want to confirm that the dataframes
have matching keys.
confirm_overlap is different from the other
confirm functions in that it takes 2 vectors as arguments,
instead of a data frame. This is to allow the user to test overlap
between different dataframes, or arbitrary vectors if necessary
confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
#> # A tibble: 1 × 5
#> only_in_iris_Sepal.Width only_in_iris_Petal.Length shared_names total_names
#> <int> <int> <int> <int>
#> 1 12 32 11 55
#> # ℹ 1 more variable: pct_shared <chr>confirm_overlap returns a summary data frame invisibly
allowing you to access individual elements using the helper
functions.
print(iris_overlap)
#> # A tibble: 55 × 4
#> x iris_Sepal.Width iris_Petal.Length both_flags
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3.5 1 1 2
#> 2 3 1 1 2
#> 3 3.2 1 0 1
#> 4 3.1 1 0 1
#> 5 3.6 1 1 2
#> 6 3.9 1 1 2
#> 7 3.4 1 0 1
#> 8 2.9 1 0 1
#> 9 3.7 1 1 2
#> 10 4 1 1 2
#> # ℹ 45 more rowsFind the elements unique to the first column
iris_overlap %>%
co_find_only_in_1() %>%
head()
#> # A tibble: 6 × 1
#> iris_Sepal.Width
#> <dbl>
#> 1 3.2
#> 2 3.1
#> 3 3.4
#> 4 2.9
#> 5 2.3
#> 6 2.8Find the elements unique to the second column
iris_overlap %>%
co_find_only_in_2() %>%
head()
#> # A tibble: 6 × 1
#> iris_Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.3
#> 3 1.5
#> 4 1.7
#> 5 1.6
#> 6 1.1Find the elements shared by both columns
determine_overlap takes a dataframe and a tidyselect
specification, and returns a tibble summarizing all of the pairwise
overlaps. Only pairs with matching types are tested.
Note that the overlap functions only test pairwise
overlaps. For multi-column and large-scale overlap testing, see Complex Upset
Plots
Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.
iris %>%
confirm_strlen(Species) -> species_len
#> Species_chr_len n percent
#> 6 50 33.3%
#> 9 50 33.3%
#> 10 50 33.3%output is a dataframe
head(species_len)
#> # A tibble: 6 × 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6A helped function for the output of confirm_strlen that
filters the database for chosen string lengths.
species_len %>%
choose_strlen(len = 6) %>%
head()
#> # A tibble: 6 × 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.
iris %>%
diagnose()
#> # A tibble: 5 × 6
#> variables types missing_count missing_percent unique_count unique_rate
#> <chr> <chr> <int> <dbl> <int> <dbl>
#> 1 Sepal.Length numeric 0 0 35 0.233
#> 2 Sepal.Width numeric 0 0 23 0.153
#> 3 Petal.Length numeric 0 0 43 0.287
#> 4 Petal.Width numeric 0 0 22 0.147
#> 5 Species factor 0 0 3 0.02These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.