The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
You can use scrutiny to detect duplicate values in any dataset. Duplicates can go a long way in assessing the reliability of published research.
This vignette walks you through scrutiny’s tools for detecting,
counting, and summarizing duplicates. It uses the pigs4
dataset as a simple example:
pigs4
#> # A tibble: 5 × 3
#> snout tail wings
#> <chr> <chr> <chr>
#> 1 4.73 6.88 6.09
#> 2 8.13 7.33 8.27
#> 3 4.22 5.17 4.40
#> 4 4.22 7.57 5.92
#> 5 5.17 8.13 5.17duplicate_count()A good first step is to create a frequency table. To do so, use
duplicate_count():
pigs4 %>%
duplicate_count()
#> # A tibble: 11 × 4
#> value frequency locations locations_n
#> <chr> <int> <chr> <int>
#> 1 5.17 3 snout, tail, wings 3
#> 2 4.22 2 snout 1
#> 3 8.13 2 snout, tail 2
#> 4 4.73 1 snout 1
#> 5 6.88 1 tail 1
#> 6 7.33 1 tail 1
#> 7 7.57 1 tail 1
#> 8 4.40 1 wings 1
#> 9 5.92 1 wings 1
#> 10 6.09 1 wings 1
#> 11 8.27 1 wings 1It returns a tibble (data frame) that lists each unique
value. The tibble is ordered by the frequency
of values in the input data frame, so the values that appear most often
are at the top. The locations are the names of all the
columns in which a given value appears. They are counted by
locations_n.
For example, 5.17 is the most frequent value in
pigs4. It appears 3 times (frequency), namely
in the snout, tail, and wings
columns; so locations_n is also 3. The next
most frequent value is 4.22 which appears twice, but both
of these instances are in the snout column, so
locations_n is 1.
Run audit() after duplicate_count() to get
summary statistics for the two numeric columns, frequency
and locations_n:
duplicate_count_colpair()Sometimes, a sequence of data may be repeated in multiple columns.
duplicate_count_colpair() helps find such cases:
pigs4 %>%
duplicate_count_colpair()
#> # A tibble: 3 × 7
#> x y count total_x total_y rate_x rate_y
#> <chr> <chr> <int> <int> <int> <dbl> <dbl>
#> 1 snout tail 2 5 5 0.4 0.4
#> 2 snout wings 1 5 5 0.2 0.2
#> 3 tail wings 1 5 5 0.2 0.2x and y represent all combinations of
columns in pigs4. The count is the number of
values that appear in both respective columns. total_x and
total_y are the numbers of non-missing values in the
original columns listed under x and y.
Similarly, rate_x is the rate of x values that
also appear in y, and rate_y is the rate of
y values that also appear in x. If there are
no missing values, total_x is the same as
total_y, and rate_x is the same as
rate_y.
Here, snout and tail are the column pair
with the most overlap: 2 out of 5 values are the same, a duplication
rate of 0.4.
Again, you can call audit() for summary statistics:
pigs4 %>%
duplicate_count_colpair() %>%
audit()
#> # A tibble: 5 × 8
#> term mean sd median min max na_count na_rate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 count 1.33 0.577 1 1 2 0 0
#> 2 total_x 5 0 5 5 5 0 0
#> 3 total_y 5 0 5 5 5 0 0
#> 4 rate_x 0.267 0.115 0.2 0.2 0.4 0 0
#> 5 rate_y 0.267 0.115 0.2 0.2 0.4 0 0duplicate_tally()Unlike the other two functions, duplicate_tally()
largely preserves the structure of the original data frame. It only adds
a column ending on _n next to each original column. The new
columns count how often the values to their left appear in the data
frame as a whole:
pigs4 %>%
duplicate_tally()
#> # A tibble: 5 × 6
#> snout snout_n tail tail_n wings wings_n
#> <chr> <int> <chr> <int> <chr> <int>
#> 1 4.73 1 6.88 1 6.09 1
#> 2 8.13 2 7.33 1 8.27 1
#> 3 4.22 2 5.17 3 4.40 1
#> 4 4.22 2 7.57 1 5.92 1
#> 5 5.17 3 8.13 2 5.17 3For example, 4.22 appears twice in snout
but not in any other column, so the corresponding entries in
snout_n are 2. Compare this to
8.13, which appears once in snout and once in
tail (but not in wings), so both observations
are marked 2 in the _n columns.
When following up duplicate_tally() with
audit(), it shows summary statistics for each
_n column. The last row summarizes all of these columns
together, with "term" saying ".total":
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.