The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Type: Package
Title: Turn Clean Data into Messy Data
Version: 0.1.1
Description: Take real or simulated data and salt it with errors commonly found in the wild, such as pseudo-OCR errors, Unicode problems, numeric fields with nonsensical punctuation, bad dates, etc.
License: MIT + file LICENSE
Depends: R (≥ 2.10)
Imports: assertthat, purrr, stringr
Suggests: charlatan, testthat (≥ 2.0.0), tibble, covr
Encoding: UTF-8
RoxygenNote: 7.3.2
URL: https://github.com/mdlincoln/salty
BugReports: https://github.com/mdlincoln/salty/issues
NeedsCompilation: no
Packaged: 2024-08-31 04:04:06 UTC; mlincoln
Author: Matthew Lincoln ORCID iD [aut, cre]
Maintainer: Matthew Lincoln <matthew.d.lincoln@gmail.com>
Repository: CRAN
Date/Publication: 2024-08-31 04:20:02 UTC

salty: Turn Clean Data into Messy Data

Description

Take real or simulated data and salt it with errors commonly found in the wild, such as pseudo-OCR errors, Unicode problems, numeric fields with nonsensical punctuation, bad dates, etc.

Author(s)

Maintainer: Matthew Lincoln matthew.d.lincoln@gmail.com (ORCID)

See Also

Useful links:


Access the original source vector for a given shaker function

Description

Access the original source vector for a given shaker function

Usage

inspect_shaker(f)

Arguments

f

A shaker function

Value

A character vector

Examples

inspect_shaker(shaker$punctuation)

Sample a proportion of indices of a vector

Description

Sample a proportion of indices of a vector

Usage

p_indices(x, p)

Arguments

x

A vector

p

A numeric probability between 0 and 1

Value

An integer vector of indices.


Salt vectors with common data problems

Description

These are easy-to-use wrapper functions that call either salt_insert (for including new characters) or salt_replace (for salting that requires replacement of specific characters) with sane defaults.

Usage

salt_punctuation(x, p = 0.2, n = 1)

salt_letters(x, p = 0.2, n = 1)

salt_whitespace(x, p = 0.2, n = 1)

salt_digits(x, p = 0.2, n = 1)

salt_ocr(x, p = 0.2, rep_p = 0.1)

salt_capitalization(x, p = 0.1, rep_p = 0.1)

salt_decimal_commas(x, p = 0.1, rep_p = 0.1)

Arguments

x

A vector. This will always be coerced to character during salting.

p

A number between 0 and 1. Percent of values in x that should be salted.

n

A positive integer. Number of times to add new values from insertions into selected values in x manually supply your own list of characters.

rep_p

A number between 0 and 1. Probability that a given match should be replaced in one of the selected values.

Details

For a more fine-grained control over how characters are added and whether , see the documentation for salt_insert, salt_substitute, salt_replace, and salt_delete.

Functions


Delete some characters from some values

Description

Delete some characters from some values

Usage

salt_delete(x, p = 0.2, n = 1)

Arguments

x

A vector. This will always be coerced to character during salting.

p

A number between 0 and 1. Percent of values in x that should be salted.

n

A positive integer. Number of times to add new values from insertions into selected values in x manually supply your own list of characters.

Value

A character vector the same length as x

Examples

x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
       "Nunc finibus tortor a elit eleifend interdum.",
       "Maecenas aliquam augue sit amet ultricies placerat.")

salt_delete(x, p = 0.5, n = 5)

salt_empty(x, p = 0.5)

salt_na(x, p = 0.5)

Insert new characters into some values in a vector

Description

Inserts a selection of characters into a percentage of values in the supplied vector.

Usage

salt_insert(x, insertions, p = 0.2, n = 1)

Arguments

x

A vector. This will always be coerced to character during salting.

insertions

A shaker function, or a character vector.

p

A number between 0 and 1. Percent of values in x that should be salted.

n

A positive integer. Number of times to add new values from insertions into selected values in x manually supply your own list of characters.

Value

A character vector the same length as x


Remove entire values from a vector

Description

Remove entire values from a vector

Usage

salt_na(x, p = 0.2)

salt_empty(x, p = 0.2)

Arguments

x

A vector

p

A number between 0 and 1. Proportion of values to edit.

Value

A vector the same length as x


Replace certain patterns into some values in a vector

Description

Inserts a selection of characters into some values of x. Pair salt_replace with the named vectors in replacement_shaker, or supply your own named vector of replacements. The convenience functions salt_ocr and salt_capitalization are light wrappers around salt_replace.

Usage

salt_replace(x, replacements, p = 0.1, rep_p = 0.5)

Arguments

x

A vector. This will always be coerced to character during salting.

replacements

A replacement_shaker function, or a named character vector of patterns and replacements.

p

A number between 0 and 1. Percent of values in x that should be salted.

rep_p

A number between 0 and 1. Probability that a given match should be replaced in one of the selected values.

Value

A character vector the same length as x

Examples


x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
       "Nunc finibus tortor a elit eleifend interdum.",
       "Maecenas aliquam augue sit amet ultricies placerat.")

salt_replace(x, replacement_shaker$capitalization, p = 0.5, rep_p = 0.2)

salt_ocr(x, p = 1, rep_p = 0.5)

Substitute certain characters in a vector

Description

Substitute certain characters in a vector

Usage

salt_substitute(x, substitutions, p = 0.2, n = 1)

Arguments

x

A vector. This will always be coerced to character during salting.

substitutions

Values to be substituted in

p

A number between 0 and 1. Percent of values in x that should be salted.

n

A positive integer. Number of times to add new values from insertions into selected values in x manually supply your own list of characters.

Value

A character vector the same length as x

Examples

x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
       "Nunc finibus tortor a elit eleifend interdum.",
       "Maecenas aliquam augue sit amet ultricies placerat.")

salt_substitute(x, shaker$digits, p = 0.5, n = 5)

Randomly swap out entire values in a vector

Description

Because swaps can be provided by either a character vector or a function that returns a character vector, salt_swap can be fruitfully used in conjunction with the charlatan::charlatan package to intersperse real data with simulated data.

Usage

salt_swap(x, swaps, p = 0.2)

Arguments

x

A vector. This will always be coerced to character during salting.

swaps

Values to be swapped out

p

A number between 0 and 1. Percent of values in x that should be salted.

Value

A character vector the same length as x

Examples

x <- c("Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
       "Nunc finibus tortor a elit eleifend interdum.",
       "Maecenas aliquam augue sit amet ultricies placerat.")

new_values <- c("foo", "bar", "baz")

salt_swap(x, swaps = new_values, p = 0.5)

salty: Turn Clean Data Into Messy Data

Description

Insert, delete, replace, and substitute bits of your data with messy values.

Details

Convenient wrappers such as salt_punctuation are provided for quick access to this package's functionality with simple defaults. For more fine-grained control, use one of the underlying salt_ functions:


Get a set of values to use in salt_ functions

Description

shaker contains various character sets to be added to your data using salt_insert and salt_substitute. replacement_shaker is for salt_replace, and contains pairlists that replace matched patterns in your data.

Usage

shaker

replacement_shaker

available_shakers()

Format

An object of class list of length 6.

An object of class list of length 3.

Value

A sampling function that will be called by salt_insert, salt_substitute, or salt_replace.

Examples

salt_insert(letters, shaker$punctuation)
available_shakers()

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.