charlatan makes fake data, inspired from and borrowing some code from Python’s faker

Why would you want to make fake data? Here’s some possible use cases to give you a sense for what you can do with this package:

  • Students in a classroom setting learning any task that needs a dataset.
  • People doing simulations/modeling that need some fake data
  • Generate fake dataset of users for a database before actual users exist
  • Complete missing spots in a dataset
  • Generate fake data to replace sensitive real data with before public release
  • Create a random set of colors for visualization
  • Generate random coordinates for a map
  • Get a set of randomly generated DOIs (Digial Object Identifiers) to assign to fake scholarly artifacts
  • Generate fake taxonomic names for a biological dataset
  • Get a set of fake sequences to use to test code/software that uses sequence data

Package API

  • Low level interfaces: All of these are R6 objects that a user can initialize and then call methods on. These contain all the logic that the below interfaces use.
  • High level interfaces: There are high level functions prefixed with ch_*() that wrap low level interfaces, and are meant to be easier to use and provide an easy way to make many instances of a thing.
  • ch_generate() - generate a data.frame with fake data, choosing which columns to include from the data types provided in charlatan
  • fraudster() - single interface to all fake data methods, - returns vectors/lists of data - this function wraps the ch_*() functions described above

Install

Stable version from CRAN

install.packages("charlatan")

Development version from Github

devtools::install_github("ropensci/charlatan")
library("charlatan")

high level function

… for all fake data operations

x <- fraudster()
x$job()
#> [1] "Banker"
x$name()
#> [1] "Jo Dicki"
x$job()
#> [1] "Armed forces technical officer"
x$color_name()
#> [1] "SeaGreen"

locale support

Adding more locales through time, e.g.,

Locale support for job data

ch_job(locale = "en_US", n = 3)
#> [1] "Architect"                  "Psychologist, forensic"    
#> [3] "Television camera operator"
ch_job(locale = "fr_FR", n = 3)
#> [1] "Ingénieur"                                      
#> [2] "Responsable de la collecte des déchets ménagers"
#> [3] "Fleuriste"
ch_job(locale = "hr_HR", n = 3)
#> [1] "Osoba stručno osposobljena za uzgoj riba i drugih morskih organizama"
#> [2] "Prvostupnik medicinsko- laboratorijske dijagnostike"                 
#> [3] "Viši knjižničar"
ch_job(locale = "uk_UA", n = 3)
#> [1] "Письменник" "Швачка"     "Біолог"
ch_job(locale = "zh_TW", n = 3)
#> [1] "領隊"             "外務/快遞/送貨" "砌磚工"

For colors:

ch_color_name(locale = "en_US", n = 3)
#> [1] "YellowGreen" "Chartreuse"  "NavajoWhite"
ch_color_name(locale = "uk_UA", n = 3)
#> [1] "Малахітовий"      "Жовто-персиковий" "Яскраво-зелений"

More coming soon …

generate a dataset

ch_generate()
#> # A tibble: 10 x 3
#>                 name                                         job
#>                <chr>                                       <chr>
#>  1    Joette Keeling               Development worker, community
#>  2       Wirt Rempel                          Purchasing manager
#>  3    Cherilyn Terry                   Higher education lecturer
#>  4   Garfield Torphy Psychologist, prison and probation services
#>  5      Sharif Stehr                           Recycling officer
#>  6        Alver Mraz                   Health and safety adviser
#>  7       Cleva Thiel                        Engineer, biomedical
#>  8        Red Skiles                  Nurse, learning disability
#>  9    Alfreda Jacobi                               Hotel manager
#> 10 Mrs. Karolyn Bode                     Data processing manager
#> # ... with 1 more variables: phone_number <chr>
ch_generate('job', 'phone_number', n = 30)
#> # A tibble: 30 x 2
#>                            job         phone_number
#>                          <chr>                <chr>
#>  1        Engineer, structural    230-418-4340x5974
#>  2          Furniture designer  (534)138-4659x98182
#>  3 Nature conservation officer  1-341-021-7161x8149
#>  4                 Tax adviser  (347)710-0498x04182
#>  5      Leisure centre manager         059.367.5785
#>  6       Scientist, biomedical 1-026-641-9920x73432
#>  7     Education administrator     +15(7)3523656979
#>  8             Naval architect   603-410-1279x66743
#>  9                Psychiatrist         664-458-5161
#> 10       Accommodation manager   1-438-457-6413x355
#> # ... with 20 more rows

Data types

person name

ch_name()
#> [1] "Cristina Heathcote-Hoeger"
ch_name(10)
#>  [1] "Harper Cassin"        "Arizona Feest"        "Danyel Harber"       
#>  [4] "Dr. Evelin Jones"     "Ms. Ally Reinger"     "Destiney Grant"      
#>  [7] "Adan Macejkovic"      "Nasir Abbott-Shields" "Cortney Thompson"    
#> [10] "Zachariah Littel"

phone number

ch_phone_number()
#> [1] "495-062-1294x75034"
ch_phone_number(10)
#>  [1] "482-502-9430x065"     "160.030.6101"         "895.692.4964x96989"  
#>  [4] "(857)282-4600"        "03506676221"          "1-045-259-2363x38019"
#>  [7] "(864)729-1156x3038"   "580-663-7118x052"     "(629)372-6175x5956"  
#> [10] "687.301.8353x530"

job

ch_job()
#> [1] "Exhibition designer"
ch_job(10)
#>  [1] "Aeronautical engineer"         "Engineer, civil (contracting)"
#>  [3] "Health promotion specialist"   "Newspaper journalist"         
#>  [5] "Science writer"                "Conference centre manager"    
#>  [7] "Special effects artist"        "Interpreter"                  
#>  [9] "Location manager"              "Podiatrist"

credit cards

ch_credit_card_provider()
#> [1] "Mastercard"
ch_credit_card_provider(n = 4)
#> [1] "JCB 15 digit"  "VISA 16 digit" "VISA 16 digit" "VISA 16 digit"
ch_credit_card_number()
#> [1] "180014495963802646"
ch_credit_card_number(n = 10)
#>  [1] "4341333219296046"    "3019347410625616"    "676210961178778"    
#>  [4] "4101721569265605"    "869976400881964937"  "3337693345342041531"
#>  [7] "675940381801501"     "6011119366631562934" "3528100212370176513"
#> [10] "180069364073399725"
ch_credit_card_security_code()
#> [1] "064"
ch_credit_card_security_code(10)
#>  [1] "181"  "6582" "358"  "021"  "322"  "538"  "117"  "013"  "322"  "4657"

Messy data

Real data is messy, right? charlatan makes it easy to create messy data. This is still in the early stages so is not available across most data types and languages, but we’re working on it.

For example, create messy names:

ch_name(50, messy = TRUE)
#>  [1] "Dr Sim Hodkiewicz DVM"    "Clella Hills md"         
#>  [3] "Elam Dietrich DDS"        "Miss Nedra Mann"         
#>  [5] "Candido Green"            "Mrs Almedia Marquardt md"
#>  [7] "Elenor Hyatt"             "Jonnie Moore"            
#>  [9] "Fleda Anderson"           "Lazaro Waelchi-Hackett"  
#> [11] "Dr. Leslie Davis Sr."     "Danny Ledner"            
#> [13] "Blair Lindgren"           "Hebert Hoeger DVM"       
#> [15] "Mr Tollie Senger"         "Audie Hamill-Hettinger"  
#> [17] "Joanne Ziemann"           "Mr Aden Moore"           
#> [19] "Mr Boyce Champlin PhD"    "Dallas Langosh"          
#> [21] "Maynard Brown"            "Mellisa Casper"          
#> [23] "Luka Brekke"              "Wilhelm Hills d.d.s."    
#> [25] "Karma Hane"               "Leonore Stehr"           
#> [27] "Ms Margarita Rodriguez"   "Dr. Melissia Simonis"    
#> [29] "Darryl Olson Jr"          "Jannette Krajcik"        
#> [31] "Shante Becker PhD"        "Randolph Borer"          
#> [33] "Hermon Willms"            "Belinda Rippin"          
#> [35] "Arjun Pfannerstill"       "Djuana Schamberger"      
#> [37] "Miss Osa Terry"           "Edie Conroy"             
#> [39] "Greta Muller DDS"         "Ms. Debora Sporer"       
#> [41] "Aidyn Mayert"             "Dr Alexa Russel"         
#> [43] "Enos Eichmann"            "Lular Bechtelar PhD"     
#> [45] "Bess Hamill-Gleason"      "Erin Wilderman"          
#> [47] "Ambrose Rice"             "Dr Mayo Hoeger"          
#> [49] "Tabetha Schamberger"      "Lanette Rodriguez"

Right now only suffixes and prefixes for names in en_US locale are supported. Notice above some variation in prefixes and suffixes.