This package is under constant development and the author would update the documentation regularly at FOYI and uncovr
Let us consider an industry example of generating transactional data for a retail store. The following steps will help in building such data.
Install conjurer package by using the following code. Since the package uses base R functions, it does not have any dependencies.
install.packages("conjurer")
A customer is identified by a unique customer identifier(ID). A customer ID is alphanumeric with prefix “cust” followed by a numeric. This numeric ranges from 1 and extend to the number of customers provided as the argument within the function. For example, if there are 100 customers, then the customer ID will range from cust001 to cust100. This ensures that the customer ID is always of the same length. Let us build a group of customer IDs using the following code. For simplicity, let us assume that there are 100 customers. customer ID is built using the function buildCust. This function takes one argument “numOfCust” that specifies the number of customer IDs to be built.
library(conjurer)
customers <- buildCust(numOfCust = 100)
print(head(customers))
#> [1] "cust001" "cust002" "cust003" "cust004" "cust005" "cust006"
A list of customer names for the 100 customer IDs can be generated in the following way.
custNames <- as.data.frame(buildNames(numOfNames = 100, minLength = 5, maxLength = 7))
#set column heading
colnames(custNames) <- c("customerName")
print(head(custNames))
#> customerName
#> 1 stelie
#> 2 nesta
#> 3 bertoni
#> 4 marlier
#> 5 nallarl
#> 6 ronna
Let us assign customer names to customer IDs. This is a random one to one mapping using the following code.
customer2name <- cbind(customers, custNames)
#set column heading
print(head(customer2name))
#> customers customerName
#> 1 cust001 stelie
#> 2 cust002 nesta
#> 3 cust003 bertoni
#> 4 cust004 marlier
#> 5 cust005 nallarl
#> 6 cust006 ronna
A list of customer ages for the 100 customer IDs can be generated in the following way.
custAge <- as.data.frame(round(buildNum(n = 10, st = 23, en = 80, disp = 0.5, outliers = 1)))
#set column heading
colnames(custAge) <- c("customerAge")
print(head(custAge))
#> customerAge
#> 1 23
#> 2 45
#> 3 51
#> 4 62
#> 5 70
#> 6 76
Let us assign customer ages to customer IDs. This is a random one to one mapping using the following code.
customer2age <- cbind(customers, custAge)
#set column heading
print(head(customer2age))
#> customers customerAge
#> 1 cust001 23
#> 2 cust002 45
#> 3 cust003 51
#> 4 cust004 62
#> 5 cust005 70
#> 6 cust006 76
A list of customer phone numbers for the 100 customer IDs can be generated in the following way.
parts <- list(c("+91","+44","+64"), c("("), c(491,324,211), c(")"), c(7821:8324))
probs <- list(c(0.25,0.25,0.50), c(1), c(0.30,0.60,0.10), c(1), c())
custPhoneNumbers <- as.data.frame(buildPattern(n=100,parts = parts, probs = probs))
head(custPhoneNumbers)
#> buildPattern(n = 100, parts = parts, probs = probs)
#> 1 +64(491)8149
#> 2 +44(324)8319
#> 3 +64(324)7866
#> 4 +64(324)7903
#> 5 +44(491)7882
#> 6 +91(491)8312
#set column heading
colnames(custPhoneNumbers) <- c("customerPhone")
print(head(custPhoneNumbers))
#> customerPhone
#> 1 +64(491)8149
#> 2 +44(324)8319
#> 3 +64(324)7866
#> 4 +64(324)7903
#> 5 +44(491)7882
#> 6 +91(491)8312
Let us assign customer ages to customer IDs. This is a random one to one mapping using the following code.
customer2phone <- cbind(customers, custPhoneNumbers)
#set column heading
print(head(customer2phone))
#> customers customerPhone
#> 1 cust001 +64(491)8149
#> 2 cust002 +44(324)8319
#> 3 cust003 +64(324)7866
#> 4 cust004 +64(324)7903
#> 5 cust005 +44(491)7882
#> 6 cust006 +91(491)8312
The next step is building some products. A product is identified by a product ID. Similar to a customer ID, a product ID is also an alphanumeric with prefix “sku” which signifies a stock keeping unit. This prefix is followed by a numeric ranging from 1 and extending to the number of products provided as the argument within the function. For example, if there are 10 products, then the product ID will range from sku01 to sku10. This ensures that the product ID is always of the same length. Besides product ID, the product price range must be specified. Let us build a group of products using the following code. For simplicity, let us assume that there are 10 products and the price range for them is from 5 dollars to 50 dollars. Products are built using the function buildProd. This function takes 3 arguments as given below.
products <- buildProd(numOfProd = 10, minPrice = 5, maxPrice = 50)
print(head(products))
#> SKU Price
#> 1 sku01 16.12
#> 2 sku02 11.74
#> 3 sku03 30.32
#> 4 sku04 12.78
#> 5 sku05 36.92
#> 6 sku06 24.52
The products belong to various categories. Let’s start to build the product hierarchy. The 10 products belong to 2 categories namely Food and Non-Food. These categories are further classifed into 4 different sub-categories namely Beverages, Dairy, Sanitary and Household.
productHierarchy <- buildHierarchy(type = "equalSplit", splits = 2, numOfLevels = 2)
print(productHierarchy)
#> level1 level2
#> 1 Level_1_element_1 Level_2_element_1
#> 2 Level_1_element_2 Level_2_element_2
#> 3 Level_1_element_1 Level_2_element_3
#> 4 Level_1_element_2 Level_2_element_4
As you can see, the product hierarchy generated has default names for levels and elements. To make it more meaningful, it can be modified as follows.
#Rename the dataframe
names(productHierarchy) <- c("category", "subcategory")
#Replace category with Food and Non-Food
productHierarchy$category <- gsub("Level_1_element_1", "Food", productHierarchy$category)
productHierarchy$category <- gsub("Level_1_element_2", "Non-Food", productHierarchy$category)
#Replace subCategories
productHierarchy$subcategory <- gsub("Level_2_element_1", "Beverages", productHierarchy$subcategory)
productHierarchy$subcategory <- gsub("Level_2_element_3", "Dairy", productHierarchy$subcategory)
productHierarchy$subcategory <- gsub("Level_2_element_2", "Sanitary", productHierarchy$subcategory)
productHierarchy$subcategory <- gsub("Level_2_element_4", "Household", productHierarchy$subcategory)
#Inspect the data to confirm the results
productHierarchy <- productHierarchy[order(productHierarchy$category),]
print(productHierarchy)
#> category subcategory
#> 1 Food Beverages
#> 3 Food Dairy
#> 2 Non-Food Sanitary
#> 4 Non-Food Household
Now that a group of customer IDs and Products are built, the next step is to build transactions. Transactions are built using the function genTrans. This function takes 5 arguments. The details of them are as follows.
Let us build transactions using the following code
transactions <- genTrans(cycles = "y", spike = 12, outliers = 1, transactions = 10000)
Visualize generated transactions by using
TxnAggregated <- aggregate(transactions$transactionID, by = list(transactions$dayNum), length)
plot(TxnAggregated, type = "l", ann = FALSE)
Bringing customers, products and transactions together is the final step of generating synthetic data. This process entails 3 steps as given below.
The allocation of transactions is achieved with the help of buildPareto function. This function takes 3 arguments as detailed below.
Let us now allocate transactions to customers first by using the following code.
customer2transaction <- buildPareto(customers, transactions$transactionID, pareto = c(80,20))
Assign readable names to the output by using the following code.
names(customer2transaction) <- c('transactionID', 'customer')
#inspect the output
print(head(customer2transaction))
#> transactionID customer
#> 1 txn-135-22 cust078
#> 2 txn-350-24 cust029
#> 3 txn-292-37 cust029
#> 4 txn-140-05 cust064
#> 5 txn-344-40 cust034
#> 6 txn-315-33 cust057
Allocate the products to the product hierarchy. This can be achieved as follows.
#First step is to ensure that the product hierarchy data frame has the same number of rows as number of products.
category <- productHierarchy$category
subcategory <- productHierarchy$subcategory
productHierarchy <- as.data.frame(cbind(category,subcategory,1:nrow(products)))
#> Warning in cbind(category, subcategory, 1:nrow(products)): number of rows of
#> result is not a multiple of vector length (arg 1)
#Randomly assign the product hierarchy to the products. Ensure that the additional unused variable towards the end is dropped.
products <- cbind(products, productHierarchy[,c("category","subcategory")])
#inspect the output
print(head(products))
#> SKU Price category subcategory
#> 1 sku01 16.12 Food Beverages
#> 2 sku02 11.74 Food Dairy
#> 3 sku03 30.32 Non-Food Sanitary
#> 4 sku04 12.78 Non-Food Household
#> 5 sku05 36.92 Food Beverages
#> 6 sku06 24.52 Food Dairy
Now, using similar step as mentioned above, allocate transactions to products using following code.
product2transaction <- buildPareto(products$SKU,transactions$transactionID,pareto = c(70,30))
names(product2transaction) <- c('transactionID', 'SKU')
#inspect the output
print(head(product2transaction))
#> transactionID SKU
#> 1 txn-347-62 sku09
#> 2 txn-297-32 sku02
#> 3 txn-340-31 sku02
#> 4 txn-103-50 sku02
#> 5 txn-315-23 sku01
#> 6 txn-26-19 sku02
The following code brings together transactions, products and customers into one dataframe.
df1 <- merge(x = customer2transaction, y = product2transaction, by = "transactionID")
df2 <- merge(x = df1, y = transactions, by = "transactionID", all.x = TRUE)
#inspect the output
print(head(df2))
#> transactionID customer SKU dayNum mthNum
#> 1 txn-1-01 cust030 sku07 1 1
#> 2 txn-1-02 cust034 sku06 1 1
#> 3 txn-1-03 cust078 sku10 1 1
#> 4 txn-1-04 cust083 sku01 1 1
#> 5 txn-1-05 cust029 sku02 1 1
#> 6 txn-1-06 cust090 sku09 1 1
We can add additional data such as customer name, product price using the code below.
df3 <- merge(x = df2, y = customer2name, by.x = "customer", by.y = "customers", all.x = TRUE)
df4 <- merge(x = df3, y = customer2age, by.x = "customer", by.y = "customers", all.x = TRUE)
df5 <- merge(x = df4, y = customer2phone, by.x = "customer", by.y = "customers", all.x = TRUE)
df6 <- merge(x = df5, y = products, by = "SKU", all.x = TRUE)
dfFinal <- df6[,c("dayNum", "mthNum", "customer", "customerName", "customerAge", "customerPhone", "transactionID", "SKU", "Price", "category","subcategory")]
#inspect the output
print(head(dfFinal))
#> dayNum mthNum customer customerName customerAge customerPhone transactionID
#> 1 82 3 cust001 stelie 23 +64(491)8149 txn-82-09
#> 2 254 9 cust023 nesti 51 +64(324)8023 txn-254-88
#> 3 80 3 cust048 allin 80 +91(211)7999 txn-80-57
#> 4 331 11 cust019 jealie 78 +44(324)7954 txn-331-30
#> 5 104 4 cust030 renel 73 +64(491)8101 txn-104-15
#> 6 69 3 cust081 lander 23 +64(491)8157 txn-69-28
#> SKU Price category subcategory
#> 1 sku01 16.12 Food Beverages
#> 2 sku01 16.12 Food Beverages
#> 3 sku01 16.12 Food Beverages
#> 4 sku01 16.12 Food Beverages
#> 5 sku01 16.12 Food Beverages
#> 6 sku01 16.12 Food Beverages
Thus, we have the final data set with transactions, customers and products.
The column names of the final data frame can be interpreted as follows.
Let us visualize the results to understand the data distribution.
Below is a view of the sum of transactions by each day.
aggregatedDataDay <- aggregate(dfFinal$transactionID, by = list(dfFinal$dayNum), length)
plot(aggregatedDataDay, type = "l", ann = FALSE)
Below is a view of the sum of transactions by each month.
aggregatedDataMth <- aggregate(dfFinal$transactionID, by = list(dfFinal$mthNum), length)
aggregatedDataMthSorted <- aggregatedDataMth[order(aggregatedDataMth$Group.1),]
plot(aggregatedDataMthSorted, ann = FALSE)