Overview

This package is under constant development and the author would update the documentation regularly at FOYI and uncovr

Steps to build synthetic data

Let us consider an industry example of generating transactional data for a retail store. The following steps will help in building such data.

Installation

Install conjurer package by using the following code. Since the package uses base R functions, it does not have any dependencies.

install.packages("conjurer")

Build customers

A customer is identified by a unique customer identifier(ID). A customer ID is alphanumeric with prefix “cust” followed by a numeric. This numeric ranges from 1 and extend to the number of customers provided as the argument within the function. For example, if there are 100 customers, then the customer ID will range from cust001 to cust100. This ensures that the customer ID is always of the same length. Let us build a group of customer IDs using the following code. For simplicity, let us assume that there are 100 customers. customer ID is built using the function buildCust. This function takes one argument “numOfCust” that specifies the number of customer IDs to be built.

library(conjurer)
customers <- buildCust(numOfCust =  100)
print(head(customers))
#> [1] "cust001" "cust002" "cust003" "cust004" "cust005" "cust006"

Build customer names

A list of customer names for the 100 customer IDs can be generated in the following way.

custNames <- as.data.frame(buildNames(numOfNames = 100, minLength = 5, maxLength = 7))

#set column heading
colnames(custNames) <- c("customerName")
print(head(custNames))
#>   customerName
#> 1        danna
#> 2       melarl
#> 3      caminda
#> 4      chellan
#> 5        emila
#> 6      kielian

Assign customer name to customer ID

Let us assign customer names to customer IDs. This is a random one to one mapping using the following code.

customer2name <- cbind(customers, custNames)
#set column heading
print(head(customer2name))
#>   customers customerName
#> 1   cust001        danna
#> 2   cust002       melarl
#> 3   cust003      caminda
#> 4   cust004      chellan
#> 5   cust005        emila
#> 6   cust006      kielian

Build customer age

A list of customer ages for the 100 customer IDs can be generated in the following way.

custAge <- as.data.frame(round(buildNum(n = 10, st = 23, en = 80, disp = 0.5, outliers = 1)))

#set column heading
colnames(custAge) <- c("customerAge")
print(head(custAge))
#>   customerAge
#> 1          23
#> 2          41
#> 3          56
#> 4          67
#> 5          75
#> 6          80

Assign customer age to customer ID

Let us assign customer ages to customer IDs. This is a random one to one mapping using the following code.

customer2age <- cbind(customers, custAge)
#set column heading
print(head(customer2age))
#>   customers customerAge
#> 1   cust001          23
#> 2   cust002          41
#> 3   cust003          56
#> 4   cust004          67
#> 5   cust005          75
#> 6   cust006          80

Build customer phone number

A list of customer phone numbers for the 100 customer IDs can be generated in the following way.

parts <- list(c("+91","+44","+64"), c("("), c(491,324,211), c(")"), c(7821:8324))
probs <- list(c(0.25,0.25,0.50), c(1), c(0.30,0.60,0.10), c(1), c())
custPhoneNumbers <- as.data.frame(buildPattern(n=100,parts = parts, probs = probs))
head(custPhoneNumbers)
#>   buildPattern(n = 100, parts = parts, probs = probs)
#> 1                                        +44(324)7949
#> 2                                        +91(491)8198
#> 3                                        +91(324)8309
#> 4                                        +64(324)8292
#> 5                                        +44(324)8041
#> 6                                        +64(324)7884

#set column heading
colnames(custPhoneNumbers) <- c("customerPhone")
print(head(custPhoneNumbers))
#>   customerPhone
#> 1  +44(324)7949
#> 2  +91(491)8198
#> 3  +91(324)8309
#> 4  +64(324)8292
#> 5  +44(324)8041
#> 6  +64(324)7884

Assign customer phone number to customer ID

Let us assign customer ages to customer IDs. This is a random one to one mapping using the following code.

customer2phone <- cbind(customers, custPhoneNumbers)
#set column heading
print(head(customer2phone))
#>   customers customerPhone
#> 1   cust001  +44(324)7949
#> 2   cust002  +91(491)8198
#> 3   cust003  +91(324)8309
#> 4   cust004  +64(324)8292
#> 5   cust005  +44(324)8041
#> 6   cust006  +64(324)7884

Build products

The next step is building some products. A product is identified by a product ID. Similar to a customer ID, a product ID is also an alphanumeric with prefix “sku” which signifies a stock keeping unit. This prefix is followed by a numeric ranging from 1 and extending to the number of products provided as the argument within the function. For example, if there are 10 products, then the product ID will range from sku01 to sku10. This ensures that the product ID is always of the same length. Besides product ID, the product price range must be specified. Let us build a group of products using the following code. For simplicity, let us assume that there are 10 products and the price range for them is from 5 dollars to 50 dollars. Products are built using the function buildProd. This function takes 3 arguments as given below.

numOfProd. This defines the number of product IDs to be generated.
minPrice. This is the minimum value of the price range.
maxPrice. This is the maximum value of the price range.

products <- buildProd(numOfProd = 10, minPrice = 5, maxPrice = 50)
print(head(products))
#>     SKU Price
#> 1 sku01 21.09
#> 2 sku02  6.23
#> 3 sku03 31.77
#> 4 sku04 28.44
#> 5 sku05 46.51
#> 6 sku06 13.06

Build product hierarchy

The products belong to various categories. Let’s start to build the product hierarchy. The 10 products belong to 2 categories namely Food and Non-Food. These categories are further classifed into 4 different sub-categories namely Beverages, Dairy, Sanitary and Household.

productHierarchy <- buildHierarchy(type = "equalSplit", splits = 2, numOfLevels = 2)
print(productHierarchy)
#>              level1            level2
#> 1 Level_1_element_1 Level_2_element_1
#> 2 Level_1_element_2 Level_2_element_2
#> 3 Level_1_element_1 Level_2_element_3
#> 4 Level_1_element_2 Level_2_element_4

As you can see, the product hierarchy generated has default names for levels and elements. To make it more meaningful, it can be modified as follows.

#Rename the dataframe
names(productHierarchy) <- c("category", "subcategory")

#Replace category with Food and Non-Food
productHierarchy$category <- gsub("Level_1_element_1", "Food", productHierarchy$category)
productHierarchy$category <- gsub("Level_1_element_2", "Non-Food", productHierarchy$category)

#Replace subCategories
productHierarchy$subcategory <- gsub("Level_2_element_1", "Beverages", productHierarchy$subcategory)
productHierarchy$subcategory <- gsub("Level_2_element_3", "Dairy", productHierarchy$subcategory)
productHierarchy$subcategory <- gsub("Level_2_element_2", "Sanitary", productHierarchy$subcategory)
productHierarchy$subcategory <- gsub("Level_2_element_4", "Household", productHierarchy$subcategory)

#Inspect the data to confirm the results 
productHierarchy <- productHierarchy[order(productHierarchy$category),]
print(productHierarchy)
#>   category subcategory
#> 1     Food   Beverages
#> 3     Food       Dairy
#> 2 Non-Food    Sanitary
#> 4 Non-Food   Household

Build transactions

Now that a group of customer IDs and Products are built, the next step is to build transactions. Transactions are built using the function genTrans. This function takes 5 arguments. The details of them are as follows.

cylces. This represents the cyclicality of data. It can take the following values
- “y”. If cycles is set to the value “y”, it means that there is only one instance of a high number of transactions during the entire year. This is a very common situation for some retail clients where the highest number of sales are during the holiday period in December.
- “q”. If cycles is set to the value “q”, it means that there are 4 instances of a high number of transactions. This is generally noticed in the financial services industry where the financial statements are revised every quarter and have an impact on the equity transactions in the secondary market.
- “m”. If cycles is set to the value “m”, it means that there are 12 instances of a high number of transactions for a year. This means that the number of transactions increases once every month and then subside for the rest of the month.
spike. This represents the seasonality of data. It can take any value from 1 to 12. These numbers represent months in an year, from January to December respectively. For example, if spike is set to 12, it means that December has the highest number of transactions.
trend. This represents the slope of data distribution. It can take a value of 1 or -1.
- If the trend is set to value 1, then the aggregated monthly transactions will exhibit an upward trend from January to December and vice versa if it is set to -1.
outliers. This signifies the presence of outliers. If set to value 1, then outliers are generated randomly. If set to value 0, then no outliers are generated. The presence of outliers is a very common occurrence and hence setting the outliers to 1 is recommended. However, there are instances where outliers are not needed. For example, if the objective of data generation is solely for visualization purposes then outliers may not be needed.
transactions. This represents the number of transactions to be generated.

Let us build transactions using the following code

transactions <- genTrans(cycles = "y", spike = 12, outliers = 1, transactions = 10000)

Visualize generated transactions by using

TxnAggregated <- aggregate(transactions$transactionID, by = list(transactions$dayNum), length)
plot(TxnAggregated, type = "l", ann = FALSE)

Build final data

Bringing customers, products and transactions together is the final step of generating synthetic data. This process entails 3 steps as given below.

Allocate customers to transactions

The allocation of transactions is achieved with the help of buildPareto function. This function takes 3 arguments as detailed below.

factor1 and factor2. These are factors to be mapped to each other. As the name suggests, they must be of data type factor.
Pareto. This defines the percentage allocation and is a numeric data type. This argument takes the form of c(x,y) where x and y are numeric and their sum is 100. If we set Pareto to c(80,20), it then allocates 80 percent of factor1 to 20 percent of factor 2. This is based on a well-known concept of Pareto principle.

Let us now allocate transactions to customers first by using the following code.

customer2transaction <- buildPareto(customers, transactions$transactionID, pareto = c(80,20))

Assign readable names to the output by using the following code.

names(customer2transaction) <- c('transactionID', 'customer')

#inspect the output
print(head(customer2transaction))
#>   transactionID customer
#> 1    txn-117-28  cust056
#> 2     txn-49-05  cust026
#> 3    txn-283-18  cust090
#> 4    txn-313-25  cust003
#> 5    txn-196-15  cust063
#> 6     txn-93-01  cust081

Allocate products to product hierarchy

Allocate the products to the product hierarchy. This can be achieved as follows.

#First step is to ensure that the product hierarchy data frame has the same number of rows as number of products.
category <- productHierarchy$category
subcategory <- productHierarchy$subcategory
productHierarchy <- as.data.frame(cbind(category,subcategory,1:nrow(products)))
#> Warning in cbind(category, subcategory, 1:nrow(products)): number of rows of
#> result is not a multiple of vector length (arg 1)

#Randomly assign the product hierarchy to the products. Ensure that the additional unused variable towards the end is dropped.
products <- cbind(products, productHierarchy[,c("category","subcategory")])
#inspect the output
print(head(products))
#>     SKU Price category subcategory
#> 1 sku01 21.09     Food   Beverages
#> 2 sku02  6.23     Food       Dairy
#> 3 sku03 31.77 Non-Food    Sanitary
#> 4 sku04 28.44 Non-Food   Household
#> 5 sku05 46.51     Food   Beverages
#> 6 sku06 13.06     Food       Dairy

Allocate products to transactions

Now, using similar step as mentioned above, allocate transactions to products using following code.

product2transaction <- buildPareto(products$SKU,transactions$transactionID,pareto = c(70,30))
names(product2transaction) <- c('transactionID', 'SKU')

#inspect the output
print(head(product2transaction))
#>   transactionID   SKU
#> 1    txn-209-17 sku03
#> 2     txn-66-14 sku08
#> 3    txn-120-19 sku10
#> 4     txn-25-02 sku10
#> 5    txn-184-23 sku03
#> 6    txn-340-03 sku03

Combine customers and transactions data

The following code brings together transactions, products and customers into one dataframe.

df1 <- merge(x = customer2transaction, y = product2transaction, by = "transactionID")

df2 <- merge(x = df1, y = transactions, by = "transactionID", all.x = TRUE)

#inspect the output
print(head(df2))
#>   transactionID customer   SKU dayNum mthNum
#> 1      txn-1-01  cust086 sku03      1      1
#> 2      txn-1-02  cust071 sku08      1      1
#> 3      txn-1-03  cust063 sku10      1      1
#> 4      txn-1-04  cust096 sku10      1      1
#> 5      txn-1-05  cust018 sku08      1      1
#> 6      txn-1-06  cust018 sku08      1      1

Final data

We can add additional data such as customer name, product price using the code below.

df3 <- merge(x = df2, y = customer2name, by.x = "customer", by.y = "customers", all.x = TRUE)
df4 <- merge(x = df3, y = customer2age, by.x = "customer", by.y = "customers", all.x = TRUE)
df5 <- merge(x = df4, y = customer2phone, by.x = "customer", by.y = "customers", all.x = TRUE)
df6 <- merge(x = df5, y = products, by = "SKU", all.x = TRUE)
dfFinal <- df6[,c("dayNum", "mthNum", "customer", "customerName", "customerAge", "customerPhone", "transactionID", "SKU", "Price", "category","subcategory")]


#inspect the output
print(head(dfFinal))
#>   dayNum mthNum customer customerName customerAge customerPhone transactionID
#> 1      2      1  cust029       nallin          57  +64(491)7823      txn-2-07
#> 2    307     11  cust012       dorian          41  +64(491)8218    txn-307-34
#> 3     75      3  cust012       dorian          41  +64(491)8218     txn-75-08
#> 4     80      3  cust030      shennic          59  +64(491)7966     txn-80-28
#> 5     67      3  cust046       sandal          80  +91(324)8305     txn-67-33
#> 6    108      4  cust068       kierti          77  +64(324)8279    txn-108-10
#>     SKU Price category subcategory
#> 1 sku01 21.09     Food   Beverages
#> 2 sku01 21.09     Food   Beverages
#> 3 sku01 21.09     Food   Beverages
#> 4 sku01 21.09     Food   Beverages
#> 5 sku01 21.09     Food   Beverages
#> 6 sku01 21.09     Food   Beverages

Thus, we have the final data set with transactions, customers and products.

Interpret the results

The column names of the final data frame can be interpreted as follows.

Each row is a transaction and the data frame has all the transactions for a year i.e 365 days.
dayNum is the day number in the year. There would be 365 unique dayNum in the data frame.
mthNum is the month number. This ranges from 1 to 12 and represents January to December respectively.
customer is the unique customer identifier. This is the customer who made that transaction.
customerName is name of the customer.
customerAge is the age of the customer.
customerPhone is the phone number of the customer.
transactionID is the unique identifier for that transaction.
SKU is the product ID that was bought in that transaction.
Price is the price of the product.
category is the product category.
subcategory is the product subcategory.

Let us visualize the results to understand the data distribution.

Below is a view of the sum of transactions by each day.

aggregatedDataDay <- aggregate(dfFinal$transactionID, by = list(dfFinal$dayNum), length)
plot(aggregatedDataDay, type = "l", ann = FALSE)

Below is a view of the sum of transactions by each month.

aggregatedDataMth <- aggregate(dfFinal$transactionID, by = list(dfFinal$mthNum), length)
aggregatedDataMthSorted <- aggregatedDataMth[order(aggregatedDataMth$Group.1),]
plot(aggregatedDataMthSorted, ann = FALSE)

Industry Example