Introduction to conjurer

Summary

The package Conjurer offers synthetic data distribution functionality to generate data that seems real. To that extent, the functions in this package help generate distributions in a parametric method. This means that the randomness of the data generation is preserved while allowing the user to define the constraints of the randomness. Such a controlled randomness will aid in the generation of multiple data distributions to simulate real world as well as unrealistic examples of data. This paper provides insights and usage of the functions in a more detailed manner than provided in the manual of the package. This paper presents each function as a sub section and provides an overview of the purpose and details examples with source code.

Continuous data

Description

The function buildNum is used to generate continuous data distribution. The continuous data in the context of this package relates to the float data type and not continuous in the context of signal processing. Although the data distribution generated is a float data type, this can be rounded off to simulate discrete data distribution. At the core, this function uses a modified form of sine curve and therefore lends itself to manipulation such that the dispersion of the data can be skewed on purpose. The dispersion of the data can be controlled by the parameter disp which takes a value between (-pi/2) and (pi/2). In order to make the data more realistic, the parameter outliers can be set to 1. It must be noted that the outliers may produce results where data could be beyond the range of the data requested i.e. st and en This functionality can be used to generate univariate distributions.

Usage

The following code illustrates the process of generating continuous data with and without outliers.

#invoke library
library("conjurer")

set.seed(123)
continuousData <- buildNum(n = 10, st = 0, en = 1, disp = (pi/3), outliers = 0)
continuousDataOutlier <- buildNum(n = 10, st = 0, en = 1, disp = (pi/3), outliers = 1)
par(mfrow=c(1,2)) 
plot(continuousData)
plot(continuousDataOutlier)

String data

Description

The function buildName is used to generate string data. This function uses probabilistic distribution of the alpabet sequences. Unlike more advanced algorithms such as conditional random fields, this function uses a more basic approach of probability of an alphabet given the probability of the alplhabet preceding it. To this extent, the function sources a data frame of string data based on which the posterior probabilities are generated. Since the generation is based on posterior probabilities, there needs to be sufficiently large data frame such that all possible permutations of the alphabets are present. If no data frame is provided, a default data frame of predetermined set of baby names is used.

Usage

The following code illustrates the process of generating of alphabet sequences based on the default data frame provided in the package as well as a mocked up data of three short parts of a ficticious genome sequence.

#invoke library
library("conjurer")

set.seed(123)
buildNames(numOfNames = 3, minLength = 5, maxLength = 7)
#> [1] "jonnet" "jaceyn" "ronni"

d <- data.frame (first_column  = c("ATGACGAGAGAGAGCA", "ATGACGAGAGAGCAGAGA","TACTGCTCTCTCGTAAATCG"))
buildNames(dframe=d, numOfNames = 3, minLength = 5, maxLength = 5)
#> Warning in buildNames(dframe = d, numOfNames = 3, minLength = 5, maxLength =
#> 5): Training data is not large enough. Expect less than minimum length names
#> and/or names that do not seem like training data
#> [1] "tt"   "tt"   "aaac"

Note: It can be observed that since the data frame of genome sequences is small, the package throws a warning that there is not enough training data

Alpha Numeric data

Description

The function buildId is used to generate the alphanumeric. In its current state the alphanumeric is a sequence of data with a string prefix followed by an incremental numeric data. This data can be used a unique identifier of an element or in cases of database schema, this can be used as a primary key of a table. ### Usage
The following code illustrates the process of generating a unique specimen id for a given number of elements.

#invoke library
library("conjurer")

buildId(numOfItems = 3, prefix = "specID")
#> [1] "specID1" "specID2" "specID3"

Sequencial data

Description

The function buildPattern is used to generate a sequence i.e. a predetermined pattern of data. This function can be considered as an intuitive form of finite state automaton or a regular expression. A pattern is built as a probabilistic combination of parts.

Usage

The following code illustrates the process of generating a pattern of phone numbers and IP addresses. The parts are generated based on the respective probabilities given in the probs.

#invoke library
library("conjurer")

set.seed(123)
parts <- list(c(172),c("."),c(16:31), c("."), c(0:255), c("."), c(0:255))
probs <- list(c(), c(),c(),c(), c(), c(), c())
buildPattern(n=5,parts = parts, probs = probs)
#> [1] "159.18.194.49" "118.20.13.152" "90.31.242.184" "92.24.98.25"  
#> [5] "7.24.210.77"

parts <- list(c("+11","+44","+64"), c("-"), c(491,324,211), c(7821:8324))
probs <- list(c(0.25,0.25,0.50), c(), c(0.30,0.60,0.10), c())
buildPattern(n=5,parts = parts, probs = probs)
#> [1] "+64-3248193" "+64-3248310" "+64-3248245" "+64-4918264" "+64-3248231"

Graph data

Description

The function buildHierarchy is used to generate graph data i.e. hierarchical data. Based on the number of levels and splits, the tree structure is built. The graph data is then presented in the form of a data frame.

Usage

The following code illustrates the process of generating a tree with 2 splits at each node and a depth of three levels.

#invoke library
library("conjurer")

buildHierarchy(splits = 2, numOfLevels = 3)
#>              level1            level2            level3
#> 1 Level_1_element_1 Level_2_element_1 Level_3_element_1
#> 2 Level_1_element_2 Level_2_element_2 Level_3_element_2
#> 3 Level_1_element_1 Level_2_element_3 Level_3_element_3
#> 4 Level_1_element_2 Level_2_element_4 Level_3_element_4
#> 5 Level_1_element_1 Level_2_element_1 Level_3_element_5
#> 6 Level_1_element_2 Level_2_element_2 Level_3_element_6
#> 7 Level_1_element_1 Level_2_element_3 Level_3_element_7
#> 8 Level_1_element_2 Level_2_element_4 Level_3_element_8

Relationship data

Description

The function buildPareto is used to map data elements to each other. This function helps in mapping or linking variables. Such a linking or mapping helps in multiple use cases such as build a data frame from a set of variables, building data distribution of one variable in relation to another.

Usage

The following code illustrates the process of generating a mapping between two factors such that 30 percent of one factor is linked to 70 percent of another factor.

#invoke library
library("conjurer")
set.seed(123)
f1 <- factor(c(1:10))
f2 <- factor(letters[1:12], labels = "f")

buildPareto(factor1 = f1, factor2 = f2, pareto = c(70,30))
#>    factor2 factor1
#> 1      f10       5
#> 2       f8       6
#> 3       f4       6
#> 4       f9       7
#> 5       f3       5
#> 6       f6       5
#> 7      f11       6
#> 8      f12       5
#> 9       f5       9
#> 10      f1      10
#> 11      f7       2
#> 12      f2       4