--- title: "Introduction to conjurer" author: "Sidharth Macherla" date: "2023-01-14" bibliography: bibliography.bib link-citations: TRUE output: html_document: number_sections: FALSE vignette: > %\VignetteIndexEntry{Industry Example} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Summary {-} The package Conjurer offers synthetic data distribution functionality to generate data that seems real. To that extent, the functions in this package help generate distributions in a parametric method. This means that the randomness of the data generation is preserved while allowing the user to define the constraints of the randomness. Such a controlled randomness will aid in the generation of multiple data distributions to simulate real world as well as unrealistic examples of data. This paper provides insights and usage of the functions in a more detailed manner than provided in the manual of the package. This paper presents each function as a sub section and provides an overview of the purpose and details examples with source code. ## Continuous data ### Description The function *buildNum* is used to generate continuous data distribution. The continuous data in the context of this package relates to the float data type and not continuous in the context of signal processing. Although the data distribution generated is a float data type, this can be rounded off to simulate discrete data distribution. At the core, this function uses a modified form of sine curve and therefore lends itself to manipulation such that the dispersion of the data can be skewed on purpose. The dispersion of the data can be controlled by the parameter *disp* which takes a value between *(-pi/2)* and *(pi/2)*. In order to make the data more realistic, the parameter *outliers* can be set to *1*. It must be noted that the outliers may produce results where data could be beyond the range of the data requested i.e. *st* and *en* This functionality can be used to generate univariate distributions. ### Usage The following code illustrates the process of generating continuous data with and without outliers. ```{r, eval=TRUE, echo=TRUE, results='markup'} #invoke library library("conjurer") set.seed(123) continuousData <- buildNum(n = 10, st = 0, en = 1, disp = (pi/3), outliers = 0) continuousDataOutlier <- buildNum(n = 10, st = 0, en = 1, disp = (pi/3), outliers = 1) par(mfrow=c(1,2)) plot(continuousData) plot(continuousDataOutlier) ``` ## String data ### Description The function *buildName* is used to generate string data. This function uses probabilistic distribution of the alpabet sequences. Unlike more advanced algorithms such as conditional random fields, this function uses a more basic approach of probability of an alphabet given the probability of the alplhabet preceding it. To this extent, the function sources a data frame of string data based on which the posterior probabilities are generated. Since the generation is based on posterior probabilities, there needs to be sufficiently large data frame such that all possible permutations of the alphabets are present. If no data frame is provided, a default data frame of predetermined set of baby names is used. ### Usage The following code illustrates the process of generating of alphabet sequences based on the default data frame provided in the package as well as a mocked up data of three short parts of a ficticious genome sequence. ```{r, eval=TRUE, echo=TRUE, results='markup'} #invoke library library("conjurer") set.seed(123) buildNames(numOfNames = 3, minLength = 5, maxLength = 7) d <- data.frame (first_column = c("ATGACGAGAGAGAGCA", "ATGACGAGAGAGCAGAGA","TACTGCTCTCTCGTAAATCG")) buildNames(dframe=d, numOfNames = 3, minLength = 5, maxLength = 5) ``` *Note: It can be observed that since the data frame of genome sequences is small, the package throws a warning that there is not enough training data* ## Alpha Numeric data ### Description The function *buildId* is used to generate the alphanumeric. In its current state the alphanumeric is a sequence of data with a string prefix followed by an incremental numeric data. This data can be used a unique identifier of an element or in cases of database schema, this can be used as a primary key of a table. ### Usage The following code illustrates the process of generating a unique specimen id for a given number of elements. ```{r, eval=TRUE, echo=TRUE, results='markup'} #invoke library library("conjurer") buildId(numOfItems = 3, prefix = "specID") ``` ## Sequencial data ### Description The function *buildPattern* is used to generate a sequence i.e. a predetermined pattern of data. This function can be considered as an intuitive form of finite state automaton or a regular expression. A pattern is built as a probabilistic combination of *parts*. ### Usage The following code illustrates the process of generating a pattern of phone numbers and IP addresses. The *parts* are generated based on the respective probabilities given in the *probs*. ```{r, eval=TRUE, echo=TRUE, results='markup'} #invoke library library("conjurer") set.seed(123) parts <- list(c(172),c("."),c(16:31), c("."), c(0:255), c("."), c(0:255)) probs <- list(c(), c(),c(),c(), c(), c(), c()) buildPattern(n=5,parts = parts, probs = probs) parts <- list(c("+11","+44","+64"), c("-"), c(491,324,211), c(7821:8324)) probs <- list(c(0.25,0.25,0.50), c(), c(0.30,0.60,0.10), c()) buildPattern(n=5,parts = parts, probs = probs) ``` ## Graph data ### Description The function *buildHierarchy* is used to generate graph data i.e. hierarchical data. Based on the number of levels and splits, the tree structure is built. The graph data is then presented in the form of a data frame. ### Usage The following code illustrates the process of generating a tree with 2 splits at each node and a depth of three levels. ```{r, eval=TRUE, echo=TRUE, results='markup'} #invoke library library("conjurer") buildHierarchy(splits = 2, numOfLevels = 3) ``` ## Relationship data ### Description The function *buildPareto* is used to map data elements to each other. This function helps in mapping or linking variables. Such a linking or mapping helps in multiple use cases such as build a data frame from a set of variables, building data distribution of one variable in relation to another. ### Usage The following code illustrates the process of generating a mapping between two factors such that 30 percent of one factor is linked to 70 percent of another factor. ```{r, eval=TRUE, echo=TRUE, results='markup'} #invoke library library("conjurer") set.seed(123) f1 <- factor(c(1:10)) f2 <- factor(letters[1:12], labels = "f") buildPareto(factor1 = f1, factor2 = f2, pareto = c(70,30)) ```