--- title: Configuration output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Configuration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Datasets configuration can be provided in a `yaml` file or as a nested list. Below you can find a detailed description of possible options. ## Data frames in a dataset A single YAML file can include multiple data frames. Entry for each will be used as name of the data frame when it comes to generating data. ```yaml first_data_frame: ... second_data_frame: ... third_data_frame: ... ``` ## Data frame configuration Data frame configuration includes two sections: 1. `columns` - where you can describe columns of your data frame. 2. `default_size` - optional value that describes default size of the data frame. ### Columns Each column of your data frame should be described in a separate entry in columns section. Entry name will be used as column name. Currently there are three major types of columns implemented: 1. Built-in basic columns (integer, numeric, string, boolean and set) 2. Columns that use custom function to be generated. 3. Columns calculated from other columns. Type of column is set by choosing a proper `type` value in column description. Check following sections for more details. The order of columns will be the same as the order of entries in the configuration. #### Built-in columns Basic column types. For an example YAML configuration check [this](../examples/built_in_columns.yaml) ##### integer Random integers from a range Parameters: * `type: integer` - column type * `unique` (optional, default: FALSE) - boolean, should values be unique * `min` (optional, default: 0) - integer, minimum value to occur in the column. * `max` (optional, default: 999999) - integer, maximum value to occur in the column. Example: ```yaml data_frame: columns: integer_column: type: integer min: 2 max: 10 ``` ##### numeric Random float numbers from a range Parameters: * `type: numeric` - column type * `unique` (optional, default: FALSE) - boolean, should values be unique * `min` (optional, default: 0) - numeric, minimum value to occur in the column. * `max` (optional, default: 999999) - numeric, maximum value to occur in the column. Example: ```yaml data_frame: columns: numeric_column: type: numeric min: 2.12 max: 10.3 ``` ##### string Random string that follows given pattern Parameters: * `type: string` - column type * `unique` (optional, default: FALSE) - boolean, should values be unique * `length` (optional, default: NULL) - integer, string length. If NULL, string length will be random (see next parameters). * `min_length` (optional, default: 1) - integer, minimum length if length is random. * `max_length` (optional, default: 15) - integer, maximum length if length is random. * `pattern` (optional, default: "[A-Za-z0-9]") - string pattern, for details check [this](https://rdrr.io/cran/stringi/man/about_search_charclass.html). Example: ```yaml data_frame: columns: string_column: type: string length: 3 pattern: "[ACGT]" ``` ##### boolean Random boolean Parameters: * `type: boolean` - column type Example: ```yaml data_frame: columns: boolean_column: type: boolean ``` ##### set Column with elements from a set Parameters: * `type: set` - column type * `set` (optional, default: NULL) - set of possible values, if NULL, will use a random set. * `set_type` (optional, default: NULL) - type of random set, can be "integer", "numeric" or "string". * `set_size` (optional, default: NULL) - integer, size of random set * If set is random, you can add parameters required by type of set (e.g. min, max, pattern, etc.) Example: ```yaml data_frame: columns: set_column_one: type: set set: ["aardvark", "elephant", "hedgehog"] set_column_two: type: set set_type: integer set_size: 3 min: 2 max: 10 ``` ##### date Column with dates Parameters: * `type: date` - column type * `min_date` - beginning of the time interval to sample from * `max_date` - end of the time interval to sample from * `format` (optional, default: NULL) - date format, for details check [this](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strptime) Example: ```yaml data_frame: columns: date_column: type: date min_date: 2012-03-31 max_date: 2015-12-23 ``` ##### time Column with times Parameters: * `type: time` - column type * `min_time` (optional, default: "00:00:00") - beginning of the time interval to sample from * `max_time` (optional, default: "23:59:59") - end of the time interval to sample from * `resolution` (optional, default: "seconds") - one of "seconds", "minutes", "hours", time resolution Example: ```yaml data_frame: columns: time_column: type: time min_time: "12:23:00" max_time: "15:48:32" resolution: "seconds" ``` ##### datetime Column with datetimes Parameters: * `type: datetime` - column type * `min_date` - beginning of the time interval to sample from * `max_date` - end of the time interval to sample from * `date_format` (optional, default: NULL) - date format, for details check [this](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strptime) * `min_time` (optional, default: "00:00:00") - beginning of the time interval to sample from * `max_time` (optional, default: "23:59:59") - end of the time interval to sample from * `time_resolution` (optional, default: "seconds") - one of "seconds", "minutes", "hours", time resolution * `tz` (optional, default: "UTC") - time zone name Example: ```yaml data_frame: columns: time_column: type: datetime min_date: 2012-03-31 max_date: 2015-12-23 min_time: "12:23:00" max_time: "15:48:32" time_resolution: "seconds" ``` #### Special columns Special predefined types of columns. For an example YAML configuration check [this](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/special_types.yaml) ##### id Id column - ordered integer that starts from defined value (default: 1). Parameters: * `type: id` - column type * `start` (optional, default: 1) - first value Example: ```yaml data_frame: columns: id_column: type: id start: 2 ``` ##### distribution Column filled with values that follow given statistical distribution. You can use one of the distributions available [here](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Distributions.html). You can use function name (e.g. `rnorm`) or regular distribution name (e.g. "Normal"). For available names, check [this file](https://github.com/jakubnowicki/fixtuRes/blob/master/inst/distributions.yaml). Parameters: * `type: distribution` - column type * `distribution_type` - distribution name * `...` - all arguments required by distribution function Example: ```yaml data_frame: columns: normal_distribution: type: distribution distribution_type: Gaussian bernoulli_distribution: type: distribution distribution_type: binomial size: 1 prob: 0.5 poisson_distribution: type: distribution distribution_type: Poisson lambda: 3 beta_distribution: type: distribution distribution_type: rbeta shape1: 20 shape2: 30 cauchy_distribution: type: distribution distribution_type: Cauchy-Lorentz ``` #### Custom columns There are two levels of custom generator that can be used. You can provide a function that generates a single value or a function that provides a whole column. For examples check [this configuration](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/custom_columns.yaml) and [this R script with functions](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/additional_functions.R). ##### custom value generator Generate column values using custom function available in your environment. Function should return a single value. Parameters: * `type: custom` - column type * `custom_generator` - name of the function that will provide values. * All parameters required by your custom function. Example: ```r return_sample_paste <- function(vector_of_values) { values <- sample(vector_of_values, 2) paste(values, collapse = "_") } ``` ```yaml data_frame: columns: custom_column: type: custom custom_generator: return_sample_paste vector_of_values: ["a", "b", "c", "d"] ``` ##### custom column generator Generate column using custom function available in your environment. Function should accept argument `size` and return a vector of length equal to it. Parameters: * `type: custom_column` - column type * `custom_column_generator` - name of the function that will generate column. * All parameters required by your custom function except `size`. Example: ```r return_repeated_value <- function(size, value) { rep(value, times = size) } ``` ```yaml data_frame: columns: custom_column: type: custom_column custom_column_generator: return_repeated_value value: "Ask me about trilobites!" ``` #### Calculated columns Calculate columns that depend on other columns. For examples check [this configuration](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/calculated_columns.yaml) and [this R script with functions](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/additional_functions.R). Parameters: * `type: calculated` - column type * `formula` - calculation that has to be performed to obtain column In general, formula can be a simple expression or a call of more complex function. In both cases formula has to include names of the columns required for the calculations. When using a function, make sure that it returns a vector of the same size as inputs. Example: ```r check_column <- function(column) { purrr::map_lgl(column, ~.x >= 10) } ``` ```yaml data_frame: columns: basic_column: type: integer min: 1 max: 10 second_basic_column: type: integer min: 1 max: 10 calculated_column: type: calculated formula: basic_column + second_basic_column second_calculated_column: type: calculated formula: check_column(calculated_column) ``` ### Default size Data frame can have a default number of rows that will be returned if size argument is not provided. Default size can be one of: * not provided - generator will return a random number of rows (from 5 to 50) * integer - single value, number of rows Example: ```yaml data_frame: columns: ... default_size: 10 ``` * random integer - you can provide arguments to `random_integer` function. Result can be a static value (if `static: TRUE` provided) or a random number generator. The first one will generate a number of rows just once ant that number will be used when data is refreshed (without providing a specific size). Example: ```yaml random_number_of_rows: columns: ... default_size: arguments: min: 10 max: 20 static_random_number_of_rows: columns: ... default_size: arguments: min: 5 max: 10 static: TRUE ``` For sample YAML configuration check [this](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/default_size_examples.yaml). ### Arrange data frame Data frame can be arranged by columns by providing a list of column names as `arange` field. Example: ```yaml data_frame: columns: a: ... b: ... c: ... d: ... arrange: [a, c] ```