---
title: Configuration
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Configuration}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Datasets configuration can be provided in a `yaml` file or as a nested list. Below you can find a detailed description of possible options.

## Data frames in a dataset

A single YAML file can include multiple data frames. Entry for each will be used as name of the data frame when it comes to generating data.

```yaml
first_data_frame:
  ...
second_data_frame:
  ...
third_data_frame:
  ...
```

## Data frame configuration

Data frame configuration includes two sections:

1. `columns` - where you can describe columns of your data frame.
2. `default_size` - optional value that describes default size of the data frame.

### Columns

Each column of your data frame should be described in a separate entry in
columns section. Entry name will be used as column name.

Currently there are three major types of columns implemented:

1. Built-in basic columns (integer, numeric, string, boolean and set)
2. Columns that use custom function to be generated.
3. Columns calculated from other columns.

Type of column is set by choosing a proper `type` value in column
description. Check following sections for more details.

The order of columns will be the same as the order of entries in the configuration.

#### Built-in columns

Basic column types. For an example YAML configuration check [this](../examples/built_in_columns.yaml)

##### integer

Random integers from a range

Parameters:

* `type: integer` - column type
* `unique` (optional, default: FALSE) - boolean, should values be unique
* `min` (optional, default: 0) - integer, minimum value to occur in the column.
* `max` (optional, default: 999999) - integer, maximum value to occur in the column.

Example:

```yaml
data_frame:
  columns:
    integer_column:
      type: integer
      min: 2
      max: 10
```

##### numeric

Random float numbers from a range

Parameters:

* `type: numeric` - column type
* `unique` (optional, default: FALSE) - boolean, should values be unique
* `min` (optional, default: 0) - numeric, minimum value to occur in the column.
* `max` (optional, default: 999999) - numeric, maximum value to occur in the column.

Example:

```yaml
data_frame:
  columns:
    numeric_column:
      type: numeric
      min: 2.12
      max: 10.3
```

##### string

Random string that follows given pattern

Parameters:

* `type: string` - column type
* `unique` (optional, default: FALSE) - boolean, should values be unique
* `length` (optional, default: NULL) - integer, string length. If NULL, string length will be random (see next parameters).
* `min_length` (optional, default: 1) - integer, minimum length if length is random.
* `max_length` (optional, default: 15) - integer, maximum length if length is random.
* `pattern` (optional, default: "[A-Za-z0-9]") - string pattern, for details check [this](https://rdrr.io/cran/stringi/man/about_search_charclass.html).

Example:

```yaml
data_frame:
  columns:
    string_column:
      type: string
      length: 3
      pattern: "[ACGT]"
```

##### boolean

Random boolean

Parameters:

* `type: boolean` - column type

Example:

```yaml
data_frame:
  columns:
    boolean_column:
      type: boolean
```

##### set

Column with elements from a set

Parameters:

* `type: set` - column type
* `set` (optional, default: NULL) - set of possible values, if NULL, will use a random set.
* `set_type` (optional, default: NULL) - type of random set, can be "integer", "numeric" or "string".
* `set_size` (optional, default: NULL) - integer, size of random set
* If set is random, you can add parameters required by type of set (e.g. min, max, pattern, etc.)

Example:

```yaml
data_frame:
  columns:
    set_column_one:
      type: set
      set: ["aardvark", "elephant", "hedgehog"]
    set_column_two:
      type: set
      set_type: integer
      set_size: 3
      min: 2
      max: 10
```

##### date

Column with dates

Parameters:

* `type: date` - column type
* `min_date` - beginning of the time interval to sample from
* `max_date` - end of the time interval to sample from
* `format` (optional, default: NULL) - date format, for details check [this](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strptime)

Example:

```yaml
data_frame:
  columns:
    date_column:
      type: date
      min_date: 2012-03-31
      max_date: 2015-12-23
```

##### time

Column with times

Parameters:

* `type: time` - column type
* `min_time` (optional, default: "00:00:00") - beginning of the time interval to sample from
* `max_time` (optional, default: "23:59:59") - end of the time interval to sample from
* `resolution` (optional, default: "seconds") - one of "seconds", "minutes", "hours", time resolution

Example:

```yaml
data_frame:
  columns:
    time_column:
      type: time
      min_time: "12:23:00"
      max_time: "15:48:32"
      resolution: "seconds"
```

##### datetime

Column with datetimes

Parameters:

* `type: datetime` - column type
* `min_date` - beginning of the time interval to sample from
* `max_date` - end of the time interval to sample from
* `date_format` (optional, default: NULL) - date format, for details check [this](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strptime)
* `min_time` (optional, default: "00:00:00") - beginning of the time interval to sample from
* `max_time` (optional, default: "23:59:59") - end of the time interval to sample from
* `time_resolution` (optional, default: "seconds") - one of "seconds", "minutes", "hours", time resolution
* `tz` (optional, default: "UTC") - time zone name

Example:

```yaml
data_frame:
  columns:
    time_column:
      type: datetime
      min_date: 2012-03-31
      max_date: 2015-12-23
      min_time: "12:23:00"
      max_time: "15:48:32"
      time_resolution: "seconds"
```

#### Special columns

Special predefined types of columns. For an example YAML configuration check [this](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/special_types.yaml)

##### id

Id column - ordered integer that starts from defined value (default: 1).

Parameters:

* `type: id` - column type
* `start` (optional, default: 1) - first value

Example:

```yaml
data_frame:
  columns:
    id_column:
      type: id
      start: 2
```

##### distribution

Column filled with values that follow given statistical distribution.
You can use one of the distributions available [here](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Distributions.html). You can use
function name (e.g. `rnorm`) or regular distribution name (e.g. "Normal").
For available names, check [this file](https://github.com/jakubnowicki/fixtuRes/blob/master/inst/distributions.yaml).

Parameters:

* `type: distribution` - column type
* `distribution_type` - distribution name
* `...` - all arguments required by distribution function

Example:

```yaml
data_frame:
  columns:
    normal_distribution:
      type: distribution
      distribution_type: Gaussian
    bernoulli_distribution:
      type: distribution
      distribution_type: binomial
      size: 1
      prob: 0.5
    poisson_distribution:
      type: distribution
      distribution_type: Poisson
      lambda: 3
    beta_distribution:
      type: distribution
      distribution_type: rbeta
      shape1: 20
      shape2: 30
    cauchy_distribution:
      type: distribution
      distribution_type: Cauchy-Lorentz
```

#### Custom columns

There are two levels of custom generator that can be used.
You can provide a function that generates a single value or
a function that provides a whole column. For examples check
[this configuration](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/custom_columns.yaml) and
[this R script with functions](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/additional_functions.R).

##### custom value generator

Generate column values using custom function available in your environment. Function should return a single value.

Parameters:

* `type: custom` - column type
* `custom_generator` - name of the function that will provide values.
* All parameters required by your custom function.

Example:

```r
return_sample_paste <- function(vector_of_values) {
  values <- sample(vector_of_values, 2)
  paste(values, collapse = "_")
}
```

```yaml
data_frame:
  columns:
    custom_column:
      type: custom
      custom_generator: return_sample_paste
      vector_of_values: ["a", "b", "c", "d"]
```

##### custom column generator

Generate column using custom function available in your environment.
Function should accept argument `size` and return a vector of length equal to it.

Parameters:

* `type: custom_column` - column type
* `custom_column_generator` - name of the function that will generate column.
* All parameters required by your custom function except `size`.

Example:

```r
return_repeated_value <- function(size, value) {
  rep(value, times = size)
}
```

```yaml
data_frame:
  columns:
    custom_column:
      type: custom_column
      custom_column_generator: return_repeated_value
      value: "Ask me about trilobites!"
```

#### Calculated columns

Calculate columns that depend on other columns. For examples check
[this configuration](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/calculated_columns.yaml) and
[this R script with functions](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/additional_functions.R).

Parameters:

* `type: calculated` - column type
* `formula` - calculation that has to be performed to obtain column

In general, formula can be a simple expression or a call of more complex
function. In both cases formula has to include names of the columns required for the calculations. When using a function, make sure that
it returns a vector of the same size as inputs.

Example:

```r
check_column <- function(column) {
  purrr::map_lgl(column, ~.x >= 10)
}
```

```yaml
data_frame:
  columns:
    basic_column:
      type: integer
      min: 1
      max: 10
    second_basic_column:
      type: integer
      min: 1
      max: 10
    calculated_column:
      type: calculated
      formula: basic_column + second_basic_column
    second_calculated_column:
      type: calculated
      formula: check_column(calculated_column)
```

### Default size

Data frame can have a default number of rows that will be returned if
size argument is not provided. Default size can be one of:

* not provided - generator will return a random number of rows (from 5 to 50)
* integer - single value, number of rows

Example:

```yaml
data_frame:
  columns:
    ...
  default_size: 10
```

* random integer - you can provide arguments to `random_integer` function. Result can be a static value (if `static: TRUE` provided) or a random number generator. The first one will generate a number of rows just once ant that number will be used when data is refreshed (without providing a specific size).

Example:

```yaml
random_number_of_rows:
  columns:
    ...
  default_size:
    arguments:
      min: 10
      max: 20
static_random_number_of_rows:
  columns:
    ...
  default_size:
    arguments:
      min: 5
      max: 10
    static: TRUE
```

For sample YAML configuration check [this](https://github.com/jakubnowicki/fixtuRes/blob/master/examples/default_size_examples.yaml).

### Arrange data frame

Data frame can be arranged by columns by providing a list of column names as `arange` field.

Example:

```yaml
data_frame:
  columns:
    a:
      ...
    b:
      ...
    c:
      ...
    d:
      ...
  arrange: [a, c]
```