yaml

Generate YAML data generation file from a DDL SQL file

dbworkload can assist with generating a stub of the data generation definition file, required by the csv command.

Example

Create a DDL file. Let's call it bank.ddl.

-- file: bank.ddl

CREATE TABLE ref_data (
    acc_no BIGINT PRIMARY KEY,
    external_ref_id UUID,
    created_time TIMESTAMPTZ,
    acc_details VARCHAR
);

CREATE TABLE orders (
    acc_no BIGINT NOT NULL,
    id UUID NOT NULL default gen_random_uuid(),
    status VARCHAR NOT NULL,
    amount DECIMAL(15, 2),
    ts TIMESTAMPTZ default now(),
    CONSTRAINT pk PRIMARY KEY (acc_no, id)
);

Run the yaml command, passing in bank.ddl as the input.
```
dbworkload util yaml -i bank.ddl
```

Open bank.yaml output and inspect YAML tree. This is a configuration file with instructions on what type of data to generate. Check section Types for a full list of available object types and arguments.

ref_data:
  - count: 100
    sort-by: []
    columns:
      acc_no:
        type: integer
        args:
          min: -9223372036854775807
          max: 9223372036854775807
          seed: 64
          null_pct: 0.3
          array: 0
      external_ref_id:
        type: uuid
        args:
          seed: 76
          null_pct: 0.25
          array: 0
      created_time:
        type: timestamp
        args:
          start: "2000-01-01"
          end: "2024-12-31"
          format: "%Y-%m-%d %H:%M:%S.%f"
          seed: 78
          null_pct: 0.37
          array: 0
      acc_details:
        type: string
        args:
          min: 10
          max: 30
          prefix: ""
          seed: 54
          null_pct: 0.38
          array: 0
orders:
  - count: 100
    sort-by: []
    columns:
      acc_no:
        type: integer
        args:
          min: -9223372036854775807
          max: 9223372036854775807
          seed: 83
          null_pct: 0.0
          array: 0
    id:
      type: uuid
      args:
        seed: 49
        null_pct: 0.0
        array: 0
    status:
      type: string
      args:
        min: 10
        max: 30
        prefix: ""
        seed: 58
        null_pct: 0.0
        array: 0
    amount:
      type: float
      args:
        min: 0
        max: 10000000000000
        round: 2
        seed: 11
        null_pct: 0.33
        array: 0
    ts:
      type: timestamp
      args:
        start: "2000-01-01"
        end: "2024-12-31"
        format: "%Y-%m-%d %H:%M:%S.%f"
        seed: 100
        null_pct: 0.26
        array: 0

Types

Below is a table with all available generators and their arguments

Generator	Description	Arguments	Default
constant	Returns `value` over and over	value `str`	"simplefaker"
sequence	Returns an `int` increased by 1 starting from `start`	start `int`	0
integer	Returns an `int` between `min` and `max`	min `int`	1
		max `int`	1,000,000,000
float	Returns a decimal between `min` and `max` with `round` precision	min `int`	1
		max `int`	1,000,000
		round `int`	2
string	Returns a `str` using chars `[A-Za-z0-9]` of length between `min` and `max` prefixed by `prefix`	min `int`	10
		max `int`	50
		prefix `str`	<blank>
json	Returns a JSON string of length between `min` and `max`	min `int`	10
		max `int`	50
choice	Picks an item `population`. Optionally set `[cum_]weights`. See docs	population `list[str]`	["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
		weights `list[int]`	empty list
		cum_weights `list[int]`	empty list
timestamp	Returns a random timestamp formatted as `format` between `start` and `end`.	start `str`	"2000-01-01"
	`start` and `end` must be valid date representation	end `str`	"2024-12-31"
	strftime quick ref	format `str`	"%Y-%m-%d %H:%M:%S.%f"
date		start `str`	"2000-01-01"
		end `str`	"2024-12-31"
		format `str`	"%Y-%m-%d"
time		start `str`	"07:30:00"
		end `str`	"22:30:00"
		micros `bool`	false
uuid	Returns the string representation of a UUIDv4
bool	Returns either 0 or 1
bit	Returns a `bit` of size `size`	size `int`	10
bytes	Returns `bytes` of size `size`	size `int`	10
custom	Uses a python class defined in `path`	path `str`
	See example usage

Furthermore, all but sequence take these common arguments.

Arguments	Description	Default
seed `float`	The random generator seed number	random
null_pct `float`	The percentange of NULL values, currently defined as an empty string `""`	0
array `int`	Size of the ARRAY	0

json does not take array.

constant does not take seed and array since it returns the same value.

Custom type generator

With a custom generator, you can define your own logic to generate any type of random data.

The custom generator is nothing more than a python class that creates a python generator, that is, it implements the __next__() function.

Below example illustrate how it works.

In the YAML file, you must pass path in the args section, with the path to your python file.

Optionally, you can pass a seed and you can declare as many args as you want. In this example, we only declare 1 arg, my_arg.

my_table:
- count: 5
  sort-by: []
  columns:
    my_custom_generator:
      type: custom
      args:
        seed: 1
        path: play/greetings.py
        my_arg: Hello

The file play/greetings.py yields a generated value using your arbitrarily complex logic

import random


class Greetings:
    def __init__(
        self,
        seed: float,
        my_arg: str = "",
        null_pct: float = 0,
        array: int = 0,
    ):
        self.array = array
        self.null_pct = null_pct
        self.rng: random.Random = random.Random(seed)
        self.my_arg = my_arg
        self.names = ['Joe', 'Xin', 'Sue', 'Mia', 'Ram']

    def __next__(self):
        if self.null_pct and self.rng.random() < self.null_pct:
            return ""
        else:
            return f"{self.my_arg} {self.rng.choice(self.names)}!"

Executing dbworkload util csv returns these rows

Hello Ram!
Hello Xin!
Hello Joe!
Hello Ram!
Hello Mia!

The file name must match the class name: file greetings.py declares class Greetings.

The __init__() function must accept seed, null_pct and array as arguments. You don't have to use them if you don't want to, but they are expected.

YAML Structure

The structure of the YAML file might look intuitive already, but here is a formal description.

For every CREATE TABLE statement in the input file, you'll find a key with the same name

ref_data:
  - count: 100
    sort-by: []
    columns:
    # truncated for brevity
orders:
  - count: 100
    sort-by: []
    columns:
    # truncated for brevity

The value type of the table key is a list: notice the - prefixing the first item of the list, count.

This is helpful if you want to generate detasets with different seed numbers for the same table, as shown in an example below. Mostly though, the list will contain only 1 item.

The items in the list are of type dict with 3 keys:

count: takes an int as a value and represents the desired count of rows in the dataset you want to generate.
sort-by: takes a list of str. Here you pass the exact column names you want to sort by, in ascending order.
columns: takes a dict as value. The dict uses the column names in the CREATE TABLE as keys.

Here's an example of generating 1000 rows sorted by acc_no, a column of type integer with default args.

ref_data:
  - count: 1000
    sort-by:
      - acc_no
    columns:
      acc_no:
        type: integer
        # notice the absence of the `args` key

Each key in columns has a dict as value with 2 keys:

type: the generator type you want to use.
args: yet another dictionary with the specific generator arguments and values.

Here is an example of column external_ref_id, which features all available arguments for its type.

ref_data:
  - count: 1000
    sort-by:
      - acc_no
    columns:
      acc_no:
        type: integer
      external_ref_id:
        type: uuid
        args:
          seed: 76
          null_pct: 0.25
          array: 0

See Types right above for a full list of all available generator types and their arguments.

Foreign Key relationship

A common scenario is to create datasets that must be linked, for example, because of a Foreign Key constraint in the table schemas.

Here is a quick example of how to use seed to generate same keys across 2 columns in different tables.

Example tables linked by a foreign key reletionship:

CREATE TABLE p (
    id UUID,
    v STRING,
    CONSTRAINT pk PRIMARY KEY (id)
);

CREATE TABLE c (
    id UUID,
    p_id UUID NOT NULL,
    v STRING,
    CONSTRAINT pk PRIMARY KEY (id),
    CONSTRAINT pid_in_p FOREIGN KEY (p_id) REFERENCES p (id)
);

Create a YAML file so that the data generated for the p.id column matches c.p_id. Notice how the seed number, 1, is the same for table p and some rows in c.

p:
  - count: 10
    sort-by: []
    columns:
      id:
        type: uuid
        args:
          seed: 1
      v:
        type: string
c:
- count: 2
  sort-by: []
  columns:
    id:
      type: uuid
    p_id:
      type: uuid
      args:
        seed: 1
    v:
      type: string
- count: 8
  sort-by: []
  columns:
    id:
      type: uuid
    p_id:
      type: uuid
    v:
      type: string

For table p, this will create:

a 10 rows dataset.

For table c, it will create:

a 2 rows dataset where c.p_id is the same as p.id;
a 8 rows dataset with random rows.

That gives a FK relationship of 2:10 between c and p.

# file: p.0_0_0.tsv

19578af7-2a49-4cdc-95ff-6a05e79ff16b    Tp5zbRSFtugZQsYJEIERXjLrozw
877c814b-66b2-48f3-8d2e-80cbbd74cdb7    NGFfjmbegU7n
ceff7bf0-0dde-4c54-b3f2-1bcf7a7fb908    4zAUXFkiHGTOlvvHk6bq
c3aba496-5410-41e6-8d48-8da450d302fc    OsKz6zG3dPYvEcqBMltDQGqrfpVUnSbsBeiH
071f951d-e4ae-4c90-9b64-eb57a915b273    OCWXgMcnpNBaODfEXRCHiHqjqD2ZGn
e277823f-c19a-497f-8582-f29ef3fb9957    vpC0wqGT53vBkwgGUNrTtw6pbLsV3PEy3BnsqEASk0IWj0pQbI
bbe37e2c-d874-4ca2-8978-4463078e4206    ksMTt13ZFXVmnt41hnjvDtdJyoZ
32ad9dbf-0a25-463e-ba1f-f248edf3a62a    QtLnVqUoUOauO
fa98f2c6-00b9-4570-b36c-9673225a7afd    yREqhAzY1YBx3cQ990
ba66edaf-04e7-4217-beb2-671a8faa1530    rv14QnjStbeUjtpIAdAeVuIvETSj

Below the 2 rows CSV generated for table c. The 2nd column, p_id, has matching values to the first column in above file.

# file: cat c.0_0_0.tsv 

5d037e2b-ce8f-4e7d-a99e-4f8efe6349f9    19578af7-2a49-4cdc-95ff-6a05e79ff16b    Q1OYAmNzDJHKrpCChkyzNQE7wuruIa8Hkb
ec8d601c-2500-4f6e-9fc6-697634f4bcd8    877c814b-66b2-48f3-8d2e-80cbbd74cdb7    VAY6PpDTDKxaDBdyVK6W02fMM6Eko492fe3pXTf9JEMgA