yaml
Generate YAML data generation file from a DDL SQL file
dbworkload
can assist with generating a stub of the data generation definition file, required by the csv
command.
Example
-
Create a DDL file. Let's call it
bank.ddl
.-- file: bank.ddl CREATE TABLE ref_data ( acc_no BIGINT PRIMARY KEY, external_ref_id UUID, created_time TIMESTAMPTZ, acc_details VARCHAR ); CREATE TABLE orders ( acc_no BIGINT NOT NULL, id UUID NOT NULL default gen_random_uuid(), status VARCHAR NOT NULL, amount DECIMAL(15, 2), ts TIMESTAMPTZ default now(), CONSTRAINT pk PRIMARY KEY (acc_no, id) );
-
Run the
yaml
command, passing inbank.ddl
as the input.dbworkload util yaml -i bank.ddl
-
Open
bank.yaml
output and inspect YAML tree. This is a configuration file with instructions on what type of data to generate. Check section Types for a full list of available object types and arguments.ref_data: - count: 100 sort-by: [] columns: acc_no: type: integer args: min: -9223372036854775807 max: 9223372036854775807 seed: 64 null_pct: 0.3 array: 0 external_ref_id: type: uuid args: seed: 76 null_pct: 0.25 array: 0 created_time: type: timestamp args: start: "2000-01-01" end: "2024-12-31" format: "%Y-%m-%d %H:%M:%S.%f" seed: 78 null_pct: 0.37 array: 0 acc_details: type: string args: min: 10 max: 30 prefix: "" seed: 54 null_pct: 0.38 array: 0 orders: - count: 100 sort-by: [] columns: acc_no: type: integer args: min: -9223372036854775807 max: 9223372036854775807 seed: 83 null_pct: 0.0 array: 0 id: type: uuid args: seed: 49 null_pct: 0.0 array: 0 status: type: string args: min: 10 max: 30 prefix: "" seed: 58 null_pct: 0.0 array: 0 amount: type: float args: min: 0 max: 10000000000000 round: 2 seed: 11 null_pct: 0.33 array: 0 ts: type: timestamp args: start: "2000-01-01" end: "2024-12-31" format: "%Y-%m-%d %H:%M:%S.%f" seed: 100 null_pct: 0.26 array: 0
Types
Below is a table with all available generators and their arguments
Generator | Description | Arguments | Default |
---|---|---|---|
constant | Returns value over and over |
value str |
"simplefaker" |
sequence | Returns an int increased by 1 starting from start |
start int |
0 |
integer | Returns an int between min and max |
min int |
1 |
max int |
1,000,000,000 | ||
float | Returns a decimal between min and max with round precision |
min int |
1 |
max int |
1,000,000 | ||
round int |
2 | ||
string | Returns a str using chars [A-Za-z0-9] of length between min and max prefixed by prefix |
min int |
10 |
max int |
50 | ||
prefix str |
<blank> | ||
json | Returns a JSON string of length between min and max |
min int |
10 |
max int |
50 | ||
choice | Picks an item population . Optionally set [cum_]weights . See docs |
population list[str] |
["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] |
weights list[int] |
empty list | ||
cum_weights list[int] |
empty list | ||
timestamp | Returns a random timestamp formatted as format between start and end . |
start str |
"2000-01-01" |
start and end must be valid date representation |
end str |
"2024-12-31" | |
strftime quick ref | format str |
"%Y-%m-%d %H:%M:%S.%f" | |
date | start str |
"2000-01-01" | |
end str |
"2024-12-31" | ||
format str |
"%Y-%m-%d" | ||
time | start str |
"07:30:00" | |
end str |
"22:30:00" | ||
micros bool |
false | ||
uuid | Returns the string representation of a UUIDv4 | ||
bool | Returns either 0 or 1 | ||
bit | Returns a bit of size size |
size int |
10 |
bytes | Returns bytes of size size |
size int |
10 |
custom | Uses a python class defined in path |
path str |
|
See example usage |
Furthermore, all but sequence
take these common arguments.
Arguments | Description | Default |
---|---|---|
seed float |
The random generator seed number | random |
null_pct float |
The percentange of NULL values, currently defined as an empty string "" |
0 |
array int |
Size of the ARRAY | 0 |
json
does not take array
.
constant
does not take seed
and array
since it returns the same value.
Custom type generator
With a custom generator, you can define your own logic to generate any type of random data.
The custom generator is nothing more than a python class that creates a python generator,
that is, it implements the __next__()
function.
Below example illustrate how it works.
In the YAML file, you must pass path
in the args
section,
with the path to your python file.
Optionally, you can pass a seed
and you can declare as many args
as you want.
In this example, we only declare 1 arg, my_arg
.
my_table:
- count: 5
sort-by: []
columns:
my_custom_generator:
type: custom
args:
seed: 1
path: play/greetings.py
my_arg: Hello
The file play/greetings.py
yields a generated value using your arbitrarily complex logic
import random
class Greetings:
def __init__(
self,
seed: float,
my_arg: str = "",
null_pct: float = 0,
array: int = 0,
):
self.array = array
self.null_pct = null_pct
self.rng: random.Random = random.Random(seed)
self.my_arg = my_arg
self.names = ['Joe', 'Xin', 'Sue', 'Mia', 'Ram']
def __next__(self):
if self.null_pct and self.rng.random() < self.null_pct:
return ""
else:
return f"{self.my_arg} {self.rng.choice(self.names)}!"
Executing dbworkload util csv
returns these rows
Hello Ram!
Hello Xin!
Hello Joe!
Hello Ram!
Hello Mia!
The file name must match the class name: file greetings.py
declares class Greetings
.
The __init__()
function must accept seed
, null_pct
and array
as arguments.
You don't have to use them if you don't want to, but they are expected.
YAML Structure
The structure of the YAML file might look intuitive already, but here is a formal description.
For every CREATE TABLE
statement in the input file, you'll find a key with the same name
ref_data:
- count: 100
sort-by: []
columns:
# truncated for brevity
orders:
- count: 100
sort-by: []
columns:
# truncated for brevity
The value type of the table key is a list: notice the -
prefixing the first item of the list, count
.
This is helpful if you want to generate detasets with different seed numbers for the same table, as shown in an example below. Mostly though, the list will contain only 1 item.
The items in the list are of type dict with 3 keys:
count
: takes anint
as a value and represents the desired count of rows in the dataset you want to generate.sort-by
: takes a list ofstr
. Here you pass the exact column names you want to sort by, in ascending order.columns
: takes a dict as value. The dict uses the column names in theCREATE TABLE
as keys.
Here's an example of generating 1000 rows sorted by acc_no
, a column of type integer
with default args
.
ref_data:
- count: 1000
sort-by:
- acc_no
columns:
acc_no:
type: integer
# notice the absence of the `args` key
Each key in columns
has a dict as value with 2 keys:
type
: the generator type you want to use.args
: yet another dictionary with the specific generator arguments and values.
Here is an example of column external_ref_id
, which features all available arguments for its type.
ref_data:
- count: 1000
sort-by:
- acc_no
columns:
acc_no:
type: integer
external_ref_id:
type: uuid
args:
seed: 76
null_pct: 0.25
array: 0
See Types right above for a full list of all available generator types and their arguments.
Foreign Key relationship
A common scenario is to create datasets that must be linked, for example, because of a Foreign Key constraint in the table schemas.
Here is a quick example of how to use seed
to generate same keys across 2 columns in different tables.
Example tables linked by a foreign key reletionship:
CREATE TABLE p (
id UUID,
v STRING,
CONSTRAINT pk PRIMARY KEY (id)
);
CREATE TABLE c (
id UUID,
p_id UUID NOT NULL,
v STRING,
CONSTRAINT pk PRIMARY KEY (id),
CONSTRAINT pid_in_p FOREIGN KEY (p_id) REFERENCES p (id)
);
Create a YAML file so that the data generated for the p.id
column matches c.p_id
.
Notice how the seed
number, 1, is the same for table p
and some rows in c
.
p:
- count: 10
sort-by: []
columns:
id:
type: uuid
args:
seed: 1
v:
type: string
c:
- count: 2
sort-by: []
columns:
id:
type: uuid
p_id:
type: uuid
args:
seed: 1
v:
type: string
- count: 8
sort-by: []
columns:
id:
type: uuid
p_id:
type: uuid
v:
type: string
For table p
, this will create:
- a 10 rows dataset.
For table c
, it will create:
- a 2 rows dataset where
c.p_id
is the same asp.id
; - a 8 rows dataset with random rows.
That gives a FK relationship of 2:10 between c
and p
.
# file: p.0_0_0.tsv
19578af7-2a49-4cdc-95ff-6a05e79ff16b Tp5zbRSFtugZQsYJEIERXjLrozw
877c814b-66b2-48f3-8d2e-80cbbd74cdb7 NGFfjmbegU7n
ceff7bf0-0dde-4c54-b3f2-1bcf7a7fb908 4zAUXFkiHGTOlvvHk6bq
c3aba496-5410-41e6-8d48-8da450d302fc OsKz6zG3dPYvEcqBMltDQGqrfpVUnSbsBeiH
071f951d-e4ae-4c90-9b64-eb57a915b273 OCWXgMcnpNBaODfEXRCHiHqjqD2ZGn
e277823f-c19a-497f-8582-f29ef3fb9957 vpC0wqGT53vBkwgGUNrTtw6pbLsV3PEy3BnsqEASk0IWj0pQbI
bbe37e2c-d874-4ca2-8978-4463078e4206 ksMTt13ZFXVmnt41hnjvDtdJyoZ
32ad9dbf-0a25-463e-ba1f-f248edf3a62a QtLnVqUoUOauO
fa98f2c6-00b9-4570-b36c-9673225a7afd yREqhAzY1YBx3cQ990
ba66edaf-04e7-4217-beb2-671a8faa1530 rv14QnjStbeUjtpIAdAeVuIvETSj
Below the 2 rows CSV generated for table c
.
The 2nd column, p_id
, has matching values to the first column in above file.
# file: cat c.0_0_0.tsv
5d037e2b-ce8f-4e7d-a99e-4f8efe6349f9 19578af7-2a49-4cdc-95ff-6a05e79ff16b Q1OYAmNzDJHKrpCChkyzNQE7wuruIa8Hkb
ec8d601c-2500-4f6e-9fc6-697634f4bcd8 877c814b-66b2-48f3-8d2e-80cbbd74cdb7 VAY6PpDTDKxaDBdyVK6W02fMM6Eko492fe3pXTf9JEMgA