Generator Types
Phony provides seven generator types under a unified abstraction: four core types (Logic, List, Model, Template) and three advanced types (Statistical, Linked, Event Sequence). Each serves a specific purpose, and they can be composed together seamlessly.
Overview: When to Use Each Type
┌─────────────────────────────────────────────────────────────────────────┐
│ GENERATOR TYPE SELECTION GUIDE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Question: What kind of data do you need? │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐│
│ │ ││
│ │ Is it COMPUTED from rules/algorithms? ││
│ │ (UUIDs, random numbers, dates, sequences) ││
│ │ │ ││
│ │ └─▶ YES → LOGIC GENERATOR ││
│ │ ││
│ │ Is it from a FINITE, VALID set? ││
│ │ (Countries, HTTP codes, currencies - must be correct) ││
│ │ │ ││
│ │ └─▶ YES → LIST GENERATOR ││
│ │ ││
│ │ Should it FEEL NATURAL for a locale? ││
│ │ (Names, addresses - should sound Turkish/English/etc.) ││
│ │ │ ││
│ │ └─▶ YES → MODEL GENERATOR (N-gram) ││
│ │ ││
│ │ Is it COMPOSED from multiple sources? ││
│ │ (Full address, email, formatted output) ││
│ │ │ ││
│ │ └─▶ YES → TEMPLATE GENERATOR ││
│ │ ││
│ └─────────────────────────────────────────────────────────────────────┘│
│ │
│ KEY INSIGHT: │
│ ───────────── │
│ • CORRECT data → List (pick from valid options) │
│ • REALISTIC data → Model (statistically similar) │
│ • COMPUTED data → Logic (pure algorithm) │
│ • COMPOSED data → Template (combine sources) │
│ │
│ HTTP 404 must BE exactly 404 (List) │
│ A Turkish name should FEEL Turkish (Model) │
│ A UUID must be algorithmically valid (Logic) │
│ An email combines name + domain (Template) │
│ │
└─────────────────────────────────────────────────────────────────────────┘Type 1: Logic Generators
Pure algorithmic generation that doesn't require external data sources.
Characteristics
- No data files needed
- Deterministic given same seed
- Language/locale independent
- Fastest execution
Use Cases
| Generator | Output | Example |
|---|---|---|
uuid_v4 | UUID version 4 | 550e8400-e29b-41d4-a716-446655440000 |
uuid_v7 | UUID version 7 (time-sortable) | 018d6e5c-5c3a-7f1e-8b5a-2c4d6e8f0a1b |
ulid | ULID | 01ARZ3NDEKTSV4RRFFQ69G5FAV |
int_between | Random integer | 42 |
float_between | Random float | 3.14159 |
boolean | Random boolean | true |
datetime_between | Random datetime | 2024-03-15T14:30:00Z |
date_between | Random date | 2024-03-15 |
time_between | Random time | 14:30:00 |
timestamp | Unix timestamp | 1710512400 |
PDL Syntax
{
"generators": {
"user_id": {
"type": "logic",
"algorithm": "uuid_v7"
},
"age": {
"type": "logic",
"algorithm": "int_between",
"params": { "min": 18, "max": 85 }
},
"price": {
"type": "logic",
"algorithm": "float_between",
"params": { "min": 0.01, "max": 9999.99, "precision": 2 }
},
"created_at": {
"type": "logic",
"algorithm": "datetime_between",
"params": { "start": "2023-01-01", "end": "now" }
},
"is_active": {
"type": "logic",
"algorithm": "boolean",
"params": { "probability": 0.85 }
}
}
}Available Algorithms
| Algorithm | Params | Description |
|---|---|---|
uuid_v4 | - | Random UUID |
uuid_v7 | - | Time-sortable UUID |
ulid | - | Universally Unique Lexicographically Sortable ID |
nanoid | length | Nano ID |
int_between | min, max | Random integer |
float_between | min, max, precision | Random float |
boolean | probability | Random boolean |
datetime_between | start, end, format | Random datetime |
date_between | start, end, format | Random date |
time_between | start, end, format | Random time |
timestamp | start, end | Unix timestamp |
sequence | start, step | Auto-incrementing |
gaussian | mean, stddev | Normal distribution |
exponential | lambda | Exponential distribution |
Type 2: List Generators
Selection from predefined, finite sets of valid values.
Characteristics
- Data must be correct (valid HTTP codes, real countries)
- Can be locale-specific or universal
- Fast O(1) random selection
- Supports weighted selection
Use Cases
| Category | Examples | Why List? |
|---|---|---|
| Standards | HTTP methods, status codes | Must be valid codes |
| Geography | Countries, currencies | ISO standards |
| Enums | Status values, categories | Application-specific |
| Lookup | Cities, districts | Real place names |
PDL Syntax
{
"generators": {
"city": {
"type": "list",
"source": "lists/geo/cities.json"
},
"http_method": {
"type": "list",
"source": "inline",
"values": ["GET", "POST", "PUT", "DELETE", "PATCH"],
"locale_independent": true
},
"order_status": {
"type": "list",
"source": "inline",
"values": [
{ "value": "completed", "weight": 60 },
{ "value": "pending", "weight": 25 },
{ "value": "cancelled", "weight": 10 },
{ "value": "refunded", "weight": 5 }
]
},
"country": {
"type": "list",
"source": "lists/geo/countries.json"
}
}
}Note: List files returning objects (like countries with
{ name, code, phone_prefix }) allow accessing nested properties via{{country.code}}.
List File Formats
Simple Array (JSON):
["İstanbul", "Ankara", "İzmir", "Bursa", "Antalya"]Weighted Array (JSON):
[
{ "value": "İstanbul", "weight": 40 },
{ "value": "Ankara", "weight": 20 },
{ "value": "İzmir", "weight": 15 },
{ "value": "Bursa", "weight": 10 },
{ "value": "Antalya", "weight": 15 }
]Object with Metadata (JSON):
[
{ "name": "Turkey", "code": "TR", "phone_prefix": "+90", "currency": "TRY" },
{ "name": "Germany", "code": "DE", "phone_prefix": "+49", "currency": "EUR" },
{ "name": "United States", "code": "US", "phone_prefix": "+1", "currency": "USD" }
]Accessing Nested Properties
{
"generators": {
"country": {
"type": "list",
"source": "lists/geo/countries.json"
},
"country_code": {
"type": "template",
"pattern": "{{country.code}}"
},
"phone_prefix": {
"type": "template",
"pattern": "{{country.phone_prefix}}"
}
}
}Type 3: Model Generators (N-gram)
Statistical generation that produces realistic, locale-specific output.
Deep Dive: For complete N-gram model architecture, training configuration, and generation algorithms, see N-gram Models.
Characteristics
- Trained on real data samples
- Produces statistically similar output (not identical)
- Locale-specific (Turkish names sound Turkish)
- Supports constraints (length, prefix)
Use Cases
| Domain | Why Model? |
|---|---|
| Person names | Should feel natural for the locale |
| Company names | Follow language patterns |
| Street names | Locale-specific naming conventions |
| Product names | Natural language patterns |
| Usernames | Realistic patterns |
PDL Syntax
Model generators require a generation block specifying how to produce output:
{
"generators": {
"first_name": {
"type": "model",
"source": "models/person_names.ngram",
"generation": { "mode": "word" },
"constraints": { "min_length": 3, "max_length": 12 }
},
"username": {
"type": "model",
"source": "models/usernames.ngram",
"generation": { "mode": "word" },
"constraints": { "min_length": 4, "max_length": 16 }
},
"company_name": {
"type": "model",
"source": "models/company_names.ngram",
"generation": {
"mode": "word",
"params": { "starts_with": "A" }
}
},
"tagline": {
"type": "model",
"source": "models/slogans.ngram",
"generation": {
"mode": "sentence",
"params": {
"word_count": "{{number:3-8}}",
"punctuation": [".", "!"]
}
}
},
"bio": {
"type": "model",
"source": "models/text.ngram",
"generation": {
"mode": "text",
"params": {
"max_chars": 200,
"suffix": "..."
}
}
}
}
}Generation Modes
| Mode | Output | Use Case |
|---|---|---|
word | Single word | Names, usernames |
sentence | Single sentence | Taglines, mottos |
text | Truncated text | Bios, descriptions |
paragraph | Full paragraph | Articles, reviews |
poem | Poem with stanzas | Creative content |
acrostic | Acrostic poem | Hidden messages |
real_word | Pick from training | Known/real names |
Note: Generation parameters can reference other generators using
{{generator_name}}or inline random syntax like{{number:3-8}}. See N-gram Models for details.
Training Models
Models are trained using the phony CLI:
# Character mode (default) - for names, usernames
phony train names.txt -o models/names.ngram
# Word mode - for company names, product names, sentences
phony train companies.txt -o models/companies.ngram --token-type word
# Full configuration example
phony train names.txt \
--output models/tr_TR/names.ngram \
--ngram-size 4 \
--token-type char \
--min-word-length 3 \
--position-depth 5 \
--fallback prefix \
--exclude-originals \
--lowercase
# Train from different sources
phony train customers.csv --column first_name -o models/names.ngram
phony train data.json --path "$.users[*].name" -o models/names.ngramToken Modes
| Mode | Description | Use Case |
|---|---|---|
char | Character-level N-grams | Names, usernames, made-up words |
word | Word-level N-grams | Company names, product names, sentences |
Model File Structure
The .ngram file is a gzipped JSON. See N-gram Models for the complete format specification.
Type 4: Template Generators
Composition of multiple generators with formatting and logic.
Characteristics
- Combines other generators
- Supports multiple variants (random template selection)
- Weighted variant selection
- Recursive resolution
- Post-processing operations
Use Cases
| Output | Composition |
|---|---|
lowercase(first_name).lowercase(last_name)@domain | |
| Full name | first_name + " " + last_name |
| Address | street + " No:" + number + " " + city |
| Phone | prefix + " " + pattern(### ## ##) |
| SKU | category_code + "-" + pattern(####) |
PDL Syntax
{
"generators": {
"email": {
"type": "template",
"pattern": "{{lowercase(first_name)}}.{{lowercase(last_name)}}@{{pick(email_domains)}}",
"unique": true
},
"street_name": {
"type": "template",
"variants": [
{ "pattern": "{{first_name}} Sokak", "weight": 35 },
{ "pattern": "{{last_name}} Caddesi", "weight": 35 },
{ "pattern": "Atatürk Bulvarı", "weight": 10 },
{ "pattern": "{{number:1-2000}}. Sokak", "weight": 20 }
]
},
"full_address": {
"type": "template",
"variants": [
{ "pattern": "{{street_name}} No:{{number:1-200}} {{district}}/{{city}}", "weight": 45 },
{ "pattern": "{{neighborhood}} Mah. {{street_name}} {{city}}", "weight": 30 },
{ "pattern": "{{street_name}} {{number:1-200}}/{{number:1-20}} {{city}}", "weight": 25 }
]
},
"slug": {
"type": "template",
"pattern": "{{product_name}}",
"operations": ["slugify", "lowercase"]
}
}
}Template Resolution Process
┌─────────────────────────────────────────────────────────────────────────┐
│ TEMPLATE RESOLUTION (Recursive) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT: {{full_address}} │
│ │
│ STEP 1: Select variant (weighted random) │
│ ───────────────────────────────────────── │
│ Selected: "{{street_name}} No:{{number:1-200}} {{city}}" │
│ │
│ STEP 2: Resolve each placeholder (recursive) │
│ ───────────────────────────────────────────── │
│ │
│ {{street_name}} → Template generator │
│ │ │
│ ├─▶ Select variant: "{{first_name}} Sokak" │
│ │ │
│ └─▶ {{first_name}} → Model generator │
│ │ │
│ └─▶ N-gram generation → "Mehmet" │
│ │
│ Result: "Mehmet Sokak" │
│ │
│ {{number:1-200}} → Inline logic generator │
│ │ │
│ └─▶ Random int 1-200 → 47 │
│ │
│ {{city}} → List generator │
│ │ │
│ └─▶ Random from cities.json → "İstanbul" │
│ │
│ STEP 3: Combine │
│ ──────────────── │
│ "Mehmet Sokak" + " No:" + "47" + " " + "İstanbul" │
│ │
│ FINAL OUTPUT: "Mehmet Sokak No:47 İstanbul" │
│ │
└─────────────────────────────────────────────────────────────────────────┘Comparison Table
| Feature | Logic | List | Model | Template |
|---|---|---|---|---|
| Data source | Algorithm | File/Inline | Trained model | Other generators |
| Locale-specific | No | Optional | Yes | Depends on refs |
| Output variety | Controlled | Finite | Infinite | Controlled |
| Correctness | Guaranteed | Guaranteed | Statistical | Depends on refs |
| Speed | Fastest | Very fast | Fast | Depends on depth |
| Training needed | No | No | Yes | No |
| Typical use | IDs, numbers | Codes, enums | Names, text | Composition |
| Consistency support | Yes | Yes | Yes | Yes |
| Linking support | No | Yes | No | Yes |
Cross-Cutting Features
These features can be applied to multiple generator types.
Consistency
Consistency ensures that the same input always produces the same output across your entire dataset. This is critical for:
- Maintaining valid joins between tables
- Preserving data cardinality
- Keeping duplicated data consistent across databases
{
"generators": {
"company_name": {
"type": "model",
"source": "models/company_names.ngram",
"generation": { "mode": "word" },
"consistency": {
"enabled": true,
"key": "company_id",
"scope": "global"
}
}
}
}| Scope | Behavior |
|---|---|
table | Same key → same value within this table |
entity | Same key → same value within this entity type |
global | Same key → same value across entire dataset |
Note: Primary key generators have automatic consistency with format-preserving encryption (FPE). Foreign keys referencing them are automatically transformed consistently.
Linking
Linked generators ensure that related columns generate coherent data together. When columns have strong inter-dependencies, linking produces realistic combinations.
{
"generators": {
"location": {
"type": "linked",
"columns": ["city", "state", "country", "postal_code", "phone_prefix"],
"source": "lists/geo/locations.json"
}
},
"entities": {
"Address": {
"fields": {
"city": { "generator": "location.city" },
"country": { "generator": "location.country" },
"postal_code": { "generator": "location.postal_code" }
}
}
}
}Without linking: city="Ankara", country="Japan", postal="90210" (invalid) With linking: city="Ankara", country="Turkey", postal="06100" (valid)
Common linking patterns:
- Geographic: city + state + country + postal_code + coordinates
- Personal: first_name + gender + title (gender-appropriate names)
- Financial: salary + bonus + tax + net_income (mathematically related)
- Temporal: birth_date + hire_date + age_at_hire (logically consistent)
Deep Dive: For complete documentation on consistency, linking, and other advanced concepts, see Advanced Concepts.
Type 5: Statistical Generators
Statistical generators produce data matching real-world distributions, not just random values.
Characteristics
- Preserves frequency distributions (categorical)
- Matches statistical distributions (continuous)
- Maintains correlations between columns (multivariate)
- Supports differential privacy
Use Cases
| Generator Mode | Output | Example |
|---|---|---|
categorical | Preserves value frequencies | 70% completed, 20% pending, 10% cancelled |
continuous | Follows distribution | Normal(mean=35, stddev=12) for ages |
algebraic | Preserves math relationships | total = subtotal + tax - discount |
multivariate | Preserves correlations | price correlates with sqft |
PDL Syntax
{
"generators": {
"order_status": {
"type": "statistical",
"mode": "categorical",
"values": [
{ "value": "completed", "weight": 70 },
{ "value": "pending", "weight": 20 },
{ "value": "cancelled", "weight": 10 }
]
},
"customer_age": {
"type": "statistical",
"mode": "continuous",
"distribution": "normal",
"params": { "mean": 35, "stddev": 12 },
"constraints": { "min": 18, "max": 85 }
},
"income": {
"type": "statistical",
"mode": "continuous",
"distribution": "lognormal",
"params": { "mu": 10.5, "sigma": 0.8 }
},
"order_totals": {
"type": "statistical",
"mode": "algebraic",
"columns": ["subtotal", "tax", "discount", "total"],
"relationship": "total = subtotal + tax - discount"
}
}
}Supported Distributions
| Distribution | Use Case | Parameters |
|---|---|---|
normal | Age, height, test scores | mean, stddev |
lognormal | Income, prices, durations | mu, sigma |
exponential | Wait times, failure rates | lambda |
uniform | Random selection | min, max |
poisson | Event counts per period | lambda |
beta | Probabilities, percentages | alpha, beta |
gamma | Wait times, rainfall | shape, scale |
Differential Privacy Option
Add mathematical privacy guarantees:
{
"generators": {
"salary": {
"type": "statistical",
"mode": "continuous",
"distribution": "normal",
"params": { "mean": 75000, "stddev": 25000 },
"differential_privacy": {
"enabled": true,
"epsilon": 1.0,
"mechanism": "laplace"
}
}
}
}Type 6: Linked Generators
Linked generators ensure related columns generate coherent data together.
Characteristics
- Multiple columns share a single selection
- All values come from the same source record
- Guarantees valid combinations
- Eliminates impossible data (e.g., "Ankara, Japan")
Use Cases
| Link Type | Columns | Why Link? |
|---|---|---|
| Geographic | city, state, country, postal, coords | Must be geographically valid |
| Personal | first_name, gender, title | Gender-appropriate names |
| Financial | salary, bonus, tax, net | Mathematically related |
| Product | category, subcategory, brand | Valid category hierarchy |
PDL Syntax
{
"generators": {
"geo_location": {
"type": "linked",
"columns": ["city", "district", "country", "postal_code", "latitude", "longitude", "phone_prefix", "currency"],
"source": "lists/geo/locations.json"
},
"person_info": {
"type": "linked",
"columns": ["first_name", "gender", "title"],
"source": "lists/person/names_with_gender.json",
"rules": {
"title": {
"male": ["Bay", "Mr."],
"female": ["Bayan", "Ms.", "Mrs."]
}
}
},
"financials": {
"type": "linked",
"columns": ["salary", "bonus", "tax", "net_income"],
"rules": {
"salary": { "type": "statistical", "distribution": "lognormal", "params": { "mu": 10, "sigma": 0.5 } },
"bonus": "salary * uniform(0.05, 0.20)",
"tax": "(salary + bonus) * 0.25",
"net_income": "salary + bonus - tax"
}
}
},
"entities": {
"Employee": {
"fields": {
"city": { "generator": "geo_location.city" },
"country": { "generator": "geo_location.country" },
"postal_code": { "generator": "geo_location.postal_code" },
"first_name": { "generator": "person_info.first_name" },
"gender": { "generator": "person_info.gender" },
"salary": { "generator": "financials.salary" },
"net_income": { "generator": "financials.net_income" }
}
}
}
}Type 7: Event Sequence Generators
Event sequences generate chronologically valid date/time series.
Characteristics
- Dates respect logical order (order < ship < deliver)
- Configurable delays between events
- Probability-based optional events
- Supports business logic conditions
Use Cases
| Sequence | Events | Logic |
|---|---|---|
| Order lifecycle | created → paid → shipped → delivered | Each step after previous |
| User journey | signup → trial_start → trial_end → converted/churned | Branching paths |
| Employment | applied → interviewed → hired → onboarded → promoted | With probabilities |
PDL Syntax
{
"generators": {
"order_timeline": {
"type": "event_sequence",
"events": [
{
"name": "created_at",
"base": true,
"range": { "start": "-1year", "end": "now" }
},
{
"name": "paid_at",
"after": "created_at",
"delay": { "min": "0h", "max": "24h" },
"probability": 0.95
},
{
"name": "shipped_at",
"after": "paid_at",
"delay": { "min": "1d", "max": "3d" },
"probability": 0.90
},
{
"name": "delivered_at",
"after": "shipped_at",
"delay": { "min": "1d", "max": "7d" },
"probability": 0.85
}
]
}
},
"entities": {
"Order": {
"fields": {
"created_at": { "generator": "order_timeline.created_at" },
"paid_at": { "generator": "order_timeline.paid_at" },
"shipped_at": { "generator": "order_timeline.shipped_at" },
"delivered_at": { "generator": "order_timeline.delivered_at" }
}
}
}
}Event Options
| Option | Description | Example |
|---|---|---|
base | The anchor event, generated first | true |
after | This event occurs after specified event | "created_at" |
delay | Time range between events | { "min": "1d", "max": "7d" } |
probability | Chance this event occurs (null if < 1.0) | 0.85 |
distribution | Delay distribution | "exponential" |
condition | Only generate if condition met | "status = 'shipped'" |
Combining Generator Types
The real power comes from combining types:
{
"generators": {
"user_id": {
"type": "logic",
"algorithm": "uuid_v7"
},
"country": {
"type": "list",
"source": "lists/geo/countries.json"
},
"first_name": {
"type": "model",
"source": "models/tr_TR/person_names.ngram"
},
"user_profile": {
"type": "template",
"pattern": "ID: {{user_id}}\nName: {{first_name}} {{last_name}}\nCountry: {{country.name}} ({{country.code}})\nPhone: {{country.phone_prefix}} {{pattern:### ### ## ##}}"
}
}
}This produces:
ID: 018d6e5c-5c3a-7f1e-8b5a-2c4d6e8f0a1b
Name: Mehmet Yılmaz
Country: Turkey (TR)
Phone: +90 532 847 23 91