Skip to content

Generator Types

Phony provides seven generator types under a unified abstraction: four core types (Logic, List, Model, Template) and three advanced types (Statistical, Linked, Event Sequence). Each serves a specific purpose, and they can be composed together seamlessly.

Overview: When to Use Each Type

┌─────────────────────────────────────────────────────────────────────────┐
│           GENERATOR TYPE SELECTION GUIDE                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Question: What kind of data do you need?                               │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐│
│  │                                                                      ││
│  │  Is it COMPUTED from rules/algorithms?                              ││
│  │  (UUIDs, random numbers, dates, sequences)                          ││
│  │     │                                                                ││
│  │     └─▶ YES → LOGIC GENERATOR                                       ││
│  │                                                                      ││
│  │  Is it from a FINITE, VALID set?                                    ││
│  │  (Countries, HTTP codes, currencies - must be correct)              ││
│  │     │                                                                ││
│  │     └─▶ YES → LIST GENERATOR                                        ││
│  │                                                                      ││
│  │  Should it FEEL NATURAL for a locale?                               ││
│  │  (Names, addresses - should sound Turkish/English/etc.)             ││
│  │     │                                                                ││
│  │     └─▶ YES → MODEL GENERATOR (N-gram)                              ││
│  │                                                                      ││
│  │  Is it COMPOSED from multiple sources?                              ││
│  │  (Full address, email, formatted output)                            ││
│  │     │                                                                ││
│  │     └─▶ YES → TEMPLATE GENERATOR                                    ││
│  │                                                                      ││
│  └─────────────────────────────────────────────────────────────────────┘│
│                                                                          │
│  KEY INSIGHT:                                                            │
│  ─────────────                                                           │
│  • CORRECT data → List (pick from valid options)                        │
│  • REALISTIC data → Model (statistically similar)                       │
│  • COMPUTED data → Logic (pure algorithm)                               │
│  • COMPOSED data → Template (combine sources)                           │
│                                                                          │
│  HTTP 404 must BE exactly 404 (List)                                    │
│  A Turkish name should FEEL Turkish (Model)                             │
│  A UUID must be algorithmically valid (Logic)                           │
│  An email combines name + domain (Template)                             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Type 1: Logic Generators

Pure algorithmic generation that doesn't require external data sources.

Characteristics

  • No data files needed
  • Deterministic given same seed
  • Language/locale independent
  • Fastest execution

Use Cases

GeneratorOutputExample
uuid_v4UUID version 4550e8400-e29b-41d4-a716-446655440000
uuid_v7UUID version 7 (time-sortable)018d6e5c-5c3a-7f1e-8b5a-2c4d6e8f0a1b
ulidULID01ARZ3NDEKTSV4RRFFQ69G5FAV
int_betweenRandom integer42
float_betweenRandom float3.14159
booleanRandom booleantrue
datetime_betweenRandom datetime2024-03-15T14:30:00Z
date_betweenRandom date2024-03-15
time_betweenRandom time14:30:00
timestampUnix timestamp1710512400

PDL Syntax

json
{
  "generators": {
    "user_id": {
      "type": "logic",
      "algorithm": "uuid_v7"
    },
    "age": {
      "type": "logic",
      "algorithm": "int_between",
      "params": { "min": 18, "max": 85 }
    },
    "price": {
      "type": "logic",
      "algorithm": "float_between",
      "params": { "min": 0.01, "max": 9999.99, "precision": 2 }
    },
    "created_at": {
      "type": "logic",
      "algorithm": "datetime_between",
      "params": { "start": "2023-01-01", "end": "now" }
    },
    "is_active": {
      "type": "logic",
      "algorithm": "boolean",
      "params": { "probability": 0.85 }
    }
  }
}

Available Algorithms

AlgorithmParamsDescription
uuid_v4-Random UUID
uuid_v7-Time-sortable UUID
ulid-Universally Unique Lexicographically Sortable ID
nanoidlengthNano ID
int_betweenmin, maxRandom integer
float_betweenmin, max, precisionRandom float
booleanprobabilityRandom boolean
datetime_betweenstart, end, formatRandom datetime
date_betweenstart, end, formatRandom date
time_betweenstart, end, formatRandom time
timestampstart, endUnix timestamp
sequencestart, stepAuto-incrementing
gaussianmean, stddevNormal distribution
exponentiallambdaExponential distribution

Type 2: List Generators

Selection from predefined, finite sets of valid values.

Characteristics

  • Data must be correct (valid HTTP codes, real countries)
  • Can be locale-specific or universal
  • Fast O(1) random selection
  • Supports weighted selection

Use Cases

CategoryExamplesWhy List?
StandardsHTTP methods, status codesMust be valid codes
GeographyCountries, currenciesISO standards
EnumsStatus values, categoriesApplication-specific
LookupCities, districtsReal place names

PDL Syntax

json
{
  "generators": {
    "city": {
      "type": "list",
      "source": "lists/geo/cities.json"
    },
    "http_method": {
      "type": "list",
      "source": "inline",
      "values": ["GET", "POST", "PUT", "DELETE", "PATCH"],
      "locale_independent": true
    },
    "order_status": {
      "type": "list",
      "source": "inline",
      "values": [
        { "value": "completed", "weight": 60 },
        { "value": "pending", "weight": 25 },
        { "value": "cancelled", "weight": 10 },
        { "value": "refunded", "weight": 5 }
      ]
    },
    "country": {
      "type": "list",
      "source": "lists/geo/countries.json"
    }
  }
}

Note: List files returning objects (like countries with { name, code, phone_prefix }) allow accessing nested properties via {{country.code}}.

List File Formats

Simple Array (JSON):

json
["İstanbul", "Ankara", "İzmir", "Bursa", "Antalya"]

Weighted Array (JSON):

json
[
  { "value": "İstanbul", "weight": 40 },
  { "value": "Ankara", "weight": 20 },
  { "value": "İzmir", "weight": 15 },
  { "value": "Bursa", "weight": 10 },
  { "value": "Antalya", "weight": 15 }
]

Object with Metadata (JSON):

json
[
  { "name": "Turkey", "code": "TR", "phone_prefix": "+90", "currency": "TRY" },
  { "name": "Germany", "code": "DE", "phone_prefix": "+49", "currency": "EUR" },
  { "name": "United States", "code": "US", "phone_prefix": "+1", "currency": "USD" }
]

Accessing Nested Properties

json
{
  "generators": {
    "country": {
      "type": "list",
      "source": "lists/geo/countries.json"
    },
    "country_code": {
      "type": "template",
      "pattern": "{{country.code}}"
    },
    "phone_prefix": {
      "type": "template",
      "pattern": "{{country.phone_prefix}}"
    }
  }
}

Type 3: Model Generators (N-gram)

Statistical generation that produces realistic, locale-specific output.

Deep Dive: For complete N-gram model architecture, training configuration, and generation algorithms, see N-gram Models.

Characteristics

  • Trained on real data samples
  • Produces statistically similar output (not identical)
  • Locale-specific (Turkish names sound Turkish)
  • Supports constraints (length, prefix)

Use Cases

DomainWhy Model?
Person namesShould feel natural for the locale
Company namesFollow language patterns
Street namesLocale-specific naming conventions
Product namesNatural language patterns
UsernamesRealistic patterns

PDL Syntax

Model generators require a generation block specifying how to produce output:

json
{
  "generators": {
    "first_name": {
      "type": "model",
      "source": "models/person_names.ngram",
      "generation": { "mode": "word" },
      "constraints": { "min_length": 3, "max_length": 12 }
    },
    "username": {
      "type": "model",
      "source": "models/usernames.ngram",
      "generation": { "mode": "word" },
      "constraints": { "min_length": 4, "max_length": 16 }
    },
    "company_name": {
      "type": "model",
      "source": "models/company_names.ngram",
      "generation": {
        "mode": "word",
        "params": { "starts_with": "A" }
      }
    },
    "tagline": {
      "type": "model",
      "source": "models/slogans.ngram",
      "generation": {
        "mode": "sentence",
        "params": {
          "word_count": "{{number:3-8}}",
          "punctuation": [".", "!"]
        }
      }
    },
    "bio": {
      "type": "model",
      "source": "models/text.ngram",
      "generation": {
        "mode": "text",
        "params": {
          "max_chars": 200,
          "suffix": "..."
        }
      }
    }
  }
}

Generation Modes

ModeOutputUse Case
wordSingle wordNames, usernames
sentenceSingle sentenceTaglines, mottos
textTruncated textBios, descriptions
paragraphFull paragraphArticles, reviews
poemPoem with stanzasCreative content
acrosticAcrostic poemHidden messages
real_wordPick from trainingKnown/real names

Note: Generation parameters can reference other generators using {{generator_name}} or inline random syntax like {{number:3-8}}. See N-gram Models for details.

Training Models

Models are trained using the phony CLI:

bash
# Character mode (default) - for names, usernames
phony train names.txt -o models/names.ngram

# Word mode - for company names, product names, sentences
phony train companies.txt -o models/companies.ngram --token-type word

# Full configuration example
phony train names.txt \
  --output models/tr_TR/names.ngram \
  --ngram-size 4 \
  --token-type char \
  --min-word-length 3 \
  --position-depth 5 \
  --fallback prefix \
  --exclude-originals \
  --lowercase

# Train from different sources
phony train customers.csv --column first_name -o models/names.ngram
phony train data.json --path "$.users[*].name" -o models/names.ngram

Token Modes

ModeDescriptionUse Case
charCharacter-level N-gramsNames, usernames, made-up words
wordWord-level N-gramsCompany names, product names, sentences

Model File Structure

The .ngram file is a gzipped JSON. See N-gram Models for the complete format specification.


Type 4: Template Generators

Composition of multiple generators with formatting and logic.

Characteristics

  • Combines other generators
  • Supports multiple variants (random template selection)
  • Weighted variant selection
  • Recursive resolution
  • Post-processing operations

Use Cases

OutputComposition
Emaillowercase(first_name).lowercase(last_name)@domain
Full namefirst_name + " " + last_name
Addressstreet + " No:" + number + " " + city
Phoneprefix + " " + pattern(### ## ##)
SKUcategory_code + "-" + pattern(####)

PDL Syntax

json
{
  "generators": {
    "email": {
      "type": "template",
      "pattern": "{{lowercase(first_name)}}.{{lowercase(last_name)}}@{{pick(email_domains)}}",
      "unique": true
    },
    "street_name": {
      "type": "template",
      "variants": [
        { "pattern": "{{first_name}} Sokak", "weight": 35 },
        { "pattern": "{{last_name}} Caddesi", "weight": 35 },
        { "pattern": "Atatürk Bulvarı", "weight": 10 },
        { "pattern": "{{number:1-2000}}. Sokak", "weight": 20 }
      ]
    },
    "full_address": {
      "type": "template",
      "variants": [
        { "pattern": "{{street_name}} No:{{number:1-200}} {{district}}/{{city}}", "weight": 45 },
        { "pattern": "{{neighborhood}} Mah. {{street_name}} {{city}}", "weight": 30 },
        { "pattern": "{{street_name}} {{number:1-200}}/{{number:1-20}} {{city}}", "weight": 25 }
      ]
    },
    "slug": {
      "type": "template",
      "pattern": "{{product_name}}",
      "operations": ["slugify", "lowercase"]
    }
  }
}

Template Resolution Process

┌─────────────────────────────────────────────────────────────────────────┐
│           TEMPLATE RESOLUTION (Recursive)                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INPUT: {{full_address}}                                                │
│                                                                          │
│  STEP 1: Select variant (weighted random)                               │
│  ─────────────────────────────────────────                               │
│  Selected: "{{street_name}} No:{{number:1-200}} {{city}}"               │
│                                                                          │
│  STEP 2: Resolve each placeholder (recursive)                           │
│  ─────────────────────────────────────────────                           │
│                                                                          │
│  {{street_name}} → Template generator                                   │
│     │                                                                    │
│     ├─▶ Select variant: "{{first_name}} Sokak"                          │
│     │                                                                    │
│     └─▶ {{first_name}} → Model generator                                │
│            │                                                             │
│            └─▶ N-gram generation → "Mehmet"                             │
│                                                                          │
│     Result: "Mehmet Sokak"                                              │
│                                                                          │
│  {{number:1-200}} → Inline logic generator                              │
│     │                                                                    │
│     └─▶ Random int 1-200 → 47                                           │
│                                                                          │
│  {{city}} → List generator                                              │
│     │                                                                    │
│     └─▶ Random from cities.json → "İstanbul"                            │
│                                                                          │
│  STEP 3: Combine                                                        │
│  ────────────────                                                        │
│  "Mehmet Sokak" + " No:" + "47" + " " + "İstanbul"                      │
│                                                                          │
│  FINAL OUTPUT: "Mehmet Sokak No:47 İstanbul"                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Comparison Table

FeatureLogicListModelTemplate
Data sourceAlgorithmFile/InlineTrained modelOther generators
Locale-specificNoOptionalYesDepends on refs
Output varietyControlledFiniteInfiniteControlled
CorrectnessGuaranteedGuaranteedStatisticalDepends on refs
SpeedFastestVery fastFastDepends on depth
Training neededNoNoYesNo
Typical useIDs, numbersCodes, enumsNames, textComposition
Consistency supportYesYesYesYes
Linking supportNoYesNoYes

Cross-Cutting Features

These features can be applied to multiple generator types.

Consistency

Consistency ensures that the same input always produces the same output across your entire dataset. This is critical for:

  • Maintaining valid joins between tables
  • Preserving data cardinality
  • Keeping duplicated data consistent across databases
json
{
  "generators": {
    "company_name": {
      "type": "model",
      "source": "models/company_names.ngram",
      "generation": { "mode": "word" },
      "consistency": {
        "enabled": true,
        "key": "company_id",
        "scope": "global"
      }
    }
  }
}
ScopeBehavior
tableSame key → same value within this table
entitySame key → same value within this entity type
globalSame key → same value across entire dataset

Note: Primary key generators have automatic consistency with format-preserving encryption (FPE). Foreign keys referencing them are automatically transformed consistently.

Linking

Linked generators ensure that related columns generate coherent data together. When columns have strong inter-dependencies, linking produces realistic combinations.

json
{
  "generators": {
    "location": {
      "type": "linked",
      "columns": ["city", "state", "country", "postal_code", "phone_prefix"],
      "source": "lists/geo/locations.json"
    }
  },
  "entities": {
    "Address": {
      "fields": {
        "city": { "generator": "location.city" },
        "country": { "generator": "location.country" },
        "postal_code": { "generator": "location.postal_code" }
      }
    }
  }
}

Without linking: city="Ankara", country="Japan", postal="90210" (invalid) With linking: city="Ankara", country="Turkey", postal="06100" (valid)

Common linking patterns:

  • Geographic: city + state + country + postal_code + coordinates
  • Personal: first_name + gender + title (gender-appropriate names)
  • Financial: salary + bonus + tax + net_income (mathematically related)
  • Temporal: birth_date + hire_date + age_at_hire (logically consistent)

Deep Dive: For complete documentation on consistency, linking, and other advanced concepts, see Advanced Concepts.


Type 5: Statistical Generators

Statistical generators produce data matching real-world distributions, not just random values.

Characteristics

  • Preserves frequency distributions (categorical)
  • Matches statistical distributions (continuous)
  • Maintains correlations between columns (multivariate)
  • Supports differential privacy

Use Cases

Generator ModeOutputExample
categoricalPreserves value frequencies70% completed, 20% pending, 10% cancelled
continuousFollows distributionNormal(mean=35, stddev=12) for ages
algebraicPreserves math relationshipstotal = subtotal + tax - discount
multivariatePreserves correlationsprice correlates with sqft

PDL Syntax

json
{
  "generators": {
    "order_status": {
      "type": "statistical",
      "mode": "categorical",
      "values": [
        { "value": "completed", "weight": 70 },
        { "value": "pending", "weight": 20 },
        { "value": "cancelled", "weight": 10 }
      ]
    },
    "customer_age": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "normal",
      "params": { "mean": 35, "stddev": 12 },
      "constraints": { "min": 18, "max": 85 }
    },
    "income": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "lognormal",
      "params": { "mu": 10.5, "sigma": 0.8 }
    },
    "order_totals": {
      "type": "statistical",
      "mode": "algebraic",
      "columns": ["subtotal", "tax", "discount", "total"],
      "relationship": "total = subtotal + tax - discount"
    }
  }
}

Supported Distributions

DistributionUse CaseParameters
normalAge, height, test scoresmean, stddev
lognormalIncome, prices, durationsmu, sigma
exponentialWait times, failure rateslambda
uniformRandom selectionmin, max
poissonEvent counts per periodlambda
betaProbabilities, percentagesalpha, beta
gammaWait times, rainfallshape, scale

Differential Privacy Option

Add mathematical privacy guarantees:

json
{
  "generators": {
    "salary": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "normal",
      "params": { "mean": 75000, "stddev": 25000 },
      "differential_privacy": {
        "enabled": true,
        "epsilon": 1.0,
        "mechanism": "laplace"
      }
    }
  }
}

Type 6: Linked Generators

Linked generators ensure related columns generate coherent data together.

Characteristics

  • Multiple columns share a single selection
  • All values come from the same source record
  • Guarantees valid combinations
  • Eliminates impossible data (e.g., "Ankara, Japan")

Use Cases

Link TypeColumnsWhy Link?
Geographiccity, state, country, postal, coordsMust be geographically valid
Personalfirst_name, gender, titleGender-appropriate names
Financialsalary, bonus, tax, netMathematically related
Productcategory, subcategory, brandValid category hierarchy

PDL Syntax

json
{
  "generators": {
    "geo_location": {
      "type": "linked",
      "columns": ["city", "district", "country", "postal_code", "latitude", "longitude", "phone_prefix", "currency"],
      "source": "lists/geo/locations.json"
    },
    "person_info": {
      "type": "linked",
      "columns": ["first_name", "gender", "title"],
      "source": "lists/person/names_with_gender.json",
      "rules": {
        "title": {
          "male": ["Bay", "Mr."],
          "female": ["Bayan", "Ms.", "Mrs."]
        }
      }
    },
    "financials": {
      "type": "linked",
      "columns": ["salary", "bonus", "tax", "net_income"],
      "rules": {
        "salary": { "type": "statistical", "distribution": "lognormal", "params": { "mu": 10, "sigma": 0.5 } },
        "bonus": "salary * uniform(0.05, 0.20)",
        "tax": "(salary + bonus) * 0.25",
        "net_income": "salary + bonus - tax"
      }
    }
  },
  "entities": {
    "Employee": {
      "fields": {
        "city": { "generator": "geo_location.city" },
        "country": { "generator": "geo_location.country" },
        "postal_code": { "generator": "geo_location.postal_code" },
        "first_name": { "generator": "person_info.first_name" },
        "gender": { "generator": "person_info.gender" },
        "salary": { "generator": "financials.salary" },
        "net_income": { "generator": "financials.net_income" }
      }
    }
  }
}

Type 7: Event Sequence Generators

Event sequences generate chronologically valid date/time series.

Characteristics

  • Dates respect logical order (order < ship < deliver)
  • Configurable delays between events
  • Probability-based optional events
  • Supports business logic conditions

Use Cases

SequenceEventsLogic
Order lifecyclecreated → paid → shipped → deliveredEach step after previous
User journeysignup → trial_start → trial_end → converted/churnedBranching paths
Employmentapplied → interviewed → hired → onboarded → promotedWith probabilities

PDL Syntax

json
{
  "generators": {
    "order_timeline": {
      "type": "event_sequence",
      "events": [
        {
          "name": "created_at",
          "base": true,
          "range": { "start": "-1year", "end": "now" }
        },
        {
          "name": "paid_at",
          "after": "created_at",
          "delay": { "min": "0h", "max": "24h" },
          "probability": 0.95
        },
        {
          "name": "shipped_at",
          "after": "paid_at",
          "delay": { "min": "1d", "max": "3d" },
          "probability": 0.90
        },
        {
          "name": "delivered_at",
          "after": "shipped_at",
          "delay": { "min": "1d", "max": "7d" },
          "probability": 0.85
        }
      ]
    }
  },
  "entities": {
    "Order": {
      "fields": {
        "created_at": { "generator": "order_timeline.created_at" },
        "paid_at": { "generator": "order_timeline.paid_at" },
        "shipped_at": { "generator": "order_timeline.shipped_at" },
        "delivered_at": { "generator": "order_timeline.delivered_at" }
      }
    }
  }
}

Event Options

OptionDescriptionExample
baseThe anchor event, generated firsttrue
afterThis event occurs after specified event"created_at"
delayTime range between events{ "min": "1d", "max": "7d" }
probabilityChance this event occurs (null if < 1.0)0.85
distributionDelay distribution"exponential"
conditionOnly generate if condition met"status = 'shipped'"

Combining Generator Types

The real power comes from combining types:

json
{
  "generators": {
    "user_id": {
      "type": "logic",
      "algorithm": "uuid_v7"
    },
    "country": {
      "type": "list",
      "source": "lists/geo/countries.json"
    },
    "first_name": {
      "type": "model",
      "source": "models/tr_TR/person_names.ngram"
    },
    "user_profile": {
      "type": "template",
      "pattern": "ID: {{user_id}}\nName: {{first_name}} {{last_name}}\nCountry: {{country.name}} ({{country.code}})\nPhone: {{country.phone_prefix}} {{pattern:### ### ## ##}}"
    }
  }
}

This produces:

ID: 018d6e5c-5c3a-7f1e-8b5a-2c4d6e8f0a1b
Name: Mehmet Yılmaz
Country: Turkey (TR)
Phone: +90 532 847 23 91

Phony Cloud Platform Specification