Advanced Data Generation Concepts

This document covers advanced concepts for generating realistic, privacy-preserving, and statistically accurate synthetic data. These features differentiate Phony from simple faker libraries.

Overview: Beyond Random Data

┌─────────────────────────────────────────────────────────────────────────┐
│           ADVANCED DATA GENERATION CONCEPTS                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BASIC FAKER                      PHONY ADVANCED                        │
│  ───────────                      ──────────────                        │
│  Random city: "Paris"             Linked: Paris + France + EUR + +33    │
│  Random date: 2024-03-15          Event sequence: order < ship < deliver│
│  Random number: 42                Statistical: follows real distribution │
│  Random text: "Lorem ipsum"       Privacy-preserving: differential priv │
│                                                                          │
│  Result: Unrealistic data         Result: Production-like data          │
│  Joins fail, logic breaks         Joins work, logic preserved           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Concept 1: Consistency

Consistency ensures that the same input always produces the same output across your entire dataset—even across different tables or databases.

Why Consistency Matters

WITHOUT CONSISTENCY                 WITH CONSISTENCY
──────────────────                 ────────────────
Table: users                        Table: users
  company: "Acme Corp"                company: "Sunrise Ltd"

Table: invoices                     Table: invoices
  company: "Beta Inc"    ← Different!  company: "Sunrise Ltd"  ← Same!

Result: JOIN fails                  Result: JOIN works perfectly

Use Cases

Use Case	Without Consistency	With Consistency
Joins	Random values break FK relationships	Same company name across tables
Cardinality	Loses distribution patterns	Preserves approximate cardinality
Deduplication	Same person appears with different names	Consistent identity across records
Testing	Different output each run	Reproducible test data

PDL Syntax

json

{
  "generators": {
    "company_name": {
      "type": "model",
      "source": "models/company_names.ngram",
      "generation": { "mode": "word" },
      "consistency": {
        "enabled": true,
        "key": "company_id"
      }
    }
  }
}

With consistency enabled:

Input company_id: 123 always generates "Sunrise Ltd"
Input company_id: 456 always generates "Northern Tech"
Same company_id in any table → same company name

Consistency Keys

json

{
  "generators": {
    "user_email": {
      "type": "template",
      "pattern": "{{lowercase(first_name)}}.{{lowercase(last_name)}}@example.com",
      "consistency": {
        "enabled": true,
        "key": "user_id",
        "scope": "global"
      }
    }
  }
}

Scope	Behavior
`table`	Same key → same value within this table
`entity`	Same key → same value within this entity type
`global`	Same key → same value across entire dataset
`database`	Same key → same value across all databases in sync

Primary Key Consistency

When applied to primary keys, consistency is automatic:

json

{
  "entities": {
    "User": {
      "fields": {
        "id": {
          "type": "logic",
          "algorithm": "uuid_v7",
          "primary_key": true
        }
      }
    },
    "Order": {
      "fields": {
        "user_id": {
          "ref": "User.id"
        }
      }
    }
  }
}

The system automatically:

Uses format-preserving encryption (FPE) for primary keys
Applies same transformation to all foreign key references
Maintains referential integrity across tables

Concept 2: Linked Generators

Linked generators ensure that related columns generate coherent data together. When multiple columns share a strong inter-dependency, linking them produces realistic combinations.

Why Linking Matters

WITHOUT LINKING                     WITH LINKING
───────────────                     ────────────
city: "Ankara"                      city: "Ankara"
state: "California"   ← Invalid!    state: null        ← Turkey has no states
country: "Japan"      ← Nonsense!   country: "Turkey"  ← Correct
postal: "90210"       ← Wrong!      postal: "06100"    ← Valid Ankara code
phone_prefix: "+44"   ← UK prefix!  phone_prefix: "+90" ← Turkey prefix
currency: "JPY"       ← Yen!        currency: "TRY"    ← Turkish Lira
lat: 35.6762                        lat: 39.9334       ← Actual Ankara
lng: 139.6503         ← Tokyo!      lng: 32.8597       ← Actual Ankara

PDL Syntax

json

{
  "generators": {
    "location": {
      "type": "linked",
      "columns": ["city", "state", "country", "postal_code", "phone_prefix", "currency"],
      "source": "lists/geo/locations.json"
    }
  },
  "entities": {
    "Address": {
      "fields": {
        "city": { "generator": "location.city" },
        "state": { "generator": "location.state" },
        "country": { "generator": "location.country" },
        "postal_code": { "generator": "location.postal_code" },
        "phone_prefix": { "generator": "location.phone_prefix" },
        "currency": { "generator": "location.currency" }
      }
    }
  }
}

Common Linking Patterns

Geographic Data

json

{
  "geo_location": {
    "type": "linked",
    "columns": ["city", "district", "postal_code", "latitude", "longitude"],
    "source": "lists/geo/tr_TR/locations.json"
  }
}

Personal Data

json

{
  "person": {
    "type": "linked",
    "columns": ["first_name", "gender", "title"],
    "rules": {
      "first_name.gender": "gender",
      "title": {
        "male": ["Bay", "Mr."],
        "female": ["Bayan", "Ms.", "Mrs."]
      }
    }
  }
}

Financial Data

json

{
  "financials": {
    "type": "linked",
    "columns": ["salary", "bonus", "tax", "net_income"],
    "rules": {
      "bonus": "salary * uniform(0.05, 0.20)",
      "tax": "(salary + bonus) * 0.25",
      "net_income": "salary + bonus - tax"
    }
  }
}

Time-Based Data

json

{
  "employment": {
    "type": "linked",
    "columns": ["birth_date", "hire_date", "age_at_hire"],
    "rules": {
      "hire_date": "birth_date + years(18-40)",
      "age_at_hire": "years_between(birth_date, hire_date)"
    }
  }
}

Concept 3: Statistical Generators

Statistical generators produce data that matches real-world distributions, not just random values.

Categorical Generator

Generates values maintaining the frequency distribution of the original data.

ORIGINAL DATA DISTRIBUTION          GENERATED DATA DISTRIBUTION
──────────────────────────          ───────────────────────────
status: completed  (70%)            status: completed  (70%)
status: pending    (20%)            status: pending    (20%)
status: cancelled  (10%)            status: cancelled  (10%)

json

{
  "generators": {
    "order_status": {
      "type": "statistical",
      "mode": "categorical",
      "source": "inline",
      "values": [
        { "value": "completed", "weight": 70 },
        { "value": "pending", "weight": 20 },
        { "value": "cancelled", "weight": 10 }
      ],
      "differential_privacy": {
        "enabled": true,
        "epsilon": 1.0
      }
    }
  }
}

Continuous Generator

Generates numeric values following a statistical distribution.

json

{
  "generators": {
    "age": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "normal",
      "params": {
        "mean": 35,
        "stddev": 12
      },
      "constraints": {
        "min": 18,
        "max": 85
      }
    },
    "income": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "lognormal",
      "params": {
        "mu": 10.5,
        "sigma": 0.8
      }
    },
    "response_time": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "exponential",
      "params": {
        "lambda": 0.5
      }
    }
  }
}

Supported Distributions

Distribution	Use Case	Parameters
`normal`	Age, height, IQ	`mean`, `stddev`
`lognormal`	Income, prices	`mu`, `sigma`
`exponential`	Wait times, failure rates	`lambda`
`uniform`	Random selection	`min`, `max`
`poisson`	Event counts	`lambda`
`beta`	Probabilities, percentages	`alpha`, `beta`
`gamma`	Wait times, rainfall	`shape`, `scale`

Algebraic Generator

Detects and preserves mathematical relationships between columns.

json

{
  "generators": {
    "order_total": {
      "type": "statistical",
      "mode": "algebraic",
      "columns": ["subtotal", "tax", "shipping", "discount", "total"],
      "relationship": "total = subtotal + tax + shipping - discount"
    }
  }
}

The generator:

Identifies the algebraic relationship
Generates values that satisfy the equation
Maintains realistic distributions for each component

Multivariate Generator (Correlated Data)

Preserves correlations between multiple numeric columns.

json

{
  "generators": {
    "real_estate": {
      "type": "statistical",
      "mode": "multivariate",
      "columns": ["price", "sqft", "bedrooms", "bathrooms", "lot_size"],
      "correlations": {
        "price-sqft": 0.85,
        "price-bedrooms": 0.65,
        "sqft-bedrooms": 0.70,
        "bedrooms-bathrooms": 0.80
      }
    }
  }
}

This ensures:

Larger houses have higher prices (positive correlation)
More bedrooms correlate with more bathrooms
Realistic property listings, not random combinations

Concept 4: Event Sequences

Event sequences generate chronologically valid date/time series where order matters.

The Problem

WITHOUT EVENT SEQUENCES             WITH EVENT SEQUENCES
───────────────────────             ────────────────────
order_date:    2024-03-15           order_date:    2024-03-15
payment_date:  2024-03-10  ← Before order!  payment_date:  2024-03-15  ← Same day
ship_date:     2024-03-08  ← Before payment! ship_date:     2024-03-17  ← 2 days later
delivery_date: 2024-03-20           delivery_date: 2024-03-22  ← 5 days later

Result: Logically impossible        Result: Realistic timeline

PDL Syntax

json

{
  "generators": {
    "order_timeline": {
      "type": "event_sequence",
      "events": [
        {
          "name": "created_at",
          "base": true,
          "range": { "start": "-1year", "end": "now" }
        },
        {
          "name": "paid_at",
          "after": "created_at",
          "delay": { "min": "0h", "max": "24h" },
          "probability": 0.95
        },
        {
          "name": "shipped_at",
          "after": "paid_at",
          "delay": { "min": "1d", "max": "3d" },
          "probability": 0.90
        },
        {
          "name": "delivered_at",
          "after": "shipped_at",
          "delay": { "min": "1d", "max": "7d" },
          "probability": 0.85
        },
        {
          "name": "reviewed_at",
          "after": "delivered_at",
          "delay": { "min": "1d", "max": "30d" },
          "probability": 0.30
        }
      ]
    }
  },
  "entities": {
    "Order": {
      "fields": {
        "created_at": { "generator": "order_timeline.created_at" },
        "paid_at": { "generator": "order_timeline.paid_at" },
        "shipped_at": { "generator": "order_timeline.shipped_at" },
        "delivered_at": { "generator": "order_timeline.delivered_at" },
        "reviewed_at": { "generator": "order_timeline.reviewed_at" }
      }
    }
  }
}

Event Sequence Features

Feature	Description
`base`	The anchor event, generated first
`after`	This event occurs after the specified event
`delay`	Time range between events
`probability`	Chance this event occurs (nullable if < 1.0)
`distribution`	Delay distribution (uniform, exponential, etc.)
`condition`	Only generate if condition is met

Complex Event Patterns

json

{
  "generators": {
    "subscription_lifecycle": {
      "type": "event_sequence",
      "events": [
        { "name": "signup_at", "base": true },
        { "name": "trial_start", "after": "signup_at", "delay": "0d" },
        { "name": "trial_end", "after": "trial_start", "delay": "14d" },
        {
          "name": "converted_at",
          "after": "trial_end",
          "delay": { "min": "0d", "max": "7d" },
          "probability": 0.25
        },
        {
          "name": "churned_at",
          "after": "trial_end",
          "delay": { "min": "0d", "max": "30d" },
          "probability": 0.75,
          "condition": "converted_at IS NULL"
        },
        {
          "name": "renewed_at",
          "after": "converted_at",
          "delay": "30d",
          "probability": 0.85,
          "condition": "converted_at IS NOT NULL"
        }
      ]
    }
  }
}

Concept 5: Cross-Table Relationships

Generate values that correctly aggregate across related tables.

Cross Table Sum

json

{
  "entities": {
    "Store": {
      "fields": {
        "id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
        "name": { "generator": "store_name" },
        "total_sales": {
          "type": "cross_table",
          "operation": "sum",
          "from": "Transaction",
          "field": "amount",
          "where": "Transaction.store_id = Store.id"
        }
      }
    },
    "Transaction": {
      "fields": {
        "id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
        "store_id": { "ref": "Store.id" },
        "amount": { "generator": "price" }
      }
    }
  }
}

This ensures Store.total_sales actually equals the sum of all transactions for that store.

Cross Table Operations

Operation	Description	Example
`sum`	Sum of related values	Total order amount
`count`	Count of related rows	Number of orders
`avg`	Average of related values	Average rating
`min`	Minimum related value	First order date
`max`	Maximum related value	Last login date

Example: Order Totals

json

{
  "entities": {
    "Order": {
      "fields": {
        "id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
        "item_count": {
          "type": "cross_table",
          "operation": "count",
          "from": "OrderItem",
          "where": "OrderItem.order_id = Order.id"
        },
        "subtotal": {
          "type": "cross_table",
          "operation": "sum",
          "from": "OrderItem",
          "field": "line_total",
          "where": "OrderItem.order_id = Order.id"
        },
        "tax": {
          "computed": "subtotal * 0.18"
        },
        "total": {
          "computed": "subtotal + tax"
        }
      }
    },
    "OrderItem": {
      "fields": {
        "order_id": { "ref": "Order.id" },
        "quantity": { "generator": "quantity" },
        "unit_price": { "generator": "price" },
        "line_total": { "computed": "quantity * unit_price" }
      }
    }
  }
}

Concept 6: Differential Privacy

Differential privacy provides mathematical guarantees that generated data cannot be reverse-engineered to identify individuals.

What is Differential Privacy?

┌─────────────────────────────────────────────────────────────────────────┐
│           DIFFERENTIAL PRIVACY                                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  GUARANTEE: Adding or removing ANY single individual from the           │
│  dataset does not significantly change the output distribution.         │
│                                                                          │
│  RESULT: Even with auxiliary information, an attacker cannot            │
│  determine if a specific person was in the original dataset.            │
│                                                                          │
│  CONTROLLED BY: Epsilon (ε) - the privacy budget                        │
│  • ε = 0.1  → Very private, more noise, less utility                   │
│  • ε = 1.0  → Balanced privacy and utility                             │
│  • ε = 10.0 → Less private, less noise, more utility                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PDL Syntax

json

{
  "generators": {
    "salary": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "normal",
      "params": { "mean": 75000, "stddev": 25000 },
      "differential_privacy": {
        "enabled": true,
        "epsilon": 1.0,
        "mechanism": "laplace"
      }
    },
    "age_group": {
      "type": "statistical",
      "mode": "categorical",
      "values": ["18-25", "26-35", "36-45", "46-55", "56+"],
      "differential_privacy": {
        "enabled": true,
        "epsilon": 0.5,
        "mechanism": "exponential"
      }
    }
  }
}

Privacy Mechanisms

Mechanism	Best For	How It Works
`laplace`	Numeric data	Adds Laplace-distributed noise
`gaussian`	Numeric data	Adds Gaussian-distributed noise
`exponential`	Categorical data	Randomizes selection with privacy guarantees

Compliance Benefits

Regulation	Differential Privacy Benefit
GDPR	Meets anonymization requirements
HIPAA	Satisfies de-identification standards
CCPA	Data cannot be re-identified

Concept 7: Geo-Aware Generation

Generate geographic data with built-in privacy and validity.

Coordinate Fuzzing

json

{
  "generators": {
    "location": {
      "type": "geo",
      "mode": "coordinates",
      "source": "original",
      "privacy": {
        "method": "k_anonymity",
        "k": 5,
        "radius_km": 1.0
      }
    }
  }
}

The generator:

Takes original lat/long coordinates
Finds regions with at least k other points
Moves the point within that region
Adds additional random fuzzing within radius

Valid Location Generation

json

{
  "generators": {
    "turkish_address": {
      "type": "geo",
      "mode": "address",
      "locale": "tr_TR",
      "constraints": {
        "country": "TR",
        "valid_postal": true,
        "valid_coordinates": true
      },
      "components": {
        "city": { "weight_by": "population" },
        "district": { "within": "city" },
        "postal_code": { "within": "district" },
        "coordinates": { "within": "postal_code" }
      }
    }
  }
}

HIPAA-Compliant Address Generation

For healthcare data, special rules apply:

json

{
  "generators": {
    "hipaa_address": {
      "type": "geo",
      "mode": "hipaa_safe_harbor",
      "rules": {
        "zip_truncation": true,
        "small_population_generalization": true,
        "coordinates": "disabled"
      }
    }
  }
}

HIPAA Safe Harbor requirements:

Truncate zip codes to first 3 digits
If population < 20,000 in that 3-digit zip, use "000"
Remove street address, keep only city/state

Concept 8: Format-Preserving Transformation (Scramble)

Transform data while preserving its format, useful for masking sensitive data.

Character Scramble

json

{
  "generators": {
    "masked_email": {
      "type": "scramble",
      "mode": "character",
      "preserve": ["@", "."],
      "rules": {
        "letters": "random_letter",
        "digits": "random_digit"
      }
    }
  }
}

Input	Output
`john.doe@example.com`	`xkpr.qwm@hdnvbzq.trm`
`jane_123@test.org`	`yznq_847@pqrs.vwx`

Phone Number Scramble

json

{
  "generators": {
    "masked_phone": {
      "type": "scramble",
      "mode": "pattern",
      "preserve_format": true,
      "preserve_country_code": true
    }
  }
}

Input	Output
`+90 532 123 45 67`	`+90 847 956 23 18`
`+1 (555) 123-4567`	`+1 (555) 847-9382`

Credit Card Scramble (Luhn-Valid)

json

{
  "generators": {
    "masked_credit_card": {
      "type": "scramble",
      "mode": "credit_card",
      "preserve_bin": true,
      "luhn_valid": true
    }
  }
}

Input	Output
`4532-1234-5678-9012`	`4532-8847-2391-4856`

First 6 digits (BIN) preserved, rest scrambled, Luhn checksum valid.

Concept 9: Structured Data Masks

Transform data within structured formats (JSON, XML, CSV, HTML).

JSON Mask

json

{
  "generators": {
    "masked_json": {
      "type": "mask",
      "format": "json",
      "paths": {
        "$.user.email": { "generator": "masked_email" },
        "$.user.phone": { "generator": "masked_phone" },
        "$.user.ssn": { "generator": "masked_ssn" },
        "$.payments[*].card_number": { "generator": "masked_credit_card" }
      }
    }
  }
}

Input:

json

{
  "user": {
    "name": "John Doe",
    "email": "john@example.com",
    "phone": "+1-555-123-4567"
  },
  "payments": [
    { "card_number": "4532-1234-5678-9012" }
  ]
}

Output:

json

{
  "user": {
    "name": "John Doe",
    "email": "xkpr@hdnvbzq.trm",
    "phone": "+1-555-847-9382"
  },
  "payments": [
    { "card_number": "4532-8847-2391-4856" }
  ]
}

XML Mask

json

{
  "generators": {
    "masked_xml": {
      "type": "mask",
      "format": "xml",
      "paths": {
        "//customer/email": { "generator": "masked_email" },
        "//customer/ssn": { "generator": "masked_ssn" },
        "//payment/@card-number": { "generator": "masked_credit_card" }
      }
    }
  }
}

Regex Mask

For custom patterns:

json

{
  "generators": {
    "order_reference": {
      "type": "mask",
      "format": "regex",
      "pattern": "^(ORD-)(\\d{8})(-)(\\w{4})$",
      "groups": {
        "1": "passthrough",
        "2": { "generator": "random_digits", "length": 8 },
        "3": "passthrough",
        "4": { "generator": "random_alphanumeric", "length": 4 }
      }
    }
  }
}

Input	Output
`ORD-20240315-AB12`	`ORD-84729163-XK47`

Comparison: Basic vs Advanced

Aspect	Basic Faker	Phony Advanced
City + Country	Random, may not match	Linked, always valid
Date sequences	Random, may be illogical	Event sequences, always chronological
Numeric distributions	Uniform random	Statistical, matches real patterns
Cross-table totals	Don't match	Computed, always correct
Privacy	None	Differential privacy, k-anonymity
Format preservation	Not supported	Scramble with format intact
Structured data	Not supported	JSON/XML/Regex masks

Next Steps

Generator Types - Core generator types
PDL Specification - Full schema language reference
N-gram Models - Statistical text generation
Execution Model - Runtime architecture

Advanced Data Generation Concepts ​

Overview: Beyond Random Data ​

Concept 1: Consistency ​

Why Consistency Matters ​

Use Cases ​

PDL Syntax ​

Consistency Keys ​

Primary Key Consistency ​

Concept 2: Linked Generators ​

Why Linking Matters ​

PDL Syntax ​

Common Linking Patterns ​

Geographic Data ​

Personal Data ​

Financial Data ​

Time-Based Data ​

Concept 3: Statistical Generators ​

Categorical Generator ​

Continuous Generator ​

Supported Distributions ​

Algebraic Generator ​

Multivariate Generator (Correlated Data) ​

Concept 4: Event Sequences ​

The Problem ​

PDL Syntax ​

Event Sequence Features ​

Complex Event Patterns ​

Concept 5: Cross-Table Relationships ​

Cross Table Sum ​

Cross Table Operations ​

Example: Order Totals ​

Concept 6: Differential Privacy ​

What is Differential Privacy? ​

PDL Syntax ​

Privacy Mechanisms ​

Compliance Benefits ​

Concept 7: Geo-Aware Generation ​

Coordinate Fuzzing ​

Valid Location Generation ​

HIPAA-Compliant Address Generation ​

Concept 8: Format-Preserving Transformation (Scramble) ​

Character Scramble ​

Phone Number Scramble ​

Credit Card Scramble (Luhn-Valid) ​

Concept 9: Structured Data Masks ​

JSON Mask ​

XML Mask ​

Regex Mask ​

Comparison: Basic vs Advanced ​

Next Steps ​

Advanced Data Generation Concepts

Overview: Beyond Random Data

Concept 1: Consistency

Why Consistency Matters

Use Cases

PDL Syntax

Consistency Keys

Primary Key Consistency

Concept 2: Linked Generators

Why Linking Matters

PDL Syntax

Common Linking Patterns

Geographic Data

Personal Data

Financial Data

Time-Based Data

Concept 3: Statistical Generators

Categorical Generator

Continuous Generator

Supported Distributions

Algebraic Generator

Multivariate Generator (Correlated Data)

Concept 4: Event Sequences

The Problem

PDL Syntax

Event Sequence Features

Complex Event Patterns

Concept 5: Cross-Table Relationships

Cross Table Sum

Cross Table Operations

Example: Order Totals

Concept 6: Differential Privacy

What is Differential Privacy?

PDL Syntax

Privacy Mechanisms

Compliance Benefits

Concept 7: Geo-Aware Generation

Coordinate Fuzzing

Valid Location Generation

HIPAA-Compliant Address Generation

Concept 8: Format-Preserving Transformation (Scramble)

Character Scramble

Phone Number Scramble

Credit Card Scramble (Luhn-Valid)

Concept 9: Structured Data Masks

JSON Mask

XML Mask

Regex Mask

Comparison: Basic vs Advanced

Next Steps