Skip to content

Advanced Data Generation Concepts

This document covers advanced concepts for generating realistic, privacy-preserving, and statistically accurate synthetic data. These features differentiate Phony from simple faker libraries.

Overview: Beyond Random Data

┌─────────────────────────────────────────────────────────────────────────┐
│           ADVANCED DATA GENERATION CONCEPTS                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BASIC FAKER                      PHONY ADVANCED                        │
│  ───────────                      ──────────────                        │
│  Random city: "Paris"             Linked: Paris + France + EUR + +33    │
│  Random date: 2024-03-15          Event sequence: order < ship < deliver│
│  Random number: 42                Statistical: follows real distribution │
│  Random text: "Lorem ipsum"       Privacy-preserving: differential priv │
│                                                                          │
│  Result: Unrealistic data         Result: Production-like data          │
│  Joins fail, logic breaks         Joins work, logic preserved           │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Concept 1: Consistency

Consistency ensures that the same input always produces the same output across your entire dataset—even across different tables or databases.

Why Consistency Matters

WITHOUT CONSISTENCY                 WITH CONSISTENCY
──────────────────                 ────────────────
Table: users                        Table: users
  company: "Acme Corp"                company: "Sunrise Ltd"

Table: invoices                     Table: invoices
  company: "Beta Inc"    ← Different!  company: "Sunrise Ltd"  ← Same!

Result: JOIN fails                  Result: JOIN works perfectly

Use Cases

Use CaseWithout ConsistencyWith Consistency
JoinsRandom values break FK relationshipsSame company name across tables
CardinalityLoses distribution patternsPreserves approximate cardinality
DeduplicationSame person appears with different namesConsistent identity across records
TestingDifferent output each runReproducible test data

PDL Syntax

json
{
  "generators": {
    "company_name": {
      "type": "model",
      "source": "models/company_names.ngram",
      "generation": { "mode": "word" },
      "consistency": {
        "enabled": true,
        "key": "company_id"
      }
    }
  }
}

With consistency enabled:

  • Input company_id: 123 always generates "Sunrise Ltd"
  • Input company_id: 456 always generates "Northern Tech"
  • Same company_id in any table → same company name

Consistency Keys

json
{
  "generators": {
    "user_email": {
      "type": "template",
      "pattern": "{{lowercase(first_name)}}.{{lowercase(last_name)}}@example.com",
      "consistency": {
        "enabled": true,
        "key": "user_id",
        "scope": "global"
      }
    }
  }
}
ScopeBehavior
tableSame key → same value within this table
entitySame key → same value within this entity type
globalSame key → same value across entire dataset
databaseSame key → same value across all databases in sync

Primary Key Consistency

When applied to primary keys, consistency is automatic:

json
{
  "entities": {
    "User": {
      "fields": {
        "id": {
          "type": "logic",
          "algorithm": "uuid_v7",
          "primary_key": true
        }
      }
    },
    "Order": {
      "fields": {
        "user_id": {
          "ref": "User.id"
        }
      }
    }
  }
}

The system automatically:

  1. Uses format-preserving encryption (FPE) for primary keys
  2. Applies same transformation to all foreign key references
  3. Maintains referential integrity across tables

Concept 2: Linked Generators

Linked generators ensure that related columns generate coherent data together. When multiple columns share a strong inter-dependency, linking them produces realistic combinations.

Why Linking Matters

WITHOUT LINKING                     WITH LINKING
───────────────                     ────────────
city: "Ankara"                      city: "Ankara"
state: "California"   ← Invalid!    state: null        ← Turkey has no states
country: "Japan"      ← Nonsense!   country: "Turkey"  ← Correct
postal: "90210"       ← Wrong!      postal: "06100"    ← Valid Ankara code
phone_prefix: "+44"   ← UK prefix!  phone_prefix: "+90" ← Turkey prefix
currency: "JPY"       ← Yen!        currency: "TRY"    ← Turkish Lira
lat: 35.6762                        lat: 39.9334       ← Actual Ankara
lng: 139.6503         ← Tokyo!      lng: 32.8597       ← Actual Ankara

PDL Syntax

json
{
  "generators": {
    "location": {
      "type": "linked",
      "columns": ["city", "state", "country", "postal_code", "phone_prefix", "currency"],
      "source": "lists/geo/locations.json"
    }
  },
  "entities": {
    "Address": {
      "fields": {
        "city": { "generator": "location.city" },
        "state": { "generator": "location.state" },
        "country": { "generator": "location.country" },
        "postal_code": { "generator": "location.postal_code" },
        "phone_prefix": { "generator": "location.phone_prefix" },
        "currency": { "generator": "location.currency" }
      }
    }
  }
}

Common Linking Patterns

Geographic Data

json
{
  "geo_location": {
    "type": "linked",
    "columns": ["city", "district", "postal_code", "latitude", "longitude"],
    "source": "lists/geo/tr_TR/locations.json"
  }
}

Personal Data

json
{
  "person": {
    "type": "linked",
    "columns": ["first_name", "gender", "title"],
    "rules": {
      "first_name.gender": "gender",
      "title": {
        "male": ["Bay", "Mr."],
        "female": ["Bayan", "Ms.", "Mrs."]
      }
    }
  }
}

Financial Data

json
{
  "financials": {
    "type": "linked",
    "columns": ["salary", "bonus", "tax", "net_income"],
    "rules": {
      "bonus": "salary * uniform(0.05, 0.20)",
      "tax": "(salary + bonus) * 0.25",
      "net_income": "salary + bonus - tax"
    }
  }
}

Time-Based Data

json
{
  "employment": {
    "type": "linked",
    "columns": ["birth_date", "hire_date", "age_at_hire"],
    "rules": {
      "hire_date": "birth_date + years(18-40)",
      "age_at_hire": "years_between(birth_date, hire_date)"
    }
  }
}

Concept 3: Statistical Generators

Statistical generators produce data that matches real-world distributions, not just random values.

Categorical Generator

Generates values maintaining the frequency distribution of the original data.

ORIGINAL DATA DISTRIBUTION          GENERATED DATA DISTRIBUTION
──────────────────────────          ───────────────────────────
status: completed  (70%)            status: completed  (70%)
status: pending    (20%)            status: pending    (20%)
status: cancelled  (10%)            status: cancelled  (10%)
json
{
  "generators": {
    "order_status": {
      "type": "statistical",
      "mode": "categorical",
      "source": "inline",
      "values": [
        { "value": "completed", "weight": 70 },
        { "value": "pending", "weight": 20 },
        { "value": "cancelled", "weight": 10 }
      ],
      "differential_privacy": {
        "enabled": true,
        "epsilon": 1.0
      }
    }
  }
}

Continuous Generator

Generates numeric values following a statistical distribution.

json
{
  "generators": {
    "age": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "normal",
      "params": {
        "mean": 35,
        "stddev": 12
      },
      "constraints": {
        "min": 18,
        "max": 85
      }
    },
    "income": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "lognormal",
      "params": {
        "mu": 10.5,
        "sigma": 0.8
      }
    },
    "response_time": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "exponential",
      "params": {
        "lambda": 0.5
      }
    }
  }
}

Supported Distributions

DistributionUse CaseParameters
normalAge, height, IQmean, stddev
lognormalIncome, pricesmu, sigma
exponentialWait times, failure rateslambda
uniformRandom selectionmin, max
poissonEvent countslambda
betaProbabilities, percentagesalpha, beta
gammaWait times, rainfallshape, scale

Algebraic Generator

Detects and preserves mathematical relationships between columns.

json
{
  "generators": {
    "order_total": {
      "type": "statistical",
      "mode": "algebraic",
      "columns": ["subtotal", "tax", "shipping", "discount", "total"],
      "relationship": "total = subtotal + tax + shipping - discount"
    }
  }
}

The generator:

  1. Identifies the algebraic relationship
  2. Generates values that satisfy the equation
  3. Maintains realistic distributions for each component

Multivariate Generator (Correlated Data)

Preserves correlations between multiple numeric columns.

json
{
  "generators": {
    "real_estate": {
      "type": "statistical",
      "mode": "multivariate",
      "columns": ["price", "sqft", "bedrooms", "bathrooms", "lot_size"],
      "correlations": {
        "price-sqft": 0.85,
        "price-bedrooms": 0.65,
        "sqft-bedrooms": 0.70,
        "bedrooms-bathrooms": 0.80
      }
    }
  }
}

This ensures:

  • Larger houses have higher prices (positive correlation)
  • More bedrooms correlate with more bathrooms
  • Realistic property listings, not random combinations

Concept 4: Event Sequences

Event sequences generate chronologically valid date/time series where order matters.

The Problem

WITHOUT EVENT SEQUENCES             WITH EVENT SEQUENCES
───────────────────────             ────────────────────
order_date:    2024-03-15           order_date:    2024-03-15
payment_date:  2024-03-10  ← Before order!  payment_date:  2024-03-15  ← Same day
ship_date:     2024-03-08  ← Before payment! ship_date:     2024-03-17  ← 2 days later
delivery_date: 2024-03-20           delivery_date: 2024-03-22  ← 5 days later

Result: Logically impossible        Result: Realistic timeline

PDL Syntax

json
{
  "generators": {
    "order_timeline": {
      "type": "event_sequence",
      "events": [
        {
          "name": "created_at",
          "base": true,
          "range": { "start": "-1year", "end": "now" }
        },
        {
          "name": "paid_at",
          "after": "created_at",
          "delay": { "min": "0h", "max": "24h" },
          "probability": 0.95
        },
        {
          "name": "shipped_at",
          "after": "paid_at",
          "delay": { "min": "1d", "max": "3d" },
          "probability": 0.90
        },
        {
          "name": "delivered_at",
          "after": "shipped_at",
          "delay": { "min": "1d", "max": "7d" },
          "probability": 0.85
        },
        {
          "name": "reviewed_at",
          "after": "delivered_at",
          "delay": { "min": "1d", "max": "30d" },
          "probability": 0.30
        }
      ]
    }
  },
  "entities": {
    "Order": {
      "fields": {
        "created_at": { "generator": "order_timeline.created_at" },
        "paid_at": { "generator": "order_timeline.paid_at" },
        "shipped_at": { "generator": "order_timeline.shipped_at" },
        "delivered_at": { "generator": "order_timeline.delivered_at" },
        "reviewed_at": { "generator": "order_timeline.reviewed_at" }
      }
    }
  }
}

Event Sequence Features

FeatureDescription
baseThe anchor event, generated first
afterThis event occurs after the specified event
delayTime range between events
probabilityChance this event occurs (nullable if < 1.0)
distributionDelay distribution (uniform, exponential, etc.)
conditionOnly generate if condition is met

Complex Event Patterns

json
{
  "generators": {
    "subscription_lifecycle": {
      "type": "event_sequence",
      "events": [
        { "name": "signup_at", "base": true },
        { "name": "trial_start", "after": "signup_at", "delay": "0d" },
        { "name": "trial_end", "after": "trial_start", "delay": "14d" },
        {
          "name": "converted_at",
          "after": "trial_end",
          "delay": { "min": "0d", "max": "7d" },
          "probability": 0.25
        },
        {
          "name": "churned_at",
          "after": "trial_end",
          "delay": { "min": "0d", "max": "30d" },
          "probability": 0.75,
          "condition": "converted_at IS NULL"
        },
        {
          "name": "renewed_at",
          "after": "converted_at",
          "delay": "30d",
          "probability": 0.85,
          "condition": "converted_at IS NOT NULL"
        }
      ]
    }
  }
}

Concept 5: Cross-Table Relationships

Generate values that correctly aggregate across related tables.

Cross Table Sum

json
{
  "entities": {
    "Store": {
      "fields": {
        "id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
        "name": { "generator": "store_name" },
        "total_sales": {
          "type": "cross_table",
          "operation": "sum",
          "from": "Transaction",
          "field": "amount",
          "where": "Transaction.store_id = Store.id"
        }
      }
    },
    "Transaction": {
      "fields": {
        "id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
        "store_id": { "ref": "Store.id" },
        "amount": { "generator": "price" }
      }
    }
  }
}

This ensures Store.total_sales actually equals the sum of all transactions for that store.

Cross Table Operations

OperationDescriptionExample
sumSum of related valuesTotal order amount
countCount of related rowsNumber of orders
avgAverage of related valuesAverage rating
minMinimum related valueFirst order date
maxMaximum related valueLast login date

Example: Order Totals

json
{
  "entities": {
    "Order": {
      "fields": {
        "id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
        "item_count": {
          "type": "cross_table",
          "operation": "count",
          "from": "OrderItem",
          "where": "OrderItem.order_id = Order.id"
        },
        "subtotal": {
          "type": "cross_table",
          "operation": "sum",
          "from": "OrderItem",
          "field": "line_total",
          "where": "OrderItem.order_id = Order.id"
        },
        "tax": {
          "computed": "subtotal * 0.18"
        },
        "total": {
          "computed": "subtotal + tax"
        }
      }
    },
    "OrderItem": {
      "fields": {
        "order_id": { "ref": "Order.id" },
        "quantity": { "generator": "quantity" },
        "unit_price": { "generator": "price" },
        "line_total": { "computed": "quantity * unit_price" }
      }
    }
  }
}

Concept 6: Differential Privacy

Differential privacy provides mathematical guarantees that generated data cannot be reverse-engineered to identify individuals.

What is Differential Privacy?

┌─────────────────────────────────────────────────────────────────────────┐
│           DIFFERENTIAL PRIVACY                                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  GUARANTEE: Adding or removing ANY single individual from the           │
│  dataset does not significantly change the output distribution.         │
│                                                                          │
│  RESULT: Even with auxiliary information, an attacker cannot            │
│  determine if a specific person was in the original dataset.            │
│                                                                          │
│  CONTROLLED BY: Epsilon (ε) - the privacy budget                        │
│  • ε = 0.1  → Very private, more noise, less utility                   │
│  • ε = 1.0  → Balanced privacy and utility                             │
│  • ε = 10.0 → Less private, less noise, more utility                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

PDL Syntax

json
{
  "generators": {
    "salary": {
      "type": "statistical",
      "mode": "continuous",
      "distribution": "normal",
      "params": { "mean": 75000, "stddev": 25000 },
      "differential_privacy": {
        "enabled": true,
        "epsilon": 1.0,
        "mechanism": "laplace"
      }
    },
    "age_group": {
      "type": "statistical",
      "mode": "categorical",
      "values": ["18-25", "26-35", "36-45", "46-55", "56+"],
      "differential_privacy": {
        "enabled": true,
        "epsilon": 0.5,
        "mechanism": "exponential"
      }
    }
  }
}

Privacy Mechanisms

MechanismBest ForHow It Works
laplaceNumeric dataAdds Laplace-distributed noise
gaussianNumeric dataAdds Gaussian-distributed noise
exponentialCategorical dataRandomizes selection with privacy guarantees

Compliance Benefits

RegulationDifferential Privacy Benefit
GDPRMeets anonymization requirements
HIPAASatisfies de-identification standards
CCPAData cannot be re-identified

Concept 7: Geo-Aware Generation

Generate geographic data with built-in privacy and validity.

Coordinate Fuzzing

json
{
  "generators": {
    "location": {
      "type": "geo",
      "mode": "coordinates",
      "source": "original",
      "privacy": {
        "method": "k_anonymity",
        "k": 5,
        "radius_km": 1.0
      }
    }
  }
}

The generator:

  1. Takes original lat/long coordinates
  2. Finds regions with at least k other points
  3. Moves the point within that region
  4. Adds additional random fuzzing within radius

Valid Location Generation

json
{
  "generators": {
    "turkish_address": {
      "type": "geo",
      "mode": "address",
      "locale": "tr_TR",
      "constraints": {
        "country": "TR",
        "valid_postal": true,
        "valid_coordinates": true
      },
      "components": {
        "city": { "weight_by": "population" },
        "district": { "within": "city" },
        "postal_code": { "within": "district" },
        "coordinates": { "within": "postal_code" }
      }
    }
  }
}

HIPAA-Compliant Address Generation

For healthcare data, special rules apply:

json
{
  "generators": {
    "hipaa_address": {
      "type": "geo",
      "mode": "hipaa_safe_harbor",
      "rules": {
        "zip_truncation": true,
        "small_population_generalization": true,
        "coordinates": "disabled"
      }
    }
  }
}

HIPAA Safe Harbor requirements:

  • Truncate zip codes to first 3 digits
  • If population < 20,000 in that 3-digit zip, use "000"
  • Remove street address, keep only city/state

Concept 8: Format-Preserving Transformation (Scramble)

Transform data while preserving its format, useful for masking sensitive data.

Character Scramble

json
{
  "generators": {
    "masked_email": {
      "type": "scramble",
      "mode": "character",
      "preserve": ["@", "."],
      "rules": {
        "letters": "random_letter",
        "digits": "random_digit"
      }
    }
  }
}
InputOutput
john.doe@example.comxkpr.qwm@hdnvbzq.trm
jane_123@test.orgyznq_847@pqrs.vwx

Phone Number Scramble

json
{
  "generators": {
    "masked_phone": {
      "type": "scramble",
      "mode": "pattern",
      "preserve_format": true,
      "preserve_country_code": true
    }
  }
}
InputOutput
+90 532 123 45 67+90 847 956 23 18
+1 (555) 123-4567+1 (555) 847-9382

Credit Card Scramble (Luhn-Valid)

json
{
  "generators": {
    "masked_credit_card": {
      "type": "scramble",
      "mode": "credit_card",
      "preserve_bin": true,
      "luhn_valid": true
    }
  }
}
InputOutput
4532-1234-5678-90124532-8847-2391-4856

First 6 digits (BIN) preserved, rest scrambled, Luhn checksum valid.


Concept 9: Structured Data Masks

Transform data within structured formats (JSON, XML, CSV, HTML).

JSON Mask

json
{
  "generators": {
    "masked_json": {
      "type": "mask",
      "format": "json",
      "paths": {
        "$.user.email": { "generator": "masked_email" },
        "$.user.phone": { "generator": "masked_phone" },
        "$.user.ssn": { "generator": "masked_ssn" },
        "$.payments[*].card_number": { "generator": "masked_credit_card" }
      }
    }
  }
}

Input:

json
{
  "user": {
    "name": "John Doe",
    "email": "john@example.com",
    "phone": "+1-555-123-4567"
  },
  "payments": [
    { "card_number": "4532-1234-5678-9012" }
  ]
}

Output:

json
{
  "user": {
    "name": "John Doe",
    "email": "xkpr@hdnvbzq.trm",
    "phone": "+1-555-847-9382"
  },
  "payments": [
    { "card_number": "4532-8847-2391-4856" }
  ]
}

XML Mask

json
{
  "generators": {
    "masked_xml": {
      "type": "mask",
      "format": "xml",
      "paths": {
        "//customer/email": { "generator": "masked_email" },
        "//customer/ssn": { "generator": "masked_ssn" },
        "//payment/@card-number": { "generator": "masked_credit_card" }
      }
    }
  }
}

Regex Mask

For custom patterns:

json
{
  "generators": {
    "order_reference": {
      "type": "mask",
      "format": "regex",
      "pattern": "^(ORD-)(\\d{8})(-)(\\w{4})$",
      "groups": {
        "1": "passthrough",
        "2": { "generator": "random_digits", "length": 8 },
        "3": "passthrough",
        "4": { "generator": "random_alphanumeric", "length": 4 }
      }
    }
  }
}
InputOutput
ORD-20240315-AB12ORD-84729163-XK47

Comparison: Basic vs Advanced

AspectBasic FakerPhony Advanced
City + CountryRandom, may not matchLinked, always valid
Date sequencesRandom, may be illogicalEvent sequences, always chronological
Numeric distributionsUniform randomStatistical, matches real patterns
Cross-table totalsDon't matchComputed, always correct
PrivacyNoneDifferential privacy, k-anonymity
Format preservationNot supportedScramble with format intact
Structured dataNot supportedJSON/XML/Regex masks

Next Steps

Phony Cloud Platform Specification