Advanced Data Generation Concepts
This document covers advanced concepts for generating realistic, privacy-preserving, and statistically accurate synthetic data. These features differentiate Phony from simple faker libraries.
Overview: Beyond Random Data
┌─────────────────────────────────────────────────────────────────────────┐
│ ADVANCED DATA GENERATION CONCEPTS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ BASIC FAKER PHONY ADVANCED │
│ ─────────── ────────────── │
│ Random city: "Paris" Linked: Paris + France + EUR + +33 │
│ Random date: 2024-03-15 Event sequence: order < ship < deliver│
│ Random number: 42 Statistical: follows real distribution │
│ Random text: "Lorem ipsum" Privacy-preserving: differential priv │
│ │
│ Result: Unrealistic data Result: Production-like data │
│ Joins fail, logic breaks Joins work, logic preserved │
│ │
└─────────────────────────────────────────────────────────────────────────┘Concept 1: Consistency
Consistency ensures that the same input always produces the same output across your entire dataset—even across different tables or databases.
Why Consistency Matters
WITHOUT CONSISTENCY WITH CONSISTENCY
────────────────── ────────────────
Table: users Table: users
company: "Acme Corp" company: "Sunrise Ltd"
Table: invoices Table: invoices
company: "Beta Inc" ← Different! company: "Sunrise Ltd" ← Same!
Result: JOIN fails Result: JOIN works perfectlyUse Cases
| Use Case | Without Consistency | With Consistency |
|---|---|---|
| Joins | Random values break FK relationships | Same company name across tables |
| Cardinality | Loses distribution patterns | Preserves approximate cardinality |
| Deduplication | Same person appears with different names | Consistent identity across records |
| Testing | Different output each run | Reproducible test data |
PDL Syntax
{
"generators": {
"company_name": {
"type": "model",
"source": "models/company_names.ngram",
"generation": { "mode": "word" },
"consistency": {
"enabled": true,
"key": "company_id"
}
}
}
}With consistency enabled:
- Input
company_id: 123always generates "Sunrise Ltd" - Input
company_id: 456always generates "Northern Tech" - Same
company_idin any table → same company name
Consistency Keys
{
"generators": {
"user_email": {
"type": "template",
"pattern": "{{lowercase(first_name)}}.{{lowercase(last_name)}}@example.com",
"consistency": {
"enabled": true,
"key": "user_id",
"scope": "global"
}
}
}
}| Scope | Behavior |
|---|---|
table | Same key → same value within this table |
entity | Same key → same value within this entity type |
global | Same key → same value across entire dataset |
database | Same key → same value across all databases in sync |
Primary Key Consistency
When applied to primary keys, consistency is automatic:
{
"entities": {
"User": {
"fields": {
"id": {
"type": "logic",
"algorithm": "uuid_v7",
"primary_key": true
}
}
},
"Order": {
"fields": {
"user_id": {
"ref": "User.id"
}
}
}
}
}The system automatically:
- Uses format-preserving encryption (FPE) for primary keys
- Applies same transformation to all foreign key references
- Maintains referential integrity across tables
Concept 2: Linked Generators
Linked generators ensure that related columns generate coherent data together. When multiple columns share a strong inter-dependency, linking them produces realistic combinations.
Why Linking Matters
WITHOUT LINKING WITH LINKING
─────────────── ────────────
city: "Ankara" city: "Ankara"
state: "California" ← Invalid! state: null ← Turkey has no states
country: "Japan" ← Nonsense! country: "Turkey" ← Correct
postal: "90210" ← Wrong! postal: "06100" ← Valid Ankara code
phone_prefix: "+44" ← UK prefix! phone_prefix: "+90" ← Turkey prefix
currency: "JPY" ← Yen! currency: "TRY" ← Turkish Lira
lat: 35.6762 lat: 39.9334 ← Actual Ankara
lng: 139.6503 ← Tokyo! lng: 32.8597 ← Actual AnkaraPDL Syntax
{
"generators": {
"location": {
"type": "linked",
"columns": ["city", "state", "country", "postal_code", "phone_prefix", "currency"],
"source": "lists/geo/locations.json"
}
},
"entities": {
"Address": {
"fields": {
"city": { "generator": "location.city" },
"state": { "generator": "location.state" },
"country": { "generator": "location.country" },
"postal_code": { "generator": "location.postal_code" },
"phone_prefix": { "generator": "location.phone_prefix" },
"currency": { "generator": "location.currency" }
}
}
}
}Common Linking Patterns
Geographic Data
{
"geo_location": {
"type": "linked",
"columns": ["city", "district", "postal_code", "latitude", "longitude"],
"source": "lists/geo/tr_TR/locations.json"
}
}Personal Data
{
"person": {
"type": "linked",
"columns": ["first_name", "gender", "title"],
"rules": {
"first_name.gender": "gender",
"title": {
"male": ["Bay", "Mr."],
"female": ["Bayan", "Ms.", "Mrs."]
}
}
}
}Financial Data
{
"financials": {
"type": "linked",
"columns": ["salary", "bonus", "tax", "net_income"],
"rules": {
"bonus": "salary * uniform(0.05, 0.20)",
"tax": "(salary + bonus) * 0.25",
"net_income": "salary + bonus - tax"
}
}
}Time-Based Data
{
"employment": {
"type": "linked",
"columns": ["birth_date", "hire_date", "age_at_hire"],
"rules": {
"hire_date": "birth_date + years(18-40)",
"age_at_hire": "years_between(birth_date, hire_date)"
}
}
}Concept 3: Statistical Generators
Statistical generators produce data that matches real-world distributions, not just random values.
Categorical Generator
Generates values maintaining the frequency distribution of the original data.
ORIGINAL DATA DISTRIBUTION GENERATED DATA DISTRIBUTION
────────────────────────── ───────────────────────────
status: completed (70%) status: completed (70%)
status: pending (20%) status: pending (20%)
status: cancelled (10%) status: cancelled (10%){
"generators": {
"order_status": {
"type": "statistical",
"mode": "categorical",
"source": "inline",
"values": [
{ "value": "completed", "weight": 70 },
{ "value": "pending", "weight": 20 },
{ "value": "cancelled", "weight": 10 }
],
"differential_privacy": {
"enabled": true,
"epsilon": 1.0
}
}
}
}Continuous Generator
Generates numeric values following a statistical distribution.
{
"generators": {
"age": {
"type": "statistical",
"mode": "continuous",
"distribution": "normal",
"params": {
"mean": 35,
"stddev": 12
},
"constraints": {
"min": 18,
"max": 85
}
},
"income": {
"type": "statistical",
"mode": "continuous",
"distribution": "lognormal",
"params": {
"mu": 10.5,
"sigma": 0.8
}
},
"response_time": {
"type": "statistical",
"mode": "continuous",
"distribution": "exponential",
"params": {
"lambda": 0.5
}
}
}
}Supported Distributions
| Distribution | Use Case | Parameters |
|---|---|---|
normal | Age, height, IQ | mean, stddev |
lognormal | Income, prices | mu, sigma |
exponential | Wait times, failure rates | lambda |
uniform | Random selection | min, max |
poisson | Event counts | lambda |
beta | Probabilities, percentages | alpha, beta |
gamma | Wait times, rainfall | shape, scale |
Algebraic Generator
Detects and preserves mathematical relationships between columns.
{
"generators": {
"order_total": {
"type": "statistical",
"mode": "algebraic",
"columns": ["subtotal", "tax", "shipping", "discount", "total"],
"relationship": "total = subtotal + tax + shipping - discount"
}
}
}The generator:
- Identifies the algebraic relationship
- Generates values that satisfy the equation
- Maintains realistic distributions for each component
Multivariate Generator (Correlated Data)
Preserves correlations between multiple numeric columns.
{
"generators": {
"real_estate": {
"type": "statistical",
"mode": "multivariate",
"columns": ["price", "sqft", "bedrooms", "bathrooms", "lot_size"],
"correlations": {
"price-sqft": 0.85,
"price-bedrooms": 0.65,
"sqft-bedrooms": 0.70,
"bedrooms-bathrooms": 0.80
}
}
}
}This ensures:
- Larger houses have higher prices (positive correlation)
- More bedrooms correlate with more bathrooms
- Realistic property listings, not random combinations
Concept 4: Event Sequences
Event sequences generate chronologically valid date/time series where order matters.
The Problem
WITHOUT EVENT SEQUENCES WITH EVENT SEQUENCES
─────────────────────── ────────────────────
order_date: 2024-03-15 order_date: 2024-03-15
payment_date: 2024-03-10 ← Before order! payment_date: 2024-03-15 ← Same day
ship_date: 2024-03-08 ← Before payment! ship_date: 2024-03-17 ← 2 days later
delivery_date: 2024-03-20 delivery_date: 2024-03-22 ← 5 days later
Result: Logically impossible Result: Realistic timelinePDL Syntax
{
"generators": {
"order_timeline": {
"type": "event_sequence",
"events": [
{
"name": "created_at",
"base": true,
"range": { "start": "-1year", "end": "now" }
},
{
"name": "paid_at",
"after": "created_at",
"delay": { "min": "0h", "max": "24h" },
"probability": 0.95
},
{
"name": "shipped_at",
"after": "paid_at",
"delay": { "min": "1d", "max": "3d" },
"probability": 0.90
},
{
"name": "delivered_at",
"after": "shipped_at",
"delay": { "min": "1d", "max": "7d" },
"probability": 0.85
},
{
"name": "reviewed_at",
"after": "delivered_at",
"delay": { "min": "1d", "max": "30d" },
"probability": 0.30
}
]
}
},
"entities": {
"Order": {
"fields": {
"created_at": { "generator": "order_timeline.created_at" },
"paid_at": { "generator": "order_timeline.paid_at" },
"shipped_at": { "generator": "order_timeline.shipped_at" },
"delivered_at": { "generator": "order_timeline.delivered_at" },
"reviewed_at": { "generator": "order_timeline.reviewed_at" }
}
}
}
}Event Sequence Features
| Feature | Description |
|---|---|
base | The anchor event, generated first |
after | This event occurs after the specified event |
delay | Time range between events |
probability | Chance this event occurs (nullable if < 1.0) |
distribution | Delay distribution (uniform, exponential, etc.) |
condition | Only generate if condition is met |
Complex Event Patterns
{
"generators": {
"subscription_lifecycle": {
"type": "event_sequence",
"events": [
{ "name": "signup_at", "base": true },
{ "name": "trial_start", "after": "signup_at", "delay": "0d" },
{ "name": "trial_end", "after": "trial_start", "delay": "14d" },
{
"name": "converted_at",
"after": "trial_end",
"delay": { "min": "0d", "max": "7d" },
"probability": 0.25
},
{
"name": "churned_at",
"after": "trial_end",
"delay": { "min": "0d", "max": "30d" },
"probability": 0.75,
"condition": "converted_at IS NULL"
},
{
"name": "renewed_at",
"after": "converted_at",
"delay": "30d",
"probability": 0.85,
"condition": "converted_at IS NOT NULL"
}
]
}
}
}Concept 5: Cross-Table Relationships
Generate values that correctly aggregate across related tables.
Cross Table Sum
{
"entities": {
"Store": {
"fields": {
"id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
"name": { "generator": "store_name" },
"total_sales": {
"type": "cross_table",
"operation": "sum",
"from": "Transaction",
"field": "amount",
"where": "Transaction.store_id = Store.id"
}
}
},
"Transaction": {
"fields": {
"id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
"store_id": { "ref": "Store.id" },
"amount": { "generator": "price" }
}
}
}
}This ensures Store.total_sales actually equals the sum of all transactions for that store.
Cross Table Operations
| Operation | Description | Example |
|---|---|---|
sum | Sum of related values | Total order amount |
count | Count of related rows | Number of orders |
avg | Average of related values | Average rating |
min | Minimum related value | First order date |
max | Maximum related value | Last login date |
Example: Order Totals
{
"entities": {
"Order": {
"fields": {
"id": { "type": "logic", "algorithm": "uuid_v7", "primary_key": true },
"item_count": {
"type": "cross_table",
"operation": "count",
"from": "OrderItem",
"where": "OrderItem.order_id = Order.id"
},
"subtotal": {
"type": "cross_table",
"operation": "sum",
"from": "OrderItem",
"field": "line_total",
"where": "OrderItem.order_id = Order.id"
},
"tax": {
"computed": "subtotal * 0.18"
},
"total": {
"computed": "subtotal + tax"
}
}
},
"OrderItem": {
"fields": {
"order_id": { "ref": "Order.id" },
"quantity": { "generator": "quantity" },
"unit_price": { "generator": "price" },
"line_total": { "computed": "quantity * unit_price" }
}
}
}
}Concept 6: Differential Privacy
Differential privacy provides mathematical guarantees that generated data cannot be reverse-engineered to identify individuals.
What is Differential Privacy?
┌─────────────────────────────────────────────────────────────────────────┐
│ DIFFERENTIAL PRIVACY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ GUARANTEE: Adding or removing ANY single individual from the │
│ dataset does not significantly change the output distribution. │
│ │
│ RESULT: Even with auxiliary information, an attacker cannot │
│ determine if a specific person was in the original dataset. │
│ │
│ CONTROLLED BY: Epsilon (ε) - the privacy budget │
│ • ε = 0.1 → Very private, more noise, less utility │
│ • ε = 1.0 → Balanced privacy and utility │
│ • ε = 10.0 → Less private, less noise, more utility │
│ │
└─────────────────────────────────────────────────────────────────────────┘PDL Syntax
{
"generators": {
"salary": {
"type": "statistical",
"mode": "continuous",
"distribution": "normal",
"params": { "mean": 75000, "stddev": 25000 },
"differential_privacy": {
"enabled": true,
"epsilon": 1.0,
"mechanism": "laplace"
}
},
"age_group": {
"type": "statistical",
"mode": "categorical",
"values": ["18-25", "26-35", "36-45", "46-55", "56+"],
"differential_privacy": {
"enabled": true,
"epsilon": 0.5,
"mechanism": "exponential"
}
}
}
}Privacy Mechanisms
| Mechanism | Best For | How It Works |
|---|---|---|
laplace | Numeric data | Adds Laplace-distributed noise |
gaussian | Numeric data | Adds Gaussian-distributed noise |
exponential | Categorical data | Randomizes selection with privacy guarantees |
Compliance Benefits
| Regulation | Differential Privacy Benefit |
|---|---|
| GDPR | Meets anonymization requirements |
| HIPAA | Satisfies de-identification standards |
| CCPA | Data cannot be re-identified |
Concept 7: Geo-Aware Generation
Generate geographic data with built-in privacy and validity.
Coordinate Fuzzing
{
"generators": {
"location": {
"type": "geo",
"mode": "coordinates",
"source": "original",
"privacy": {
"method": "k_anonymity",
"k": 5,
"radius_km": 1.0
}
}
}
}The generator:
- Takes original lat/long coordinates
- Finds regions with at least k other points
- Moves the point within that region
- Adds additional random fuzzing within radius
Valid Location Generation
{
"generators": {
"turkish_address": {
"type": "geo",
"mode": "address",
"locale": "tr_TR",
"constraints": {
"country": "TR",
"valid_postal": true,
"valid_coordinates": true
},
"components": {
"city": { "weight_by": "population" },
"district": { "within": "city" },
"postal_code": { "within": "district" },
"coordinates": { "within": "postal_code" }
}
}
}
}HIPAA-Compliant Address Generation
For healthcare data, special rules apply:
{
"generators": {
"hipaa_address": {
"type": "geo",
"mode": "hipaa_safe_harbor",
"rules": {
"zip_truncation": true,
"small_population_generalization": true,
"coordinates": "disabled"
}
}
}
}HIPAA Safe Harbor requirements:
- Truncate zip codes to first 3 digits
- If population < 20,000 in that 3-digit zip, use "000"
- Remove street address, keep only city/state
Concept 8: Format-Preserving Transformation (Scramble)
Transform data while preserving its format, useful for masking sensitive data.
Character Scramble
{
"generators": {
"masked_email": {
"type": "scramble",
"mode": "character",
"preserve": ["@", "."],
"rules": {
"letters": "random_letter",
"digits": "random_digit"
}
}
}
}| Input | Output |
|---|---|
john.doe@example.com | xkpr.qwm@hdnvbzq.trm |
jane_123@test.org | yznq_847@pqrs.vwx |
Phone Number Scramble
{
"generators": {
"masked_phone": {
"type": "scramble",
"mode": "pattern",
"preserve_format": true,
"preserve_country_code": true
}
}
}| Input | Output |
|---|---|
+90 532 123 45 67 | +90 847 956 23 18 |
+1 (555) 123-4567 | +1 (555) 847-9382 |
Credit Card Scramble (Luhn-Valid)
{
"generators": {
"masked_credit_card": {
"type": "scramble",
"mode": "credit_card",
"preserve_bin": true,
"luhn_valid": true
}
}
}| Input | Output |
|---|---|
4532-1234-5678-9012 | 4532-8847-2391-4856 |
First 6 digits (BIN) preserved, rest scrambled, Luhn checksum valid.
Concept 9: Structured Data Masks
Transform data within structured formats (JSON, XML, CSV, HTML).
JSON Mask
{
"generators": {
"masked_json": {
"type": "mask",
"format": "json",
"paths": {
"$.user.email": { "generator": "masked_email" },
"$.user.phone": { "generator": "masked_phone" },
"$.user.ssn": { "generator": "masked_ssn" },
"$.payments[*].card_number": { "generator": "masked_credit_card" }
}
}
}
}Input:
{
"user": {
"name": "John Doe",
"email": "john@example.com",
"phone": "+1-555-123-4567"
},
"payments": [
{ "card_number": "4532-1234-5678-9012" }
]
}Output:
{
"user": {
"name": "John Doe",
"email": "xkpr@hdnvbzq.trm",
"phone": "+1-555-847-9382"
},
"payments": [
{ "card_number": "4532-8847-2391-4856" }
]
}XML Mask
{
"generators": {
"masked_xml": {
"type": "mask",
"format": "xml",
"paths": {
"//customer/email": { "generator": "masked_email" },
"//customer/ssn": { "generator": "masked_ssn" },
"//payment/@card-number": { "generator": "masked_credit_card" }
}
}
}
}Regex Mask
For custom patterns:
{
"generators": {
"order_reference": {
"type": "mask",
"format": "regex",
"pattern": "^(ORD-)(\\d{8})(-)(\\w{4})$",
"groups": {
"1": "passthrough",
"2": { "generator": "random_digits", "length": 8 },
"3": "passthrough",
"4": { "generator": "random_alphanumeric", "length": 4 }
}
}
}
}| Input | Output |
|---|---|
ORD-20240315-AB12 | ORD-84729163-XK47 |
Comparison: Basic vs Advanced
| Aspect | Basic Faker | Phony Advanced |
|---|---|---|
| City + Country | Random, may not match | Linked, always valid |
| Date sequences | Random, may be illogical | Event sequences, always chronological |
| Numeric distributions | Uniform random | Statistical, matches real patterns |
| Cross-table totals | Don't match | Computed, always correct |
| Privacy | None | Differential privacy, k-anonymity |
| Format preservation | Not supported | Scramble with format intact |
| Structured data | Not supported | JSON/XML/Regex masks |
Next Steps
- Generator Types - Core generator types
- PDL Specification - Full schema language reference
- N-gram Models - Statistical text generation
- Execution Model - Runtime architecture