Data Generation Architecture
Scope: This section covers the core data generation system - PDL schema language, generator types, N-gram models, and the portable .phony package format. For cloud platform features (sync, mock API, registry), see Cloud Platform Architecture.
Overview
Phony's data generation architecture is built on the principle of "Data Generation as Code" - treating synthetic data generation with the same rigor as infrastructure management.
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA GENERATION ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PDL Schema │
│ (schema.pdl.json) │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Logic │ │ List │ │ Model │ │Statis-│ │Linked │ │
│ │ │ │ │ │(N-gram│ │tical │ │ │ │
│ │UUIDs, │ │Codes, │ │Names, │ │Distri-│ │City+ │ │
│ │Numbers│ │Enums │ │Text │ │butions│ │Country│ │
│ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ │
│ └─────────┴─────────┼─────────┴─────────┘ │
│ ▼ │
│ ┌────────────────────────────────────────┐ │
│ │ Template Generator │ │
│ │ (Compose, Format, Operations) │ │
│ └────────────────────┬───────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────────┐ ┌─────────┐ │
│ │ Event │ │ Cross-Table │ │ Privacy │ │
│ │Sequence │ │ Operations │ │ Features│ │
│ │ │ │ │ │ │ │
│ │Chrono- │ │Sum, Count, │ │Diff Priv│ │
│ │logical │ │Aggregates │ │Geo-Anon │ │
│ └────┬────┘ └──────┬──────┘ └────┬────┘ │
│ └──────────────────┼──────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ .phony │ │
│ │ Package │ │
│ │ (Portable) │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘Core Components
| Component | Description | Documentation |
|---|---|---|
| Generator Types | Seven unified generator abstractions (Logic, List, Model, Template, Statistical, Linked, Event Sequence) | Generator Types |
| Advanced Concepts | Consistency, linking, statistical generation, differential privacy, format-preserving transformation | Advanced Concepts |
| N-gram Models | Statistical text generation using Markov chains | N-gram Models |
| PDL Specification | Declarative JSON schema for data generation | PDL Specification |
| Expression Language | Template syntax (PEL) for composition | Expression Language |
| Package Format | Portable, self-contained .phony packages | Package Format |
| Locale System | Multi-layer inheritance for i18n | Locale System |
| Execution Model | OSS vs Cloud runtime differences | Execution Model |
Key Principles
- Declarative - Define WHAT data you need, not HOW to generate it
- Portable - Same .phony package runs on CLI, PHP, Python, Cloud
- Composable - Mix N-gram models + templates + lists + logic seamlessly
- Deterministic - Same seed produces same output across all runtimes
- Consistent - Same input produces same output across tables and databases
- Statistically Accurate - Generated data matches real-world distributions
- Privacy-Preserving - Differential privacy and k-anonymity support
- Relationship-Aware - Linked generators ensure valid data combinations
- Open - OSS core with MIT license, Cloud for enterprise features