Database Sync Architecture

Phony's Database Sync engine enables secure, scalable synchronization of production databases to non-production environments with automatic PII anonymization.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                    DATABASE SYNC ARCHITECTURE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CUSTOMER ENVIRONMENT                    PHONY CLOUD                    │
│  ════════════════════                    ═══════════                    │
│                                                                          │
│  ┌─────────────────┐                    ┌─────────────────────────────┐ │
│  │  Production DB  │                    │      SYNC ENGINE (Go)       │ │
│  │  (MySQL/PG)     │◄── Secure ────────►│                             │ │
│  │                 │    Tunnel          │  ┌─────────────────────────┐│ │
│  │  100GB / PII    │                    │  │    Connection Pool      ││ │
│  └─────────────────┘                    │  │    (pgx / go-mysql)     ││ │
│                                          │  └───────────┬─────────────┘│ │
│                                          │              │              │ │
│                                          │  ┌───────────▼─────────────┐│ │
│                                          │  │    Schema Analyzer      ││ │
│                                          │  │    • FK Detection       ││ │
│                                          │  │    • PII Detection      ││ │
│                                          │  │    • Type Mapping       ││ │
│                                          │  └───────────┬─────────────┘│ │
│                                          │              │              │ │
│                                          │  ┌───────────▼─────────────┐│ │
│                                          │  │   Transform Pipeline    ││ │
│                                          │  │    • Streaming Read     ││ │
│                                          │  │    • Batch Transform    ││ │
│                                          │  │    • Parallel Write     ││ │
│                                          │  └───────────┬─────────────┘│ │
│                                          │              │              │ │
│                                          │  ┌───────────▼─────────────┐│ │
│  ┌─────────────────┐                    │  │    Rust Core (FFI)      ││ │
│  │  Staging DB     │◄── Secure ────────►│  │    • N-gram Generate    ││ │
│  │  (MySQL/PG)     │    Tunnel          │  │    • 5M records/sec     ││ │
│  │                 │                    │  └─────────────────────────┘│ │
│  │  100GB / No PII │                    └─────────────────────────────┘ │
│  └─────────────────┘                                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Connection Management

Secure Tunnels

Customer databases are never exposed to the public internet. Phony uses secure tunnels:

┌─────────────────────────────────────────────────────────────────────────┐
│                    CONNECTION OPTIONS                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OPTION 1: SSH Tunnel (Recommended)                                     │
│  ═══════════════════════════════════                                    │
│                                                                          │
│  Customer VPC          │         Phony Cloud                            │
│  ┌──────────────┐      │      ┌──────────────┐                          │
│  │   Bastion    │◄─────┼──────│  Sync Engine │                          │
│  │   (SSH Key)  │  SSH │      │              │                          │
│  └──────┬───────┘      │      └──────────────┘                          │
│         │              │                                                 │
│  ┌──────▼───────┐      │                                                │
│  │   Database   │      │                                                │
│  │  (Private)   │      │                                                │
│  └──────────────┘      │                                                │
│                                                                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OPTION 2: VPC Peering (Enterprise)                                     │
│  ═══════════════════════════════════                                    │
│                                                                          │
│  Customer VPC          │         Phony VPC                              │
│  ┌──────────────┐      │      ┌──────────────┐                          │
│  │   Database   │◄─────┼──────│  Sync Engine │                          │
│  │              │ VPC  │      │              │                          │
│  │              │Peering      │              │                          │
│  └──────────────┘      │      └──────────────┘                          │
│                                                                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  OPTION 3: Direct Connect (On-premise)                                  │
│  ═════════════════════════════════════                                  │
│                                                                          │
│  On-Premise DC         │         Phony Cloud                            │
│  ┌──────────────┐      │      ┌──────────────┐                          │
│  │   Database   │◄─────┼──────│  Sync Engine │                          │
│  │              │ IPSec│      │              │                          │
│  │              │  VPN │      │              │                          │
│  └──────────────┘      │      └──────────────┘                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Connection Configuration

json

{
  "source": {
    "type": "postgresql",
    "host": "db.internal.company.com",
    "port": 5432,
    "database": "production",
    "credentials": {
      "type": "ssh_tunnel",
      "bastion_host": "bastion.company.com",
      "bastion_user": "phony-sync",
      "private_key_ref": "vault://secrets/company/ssh-key"
    },
    "ssl": {
      "mode": "verify-full",
      "ca_cert_ref": "vault://secrets/company/ca-cert"
    }
  }
}

Schema Analysis

Automatic PII Detection

The Schema Analyzer uses multiple strategies to detect PII columns:

┌─────────────────────────────────────────────────────────────────────────┐
│                    PII DETECTION PIPELINE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. COLUMN NAME HEURISTICS (Fast)                                       │
│  ════════════════════════════════                                       │
│  Pattern matching on column names:                                      │
│  • email, mail, e_mail → EMAIL type                                     │
│  • phone, tel, mobile, gsm → PHONE type                                 │
│  • first_name, fname, ad → NAME type                                    │
│  • ssn, tc_kimlik, national_id → NATIONAL_ID type                      │
│  • address, adres, street → ADDRESS type                                │
│  • credit_card, cc_number → CREDIT_CARD type                           │
│                                                                          │
│  2. DATA SAMPLING (Accurate)                                            │
│  ════════════════════════════                                           │
│  Sample N rows and analyze content:                                     │
│  • Regex patterns (email format, phone format)                          │
│  • Statistical analysis (entropy, uniqueness)                           │
│  • Named entity recognition (names, addresses)                          │
│                                                                          │
│  3. ML CLASSIFICATION (Enterprise)                                      │
│  ══════════════════════════════════                                     │
│  Fine-tuned model for domain-specific PII:                              │
│  • Healthcare: MRN, diagnosis codes                                     │
│  • Finance: account numbers, transactions                               │
│  • Custom training on customer data patterns                            │
│                                                                          │
│  OUTPUT:                                                                │
│  ════════                                                               │
│  {                                                                      │
│    "users.email": { "pii_type": "EMAIL", "confidence": 0.98 },         │
│    "users.phone": { "pii_type": "PHONE", "confidence": 0.95 },         │
│    "users.bio": { "pii_type": "FREE_TEXT", "confidence": 0.72 }        │
│  }                                                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Foreign Key Detection

Preserving referential integrity is critical:

sql

-- Automatic FK detection via:
-- 1. Database metadata (information_schema)
-- 2. Naming conventions (user_id, order_id)
-- 3. Data analysis (matching value distributions)

-- Result: Dependency graph
orders.user_id      → users.id
order_items.order_id → orders.id
payments.order_id    → orders.id

Transform Pipeline

Streaming Architecture

Large databases are processed in streams, never loaded fully into memory:

┌─────────────────────────────────────────────────────────────────────────┐
│                    STREAMING TRANSFORM PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SOURCE DB              SYNC ENGINE                    TARGET DB        │
│  ═════════              ═══════════                    ═════════        │
│                                                                          │
│  ┌────────┐    ┌────────────────────────────────┐    ┌────────┐        │
│  │        │    │                                │    │        │        │
│  │ users  │───►│  ┌──────┐  ┌──────┐  ┌──────┐ │───►│ users  │        │
│  │ 10M    │    │  │ Read │  │Trans-│  │ Write│ │    │ 10M    │        │
│  │ rows   │    │  │Batch │─►│form  │─►│Batch │ │    │ rows   │        │
│  │        │    │  │ 10K  │  │      │  │ 10K  │ │    │        │        │
│  └────────┘    │  └──────┘  └──────┘  └──────┘ │    └────────┘        │
│                │       ▲                   │    │                       │
│                │       │    ┌──────────┐   │    │                       │
│                │       └────│  Buffer  │◄──┘    │                       │
│                │            │  Queue   │        │                       │
│                │            │ (memory) │        │                       │
│                │            └──────────┘        │                       │
│                │                                │                       │
│                │  Parallel workers: 4-16        │                       │
│                │  Batch size: 1K-100K           │                       │
│                │  Memory limit: 512MB-4GB       │                       │
│                │                                │                       │
│                └────────────────────────────────┘                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Transform Types

json

{
  "tables": {
    "users": {
      "columns": {
        "id": { "transform": "KEEP" },
        "email": {
          "transform": "ANONYMIZE",
          "generator": "template",
          "pattern": "{{uuid}}@example.com"
        },
        "first_name": {
          "transform": "ANONYMIZE",
          "generator": "model",
          "source": "tr_TR/first_names"
        },
        "phone": {
          "transform": "MASK",
          "pattern": "+90 5** *** **{last2}"
        },
        "password_hash": { "transform": "KEEP" },
        "created_at": { "transform": "KEEP" },
        "notes": { "transform": "NULL" }
      }
    }
  }
}

Sync Modes

Full Sync

Complete table copy with transformation:

┌─────────────────────────────────────────────────────────────────────────┐
│                    FULL SYNC PROCESS                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. SCHEMA SYNC                                                         │
│     • Drop existing tables (if exists)                                  │
│     • Create tables with same schema                                    │
│     • Create indexes after data load                                    │
│                                                                          │
│  2. DATA SYNC (per table, ordered by FK dependencies)                   │
│     • Disable FK constraints                                            │
│     • Truncate target table                                             │
│     • Stream + Transform + Bulk insert                                  │
│     • Enable FK constraints                                             │
│                                                                          │
│  3. POST-SYNC                                                           │
│     • Rebuild indexes                                                   │
│     • Update statistics                                                 │
│     • Verify row counts                                                 │
│                                                                          │
│  Duration: ~1 hour per 100GB (depends on transforms)                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Incremental Sync

Only sync changes since last sync:

┌─────────────────────────────────────────────────────────────────────────┐
│                    INCREMENTAL SYNC STRATEGIES                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STRATEGY 1: Timestamp-based (Simple)                                   │
│  ══════════════════════════════════════                                 │
│  Requires: updated_at column on tables                                  │
│                                                                          │
│  SELECT * FROM users                                                    │
│  WHERE updated_at > '2024-01-15 10:00:00'                              │
│                                                                          │
│  Pros: Simple, works everywhere                                         │
│  Cons: Misses deletes, clock skew issues                               │
│                                                                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STRATEGY 2: CDC (Change Data Capture) - Recommended                    │
│  ═════════════════════════════════════════════════════                  │
│  Uses database replication log:                                         │
│  • PostgreSQL: Logical replication slots                                │
│  • MySQL: Binary log (binlog)                                           │
│                                                                          │
│  Captures: INSERT, UPDATE, DELETE                                       │
│  Latency: Near real-time (seconds)                                      │
│                                                                          │
│  Pros: Complete, no missed changes                                      │
│  Cons: Requires DB config, more complex                                 │
│                                                                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STRATEGY 3: Soft Delete Tracking                                       │
│  ═════════════════════════════════                                      │
│  Requires: deleted_at column + no hard deletes                         │
│                                                                          │
│  Combined with timestamp-based for complete picture                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Subset Sync

Sync a representative subset for development:

json

{
  "subset": {
    "strategy": "percentage",
    "percentage": 1,
    "preserve_integrity": true,
    "seed_tables": ["users"],
    "rules": {
      "users": {
        "sample": "1%",
        "filter": "created_at > '2024-01-01'"
      },
      "orders": {
        "sample": "follow_fk",
        "parent": "users"
      },
      "order_items": {
        "sample": "follow_fk",
        "parent": "orders"
      }
    }
  }
}

Consistency Guarantees

Referential Integrity

Foreign keys are preserved through deterministic ID mapping:

┌─────────────────────────────────────────────────────────────────────────┐
│                    FK CONSISTENCY                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  SOURCE:                           TARGET:                              │
│  ════════                          ═══════                              │
│                                                                          │
│  users                             users                                │
│  ┌─────┬───────────┐              ┌─────┬───────────┐                  │
│  │ id  │ email     │              │ id  │ email     │                  │
│  ├─────┼───────────┤   Transform  ├─────┼───────────┤                  │
│  │ 1   │ a@x.com   │ ──────────►  │ 1   │ xx@ex.com │                  │
│  │ 2   │ b@x.com   │              │ 2   │ yy@ex.com │                  │
│  └─────┴───────────┘              └─────┴───────────┘                  │
│                                                                          │
│  orders                            orders                               │
│  ┌─────┬─────────┐                ┌─────┬─────────┐                    │
│  │ id  │ user_id │                │ id  │ user_id │                    │
│  ├─────┼─────────┤   ID preserved ├─────┼─────────┤                    │
│  │ 101 │ 1       │ ──────────────►│ 101 │ 1       │  ✓ FK intact      │
│  │ 102 │ 2       │                │ 102 │ 2       │                    │
│  └─────┴─────────┘                └─────┴─────────┘                    │
│                                                                          │
│  KEY INSIGHT: IDs are NEVER transformed, only PII columns               │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Transaction Consistency

Sync jobs are atomic at the table level:

If a table sync fails, it's rolled back
Other tables remain consistent
Retry mechanism with exponential backoff
Dead letter queue for persistent failures

Performance

Database Size	Full Sync	Incremental	Subset (1%)
1 GB	~2 min	~10 sec	~5 sec
10 GB	~15 min	~30 sec	~20 sec
100 GB	~2 hours	~2 min	~2 min
1 TB	~20 hours	~10 min	~10 min

Optimization levers:

Parallel workers (up to 16)
Batch size tuning
Index-free loading (rebuild after)
Connection pooling

Scheduling

json

{
  "schedule": {
    "type": "cron",
    "expression": "0 2 * * *",
    "timezone": "Europe/Istanbul",
    "mode": "incremental",
    "fallback_to_full": {
      "enabled": true,
      "after_days": 7
    }
  },
  "notifications": {
    "on_success": ["slack://channel"],
    "on_failure": ["email://team@company.com", "pagerduty://service"]
  }
}

Database Sync Architecture ​

Architecture Overview ​

Connection Management ​

Secure Tunnels ​

Connection Configuration ​

Schema Analysis ​

Automatic PII Detection ​

Foreign Key Detection ​

Transform Pipeline ​

Streaming Architecture ​

Transform Types ​

Sync Modes ​

Full Sync ​

Incremental Sync ​

Subset Sync ​

Consistency Guarantees ​

Referential Integrity ​

Transaction Consistency ​

Performance ​

Scheduling ​

Database Sync Architecture

Architecture Overview

Connection Management

Secure Tunnels

Connection Configuration

Schema Analysis

Automatic PII Detection

Foreign Key Detection

Transform Pipeline

Streaming Architecture

Transform Types

Sync Modes

Full Sync

Incremental Sync

Subset Sync

Consistency Guarantees

Referential Integrity

Transaction Consistency

Performance

Scheduling