Skip to content

N-gram Language Models

Phony uses character-level N-gram Markov chains for generating realistic, locale-specific text. This document describes the model architecture, training process, and generation algorithms.

Core Concepts

What is a Character-Level N-gram?

Unlike word-level N-grams (which model word sequences), Phony uses character-level N-grams to model character sequences within words:

Word: "sphinx" (n=4)
─────────────────────────
s p h i n x
└───┘         → "sphi" (first n-gram)
  └───┘       → "phin" (middle n-gram)
    └───┘     → "hinx" (last n-gram)

This approach enables:

  • Infinite vocabulary: Generate words never seen in training
  • Locale characteristics: Model learns "Turkish-sounding" or "English-sounding" patterns
  • Compact storage: Model file much smaller than word lists

Markov Chain Graph

Each N-gram forms a node in a directed graph with weighted edges:

┌─────────────────────────────────────────────────────────────────────────┐
│                    MARKOV CHAIN STRUCTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  "sphi" ─────(w:4)────▶ "phin" ─────(w:4)────▶ "hinx"                   │
│    │                      │                      │                       │
│    │                      │                      │                       │
│    ▼                      ▼                      ▼                       │
│  children              children              lastChildren                │
│  (mid-word)            (mid-word)            (word-end)                  │
│                                                                          │
│  Key insight:                                                            │
│  • children (c)      → transitions within word                          │
│  • lastChildren (lc) → transitions that end the word                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Data Structure

LookupList Format

The core data structure for weighted random selection:

json
{
  "e": ["sphi", "phin", "hinx", "quar"],
  "w": [4, 3, 2, 5],
  "cw": [4, 7, 9, 14],
  "i": {"sphi": 0, "phin": 1, "hinx": 2, "quar": 3},
  "sw": 14,
  "c": 4
}
FieldNameDescription
eelementsN-gram strings array
wweightsOccurrence counts (frequency)
cwcumulative weightsRunning sum for binary search
iindexReverse lookup (ngram → array index)
swsum of weightsTotal weight
ccountNumber of elements

Why cumulative weights? Enables O(log n) weighted random selection via binary search instead of O(n) linear scan.

Element (Graph Node) Format

json
{
  "sphi": {
    "c": {
      "e": ["phin"],
      "w": [4],
      "cw": [4],
      "sw": 4,
      "c": 1
    },
    "lc": {
      "e": [],
      "w": [],
      "cw": [],
      "sw": 0,
      "c": 0
    }
  }
}
FieldNameDescription
cchildrenNext n-grams (mid-word transitions)
lclastChildrenTerminal n-grams (word-end transitions)

Complete Model Format

json
{
  "version": 1,
  "metadata": {
    "locale": "tr_TR",
    "domain": "person_names",
    "ngram_order": 4,
    "token_type": "char",
    "trained_on": "2024-03-15T10:00:00Z",
    "sample_count": 50000
  },
  "config": {
    "min_word_length": 3,
    "exclude_originals": true,
    "include_real_words": true,
    "position_depth": 5,
    "fallback": "prefix",
    "tokenizer": {
      "word_separator": "\\s",
      "sentence_separator": "[.!?:;]",
      "word_filter": "/[^a-zçğıöşü]+/u",
      "to_lowercase": true
    }
  },
  "ngrams": ["mehm", "ehme", "hmet", "ayşe", "sphi", "phin", "hinx", ...],
  "distributions": {
    "word_lengths": {
      "e": [4, 5, 6, 7, 8],
      "w": [120, 350, 280, 150, 100],
      "cw": [120, 470, 750, 900, 1000]
    },
    "sentence_lengths": {
      "e": [5, 6, 7, 8, 9, 10],
      "w": [50, 120, 200, 180, 100, 50],
      "cw": [50, 170, 370, 550, 650, 700]
    }
  },
  "first": { "e": [0, 3, 4, ...], "w": [...], "cw": [...] },
  "positions": {
    "1": { "e": [0, 4, 12, ...], "w": [...], "cw": [...] },
    "-1": { "e": [2, 6, 15, ...], "w": [...], "cw": [...] }
  },
  "graph": {
    "0": { "c": {"e": [1], "w": [4], "cw": [4]}, "lc": {"e": [2], "w": [3], "cw": [3]} },
    "1": { "c": {"e": [2], "w": [4], "cw": [4]}, "lc": {"e": [], "w": [], "cw": []} }
  },
  "words": {
    "excluded": [
      [0, 1, 2],
      [3]
    ],
    "real": [
      [0, 1, 2],
      [3],
      [4, 5, 6]
    ]
  }
}

N-gram Only String Storage

The only strings in the model are N-grams. Words are represented as sequences of N-gram indices:

┌─────────────────────────────────────────────────────────────────────────┐
│                    WORD AS NGRAM SEQUENCE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ngrams: ["mehm", "ehme", "hmet", "ayşe", ...]                          │
│             0       1       2       3                                    │
│                                                                          │
│  Word "mehmet" = ngram sequence [0, 1, 2]                               │
│                                                                          │
│  Reconstruction:                                                         │
│  ────────────────                                                        │
│  ngrams[0]         = "mehm"                                             │
│  ngrams[0] + [1]   = "mehm" + "e" = "mehme"                             │
│  ngrams[0] + [1,2] = "mehme" + "t" = "mehmet" ✓                         │
│                                                                          │
│  Algorithm:                                                              │
│  ──────────                                                              │
│  word = ngrams[sequence[0]]                                             │
│  for i in sequence[1:]:                                                 │
│      word += ngrams[i][-1]   // append last char                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
FieldDescription
ngramsAll unique N-gram strings (single source of truth)
words.excludedN-gram index sequences for deduplication
words.realN-gram index sequences for real_word generation mode

Benefits:

  • Single source of truth: Only N-grams stored as strings
  • No duplication: Words derived from existing N-grams
  • Smaller file size: Integer arrays vs repeated strings
  • Consistent: All data uses numeric indices

Positional N-grams

A key innovation: tracking N-gram distributions by sentence position.

┌─────────────────────────────────────────────────────────────────────────┐
│                    POSITIONAL N-GRAM CONCEPT                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Sentence: "The quick brown fox jumps"                                  │
│             ▲   ▲     ▲     ▲   ▲                                       │
│             │   │     │     │   │                                       │
│           pos:1 pos:2  ...  pos:-2 pos:-1                               │
│                                                                          │
│  positions[1]  → N-grams that START sentences                           │
│  positions[2]  → N-grams common in 2nd position                         │
│  positions[-1] → N-grams that END sentences                             │
│  positions[-2] → N-grams common in 2nd-to-last position                 │
│                                                                          │
│  WHY THIS MATTERS:                                                       │
│  ─────────────────                                                       │
│  In Turkish: "bir" commonly starts sentences                            │
│              "oldu" commonly ends sentences                              │
│  In English: "The" commonly starts sentences                            │
│              "today" commonly ends sentences                             │
│                                                                          │
│  This produces more natural sentence structure.                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Configuration

json
{
  "config": {
    "position_depth": 5
  }
}

When position_depth: 5:

  • Tracks positions: 1, 2, 3, 4, 5, -5, -4, -3, -2, -1
  • Middle words use generic first distribution

Tokenization

Regex-Based Boundaries

Sentence and word boundaries are defined via regular expressions:

json
{
  "config": {
    "tokenizer": {
      "word_separator": "\\s",
      "sentence_separator": "[.!?:;]",
      "word_filter": "/[^a-zçğıöşü]+/u",
      "to_lowercase": true
    }
  }
}
FieldDescriptionExample
word_separatorRegex to split words\s (whitespace)
sentence_separatorRegex to split sentences[.!?:;]
word_filterRegex to clean words/[^a-z]+/ (keep only letters)
to_lowercaseNormalize casetrue

Predefined Filter Types

TypePatternUse Case
ALPHABETICAL/[^a-z]+/English basic
FRENCH/[^a-zéèëêúüûùœàáäâæíïìîóöôòç]+/French
TURKISH/[^a-zçğıöşü]+/uTurkish
JAPANESEUnicode rangesJapanese (hiragana, katakana, kanji)
CHINESEUnicode rangesChinese (CJK)
NO_SYMBOLS/[^ \p{L}]+/uAny Unicode letter

Token Type: Character vs Word

Terminology Note: Don't confuse token_type (how the model was trained) with generation.mode (how output is produced). They are independent concepts:

  • token_type: Set at training time. Determines what a "token" is (character or word).
  • generation.mode: Set at generation time in PDL. Determines output type (word, sentence, text, etc.).

Character Token Type (Default)

N-grams are character sequences. The model learns character-by-character patterns.

Training: ["sphinx", "quarter", "quick"]
N-grams:  "sphi" → "phin" → "hinx"
          "quar" → "uart" → "arte" → "rter"
          "quic" → "uick"

Output: "sphinter" (new word, never in training!)

Use for: Names, usernames, made-up words

Word Token Type

N-grams are word sequences. The model learns word-by-word patterns (for multi-word outputs like company names).

Training corpus (company names):
  "Yılmaz Holding A.Ş."
  "Koç Holding A.Ş."
  "Sabancı Holding A.Ş."
  "Yılmaz Gıda Sanayi"
  "Koç Otomotiv Sanayi"

Word bigrams (n=2):
  ["Yılmaz", "Holding"] → ["Holding", "A.Ş."]
  ["Koç", "Holding"] → ["Holding", "A.Ş."]
  ["Yılmaz", "Gıda"] → ["Gıda", "Sanayi"]
  ["Koç", "Otomotiv"] → ["Otomotiv", "Sanayi"]

Generated outputs:
  "Yılmaz Otomotiv Sanayi"  (new combination!)
  "Koç Gıda Sanayi"         (new combination!)
  "Sabancı Holding A.Ş."    (from training)

Use for: Company names, product names, multi-word phrases

Training with Word Token Type

bash
# Company names (word bigrams)
phony train companies.txt \
  -o models/companies.ngram \
  --token-type word \
  --ngram-size 2

# Product names (word trigrams for more context)
phony train products.txt \
  -o models/products.ngram \
  --token-type word \
  --ngram-size 3

Configuration

json
{
  "metadata": {
    "token_type": "char"
  }
}
ValueDescriptionN-gram Example
charCharacter-level (default)"meh" → "ehm" → "hme"
wordWord-level["Koç", "Holding"] → ["Holding", "A.Ş."]

Note: token_type is a model property set at training time. It cannot be changed when using the model. Use --token-type word during training for multi-word phrase generation (company names, product names).


Design Rationale

Why Single N-gram Order?

Phony uses a single N-gram order (e.g., only 4-grams) rather than multi-order approaches. This was a deliberate design choice after evaluating academic smoothing methods.

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE LOREM IPSUM PRINCIPLE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Phony generates FAKE data that LOOKS LIKE real data.                   │
│  It doesn't need to BE real data.                                       │
│                                                                          │
│  Priority 1: Speed (generate millions of records fast)                  │
│  Priority 2: Plausibility (output looks locale-appropriate)             │
│  Priority 3: NOT coverage (we don't need every possible word)           │
│                                                                          │
│  Academic smoothing methods optimize for coverage.                       │
│  We optimize for speed and plausibility.                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Evaluated But Rejected

We evaluated several academic approaches before deciding on simplicity:

MethodWhat It DoesWhy Not For Phony
Kneser-Ney SmoothingRedistributes probability mass using continuation countsComplexity overhead, diminishing returns for fake data
Modified Kneser-NeyThree discount parameters (D₁, D₂, D₃₊)Even more complexity, marginal quality gain
Interpolationλ-weighted mix of all n-gram ordersRequires storing all orders (2x-4x model size)
Stupid BackoffFixed discount (0.4) without normalizationStill requires multi-order storage

References:

  • Chen & Goodman (1999). "An Empirical Study of Smoothing Techniques for Language Modeling"
  • Kneser & Ney (1995). "Improved backing-off for M-gram language modeling"

Our Approach: KISS

Single n-gram order + Simple prefix-match fallback
────────────────────────────────────────────────────
✓ One model file (not separate files per order)
✓ Smaller memory footprint
✓ Faster generation (no interpolation math)
✓ Good enough for fake data

Locale-Based N-gram Defaults

Different languages have different optimal N-gram sizes:

LocaleDefault NRationale
en_*3Shorter words, common patterns
tr_TR4Agglutinative, longer morphemes
de_DE4Compound words, longer stems
ja_JP2Character-based (hiragana/katakana)
zh_*2Character-based (hanzi)
ar_*3Root-pattern morphology

These are defaults; users can override with --ngram-size flag.


Fallback Strategy

The Sparse Data Problem

When an N-gram has no recorded transitions:

Request: word starting with "xyz"
Model: No N-grams starting with "xyz"
Result: null (failure)

Prefix-Match Fallback

Since we store only single-order N-grams, we use prefix matching as fallback:

┌─────────────────────────────────────────────────────────────────────────┐
│                    PREFIX-MATCH FALLBACK                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Looking for 4-gram starting with "xyz"                                 │
│      ↓                                                                   │
│  Search: Any 4-gram matching "xyz*" → found "xyza" → use it            │
│      ↓ (if not found)                                                   │
│  Search: Any 4-gram matching "xy**" → found "xyzb" → use it            │
│      ↓ (if not found)                                                   │
│  Search: Any 4-gram matching "x***" → found "xabc" → use it            │
│      ↓ (if not found)                                                   │
│  Use random from `first` distribution                                   │
│                                                                          │
│  Key insight: We stay within single n-gram order.                       │
│  No 3-grams or 2-grams needed.                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Configuration

json
{
  "config": {
    "fallback": "prefix"
  }
}
ValueDescription
noneReturn null on miss (strict mode)
prefixPrefix-match within same n-gram order (default)

Generation Modes

Word Generation

word(lengthHint?, position?, startsWith?) → string | null
ParameterTypeDescription
lengthHintintTarget length (soft constraint)
positionintSentence position (1, 2, -1, -2)
startsWithstringRequired prefix

Algorithm:

  1. Select starting N-gram (filtered by startsWith if provided)
  2. Select target length from word_lengths distribution
  3. Loop: pick weighted random child, append last character
  4. When length ≥ hint and lastChildren exists, terminate

Sentence Generation

sentence(numberOfWords?, startsWith?, endingPunctuation?) → string
ParameterTypeDescription
numberOfWordsintWord count
startsWithstringFirst word prefix
endingPunctuationstring|array"." or [".", "!", "?"]

Algorithm:

  1. Generate position-aware words for first N positions
  2. Generate generic words for middle
  3. Generate position-aware words for last N positions
  4. Capitalize first letter, append punctuation

Text Generation

text(maxNumberOfCharacters?, sentenceEndingPunctuation?, suffix?) → string
ParameterTypeDescription
maxNumberOfCharactersintMaximum output length
sentenceEndingPunctuationstring|arrayPunctuation options
suffixstringTruncation suffix ("...")

Poem Generation

poem(numberOfVerses?, stanzaLength?, maximumNumberOfWords?, endingPunctuation?) → string
ParameterTypeDescription
numberOfVersesintTotal lines
stanzaLengthintLines per stanza (4 = quatrain)
maximumNumberOfWordsintMax words per line
endingPunctuationstring|arrayLine endings

Acrostic Generation

acrosticPoem(initials, maximumNumberOfWords?, endingPunctuation?) → string
ParameterTypeDescription
initialsstringWord to spell (required)
maximumNumberOfWordsintMax words per line
endingPunctuationstring|arrayLine endings

Example:

Input: initials = "HELLO"

Output:
Happiness fills the morning air.
Every bird sings a joyful tune.
Laughter echoes through the valley.
Light dances on the water.
Open your heart to the world.

Constrained Generation

Phony supports multiple constraint types for controlled output:

ConstraintDescriptionExample
startsWithPrefix constraintstartsWith: "A" → "Ahmet"
lengthHintSoft length targetlengthHint: 6 → ~6 chars
min_lengthHard minimummin_length: 4 → ≥4 chars
max_lengthHard maximummax_length: 12 → ≤12 chars
endsWithSuffix constraintendsWith: "oğlu"
containsSubstring constraintcontains: "han"
positionSentence positionposition: 1 → sentence-start style

Soft vs Hard Constraints

  • Soft (lengthHint): Guides generation, may overshoot
  • Hard (min_length, max_length): Enforced strictly, may return null

Binary Search Weighted Random

The key algorithm for O(log n) selection:

┌─────────────────────────────────────────────────────────────────────────┐
│                    WEIGHTED RANDOM SELECTION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Elements: ["a", "b", "c", "d"]                                         │
│  Weights:  [1,   2,   3,   4]                                           │
│  Cumulative: [1, 3, 6, 10]                                              │
│                                                                          │
│  Random(0-10) = 5                                                        │
│                                                                          │
│  Binary search for first cw ≥ 5:                                        │
│                                                                          │
│  low=0, high=3                                                          │
│  mid=1, cw[1]=3 < 5 → low=2                                             │
│  mid=2, cw[2]=6 ≥ 5 → found!                                            │
│                                                                          │
│  Return: elements[2] = "c"                                              │
│                                                                          │
│  Probability distribution:                                               │
│  a: 10%, b: 20%, c: 30%, d: 40%                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Training Configuration

CLI Training

bash
# Basic training
phony train names.txt -o models/names.ngram

# With options
phony train names.txt \
  --output models/tr_TR/names.ngram \
  --ngram-size 4 \
  --min-word-length 3 \
  --position-depth 5 \
  --token-type char \
  --locale tr_TR \
  --exclude-originals \
  --include-real-words \
  --fallback prefix \
  --word-separator '\s' \
  --sentence-separator '[.!?]' \
  --word-filter '/[^a-zçğıöşü]+/u' \
  --lowercase

Training Flags

FlagDescription
--ngram-size NN-gram order (default: 4)
--token-type char|wordToken type (char=character-level, word=word-level)
--min-word-length NFilter short words
--position-depth NSentence position tracking depth
--exclude-originalsStore training words for deduplication
--include-real-wordsEmbed training words for real_word mode
--fallback none|prefixFallback strategy for sparse data
--word-separator REGEXWord boundary regex
--sentence-separator REGEXSentence boundary regex
--word-filter REGEXCharacter filter regex
--lowercaseNormalize to lowercase

Training from Different Sources

bash
# From CSV column
phony train customers.csv --column first_name -o models/names.ngram

# From JSON path
phony train data.json --path "$.users[*].name" -o models/names.ngram

# From multiple files
phony train names/*.txt -o models/names.ngram

PDL Integration

Model generators must specify a generation block. There is no default mode - omitting it will cause a validation error.

Generation Block Syntax

json
{
  "type": "model",
  "source": "path/to/model.ngram",
  "generation": {
    "mode": "word | sentence | text | paragraph | poem | acrostic | real_word",
    "params": { ... }
  },
  "constraints": { ... }
}

Structure:

  • generation.mode (required): How to produce output
  • generation.params (optional): Mode-specific parameters (e.g., word_count, starts_with)
  • constraints (optional): Output validation (e.g., min_length, max_length - rejects if not met)

Available Modes

ModeOutputRequired ParamsOptional Params
wordSingle word-length_hint, starts_with, position
wordsMultiple words-count, length_hint, starts_with
sentenceSingle sentence-word_count, starts_with, punctuation
sentencesMultiple sentences-count, word_count, punctuation
textTruncated text-max_chars, suffix, punctuation
paragraphParagraph-sentence_count, punctuation
poemPoem with stanzas-verses, stanza_length, max_words, punctuation
acrosticAcrostic poeminitialsmax_words, punctuation
real_wordPick from training-starts_with

Dynamic Parameters with Other Generators

Generation parameters can reference other generators using {{generator_name}}:

json
{
  "generators": {
    "word_count": {
      "type": "logic",
      "algorithm": "int_between",
      "params": { "min": 3, "max": 8 }
    },
    "sentence_count": {
      "type": "logic",
      "algorithm": "int_between",
      "params": { "min": 2, "max": 5 }
    },
    "bio": {
      "type": "model",
      "source": "models/text.ngram",
      "generation": {
        "mode": "sentence",
        "params": {
          "word_count": "{{word_count}}",
          "punctuation": [".", "!"]
        }
      }
    },
    "description": {
      "type": "model",
      "source": "models/text.ngram",
      "generation": {
        "mode": "paragraph",
        "params": {
          "sentence_count": "{{sentence_count}}"
        }
      }
    }
  }
}

Inline Random Ranges

Parameters also support inline random syntax:

json
{
  "generators": {
    "tagline": {
      "type": "model",
      "source": "models/slogans.ngram",
      "generation": {
        "mode": "sentence",
        "params": {
          "word_count": "{{number:3-8}}",
          "punctuation": "{{pick(['!', '.', '...'])}}"
        }
      }
    }
  }
}

Complete Examples

Simple word generation:

json
{
  "first_name": {
    "type": "model",
    "source": "models/names.ngram",
    "generation": { "mode": "word" },
    "constraints": { "min_length": 3, "max_length": 12 }
  }
}

Sentence with dynamic word count:

json
{
  "motto": {
    "type": "model",
    "source": "models/quotes.ngram",
    "generation": {
      "mode": "sentence",
      "params": {
        "word_count": "{{number:5-12}}",
        "starts_with": "{{pick(['A', 'B', 'C'])}}",
        "punctuation": [".", "!"]
      }
    }
  }
}

Poem with configurable structure:

json
{
  "haiku": {
    "type": "model",
    "source": "models/poetry.ngram",
    "generation": {
      "mode": "poem",
      "params": {
        "verses": 3,
        "stanza_length": 0,
        "max_words": "{{number:3-5}}",
        "punctuation": ""
      }
    }
  }
}

Acrostic poem:

json
{
  "welcome_poem": {
    "type": "model",
    "source": "models/poetry.ngram",
    "generation": {
      "mode": "acrostic",
      "params": {
        "initials": "WELCOME",
        "max_words": "{{number:4-8}}",
        "punctuation": "."
      }
    }
  }
}

Pick real word from training data:

json
{
  "known_name": {
    "type": "model",
    "source": "models/names.ngram",
    "generation": {
      "mode": "real_word",
      "params": {
        "starts_with": "A"
      }
    }
  }
}

Mode Selection Guide

┌─────────────────────────────────────────────────────────────────────────┐
│                    GENERATION MODE SELECTION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Need a single generated item?                                          │
│  ├── Name, username, made-up word    → mode: "word"                     │
│  ├── Company name (multi-word)       → mode: "word" + word-mode model   │
│  └── Known/real name from data       → mode: "real_word"                │
│                                                                          │
│  Need text content?                                                      │
│  ├── Short tagline, motto            → mode: "sentence"                 │
│  ├── Bio, description (fixed length) → mode: "text" + max_chars         │
│  └── Article, review                 → mode: "paragraph"                │
│                                                                          │
│  Need creative/structured text?                                          │
│  ├── Poem with stanzas               → mode: "poem"                     │
│  └── Hidden message in initials      → mode: "acrostic"                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Performance Characteristics

OperationComplexityNotes
Training (feed)O(n × k)n = chars, k = ngram size
CalculateO(m log m)m = unique ngrams (sorting)
Weighted randomO(log m)Binary search
Word generationO(length)Linear in output length
Model size100KB - 10MBDepends on training corpus

File Format

N-gram models are stored as .ngram files:

model.ngram
├── Gzip compressed
└── JSON structure (as described above)

The .ngram extension indicates:

  • Machine-trained statistical model
  • Not human-editable (unlike lists)
  • Locale-specific content

Future Optimizations

FlatBuffers for Zero-Copy Loading

Current: JSON + Gzip requires parsing and memory allocation during load.

Future: FlatBuffers format for instant model loading in Cloud engine.

┌─────────────────────────────────────────────────────────────────────────┐
│                    ZERO-COPY DESERIALIZATION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Current (JSON + Gzip):                                                 │
│  ──────────────────────                                                 │
│  1. Read file from disk        (~10ms)                                  │
│  2. Decompress gzip            (~20ms)                                  │
│  3. Parse JSON text            (~50-100ms) ← Bottleneck                 │
│  4. Build data structures      (~10ms)                                  │
│  Total: ~100-150ms                                                      │
│                                                                          │
│  Future (FlatBuffers):                                                  │
│  ─────────────────────                                                  │
│  1. mmap() file into memory    (~1ms)                                   │
│  2. Done. No parsing needed.                                            │
│  Total: ~1-5ms                                                          │
│                                                                          │
│  WHY IT WORKS:                                                          │
│  ─────────────                                                          │
│  FlatBuffers file format IS the memory format.                          │
│  No conversion needed - just point to the data.                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Proposed dual-format support:

ExtensionFormatUse Case
.ngramJSON + GzipOSS libraries, debugging, human inspection
.ngram.fbFlatBuffersCloud engine, high-performance scenarios

Benefits for Cloud (5M rec/sec):

  • Instant model switching between locales
  • Lower memory pressure (no parsing overhead)
  • Better cache utilization

References:


Streaming Training for Large Files

Current: Entire training file loaded into memory before processing.

Future: Stream-based training for arbitrarily large files.

┌─────────────────────────────────────────────────────────────────────────┐
│                    STREAMING TRAINING PIPELINE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Current (Batch):                                                       │
│  ─────────────────                                                      │
│  [Read entire file] → [Process all] → [Build model]                    │
│  Memory: O(file_size)                                                   │
│  Problem: 10GB file = 10GB+ RAM needed                                  │
│                                                                          │
│  Future (Streaming):                                                    │
│  ──────────────────                                                     │
│  [Read chunk 1] → [Update model] ─┐                                    │
│  [Read chunk 2] → [Update model] ─┤→ [Finalize model]                  │
│  [Read chunk 3] → [Update model] ─┘                                    │
│  Memory: O(chunk_size + model_size)                                     │
│  Benefit: 10GB file with 64MB chunks = ~100MB RAM                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Implementation approach:

rust
// Pseudocode for streaming trainer
fn train_streaming(path: &str, chunk_size: usize) -> Model {
    let mut model = Model::new();

    for chunk in read_chunks(path, chunk_size) {
        let partial_counts = count_ngrams(&chunk);
        model.merge_counts(partial_counts);
    }

    model.finalize();  // Calculate cumulative weights, etc.
    model
}

Key considerations:

  • N-grams at chunk boundaries: Overlap chunks by n-1 characters
  • Memory for counts: Hash map grows with unique N-grams (bounded)
  • Finalization: Single pass to compute cumulative weights

Parallel Training

Future enhancement: Multi-threaded training for very large corpora.

┌─────────────────────────────────────────────────────────────────────────┐
│                    PARALLEL TRAINING                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Thread 1: [Chunk 1] → [Count N-grams] ─┐                               │
│  Thread 2: [Chunk 2] → [Count N-grams] ─┼→ [Merge] → [Model]           │
│  Thread 3: [Chunk 3] → [Count N-grams] ─┤                               │
│  Thread 4: [Chunk 4] → [Count N-grams] ─┘                               │
│                                                                          │
│  Speedup: ~linear with cores (N-gram counting is embarrassingly parallel)│
│                                                                          │
│  MERGE STRATEGY:                                                        │
│  ───────────────                                                        │
│  • Each thread produces partial frequency counts                        │
│  • Main thread merges counts (simple addition)                          │
│  • Finalize: compute cumulative weights once                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

CLI flag (future):

bash
# Parallel training with 8 threads
phony train large-corpus.txt \
  -o models/large.ngram \
  --parallel 8 \
  --chunk-size 64MB

When to use:

  • Training data > 100MB
  • Multi-core machine available
  • One-time model building (not interactive)

Optimization Priority

OptimizationEffortImpactPriority
Streaming trainingMediumHigh (enables large files)P1
Parallel trainingMediumMedium (faster training)P2
FlatBuffers (Cloud)HighHigh (instant loading)P2
FlatBuffers (OSS)HighLow (one-time load)P3

Phony Cloud Platform Specification