N-gram Language Models

Phony uses character-level N-gram Markov chains for generating realistic, locale-specific text. This document describes the model architecture, training process, and generation algorithms.

Core Concepts

What is a Character-Level N-gram?

Unlike word-level N-grams (which model word sequences), Phony uses character-level N-grams to model character sequences within words:

Word: "sphinx" (n=4)
─────────────────────────
s p h i n x
└───┘         → "sphi" (first n-gram)
  └───┘       → "phin" (middle n-gram)
    └───┘     → "hinx" (last n-gram)

This approach enables:

Infinite vocabulary: Generate words never seen in training
Locale characteristics: Model learns "Turkish-sounding" or "English-sounding" patterns
Compact storage: Model file much smaller than word lists

Markov Chain Graph

Each N-gram forms a node in a directed graph with weighted edges:

┌─────────────────────────────────────────────────────────────────────────┐
│                    MARKOV CHAIN STRUCTURE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  "sphi" ─────(w:4)────▶ "phin" ─────(w:4)────▶ "hinx"                   │
│    │                      │                      │                       │
│    │                      │                      │                       │
│    ▼                      ▼                      ▼                       │
│  children              children              lastChildren                │
│  (mid-word)            (mid-word)            (word-end)                  │
│                                                                          │
│  Key insight:                                                            │
│  • children (c)      → transitions within word                          │
│  • lastChildren (lc) → transitions that end the word                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Model Data Structure

LookupList Format

The core data structure for weighted random selection:

json

{
  "e": ["sphi", "phin", "hinx", "quar"],
  "w": [4, 3, 2, 5],
  "cw": [4, 7, 9, 14],
  "i": {"sphi": 0, "phin": 1, "hinx": 2, "quar": 3},
  "sw": 14,
  "c": 4
}

Field	Name	Description
`e`	elements	N-gram strings array
`w`	weights	Occurrence counts (frequency)
`cw`	cumulative weights	Running sum for binary search
`i`	index	Reverse lookup (ngram → array index)
`sw`	sum of weights	Total weight
`c`	count	Number of elements

Why cumulative weights? Enables O(log n) weighted random selection via binary search instead of O(n) linear scan.

Element (Graph Node) Format

json

{
  "sphi": {
    "c": {
      "e": ["phin"],
      "w": [4],
      "cw": [4],
      "sw": 4,
      "c": 1
    },
    "lc": {
      "e": [],
      "w": [],
      "cw": [],
      "sw": 0,
      "c": 0
    }
  }
}

Field	Name	Description
`c`	children	Next n-grams (mid-word transitions)
`lc`	lastChildren	Terminal n-grams (word-end transitions)

Complete Model Format

json

{
  "version": 1,
  "metadata": {
    "locale": "tr_TR",
    "domain": "person_names",
    "ngram_order": 4,
    "token_type": "char",
    "trained_on": "2024-03-15T10:00:00Z",
    "sample_count": 50000
  },
  "config": {
    "min_word_length": 3,
    "exclude_originals": true,
    "include_real_words": true,
    "position_depth": 5,
    "fallback": "prefix",
    "tokenizer": {
      "word_separator": "\\s",
      "sentence_separator": "[.!?:;]",
      "word_filter": "/[^a-zçğıöşü]+/u",
      "to_lowercase": true
    }
  },
  "ngrams": ["mehm", "ehme", "hmet", "ayşe", "sphi", "phin", "hinx", ...],
  "distributions": {
    "word_lengths": {
      "e": [4, 5, 6, 7, 8],
      "w": [120, 350, 280, 150, 100],
      "cw": [120, 470, 750, 900, 1000]
    },
    "sentence_lengths": {
      "e": [5, 6, 7, 8, 9, 10],
      "w": [50, 120, 200, 180, 100, 50],
      "cw": [50, 170, 370, 550, 650, 700]
    }
  },
  "first": { "e": [0, 3, 4, ...], "w": [...], "cw": [...] },
  "positions": {
    "1": { "e": [0, 4, 12, ...], "w": [...], "cw": [...] },
    "-1": { "e": [2, 6, 15, ...], "w": [...], "cw": [...] }
  },
  "graph": {
    "0": { "c": {"e": [1], "w": [4], "cw": [4]}, "lc": {"e": [2], "w": [3], "cw": [3]} },
    "1": { "c": {"e": [2], "w": [4], "cw": [4]}, "lc": {"e": [], "w": [], "cw": []} }
  },
  "words": {
    "excluded": [
      [0, 1, 2],
      [3]
    ],
    "real": [
      [0, 1, 2],
      [3],
      [4, 5, 6]
    ]
  }
}

N-gram Only String Storage

The only strings in the model are N-grams. Words are represented as sequences of N-gram indices:

┌─────────────────────────────────────────────────────────────────────────┐
│                    WORD AS NGRAM SEQUENCE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ngrams: ["mehm", "ehme", "hmet", "ayşe", ...]                          │
│             0       1       2       3                                    │
│                                                                          │
│  Word "mehmet" = ngram sequence [0, 1, 2]                               │
│                                                                          │
│  Reconstruction:                                                         │
│  ────────────────                                                        │
│  ngrams[0]         = "mehm"                                             │
│  ngrams[0] + [1]   = "mehm" + "e" = "mehme"                             │
│  ngrams[0] + [1,2] = "mehme" + "t" = "mehmet" ✓                         │
│                                                                          │
│  Algorithm:                                                              │
│  ──────────                                                              │
│  word = ngrams[sequence[0]]                                             │
│  for i in sequence[1:]:                                                 │
│      word += ngrams[i][-1]   // append last char                        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Field	Description
`ngrams`	All unique N-gram strings (single source of truth)
`words.excluded`	N-gram index sequences for deduplication
`words.real`	N-gram index sequences for `real_word` generation mode

Benefits:

Single source of truth: Only N-grams stored as strings
No duplication: Words derived from existing N-grams
Smaller file size: Integer arrays vs repeated strings
Consistent: All data uses numeric indices

Positional N-grams

A key innovation: tracking N-gram distributions by sentence position.

┌─────────────────────────────────────────────────────────────────────────┐
│                    POSITIONAL N-GRAM CONCEPT                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Sentence: "The quick brown fox jumps"                                  │
│             ▲   ▲     ▲     ▲   ▲                                       │
│             │   │     │     │   │                                       │
│           pos:1 pos:2  ...  pos:-2 pos:-1                               │
│                                                                          │
│  positions[1]  → N-grams that START sentences                           │
│  positions[2]  → N-grams common in 2nd position                         │
│  positions[-1] → N-grams that END sentences                             │
│  positions[-2] → N-grams common in 2nd-to-last position                 │
│                                                                          │
│  WHY THIS MATTERS:                                                       │
│  ─────────────────                                                       │
│  In Turkish: "bir" commonly starts sentences                            │
│              "oldu" commonly ends sentences                              │
│  In English: "The" commonly starts sentences                            │
│              "today" commonly ends sentences                             │
│                                                                          │
│  This produces more natural sentence structure.                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Configuration

json

{
  "config": {
    "position_depth": 5
  }
}

When position_depth: 5:

Tracks positions: 1, 2, 3, 4, 5, -5, -4, -3, -2, -1
Middle words use generic first distribution

Tokenization

Regex-Based Boundaries

Sentence and word boundaries are defined via regular expressions:

json

{
  "config": {
    "tokenizer": {
      "word_separator": "\\s",
      "sentence_separator": "[.!?:;]",
      "word_filter": "/[^a-zçğıöşü]+/u",
      "to_lowercase": true
    }
  }
}

Field	Description	Example
`word_separator`	Regex to split words	`\s` (whitespace)
`sentence_separator`	Regex to split sentences	`[.!?:;]`
`word_filter`	Regex to clean words	`/[^a-z]+/` (keep only letters)
`to_lowercase`	Normalize case	`true`

Predefined Filter Types

Type	Pattern	Use Case
`ALPHABETICAL`	`/[^a-z]+/`	English basic
`FRENCH`	`/[^a-zéèëêúüûùœàáäâæíïìîóöôòç]+/`	French
`TURKISH`	`/[^a-zçğıöşü]+/u`	Turkish
`JAPANESE`	Unicode ranges	Japanese (hiragana, katakana, kanji)
`CHINESE`	Unicode ranges	Chinese (CJK)
`NO_SYMBOLS`	`/[^ \p{L}]+/u`	Any Unicode letter

Token Type: Character vs Word

Terminology Note: Don't confuse token_type (how the model was trained) with generation.mode (how output is produced). They are independent concepts:
token_type: Set at training time. Determines what a "token" is (character or word).
generation.mode: Set at generation time in PDL. Determines output type (word, sentence, text, etc.).

Character Token Type (Default)

N-grams are character sequences. The model learns character-by-character patterns.

Training: ["sphinx", "quarter", "quick"]
N-grams:  "sphi" → "phin" → "hinx"
          "quar" → "uart" → "arte" → "rter"
          "quic" → "uick"

Output: "sphinter" (new word, never in training!)

Use for: Names, usernames, made-up words

Word Token Type

N-grams are word sequences. The model learns word-by-word patterns (for multi-word outputs like company names).

Training corpus (company names):
  "Yılmaz Holding A.Ş."
  "Koç Holding A.Ş."
  "Sabancı Holding A.Ş."
  "Yılmaz Gıda Sanayi"
  "Koç Otomotiv Sanayi"

Word bigrams (n=2):
  ["Yılmaz", "Holding"] → ["Holding", "A.Ş."]
  ["Koç", "Holding"] → ["Holding", "A.Ş."]
  ["Yılmaz", "Gıda"] → ["Gıda", "Sanayi"]
  ["Koç", "Otomotiv"] → ["Otomotiv", "Sanayi"]

Generated outputs:
  "Yılmaz Otomotiv Sanayi"  (new combination!)
  "Koç Gıda Sanayi"         (new combination!)
  "Sabancı Holding A.Ş."    (from training)

Use for: Company names, product names, multi-word phrases

Training with Word Token Type

bash

# Company names (word bigrams)
phony train companies.txt \
  -o models/companies.ngram \
  --token-type word \
  --ngram-size 2

# Product names (word trigrams for more context)
phony train products.txt \
  -o models/products.ngram \
  --token-type word \
  --ngram-size 3

Configuration

json

{
  "metadata": {
    "token_type": "char"
  }
}

Value	Description	N-gram Example
`char`	Character-level (default)	"meh" → "ehm" → "hme"
`word`	Word-level	["Koç", "Holding"] → ["Holding", "A.Ş."]

Note: token_type is a model property set at training time. It cannot be changed when using the model. Use --token-type word during training for multi-word phrase generation (company names, product names).

Design Rationale

Why Single N-gram Order?

Phony uses a single N-gram order (e.g., only 4-grams) rather than multi-order approaches. This was a deliberate design choice after evaluating academic smoothing methods.

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE LOREM IPSUM PRINCIPLE                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Phony generates FAKE data that LOOKS LIKE real data.                   │
│  It doesn't need to BE real data.                                       │
│                                                                          │
│  Priority 1: Speed (generate millions of records fast)                  │
│  Priority 2: Plausibility (output looks locale-appropriate)             │
│  Priority 3: NOT coverage (we don't need every possible word)           │
│                                                                          │
│  Academic smoothing methods optimize for coverage.                       │
│  We optimize for speed and plausibility.                                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Evaluated But Rejected

We evaluated several academic approaches before deciding on simplicity:

Method	What It Does	Why Not For Phony
Kneser-Ney Smoothing	Redistributes probability mass using continuation counts	Complexity overhead, diminishing returns for fake data
Modified Kneser-Ney	Three discount parameters (D₁, D₂, D₃₊)	Even more complexity, marginal quality gain
Interpolation	λ-weighted mix of all n-gram orders	Requires storing all orders (2x-4x model size)
Stupid Backoff	Fixed discount (0.4) without normalization	Still requires multi-order storage

References:

Chen & Goodman (1999). "An Empirical Study of Smoothing Techniques for Language Modeling"
Kneser & Ney (1995). "Improved backing-off for M-gram language modeling"

Our Approach: KISS

Single n-gram order + Simple prefix-match fallback
────────────────────────────────────────────────────
✓ One model file (not separate files per order)
✓ Smaller memory footprint
✓ Faster generation (no interpolation math)
✓ Good enough for fake data

Locale-Based N-gram Defaults

Different languages have different optimal N-gram sizes:

Locale	Default N	Rationale
`en_*`	3	Shorter words, common patterns
`tr_TR`	4	Agglutinative, longer morphemes
`de_DE`	4	Compound words, longer stems
`ja_JP`	2	Character-based (hiragana/katakana)
`zh_*`	2	Character-based (hanzi)
`ar_*`	3	Root-pattern morphology

These are defaults; users can override with --ngram-size flag.

Fallback Strategy

The Sparse Data Problem

When an N-gram has no recorded transitions:

Request: word starting with "xyz"
Model: No N-grams starting with "xyz"
Result: null (failure)

Prefix-Match Fallback

Since we store only single-order N-grams, we use prefix matching as fallback:

┌─────────────────────────────────────────────────────────────────────────┐
│                    PREFIX-MATCH FALLBACK                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Looking for 4-gram starting with "xyz"                                 │
│      ↓                                                                   │
│  Search: Any 4-gram matching "xyz*" → found "xyza" → use it            │
│      ↓ (if not found)                                                   │
│  Search: Any 4-gram matching "xy**" → found "xyzb" → use it            │
│      ↓ (if not found)                                                   │
│  Search: Any 4-gram matching "x***" → found "xabc" → use it            │
│      ↓ (if not found)                                                   │
│  Use random from `first` distribution                                   │
│                                                                          │
│  Key insight: We stay within single n-gram order.                       │
│  No 3-grams or 2-grams needed.                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Configuration

json

{
  "config": {
    "fallback": "prefix"
  }
}

Value	Description
`none`	Return null on miss (strict mode)
`prefix`	Prefix-match within same n-gram order (default)

Generation Modes

Word Generation

word(lengthHint?, position?, startsWith?) → string | null

Parameter	Type	Description
`lengthHint`	int	Target length (soft constraint)
`position`	int	Sentence position (1, 2, -1, -2)
`startsWith`	string	Required prefix

Algorithm:

Select starting N-gram (filtered by startsWith if provided)
Select target length from word_lengths distribution
Loop: pick weighted random child, append last character
When length ≥ hint and lastChildren exists, terminate

Sentence Generation

sentence(numberOfWords?, startsWith?, endingPunctuation?) → string

Parameter	Type	Description
`numberOfWords`	int	Word count
`startsWith`	string	First word prefix
`endingPunctuation`	string\|array	`"."` or `[".", "!", "?"]`

Algorithm:

Generate position-aware words for first N positions
Generate generic words for middle
Generate position-aware words for last N positions
Capitalize first letter, append punctuation

Text Generation

text(maxNumberOfCharacters?, sentenceEndingPunctuation?, suffix?) → string

Parameter	Type	Description
`maxNumberOfCharacters`	int	Maximum output length
`sentenceEndingPunctuation`	string\|array	Punctuation options
`suffix`	string	Truncation suffix (`"..."`)

Poem Generation

poem(numberOfVerses?, stanzaLength?, maximumNumberOfWords?, endingPunctuation?) → string

Parameter	Type	Description
`numberOfVerses`	int	Total lines
`stanzaLength`	int	Lines per stanza (4 = quatrain)
`maximumNumberOfWords`	int	Max words per line
`endingPunctuation`	string\|array	Line endings

Acrostic Generation

acrosticPoem(initials, maximumNumberOfWords?, endingPunctuation?) → string

Parameter	Type	Description
`initials`	string	Word to spell (required)
`maximumNumberOfWords`	int	Max words per line
`endingPunctuation`	string\|array	Line endings

Example:

Input: initials = "HELLO"

Output:
Happiness fills the morning air.
Every bird sings a joyful tune.
Laughter echoes through the valley.
Light dances on the water.
Open your heart to the world.

Constrained Generation

Phony supports multiple constraint types for controlled output:

Constraint	Description	Example
`startsWith`	Prefix constraint	`startsWith: "A"` → "Ahmet"
`lengthHint`	Soft length target	`lengthHint: 6` → ~6 chars
`min_length`	Hard minimum	`min_length: 4` → ≥4 chars
`max_length`	Hard maximum	`max_length: 12` → ≤12 chars
`endsWith`	Suffix constraint	`endsWith: "oğlu"`
`contains`	Substring constraint	`contains: "han"`
`position`	Sentence position	`position: 1` → sentence-start style

Soft vs Hard Constraints

Soft (lengthHint): Guides generation, may overshoot
Hard (min_length, max_length): Enforced strictly, may return null

Binary Search Weighted Random

The key algorithm for O(log n) selection:

┌─────────────────────────────────────────────────────────────────────────┐
│                    WEIGHTED RANDOM SELECTION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Elements: ["a", "b", "c", "d"]                                         │
│  Weights:  [1,   2,   3,   4]                                           │
│  Cumulative: [1, 3, 6, 10]                                              │
│                                                                          │
│  Random(0-10) = 5                                                        │
│                                                                          │
│  Binary search for first cw ≥ 5:                                        │
│                                                                          │
│  low=0, high=3                                                          │
│  mid=1, cw[1]=3 < 5 → low=2                                             │
│  mid=2, cw[2]=6 ≥ 5 → found!                                            │
│                                                                          │
│  Return: elements[2] = "c"                                              │
│                                                                          │
│  Probability distribution:                                               │
│  a: 10%, b: 20%, c: 30%, d: 40%                                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Training Configuration

CLI Training

bash

# Basic training
phony train names.txt -o models/names.ngram

# With options
phony train names.txt \
  --output models/tr_TR/names.ngram \
  --ngram-size 4 \
  --min-word-length 3 \
  --position-depth 5 \
  --token-type char \
  --locale tr_TR \
  --exclude-originals \
  --include-real-words \
  --fallback prefix \
  --word-separator '\s' \
  --sentence-separator '[.!?]' \
  --word-filter '/[^a-zçğıöşü]+/u' \
  --lowercase

Training Flags

Flag	Description
`--ngram-size N`	N-gram order (default: 4)
`--token-type char\|word`	Token type (char=character-level, word=word-level)
`--min-word-length N`	Filter short words
`--position-depth N`	Sentence position tracking depth
`--exclude-originals`	Store training words for deduplication
`--include-real-words`	Embed training words for `real_word` mode
`--fallback none\|prefix`	Fallback strategy for sparse data
`--word-separator REGEX`	Word boundary regex
`--sentence-separator REGEX`	Sentence boundary regex
`--word-filter REGEX`	Character filter regex
`--lowercase`	Normalize to lowercase

Training from Different Sources

bash

# From CSV column
phony train customers.csv --column first_name -o models/names.ngram

# From JSON path
phony train data.json --path "$.users[*].name" -o models/names.ngram

# From multiple files
phony train names/*.txt -o models/names.ngram

PDL Integration

Model generators must specify a generation block. There is no default mode - omitting it will cause a validation error.

Generation Block Syntax

json

{
  "type": "model",
  "source": "path/to/model.ngram",
  "generation": {
    "mode": "word | sentence | text | paragraph | poem | acrostic | real_word",
    "params": { ... }
  },
  "constraints": { ... }
}

Structure:

generation.mode (required): How to produce output
generation.params (optional): Mode-specific parameters (e.g., word_count, starts_with)
constraints (optional): Output validation (e.g., min_length, max_length - rejects if not met)

Available Modes

Mode	Output	Required Params	Optional Params
`word`	Single word	-	`length_hint`, `starts_with`, `position`
`words`	Multiple words	-	`count`, `length_hint`, `starts_with`
`sentence`	Single sentence	-	`word_count`, `starts_with`, `punctuation`
`sentences`	Multiple sentences	-	`count`, `word_count`, `punctuation`
`text`	Truncated text	-	`max_chars`, `suffix`, `punctuation`
`paragraph`	Paragraph	-	`sentence_count`, `punctuation`
`poem`	Poem with stanzas	-	`verses`, `stanza_length`, `max_words`, `punctuation`
`acrostic`	Acrostic poem	`initials`	`max_words`, `punctuation`
`real_word`	Pick from training	-	`starts_with`

Dynamic Parameters with Other Generators

Generation parameters can reference other generators using {{generator_name}}:

json

{
  "generators": {
    "word_count": {
      "type": "logic",
      "algorithm": "int_between",
      "params": { "min": 3, "max": 8 }
    },
    "sentence_count": {
      "type": "logic",
      "algorithm": "int_between",
      "params": { "min": 2, "max": 5 }
    },
    "bio": {
      "type": "model",
      "source": "models/text.ngram",
      "generation": {
        "mode": "sentence",
        "params": {
          "word_count": "{{word_count}}",
          "punctuation": [".", "!"]
        }
      }
    },
    "description": {
      "type": "model",
      "source": "models/text.ngram",
      "generation": {
        "mode": "paragraph",
        "params": {
          "sentence_count": "{{sentence_count}}"
        }
      }
    }
  }
}

Inline Random Ranges

Parameters also support inline random syntax:

json

{
  "generators": {
    "tagline": {
      "type": "model",
      "source": "models/slogans.ngram",
      "generation": {
        "mode": "sentence",
        "params": {
          "word_count": "{{number:3-8}}",
          "punctuation": "{{pick(['!', '.', '...'])}}"
        }
      }
    }
  }
}

Complete Examples

Simple word generation:

json

{
  "first_name": {
    "type": "model",
    "source": "models/names.ngram",
    "generation": { "mode": "word" },
    "constraints": { "min_length": 3, "max_length": 12 }
  }
}

Sentence with dynamic word count:

json

{
  "motto": {
    "type": "model",
    "source": "models/quotes.ngram",
    "generation": {
      "mode": "sentence",
      "params": {
        "word_count": "{{number:5-12}}",
        "starts_with": "{{pick(['A', 'B', 'C'])}}",
        "punctuation": [".", "!"]
      }
    }
  }
}

Poem with configurable structure:

json

{
  "haiku": {
    "type": "model",
    "source": "models/poetry.ngram",
    "generation": {
      "mode": "poem",
      "params": {
        "verses": 3,
        "stanza_length": 0,
        "max_words": "{{number:3-5}}",
        "punctuation": ""
      }
    }
  }
}

Acrostic poem:

json

{
  "welcome_poem": {
    "type": "model",
    "source": "models/poetry.ngram",
    "generation": {
      "mode": "acrostic",
      "params": {
        "initials": "WELCOME",
        "max_words": "{{number:4-8}}",
        "punctuation": "."
      }
    }
  }
}

Pick real word from training data:

json

{
  "known_name": {
    "type": "model",
    "source": "models/names.ngram",
    "generation": {
      "mode": "real_word",
      "params": {
        "starts_with": "A"
      }
    }
  }
}

Mode Selection Guide

┌─────────────────────────────────────────────────────────────────────────┐
│                    GENERATION MODE SELECTION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Need a single generated item?                                          │
│  ├── Name, username, made-up word    → mode: "word"                     │
│  ├── Company name (multi-word)       → mode: "word" + word-mode model   │
│  └── Known/real name from data       → mode: "real_word"                │
│                                                                          │
│  Need text content?                                                      │
│  ├── Short tagline, motto            → mode: "sentence"                 │
│  ├── Bio, description (fixed length) → mode: "text" + max_chars         │
│  └── Article, review                 → mode: "paragraph"                │
│                                                                          │
│  Need creative/structured text?                                          │
│  ├── Poem with stanzas               → mode: "poem"                     │
│  └── Hidden message in initials      → mode: "acrostic"                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Performance Characteristics

Operation	Complexity	Notes
Training (feed)	O(n × k)	n = chars, k = ngram size
Calculate	O(m log m)	m = unique ngrams (sorting)
Weighted random	O(log m)	Binary search
Word generation	O(length)	Linear in output length
Model size	100KB - 10MB	Depends on training corpus

File Format

N-gram models are stored as .ngram files:

model.ngram
├── Gzip compressed
└── JSON structure (as described above)

The .ngram extension indicates:

Machine-trained statistical model
Not human-editable (unlike lists)
Locale-specific content

Future Optimizations

FlatBuffers for Zero-Copy Loading

Current: JSON + Gzip requires parsing and memory allocation during load.

Future: FlatBuffers format for instant model loading in Cloud engine.

┌─────────────────────────────────────────────────────────────────────────┐
│                    ZERO-COPY DESERIALIZATION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Current (JSON + Gzip):                                                 │
│  ──────────────────────                                                 │
│  1. Read file from disk        (~10ms)                                  │
│  2. Decompress gzip            (~20ms)                                  │
│  3. Parse JSON text            (~50-100ms) ← Bottleneck                 │
│  4. Build data structures      (~10ms)                                  │
│  Total: ~100-150ms                                                      │
│                                                                          │
│  Future (FlatBuffers):                                                  │
│  ─────────────────────                                                  │
│  1. mmap() file into memory    (~1ms)                                   │
│  2. Done. No parsing needed.                                            │
│  Total: ~1-5ms                                                          │
│                                                                          │
│  WHY IT WORKS:                                                          │
│  ─────────────                                                          │
│  FlatBuffers file format IS the memory format.                          │
│  No conversion needed - just point to the data.                         │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Proposed dual-format support:

Extension	Format	Use Case
`.ngram`	JSON + Gzip	OSS libraries, debugging, human inspection
`.ngram.fb`	FlatBuffers	Cloud engine, high-performance scenarios

Benefits for Cloud (5M rec/sec):

Instant model switching between locales
Lower memory pressure (no parsing overhead)
Better cache utilization

References:

FlatBuffers
Cap'n Proto (alternative)

Streaming Training for Large Files

Current: Entire training file loaded into memory before processing.

Future: Stream-based training for arbitrarily large files.

┌─────────────────────────────────────────────────────────────────────────┐
│                    STREAMING TRAINING PIPELINE                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Current (Batch):                                                       │
│  ─────────────────                                                      │
│  [Read entire file] → [Process all] → [Build model]                    │
│  Memory: O(file_size)                                                   │
│  Problem: 10GB file = 10GB+ RAM needed                                  │
│                                                                          │
│  Future (Streaming):                                                    │
│  ──────────────────                                                     │
│  [Read chunk 1] → [Update model] ─┐                                    │
│  [Read chunk 2] → [Update model] ─┤→ [Finalize model]                  │
│  [Read chunk 3] → [Update model] ─┘                                    │
│  Memory: O(chunk_size + model_size)                                     │
│  Benefit: 10GB file with 64MB chunks = ~100MB RAM                       │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Implementation approach:

rust

// Pseudocode for streaming trainer
fn train_streaming(path: &str, chunk_size: usize) -> Model {
    let mut model = Model::new();

    for chunk in read_chunks(path, chunk_size) {
        let partial_counts = count_ngrams(&chunk);
        model.merge_counts(partial_counts);
    }

    model.finalize();  // Calculate cumulative weights, etc.
    model
}

Key considerations:

N-grams at chunk boundaries: Overlap chunks by n-1 characters
Memory for counts: Hash map grows with unique N-grams (bounded)
Finalization: Single pass to compute cumulative weights

Parallel Training

Future enhancement: Multi-threaded training for very large corpora.

┌─────────────────────────────────────────────────────────────────────────┐
│                    PARALLEL TRAINING                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Thread 1: [Chunk 1] → [Count N-grams] ─┐                               │
│  Thread 2: [Chunk 2] → [Count N-grams] ─┼→ [Merge] → [Model]           │
│  Thread 3: [Chunk 3] → [Count N-grams] ─┤                               │
│  Thread 4: [Chunk 4] → [Count N-grams] ─┘                               │
│                                                                          │
│  Speedup: ~linear with cores (N-gram counting is embarrassingly parallel)│
│                                                                          │
│  MERGE STRATEGY:                                                        │
│  ───────────────                                                        │
│  • Each thread produces partial frequency counts                        │
│  • Main thread merges counts (simple addition)                          │
│  • Finalize: compute cumulative weights once                            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

CLI flag (future):

bash

# Parallel training with 8 threads
phony train large-corpus.txt \
  -o models/large.ngram \
  --parallel 8 \
  --chunk-size 64MB

When to use:

Training data > 100MB
Multi-core machine available
One-time model building (not interactive)

Optimization Priority

Optimization	Effort	Impact	Priority
Streaming training	Medium	High (enables large files)	P1
Parallel training	Medium	Medium (faster training)	P2
FlatBuffers (Cloud)	High	High (instant loading)	P2
FlatBuffers (OSS)	High	Low (one-time load)	P3

N-gram Language Models ​

Core Concepts ​

What is a Character-Level N-gram? ​

Markov Chain Graph ​

Model Data Structure ​

LookupList Format ​

Element (Graph Node) Format ​

Complete Model Format ​

N-gram Only String Storage ​

Positional N-grams ​

Configuration ​

Tokenization ​

Regex-Based Boundaries ​

Predefined Filter Types ​

Token Type: Character vs Word ​

Character Token Type (Default) ​

Word Token Type ​

Training with Word Token Type ​

Configuration ​

Design Rationale ​

Why Single N-gram Order? ​

Evaluated But Rejected ​

Our Approach: KISS ​

Locale-Based N-gram Defaults ​

Fallback Strategy ​

The Sparse Data Problem ​

Prefix-Match Fallback ​

Configuration ​

Generation Modes ​

Word Generation ​

Sentence Generation ​

Text Generation ​

Poem Generation ​

Acrostic Generation ​

Constrained Generation ​

Soft vs Hard Constraints ​

Binary Search Weighted Random ​

Training Configuration ​

CLI Training ​

Training Flags ​

Training from Different Sources ​

PDL Integration ​

Generation Block Syntax ​

Available Modes ​

Dynamic Parameters with Other Generators ​

Inline Random Ranges ​

Complete Examples ​

Mode Selection Guide ​

Performance Characteristics ​

File Format ​

Future Optimizations ​

FlatBuffers for Zero-Copy Loading ​

Streaming Training for Large Files ​

Parallel Training ​

Optimization Priority ​

N-gram Language Models

Core Concepts

What is a Character-Level N-gram?

Markov Chain Graph

Model Data Structure

LookupList Format

Element (Graph Node) Format

Complete Model Format

N-gram Only String Storage

Positional N-grams

Configuration

Tokenization

Regex-Based Boundaries

Predefined Filter Types

Token Type: Character vs Word

Character Token Type (Default)

Word Token Type

Training with Word Token Type

Configuration

Design Rationale

Why Single N-gram Order?

Evaluated But Rejected

Our Approach: KISS

Locale-Based N-gram Defaults

Fallback Strategy

The Sparse Data Problem

Prefix-Match Fallback

Configuration

Generation Modes

Word Generation

Sentence Generation

Text Generation

Poem Generation

Acrostic Generation

Constrained Generation

Soft vs Hard Constraints

Binary Search Weighted Random

Training Configuration

CLI Training

Training Flags

Training from Different Sources

PDL Integration

Generation Block Syntax

Available Modes

Dynamic Parameters with Other Generators

Inline Random Ranges

Complete Examples

Mode Selection Guide

Performance Characteristics

File Format

Future Optimizations

FlatBuffers for Zero-Copy Loading

Streaming Training for Large Files

Parallel Training

Optimization Priority