N-gram Language Models
Phony uses character-level N-gram Markov chains for generating realistic, locale-specific text. This document describes the model architecture, training process, and generation algorithms.
Core Concepts
What is a Character-Level N-gram?
Unlike word-level N-grams (which model word sequences), Phony uses character-level N-grams to model character sequences within words:
Word: "sphinx" (n=4)
─────────────────────────
s p h i n x
└───┘ → "sphi" (first n-gram)
└───┘ → "phin" (middle n-gram)
└───┘ → "hinx" (last n-gram)This approach enables:
- Infinite vocabulary: Generate words never seen in training
- Locale characteristics: Model learns "Turkish-sounding" or "English-sounding" patterns
- Compact storage: Model file much smaller than word lists
Markov Chain Graph
Each N-gram forms a node in a directed graph with weighted edges:
┌─────────────────────────────────────────────────────────────────────────┐
│ MARKOV CHAIN STRUCTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ "sphi" ─────(w:4)────▶ "phin" ─────(w:4)────▶ "hinx" │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ children children lastChildren │
│ (mid-word) (mid-word) (word-end) │
│ │
│ Key insight: │
│ • children (c) → transitions within word │
│ • lastChildren (lc) → transitions that end the word │
│ │
└─────────────────────────────────────────────────────────────────────────┘Model Data Structure
LookupList Format
The core data structure for weighted random selection:
{
"e": ["sphi", "phin", "hinx", "quar"],
"w": [4, 3, 2, 5],
"cw": [4, 7, 9, 14],
"i": {"sphi": 0, "phin": 1, "hinx": 2, "quar": 3},
"sw": 14,
"c": 4
}| Field | Name | Description |
|---|---|---|
e | elements | N-gram strings array |
w | weights | Occurrence counts (frequency) |
cw | cumulative weights | Running sum for binary search |
i | index | Reverse lookup (ngram → array index) |
sw | sum of weights | Total weight |
c | count | Number of elements |
Why cumulative weights? Enables O(log n) weighted random selection via binary search instead of O(n) linear scan.
Element (Graph Node) Format
{
"sphi": {
"c": {
"e": ["phin"],
"w": [4],
"cw": [4],
"sw": 4,
"c": 1
},
"lc": {
"e": [],
"w": [],
"cw": [],
"sw": 0,
"c": 0
}
}
}| Field | Name | Description |
|---|---|---|
c | children | Next n-grams (mid-word transitions) |
lc | lastChildren | Terminal n-grams (word-end transitions) |
Complete Model Format
{
"version": 1,
"metadata": {
"locale": "tr_TR",
"domain": "person_names",
"ngram_order": 4,
"token_type": "char",
"trained_on": "2024-03-15T10:00:00Z",
"sample_count": 50000
},
"config": {
"min_word_length": 3,
"exclude_originals": true,
"include_real_words": true,
"position_depth": 5,
"fallback": "prefix",
"tokenizer": {
"word_separator": "\\s",
"sentence_separator": "[.!?:;]",
"word_filter": "/[^a-zçğıöşü]+/u",
"to_lowercase": true
}
},
"ngrams": ["mehm", "ehme", "hmet", "ayşe", "sphi", "phin", "hinx", ...],
"distributions": {
"word_lengths": {
"e": [4, 5, 6, 7, 8],
"w": [120, 350, 280, 150, 100],
"cw": [120, 470, 750, 900, 1000]
},
"sentence_lengths": {
"e": [5, 6, 7, 8, 9, 10],
"w": [50, 120, 200, 180, 100, 50],
"cw": [50, 170, 370, 550, 650, 700]
}
},
"first": { "e": [0, 3, 4, ...], "w": [...], "cw": [...] },
"positions": {
"1": { "e": [0, 4, 12, ...], "w": [...], "cw": [...] },
"-1": { "e": [2, 6, 15, ...], "w": [...], "cw": [...] }
},
"graph": {
"0": { "c": {"e": [1], "w": [4], "cw": [4]}, "lc": {"e": [2], "w": [3], "cw": [3]} },
"1": { "c": {"e": [2], "w": [4], "cw": [4]}, "lc": {"e": [], "w": [], "cw": []} }
},
"words": {
"excluded": [
[0, 1, 2],
[3]
],
"real": [
[0, 1, 2],
[3],
[4, 5, 6]
]
}
}N-gram Only String Storage
The only strings in the model are N-grams. Words are represented as sequences of N-gram indices:
┌─────────────────────────────────────────────────────────────────────────┐
│ WORD AS NGRAM SEQUENCE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ngrams: ["mehm", "ehme", "hmet", "ayşe", ...] │
│ 0 1 2 3 │
│ │
│ Word "mehmet" = ngram sequence [0, 1, 2] │
│ │
│ Reconstruction: │
│ ──────────────── │
│ ngrams[0] = "mehm" │
│ ngrams[0] + [1] = "mehm" + "e" = "mehme" │
│ ngrams[0] + [1,2] = "mehme" + "t" = "mehmet" ✓ │
│ │
│ Algorithm: │
│ ────────── │
│ word = ngrams[sequence[0]] │
│ for i in sequence[1:]: │
│ word += ngrams[i][-1] // append last char │
│ │
└─────────────────────────────────────────────────────────────────────────┘| Field | Description |
|---|---|
ngrams | All unique N-gram strings (single source of truth) |
words.excluded | N-gram index sequences for deduplication |
words.real | N-gram index sequences for real_word generation mode |
Benefits:
- Single source of truth: Only N-grams stored as strings
- No duplication: Words derived from existing N-grams
- Smaller file size: Integer arrays vs repeated strings
- Consistent: All data uses numeric indices
Positional N-grams
A key innovation: tracking N-gram distributions by sentence position.
┌─────────────────────────────────────────────────────────────────────────┐
│ POSITIONAL N-GRAM CONCEPT │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Sentence: "The quick brown fox jumps" │
│ ▲ ▲ ▲ ▲ ▲ │
│ │ │ │ │ │ │
│ pos:1 pos:2 ... pos:-2 pos:-1 │
│ │
│ positions[1] → N-grams that START sentences │
│ positions[2] → N-grams common in 2nd position │
│ positions[-1] → N-grams that END sentences │
│ positions[-2] → N-grams common in 2nd-to-last position │
│ │
│ WHY THIS MATTERS: │
│ ───────────────── │
│ In Turkish: "bir" commonly starts sentences │
│ "oldu" commonly ends sentences │
│ In English: "The" commonly starts sentences │
│ "today" commonly ends sentences │
│ │
│ This produces more natural sentence structure. │
│ │
└─────────────────────────────────────────────────────────────────────────┘Configuration
{
"config": {
"position_depth": 5
}
}When position_depth: 5:
- Tracks positions: 1, 2, 3, 4, 5, -5, -4, -3, -2, -1
- Middle words use generic
firstdistribution
Tokenization
Regex-Based Boundaries
Sentence and word boundaries are defined via regular expressions:
{
"config": {
"tokenizer": {
"word_separator": "\\s",
"sentence_separator": "[.!?:;]",
"word_filter": "/[^a-zçğıöşü]+/u",
"to_lowercase": true
}
}
}| Field | Description | Example |
|---|---|---|
word_separator | Regex to split words | \s (whitespace) |
sentence_separator | Regex to split sentences | [.!?:;] |
word_filter | Regex to clean words | /[^a-z]+/ (keep only letters) |
to_lowercase | Normalize case | true |
Predefined Filter Types
| Type | Pattern | Use Case |
|---|---|---|
ALPHABETICAL | /[^a-z]+/ | English basic |
FRENCH | /[^a-zéèëêúüûùœàáäâæíïìîóöôòç]+/ | French |
TURKISH | /[^a-zçğıöşü]+/u | Turkish |
JAPANESE | Unicode ranges | Japanese (hiragana, katakana, kanji) |
CHINESE | Unicode ranges | Chinese (CJK) |
NO_SYMBOLS | /[^ \p{L}]+/u | Any Unicode letter |
Token Type: Character vs Word
Terminology Note: Don't confuse
token_type(how the model was trained) withgeneration.mode(how output is produced). They are independent concepts:
token_type: Set at training time. Determines what a "token" is (character or word).generation.mode: Set at generation time in PDL. Determines output type (word, sentence, text, etc.).
Character Token Type (Default)
N-grams are character sequences. The model learns character-by-character patterns.
Training: ["sphinx", "quarter", "quick"]
N-grams: "sphi" → "phin" → "hinx"
"quar" → "uart" → "arte" → "rter"
"quic" → "uick"
Output: "sphinter" (new word, never in training!)Use for: Names, usernames, made-up words
Word Token Type
N-grams are word sequences. The model learns word-by-word patterns (for multi-word outputs like company names).
Training corpus (company names):
"Yılmaz Holding A.Ş."
"Koç Holding A.Ş."
"Sabancı Holding A.Ş."
"Yılmaz Gıda Sanayi"
"Koç Otomotiv Sanayi"
Word bigrams (n=2):
["Yılmaz", "Holding"] → ["Holding", "A.Ş."]
["Koç", "Holding"] → ["Holding", "A.Ş."]
["Yılmaz", "Gıda"] → ["Gıda", "Sanayi"]
["Koç", "Otomotiv"] → ["Otomotiv", "Sanayi"]
Generated outputs:
"Yılmaz Otomotiv Sanayi" (new combination!)
"Koç Gıda Sanayi" (new combination!)
"Sabancı Holding A.Ş." (from training)Use for: Company names, product names, multi-word phrases
Training with Word Token Type
# Company names (word bigrams)
phony train companies.txt \
-o models/companies.ngram \
--token-type word \
--ngram-size 2
# Product names (word trigrams for more context)
phony train products.txt \
-o models/products.ngram \
--token-type word \
--ngram-size 3Configuration
{
"metadata": {
"token_type": "char"
}
}| Value | Description | N-gram Example |
|---|---|---|
char | Character-level (default) | "meh" → "ehm" → "hme" |
word | Word-level | ["Koç", "Holding"] → ["Holding", "A.Ş."] |
Note:
token_typeis a model property set at training time. It cannot be changed when using the model. Use--token-type wordduring training for multi-word phrase generation (company names, product names).
Design Rationale
Why Single N-gram Order?
Phony uses a single N-gram order (e.g., only 4-grams) rather than multi-order approaches. This was a deliberate design choice after evaluating academic smoothing methods.
┌─────────────────────────────────────────────────────────────────────────┐
│ THE LOREM IPSUM PRINCIPLE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Phony generates FAKE data that LOOKS LIKE real data. │
│ It doesn't need to BE real data. │
│ │
│ Priority 1: Speed (generate millions of records fast) │
│ Priority 2: Plausibility (output looks locale-appropriate) │
│ Priority 3: NOT coverage (we don't need every possible word) │
│ │
│ Academic smoothing methods optimize for coverage. │
│ We optimize for speed and plausibility. │
│ │
└─────────────────────────────────────────────────────────────────────────┘Evaluated But Rejected
We evaluated several academic approaches before deciding on simplicity:
| Method | What It Does | Why Not For Phony |
|---|---|---|
| Kneser-Ney Smoothing | Redistributes probability mass using continuation counts | Complexity overhead, diminishing returns for fake data |
| Modified Kneser-Ney | Three discount parameters (D₁, D₂, D₃₊) | Even more complexity, marginal quality gain |
| Interpolation | λ-weighted mix of all n-gram orders | Requires storing all orders (2x-4x model size) |
| Stupid Backoff | Fixed discount (0.4) without normalization | Still requires multi-order storage |
References:
- Chen & Goodman (1999). "An Empirical Study of Smoothing Techniques for Language Modeling"
- Kneser & Ney (1995). "Improved backing-off for M-gram language modeling"
Our Approach: KISS
Single n-gram order + Simple prefix-match fallback
────────────────────────────────────────────────────
✓ One model file (not separate files per order)
✓ Smaller memory footprint
✓ Faster generation (no interpolation math)
✓ Good enough for fake dataLocale-Based N-gram Defaults
Different languages have different optimal N-gram sizes:
| Locale | Default N | Rationale |
|---|---|---|
en_* | 3 | Shorter words, common patterns |
tr_TR | 4 | Agglutinative, longer morphemes |
de_DE | 4 | Compound words, longer stems |
ja_JP | 2 | Character-based (hiragana/katakana) |
zh_* | 2 | Character-based (hanzi) |
ar_* | 3 | Root-pattern morphology |
These are defaults; users can override with --ngram-size flag.
Fallback Strategy
The Sparse Data Problem
When an N-gram has no recorded transitions:
Request: word starting with "xyz"
Model: No N-grams starting with "xyz"
Result: null (failure)Prefix-Match Fallback
Since we store only single-order N-grams, we use prefix matching as fallback:
┌─────────────────────────────────────────────────────────────────────────┐
│ PREFIX-MATCH FALLBACK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Looking for 4-gram starting with "xyz" │
│ ↓ │
│ Search: Any 4-gram matching "xyz*" → found "xyza" → use it │
│ ↓ (if not found) │
│ Search: Any 4-gram matching "xy**" → found "xyzb" → use it │
│ ↓ (if not found) │
│ Search: Any 4-gram matching "x***" → found "xabc" → use it │
│ ↓ (if not found) │
│ Use random from `first` distribution │
│ │
│ Key insight: We stay within single n-gram order. │
│ No 3-grams or 2-grams needed. │
│ │
└─────────────────────────────────────────────────────────────────────────┘Configuration
{
"config": {
"fallback": "prefix"
}
}| Value | Description |
|---|---|
none | Return null on miss (strict mode) |
prefix | Prefix-match within same n-gram order (default) |
Generation Modes
Word Generation
word(lengthHint?, position?, startsWith?) → string | null| Parameter | Type | Description |
|---|---|---|
lengthHint | int | Target length (soft constraint) |
position | int | Sentence position (1, 2, -1, -2) |
startsWith | string | Required prefix |
Algorithm:
- Select starting N-gram (filtered by
startsWithif provided) - Select target length from
word_lengthsdistribution - Loop: pick weighted random child, append last character
- When length ≥ hint and
lastChildrenexists, terminate
Sentence Generation
sentence(numberOfWords?, startsWith?, endingPunctuation?) → string| Parameter | Type | Description |
|---|---|---|
numberOfWords | int | Word count |
startsWith | string | First word prefix |
endingPunctuation | string|array | "." or [".", "!", "?"] |
Algorithm:
- Generate position-aware words for first N positions
- Generate generic words for middle
- Generate position-aware words for last N positions
- Capitalize first letter, append punctuation
Text Generation
text(maxNumberOfCharacters?, sentenceEndingPunctuation?, suffix?) → string| Parameter | Type | Description |
|---|---|---|
maxNumberOfCharacters | int | Maximum output length |
sentenceEndingPunctuation | string|array | Punctuation options |
suffix | string | Truncation suffix ("...") |
Poem Generation
poem(numberOfVerses?, stanzaLength?, maximumNumberOfWords?, endingPunctuation?) → string| Parameter | Type | Description |
|---|---|---|
numberOfVerses | int | Total lines |
stanzaLength | int | Lines per stanza (4 = quatrain) |
maximumNumberOfWords | int | Max words per line |
endingPunctuation | string|array | Line endings |
Acrostic Generation
acrosticPoem(initials, maximumNumberOfWords?, endingPunctuation?) → string| Parameter | Type | Description |
|---|---|---|
initials | string | Word to spell (required) |
maximumNumberOfWords | int | Max words per line |
endingPunctuation | string|array | Line endings |
Example:
Input: initials = "HELLO"
Output:
Happiness fills the morning air.
Every bird sings a joyful tune.
Laughter echoes through the valley.
Light dances on the water.
Open your heart to the world.Constrained Generation
Phony supports multiple constraint types for controlled output:
| Constraint | Description | Example |
|---|---|---|
startsWith | Prefix constraint | startsWith: "A" → "Ahmet" |
lengthHint | Soft length target | lengthHint: 6 → ~6 chars |
min_length | Hard minimum | min_length: 4 → ≥4 chars |
max_length | Hard maximum | max_length: 12 → ≤12 chars |
endsWith | Suffix constraint | endsWith: "oğlu" |
contains | Substring constraint | contains: "han" |
position | Sentence position | position: 1 → sentence-start style |
Soft vs Hard Constraints
- Soft (
lengthHint): Guides generation, may overshoot - Hard (
min_length,max_length): Enforced strictly, may return null
Binary Search Weighted Random
The key algorithm for O(log n) selection:
┌─────────────────────────────────────────────────────────────────────────┐
│ WEIGHTED RANDOM SELECTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Elements: ["a", "b", "c", "d"] │
│ Weights: [1, 2, 3, 4] │
│ Cumulative: [1, 3, 6, 10] │
│ │
│ Random(0-10) = 5 │
│ │
│ Binary search for first cw ≥ 5: │
│ │
│ low=0, high=3 │
│ mid=1, cw[1]=3 < 5 → low=2 │
│ mid=2, cw[2]=6 ≥ 5 → found! │
│ │
│ Return: elements[2] = "c" │
│ │
│ Probability distribution: │
│ a: 10%, b: 20%, c: 30%, d: 40% │
│ │
└─────────────────────────────────────────────────────────────────────────┘Training Configuration
CLI Training
# Basic training
phony train names.txt -o models/names.ngram
# With options
phony train names.txt \
--output models/tr_TR/names.ngram \
--ngram-size 4 \
--min-word-length 3 \
--position-depth 5 \
--token-type char \
--locale tr_TR \
--exclude-originals \
--include-real-words \
--fallback prefix \
--word-separator '\s' \
--sentence-separator '[.!?]' \
--word-filter '/[^a-zçğıöşü]+/u' \
--lowercaseTraining Flags
| Flag | Description |
|---|---|
--ngram-size N | N-gram order (default: 4) |
--token-type char|word | Token type (char=character-level, word=word-level) |
--min-word-length N | Filter short words |
--position-depth N | Sentence position tracking depth |
--exclude-originals | Store training words for deduplication |
--include-real-words | Embed training words for real_word mode |
--fallback none|prefix | Fallback strategy for sparse data |
--word-separator REGEX | Word boundary regex |
--sentence-separator REGEX | Sentence boundary regex |
--word-filter REGEX | Character filter regex |
--lowercase | Normalize to lowercase |
Training from Different Sources
# From CSV column
phony train customers.csv --column first_name -o models/names.ngram
# From JSON path
phony train data.json --path "$.users[*].name" -o models/names.ngram
# From multiple files
phony train names/*.txt -o models/names.ngramPDL Integration
Model generators must specify a generation block. There is no default mode - omitting it will cause a validation error.
Generation Block Syntax
{
"type": "model",
"source": "path/to/model.ngram",
"generation": {
"mode": "word | sentence | text | paragraph | poem | acrostic | real_word",
"params": { ... }
},
"constraints": { ... }
}Structure:
generation.mode(required): How to produce outputgeneration.params(optional): Mode-specific parameters (e.g.,word_count,starts_with)constraints(optional): Output validation (e.g.,min_length,max_length- rejects if not met)
Available Modes
| Mode | Output | Required Params | Optional Params |
|---|---|---|---|
word | Single word | - | length_hint, starts_with, position |
words | Multiple words | - | count, length_hint, starts_with |
sentence | Single sentence | - | word_count, starts_with, punctuation |
sentences | Multiple sentences | - | count, word_count, punctuation |
text | Truncated text | - | max_chars, suffix, punctuation |
paragraph | Paragraph | - | sentence_count, punctuation |
poem | Poem with stanzas | - | verses, stanza_length, max_words, punctuation |
acrostic | Acrostic poem | initials | max_words, punctuation |
real_word | Pick from training | - | starts_with |
Dynamic Parameters with Other Generators
Generation parameters can reference other generators using {{generator_name}}:
{
"generators": {
"word_count": {
"type": "logic",
"algorithm": "int_between",
"params": { "min": 3, "max": 8 }
},
"sentence_count": {
"type": "logic",
"algorithm": "int_between",
"params": { "min": 2, "max": 5 }
},
"bio": {
"type": "model",
"source": "models/text.ngram",
"generation": {
"mode": "sentence",
"params": {
"word_count": "{{word_count}}",
"punctuation": [".", "!"]
}
}
},
"description": {
"type": "model",
"source": "models/text.ngram",
"generation": {
"mode": "paragraph",
"params": {
"sentence_count": "{{sentence_count}}"
}
}
}
}
}Inline Random Ranges
Parameters also support inline random syntax:
{
"generators": {
"tagline": {
"type": "model",
"source": "models/slogans.ngram",
"generation": {
"mode": "sentence",
"params": {
"word_count": "{{number:3-8}}",
"punctuation": "{{pick(['!', '.', '...'])}}"
}
}
}
}
}Complete Examples
Simple word generation:
{
"first_name": {
"type": "model",
"source": "models/names.ngram",
"generation": { "mode": "word" },
"constraints": { "min_length": 3, "max_length": 12 }
}
}Sentence with dynamic word count:
{
"motto": {
"type": "model",
"source": "models/quotes.ngram",
"generation": {
"mode": "sentence",
"params": {
"word_count": "{{number:5-12}}",
"starts_with": "{{pick(['A', 'B', 'C'])}}",
"punctuation": [".", "!"]
}
}
}
}Poem with configurable structure:
{
"haiku": {
"type": "model",
"source": "models/poetry.ngram",
"generation": {
"mode": "poem",
"params": {
"verses": 3,
"stanza_length": 0,
"max_words": "{{number:3-5}}",
"punctuation": ""
}
}
}
}Acrostic poem:
{
"welcome_poem": {
"type": "model",
"source": "models/poetry.ngram",
"generation": {
"mode": "acrostic",
"params": {
"initials": "WELCOME",
"max_words": "{{number:4-8}}",
"punctuation": "."
}
}
}
}Pick real word from training data:
{
"known_name": {
"type": "model",
"source": "models/names.ngram",
"generation": {
"mode": "real_word",
"params": {
"starts_with": "A"
}
}
}
}Mode Selection Guide
┌─────────────────────────────────────────────────────────────────────────┐
│ GENERATION MODE SELECTION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Need a single generated item? │
│ ├── Name, username, made-up word → mode: "word" │
│ ├── Company name (multi-word) → mode: "word" + word-mode model │
│ └── Known/real name from data → mode: "real_word" │
│ │
│ Need text content? │
│ ├── Short tagline, motto → mode: "sentence" │
│ ├── Bio, description (fixed length) → mode: "text" + max_chars │
│ └── Article, review → mode: "paragraph" │
│ │
│ Need creative/structured text? │
│ ├── Poem with stanzas → mode: "poem" │
│ └── Hidden message in initials → mode: "acrostic" │
│ │
└─────────────────────────────────────────────────────────────────────────┘Performance Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| Training (feed) | O(n × k) | n = chars, k = ngram size |
| Calculate | O(m log m) | m = unique ngrams (sorting) |
| Weighted random | O(log m) | Binary search |
| Word generation | O(length) | Linear in output length |
| Model size | 100KB - 10MB | Depends on training corpus |
File Format
N-gram models are stored as .ngram files:
model.ngram
├── Gzip compressed
└── JSON structure (as described above)The .ngram extension indicates:
- Machine-trained statistical model
- Not human-editable (unlike lists)
- Locale-specific content
Future Optimizations
FlatBuffers for Zero-Copy Loading
Current: JSON + Gzip requires parsing and memory allocation during load.
Future: FlatBuffers format for instant model loading in Cloud engine.
┌─────────────────────────────────────────────────────────────────────────┐
│ ZERO-COPY DESERIALIZATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Current (JSON + Gzip): │
│ ────────────────────── │
│ 1. Read file from disk (~10ms) │
│ 2. Decompress gzip (~20ms) │
│ 3. Parse JSON text (~50-100ms) ← Bottleneck │
│ 4. Build data structures (~10ms) │
│ Total: ~100-150ms │
│ │
│ Future (FlatBuffers): │
│ ───────────────────── │
│ 1. mmap() file into memory (~1ms) │
│ 2. Done. No parsing needed. │
│ Total: ~1-5ms │
│ │
│ WHY IT WORKS: │
│ ───────────── │
│ FlatBuffers file format IS the memory format. │
│ No conversion needed - just point to the data. │
│ │
└─────────────────────────────────────────────────────────────────────────┘Proposed dual-format support:
| Extension | Format | Use Case |
|---|---|---|
.ngram | JSON + Gzip | OSS libraries, debugging, human inspection |
.ngram.fb | FlatBuffers | Cloud engine, high-performance scenarios |
Benefits for Cloud (5M rec/sec):
- Instant model switching between locales
- Lower memory pressure (no parsing overhead)
- Better cache utilization
References:
- FlatBuffers
- Cap'n Proto (alternative)
Streaming Training for Large Files
Current: Entire training file loaded into memory before processing.
Future: Stream-based training for arbitrarily large files.
┌─────────────────────────────────────────────────────────────────────────┐
│ STREAMING TRAINING PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Current (Batch): │
│ ───────────────── │
│ [Read entire file] → [Process all] → [Build model] │
│ Memory: O(file_size) │
│ Problem: 10GB file = 10GB+ RAM needed │
│ │
│ Future (Streaming): │
│ ────────────────── │
│ [Read chunk 1] → [Update model] ─┐ │
│ [Read chunk 2] → [Update model] ─┤→ [Finalize model] │
│ [Read chunk 3] → [Update model] ─┘ │
│ Memory: O(chunk_size + model_size) │
│ Benefit: 10GB file with 64MB chunks = ~100MB RAM │
│ │
└─────────────────────────────────────────────────────────────────────────┘Implementation approach:
// Pseudocode for streaming trainer
fn train_streaming(path: &str, chunk_size: usize) -> Model {
let mut model = Model::new();
for chunk in read_chunks(path, chunk_size) {
let partial_counts = count_ngrams(&chunk);
model.merge_counts(partial_counts);
}
model.finalize(); // Calculate cumulative weights, etc.
model
}Key considerations:
- N-grams at chunk boundaries: Overlap chunks by
n-1characters - Memory for counts: Hash map grows with unique N-grams (bounded)
- Finalization: Single pass to compute cumulative weights
Parallel Training
Future enhancement: Multi-threaded training for very large corpora.
┌─────────────────────────────────────────────────────────────────────────┐
│ PARALLEL TRAINING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Thread 1: [Chunk 1] → [Count N-grams] ─┐ │
│ Thread 2: [Chunk 2] → [Count N-grams] ─┼→ [Merge] → [Model] │
│ Thread 3: [Chunk 3] → [Count N-grams] ─┤ │
│ Thread 4: [Chunk 4] → [Count N-grams] ─┘ │
│ │
│ Speedup: ~linear with cores (N-gram counting is embarrassingly parallel)│
│ │
│ MERGE STRATEGY: │
│ ─────────────── │
│ • Each thread produces partial frequency counts │
│ • Main thread merges counts (simple addition) │
│ • Finalize: compute cumulative weights once │
│ │
└─────────────────────────────────────────────────────────────────────────┘CLI flag (future):
# Parallel training with 8 threads
phony train large-corpus.txt \
-o models/large.ngram \
--parallel 8 \
--chunk-size 64MBWhen to use:
- Training data > 100MB
- Multi-core machine available
- One-time model building (not interactive)
Optimization Priority
| Optimization | Effort | Impact | Priority |
|---|---|---|---|
| Streaming training | Medium | High (enables large files) | P1 |
| Parallel training | Medium | Medium (faster training) | P2 |
| FlatBuffers (Cloud) | High | High (instant loading) | P2 |
| FlatBuffers (OSS) | High | Low (one-time load) | P3 |