116 lines
2.4 KiB
Plaintext
116 lines
2.4 KiB
Plaintext
|
||
NEW SRI-LM LIBRARY AND TOOLS -- OVERVIEW
|
||
|
||
Design Goals
|
||
|
||
- coverage of state-of-the-art LM methods
|
||
- extensible vehicle for LM research
|
||
- code reusability
|
||
- tool versatility
|
||
- speed
|
||
|
||
Implementation language: C++ (GNU compiler)
|
||
|
||
|
||
|
||
LM CLASS HIERARCHY
|
||
|
||
LM
|
||
Ngram -- arbitrary-order N-gram backoff models
|
||
DFNgram -- N-grams including disfluency model
|
||
VarNgram -- variable-order N-grams
|
||
TaggedNgram -- word/tag N-grams
|
||
CacheLM -- unigram from recent history
|
||
DynamicLM -- changes as a function of external info
|
||
BayesMix -- mixture of LMs with contextual adaptation
|
||
|
||
|
||
|
||
|
||
OTHER CLASSES
|
||
|
||
Vocab -- word string/index mapping
|
||
TaggedVocab -- same for word/tag pairs
|
||
|
||
LMStats -- statistics for LM estimation
|
||
NgramStats -- N-gram counts
|
||
TaggedNgramStats -- word/tag N-gram counts
|
||
|
||
Discount -- backoff probality discounting
|
||
GoodTuring -- standard
|
||
ConstDiscount -- Ney's method
|
||
NaturalDiscount -- Ristad's method
|
||
|
||
|
||
|
||
HELPER LIBRARIES
|
||
|
||
libdstruct -- template data structures
|
||
|
||
Array -- self-extending arrays
|
||
|
||
Map
|
||
SArray -- sorted arrays
|
||
LHash -- linear hash tables
|
||
|
||
Trie -- index trees (based on a Map type)
|
||
|
||
MemStats -- memory usage tracking
|
||
|
||
|
||
libmisc -- convenience functions:
|
||
option parsing,
|
||
compressed file i/o,
|
||
object debugging
|
||
|
||
|
||
|
||
TOOLS
|
||
|
||
ngram-count -- N-gram counting and model estimation
|
||
ngram-merge -- N-gram count merging
|
||
ngram -- N-gram model scoring, perplexity,
|
||
sentence generation, mixing and
|
||
interpolation
|
||
|
||
|
||
|
||
LM INTERFACE
|
||
|
||
LogP wordProb(VocabIndex word, const VocabIndex *context)
|
||
LogP wordProb(VocabString word, const VocabString *context)
|
||
|
||
LogP sentenceProb(const VocabIndex *sentence, TextStats &stats)
|
||
LogP sentenceProb(const VocabString *sentence, TextStats &stats)
|
||
|
||
unsigned pplFile(File &file, TextStats &stats, const char *escapeString = 0)
|
||
setState(const char *state);
|
||
|
||
wordProbSum(const VocabIndex *context)
|
||
|
||
VocabIndex generateWord(const VocabIndex *context)
|
||
VocabIndex *generateSentence(unsigned maxWords, VocabIndex *sentence)
|
||
VocabString *generateSentence(unsigned maxWords, VocabString *sentence)
|
||
|
||
Boolean isNonWord(VocabIndex word)
|
||
Boolean read(File &file);
|
||
void write(File &file);
|
||
|
||
|
||
|
||
|
||
EXTENSIBILITY/REUSABILITY
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
THINGS TO DO
|
||
|
||
- Node array interface
|
||
- General interpolated LMs
|
||
- LM "shell" for interactive model manipulation and use (Tcl based)
|
||
|