Files
b2txt25/language_model/srilm-1.7.3/doc/overview
2025-07-02 12:18:09 -07:00

116 lines
2.4 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

NEW SRI-LM LIBRARY AND TOOLS -- OVERVIEW
Design Goals
- coverage of state-of-the-art LM methods
- extensible vehicle for LM research
- code reusability
- tool versatility
- speed
Implementation language: C++ (GNU compiler)
LM CLASS HIERARCHY
LM
Ngram -- arbitrary-order N-gram backoff models
DFNgram -- N-grams including disfluency model
VarNgram -- variable-order N-grams
TaggedNgram -- word/tag N-grams
CacheLM -- unigram from recent history
DynamicLM -- changes as a function of external info
BayesMix -- mixture of LMs with contextual adaptation
OTHER CLASSES
Vocab -- word string/index mapping
TaggedVocab -- same for word/tag pairs
LMStats -- statistics for LM estimation
NgramStats -- N-gram counts
TaggedNgramStats -- word/tag N-gram counts
Discount -- backoff probality discounting
GoodTuring -- standard
ConstDiscount -- Ney's method
NaturalDiscount -- Ristad's method
HELPER LIBRARIES
libdstruct -- template data structures
Array -- self-extending arrays
Map
SArray -- sorted arrays
LHash -- linear hash tables
Trie -- index trees (based on a Map type)
MemStats -- memory usage tracking
libmisc -- convenience functions:
option parsing,
compressed file i/o,
object debugging
TOOLS
ngram-count -- N-gram counting and model estimation
ngram-merge -- N-gram count merging
ngram -- N-gram model scoring, perplexity,
sentence generation, mixing and
interpolation
LM INTERFACE
LogP wordProb(VocabIndex word, const VocabIndex *context)
LogP wordProb(VocabString word, const VocabString *context)
LogP sentenceProb(const VocabIndex *sentence, TextStats &stats)
LogP sentenceProb(const VocabString *sentence, TextStats &stats)
unsigned pplFile(File &file, TextStats &stats, const char *escapeString = 0)
setState(const char *state);
wordProbSum(const VocabIndex *context)
VocabIndex generateWord(const VocabIndex *context)
VocabIndex *generateSentence(unsigned maxWords, VocabIndex *sentence)
VocabString *generateSentence(unsigned maxWords, VocabString *sentence)
Boolean isNonWord(VocabIndex word)
Boolean read(File &file);
void write(File &file);
EXTENSIBILITY/REUSABILITY
THINGS TO DO
- Node array interface
- General interpolated LMs
- LM "shell" for interactive model manipulation and use (Tcl based)