2169 lines
95 KiB
Plaintext
2169 lines
95 KiB
Plaintext
|
|
Version History
|
|
|
|
0.90 29 Jun 95 first working code, n-gram models only
|
|
|
|
0.91 02 Aug 95 snapshot for fosler@icsi, minor bug fixes
|
|
|
|
0.92 13 Aug 95 added BayesMix, VarNgram LMs
|
|
|
|
0.93 27 Aug 95 included all LM95 code
|
|
|
|
0.94 13 Oct 95
|
|
* new directory structure mirroring DECIPHER layout.
|
|
* man pages added
|
|
* added support for Decipher N-best list rescoring
|
|
* added Null LM
|
|
* added new utility scripts
|
|
* bug fixes
|
|
|
|
0.95 08 Sep 96 as of WS96
|
|
* added Trellis class, disambig program
|
|
* added support for pause tokens (-pau-) in sentences
|
|
(these are ignored for sentence prob computation)
|
|
* added -tolower mapping
|
|
* added word reversal
|
|
* made Ngram model reading much faster (optimized floating point parsing)
|
|
* added template class for ngram count tries (to use either integer or
|
|
float count value)
|
|
* added optional noise tag skipping
|
|
* added SkipNgram model
|
|
* added Witten-Bell backoff
|
|
* ported to native Sun and SGI C++ compilers (see doc/c++porting-notes),
|
|
* suppress log10(0.0) warnings
|
|
|
|
0.96 05 Jun 97
|
|
* Honor -gtNmin parameter even when discounting of higher counts
|
|
is effectively disabled. (Allows building maximum likelihood LMs
|
|
smoothed only by low-count ngram elimination.)
|
|
* Ignore pauses and noise in nbest-lattice alignments (also added
|
|
-noise option).
|
|
* ngram now supports mixtures of up to 6 ngram models.
|
|
* added HiddenSNgram LM.
|
|
* warn about multiple uses of '-' file for input or output
|
|
* zio now handles incomplete reading of compressed file without error
|
|
* Fixed interaction between deletion and iterations
|
|
* Fixed handling of OOVs in cache model
|
|
* Fixed decipher N-best rescoring: we now duplicate even the
|
|
roundoff errors incurred by bytelogs. Also added -decipher flag
|
|
to ngram to allow replication of recognizer LM scores.
|
|
Also, takes into account that Decipher (incorrectly) applies WTW
|
|
even to pauses.
|
|
* Enhanced decipher-rescore script to deal with NBestList2.0 format,
|
|
with -bytelog and -nodecipherlm options .
|
|
* Added tools to convert bigram and trigram backoff LMs into
|
|
Decipher PFSG format (pfsg-from-ngram).
|
|
* Enable DecipherNgram models order higher than bigram
|
|
(ngram -decipher-order flag). Default is still bigram.
|
|
* Fixed bug that caused float command line arguments to be parsed
|
|
incorrectly on SunOS4 systems (missing declaration in system header).
|
|
|
|
0.97 30 Aug 97 as of WS97
|
|
* New programs: segment and segment-nbest (moved here from
|
|
development code).
|
|
* Made low-level NgramLM access functions public
|
|
(findProb, findBOW, insertProb, insertBOW).
|
|
* Fixed nbest-lattice to use normalized posterior word
|
|
probabilities in lattice.
|
|
* NBest, nbest-lattice: added N-best error computation.
|
|
* WordLattice, nbest-lattice: added lattice error computation.
|
|
* WordLattice: base all alignments on edit distance costs defined
|
|
in WordAlign.h.
|
|
* contextID() now also returns length of context used.
|
|
Added contextID() implementations for NullLM and BayesMix.
|
|
* Fixed contextID() for Ngram: don't truncate context if BOW = 1.
|
|
* Fixed SArray, LHash to avoid assignment operator on remove().
|
|
* Fixed add-ppls, subtract-ppls to handle -ppl -debug 2 output.
|
|
* Lots of memory management fixes.
|
|
* SArrayIter and LHashIter now work even while underlying object is
|
|
being moved (as when containing data structure is enlarged).
|
|
* Added HTK Lattice tool interface (htk/ directory).
|
|
* Made Trellis into a template class.
|
|
* Allow arbitrary n-gram orders with disambig(1).
|
|
* Added forward-backward decoding and posterior probability computation
|
|
to disambig(1).
|
|
* Added disambig -lmw and -mapw options.
|
|
* Added HMMofNGrams model (ngram -hmm option).
|
|
* VocabMap reader now warns about duplicate entries
|
|
|
|
0.98 18 April 98
|
|
* Allow ngram to disable Decipher LM backoff hack, for rescoring
|
|
new exact lattices (ngram -decipher-nobackoff).
|
|
* N-best list vocabulary is now always expanded dynamically
|
|
(no more OOVs in N-best lists).
|
|
* Added wrapper script for nbest-lattice to compute N-best error rate
|
|
(nbest-error).
|
|
* Skip ngrams exceeding model order when reading.
|
|
* Fixed memory bug in generateSentence().
|
|
* Changed libmisc to work with Tcl version > 7.
|
|
* Compute word error correctly for empty N-best list.
|
|
* Added ngram pruning based on model perplexity change
|
|
(ngram-count -prune and ngram -prune).
|
|
* Old ngram -prune option renamed -varprune.
|
|
* New lattice word error minimization (nbest-lattice -lattice-wer).
|
|
* Fixed ngram -gen bug due to omissions in SunOS4 header files.
|
|
* merge-batch-counts removes merged source files
|
|
* Added ngram -prune-lowprobs function to do the work of
|
|
remove-lowprob-ngrams, but much faster and using less memory.
|
|
* Added support for new Decipher NBestList2.0 format.
|
|
* Added word error count and posterior probability fields to NBestHyp
|
|
structure.
|
|
* Added optional factor argument to countSentence() (convenient
|
|
to compute fractional sufficient statistics for alternative
|
|
training methods).
|
|
* Don't make special symbols (<s>, </s>, <unk>) member of SubVocab
|
|
by default.
|
|
* Ported to gcc 2.8.1 .
|
|
|
|
0.99 31 July 1999
|
|
* Added hidden-ngram (word-boundary tagger).
|
|
* Removed line length limit for File object.
|
|
* Added disambig -continuous flag.
|
|
* Fixed backward computation in disambig (again).
|
|
* Generalized compute-best-mix to N > 2 models
|
|
* Added AdaptiveMix LM class
|
|
* Added nbest-mix utility (interpolation of N-best posteriors)
|
|
* Added ngram -unk flag to handle open-class LMs
|
|
* Added disambig and hidden-ngram -text-map option
|
|
* Script enhancements:
|
|
- New script to convert nbest-lattice word graphs to PFSG
|
|
(wlat-to-pfsg)
|
|
- Added switches include probabilities in wlat-to-dot and pfsg-to-dot
|
|
output.
|
|
- Conversion to/from AT&T FSM format: fsm-to-pfsg and pfsg-to-fsm
|
|
* ngram -rescore and associated scripts no longer set a hyp
|
|
probability to zero if it contains OOVs. Instead, the probability
|
|
is computed ignoring those words (more useful in practice).
|
|
A warning is output as always.
|
|
* Added ngram-count -float-counts option.
|
|
* Added build support for Linux/i686 platform.
|
|
|
|
1.00 8 June 2000
|
|
* Added ClassNgram class and ngram -classes option.
|
|
* Capability to convert class ngrams into word ngrams.
|
|
* New program ngram-class for automatic word class induction.
|
|
* Fixed interaction of ngram -mix-lm -bayes with non-standard n-grams:
|
|
can now build an interpolation of the non-standard (hidden-event,
|
|
class-based, etc.) n-gram with the additional, standard n-grams.
|
|
* Replaced LM.noiseTag with LM.noiseVocab (list of noise tags to
|
|
be ignored). Tools now take -noise-vocab option (as well as -noise
|
|
for backward compatibility).
|
|
* Made ngram -counts work for non-n-gram models.
|
|
* Added nbest-lattice -posterior-{amw,lmw,wtw} options to compute
|
|
word posteriors with different weightings from the one used in
|
|
hypothesis ranking. Also added -deletion-bias flag for explicit
|
|
control of del/ins errors (-use-mesh mode only).
|
|
* NBest rescoring methods now have optional acoustic model weight
|
|
(defaulting to 1 as before).
|
|
* New class RefList (list of reference transcripts).
|
|
* New class NBestSet (set of N-Best lists).
|
|
* NBest, NBestSet, and nbest-lattice optionally split multiwords into
|
|
their components on reading (-multiwords option).
|
|
* New nbest-optimize tool for finding near-optimal score combination
|
|
weights for word error minimizing N-best rescoring.
|
|
* New anti-ngram program, for computing posterior-weighted N-gram
|
|
counts from N-best lists.
|
|
* New nbest-rover script allows ROVER-style combination of hypotheses
|
|
from multiple N-best lists.
|
|
* New rescore-decipher -norescore option, to reformat N-best lists
|
|
without LM rescoring.
|
|
* Fixed bugs related to missing <s> and </s> in change-lm-vocab and
|
|
make-ngram-pfsg.
|
|
* Significant speedups in LMs involving dynamic programming
|
|
(HiddenNgram, DFNgram, HMMofNgrams) when interpolating with other
|
|
models or running in "ngram -debug 2" mode.
|
|
* Allow absolute discounting on fractional counts, for more
|
|
effective construction of models from fractional counts.
|
|
* Added ngram-merge -float-counts option, and allow "-" (stdin) as
|
|
input file.
|
|
* ngram-count ensures <s> unigram (with prob 0) is defined to avoid
|
|
breaking other programs.
|
|
* Added make-abs-discount script to compute absolute discounting
|
|
constants from Good-Turing statistics.
|
|
* compute-sclite and compare-sclite now take -multiwords option to
|
|
split compound words prior to scoring.
|
|
* Changed option handling so that unsigned option arguments are forced
|
|
to be non-negative.
|
|
* Added Map2 (2D Map) class to libdstruct.
|
|
* Much better string hash function (borrowed from Tcl).
|
|
* New man pages: training-scripts(1), lm-scripts(1), ppl-scripts(1),
|
|
pfsg-scripts(1), nbest-scripts(1), lm-format(5), classes-format(5),
|
|
pfsg-format(5), nbest-format(5).
|
|
|
|
1.0.1 12 July 2000
|
|
|
|
Functionality:
|
|
|
|
* wordError() and nbest-lattice -dump-errors now also output the
|
|
location of deletions in the alignment (NOTE: possible code
|
|
incompatibility).
|
|
* New reverse-ngram-counts script.
|
|
|
|
Bug fixes:
|
|
|
|
* Workarounds for shortcomings in Linux gcc, math library, and linker.
|
|
* make-ngram-pfsg: don't ignore bigram states with zero BOW (bugfix).
|
|
* nbest-rover: fixed problem with handling of + lines.
|
|
|
|
1.1 21 May 2001
|
|
|
|
Functionality:
|
|
|
|
* HiddenNgram class generalized to deal with disfluency-type events
|
|
that manipulate the N-gram context.
|
|
* rescore-reweight script now accepts additional score directories
|
|
(and associated score weights) for combination of an arbitrary number
|
|
of knowledge sources.
|
|
* Enhanced rescore-decipher functionality:
|
|
- Option -lm-only to produce output containing LM scores only
|
|
- Option -pretty to perform word mapping on the fly.
|
|
- Warn about and handle LM scores that are NaN.
|
|
* New class VocabMultiMap, implementing dictionary-style mappings of
|
|
words to strings from another vocabulary.
|
|
* Added support for pronunciation-based word alignments in
|
|
WordMesh and nbest-lattice -use-mesh .
|
|
* Added nbest-lattice -keep-noise option to preserve pauses and noises
|
|
in alignments.
|
|
* Support for multiwords: - make-multiword-pfsg expands PFSGs to use
|
|
multiwords (using AT&T FSM tools).
|
|
- multi-ngram expands N-gram LM to include multiwords.
|
|
* Added support for Decipher Intlog scaled log probabilities.
|
|
* Added ngram -seed option to initialize random sentence generation
|
|
(contributed by Eric Fosler).
|
|
* New add-pauses-to-pfsg pause= and version= options to allow
|
|
generation of Nuance-compatible PFSGs (see man page for details).
|
|
* The NBest class and scripts handle NBestList2.0 format containing
|
|
phone and/or state backtraces (by ignoring them).
|
|
* Added Amoeba search option to nbest-optimize (contributed by
|
|
Dimitra Vergyri).
|
|
* Added standard 1-best optimization mode to nbest-optimize.
|
|
* wlat-to-pfsg script now also processes confusion networks output by
|
|
nbest-lattice -use-mesh .
|
|
|
|
Bug fixes:
|
|
|
|
* ngram -decipher-nobackoff now applies to the -lm ngram as well if
|
|
option -decipher is also specified.
|
|
* ngram -expand-classes no longer dumps core when handling
|
|
"context-free" class expansions (though those aren't supported).
|
|
* gawk path in scripts is now adjusted prior to installation
|
|
(/usr/bin/gawk for Linux, /usr/local/bin/gawk elsewhere).
|
|
* Fixed numerical problems in nbest-rover/nbest-posteriors.
|
|
* ngram-counts -float-counts behaved differently from equivalent
|
|
integer-count estimation; both integer and float counts now use
|
|
the same estimation code.
|
|
* Reduced memory requirements of nbest-optimize by about 25%.
|
|
* Minor changes for gcc-2.95.3.
|
|
|
|
1.1.1 20 July 2001
|
|
|
|
Functionality:
|
|
|
|
* WordMesh: new interface to record reference word string in alignment.
|
|
* nbest-lattice: confusion networks can now record reference words
|
|
if specified with -reference, and are preserved by -write/-read.
|
|
* replace-words-with-classes now has option to process ngram count
|
|
files (have_counts=1).
|
|
* merge-nbest: new utility to merge N-best hyps from multiple lists.
|
|
* wlat-stats: new utility to compute statistics of word posterior
|
|
lattices.
|
|
|
|
Bug fixes:
|
|
|
|
* GT discounting: fixed anomaly due to different floating point
|
|
precision on x86 platforms.
|
|
* anti-ngram(1): documented options previously omitted.
|
|
* WordMesh: reading/writing of confusion networks now preserves
|
|
total posterior mass.
|
|
* Changed the hypothesis alignment order in nbest-optimize to be
|
|
more compatible with decoding in nbest-lattice: first align nbest
|
|
hyps in order of decreasing (initial) scores, then align reference.
|
|
nbest-optimize -no-reorder keeps the old behavior (with references
|
|
anchoring the alignment). All scores and initial lambdas are now
|
|
used to compute initial posterior hyp probabilities to guide the
|
|
hypothesis alignment; thus, it now makes sense to restart an
|
|
optimization with partially optimized weights to revised the
|
|
alignments.
|
|
* nbest-optimize now warns about missing or incomplete score files.
|
|
* Fixed a memory access error in nbest-optimize -1best.
|
|
* Fixed weight normalization in nbest-optimize when first element is 0.
|
|
* Miscellaneous fixes for compile under RH Linux 7.0.
|
|
|
|
1.2 20 November 2001
|
|
|
|
Functionality:
|
|
|
|
* nbest-lattice -dictionary allows word alignments to be guided by
|
|
dictionary pronunciations.
|
|
* nbest-lattice -use-mesh -record-hyps records the rank of N-best hyps
|
|
contributing to each word hypothesis in the confusion network.
|
|
* nbest-lattice -no-rescore and -decipher-format options make it
|
|
more convenient as an N-best format conversion tool.
|
|
* VocabDistance: new class and subclasses to represent distance metrics
|
|
(e.g., phonetic distance) over vocabularies.
|
|
* WordMesh: output word hyps in order of decreasing posteriors.
|
|
* WordMesh: reading/writing of confusion networks now includes hyp IDs
|
|
from alignment.
|
|
* NBest/MultiAlign/WordMesh: support for keeping extra word-level
|
|
information (NBeSTWordInfo).
|
|
* nbest-lattice: unified single and multiple file processing.
|
|
New option -write-dir to write multiple output lattices.
|
|
New option -refs to supply multiple references.
|
|
Options -nbest-errors and -lattice-errors are replaced by
|
|
switches -nbest-error/-lattice-error, in conjunction with
|
|
-references/-refs. Outputs are now prefixed by utterance IDs
|
|
when processing multiple files.
|
|
* nbest-lattice -nbest-backtrace enables processing of backtrace
|
|
information from N-best lists; combined with -use-mesh this produces
|
|
sausages that contain word-level scores and alignment information,
|
|
as well as phone backtraces (see new wlat-format(5) man page).
|
|
* wlat-stats script now also computes error statistics when processing
|
|
confusion networks with references.
|
|
* nbest-rover now handles N-best lists in Decipher format.
|
|
* hidden-ngram and disambig: new option -fw-only to use only forward
|
|
probabilities for posterior computation.
|
|
* rescore-decipher -filter option to apply textual rewriting filters
|
|
to hypotheses before rescoring.
|
|
* segment-nbest -write-nbest-dir option for dumping rescored N-best
|
|
lists to a directory instead of to stdout.
|
|
* segment-nbest -start-tag and -end-tag options to insert tags at
|
|
margins of N-best hyps.
|
|
|
|
Bug fixes:
|
|
|
|
* WordMesh: computation of deletion costs using a dictionary distance
|
|
was completely bogus (only affected undocumented nbest-lattice
|
|
-dictionary option).
|
|
* nbest-lattice: correctly process -nbest-files using -dictionary in
|
|
alignment.
|
|
* nbest-rover: fixed to work on Linux
|
|
* hidden-ngram: don't abort when an event posterior is 0.
|
|
* hidden-ngram: avoid abort when *noevent* occurs in -hidden-vocab list.
|
|
* segment-nbest: now correctly uses ngram contexts longer than trigram.
|
|
* segment-nbest: optimized -bias 0 case by disallowing sentence
|
|
boundary states altogether.
|
|
* multi-ngram -prune-unseen-ngrams prevents insertion of multiword
|
|
N-grams whose component N-grams were not in the original model.
|
|
* ngram: fixed computation of mixture lambda for second LM when three
|
|
or more models are interpolated.
|
|
* nbest-posterior (and thus nbest-rover) no longer split multiwords by
|
|
themselves. To split multiwords with nbest-rover, append the
|
|
-multiwords option to the argument list, which is passed on to
|
|
nbest-lattice to achieve the desired effect.
|
|
* ngram -renorm now applies BEFORE class expansion or pruning of
|
|
model (in case input model is unnormalized).
|
|
* make-nbest-pfsg bug involving transition into final node fixed.
|
|
* Minor script changes to avoid warnings with gawk 3.1.0.
|
|
|
|
1.3 11 February 2002
|
|
|
|
Functionality:
|
|
|
|
* Trellis class, disambig and hidden-ngram tools: added support for
|
|
N-best decoding (contributed by Anand Venkataraman).
|
|
|
|
* MultiwordLM wrapper LM class as a convenient way to split multiwords
|
|
prior to LM evaluation.
|
|
|
|
* New MultiwordVocab class to support MultiwordLM.
|
|
|
|
* Added ngram -multiwords option (based on MultiwordLM wrapper).
|
|
|
|
* Added support for Chen & Goodman's Modified Kneser-Ney smoothing
|
|
and interpolated backoff estimates. See ngram-count options
|
|
-kndiscount[1-6], -kn[1-6], and interpolate[1-6].
|
|
|
|
* New library and tool for lattice manipulation: lattice-tool.
|
|
|
|
* New nbest-mix -set-am-scores and -set-lm-scores options. These allow
|
|
setting either the AM or the LM scores in the N-best output to simulate
|
|
the combined posteriors, while preserving the other scores.
|
|
|
|
* Added some regression tests (test/ subdirectory).
|
|
|
|
* Support for Windows via CYGWIN porting layer (MACHINE_TYPE=cygwin).
|
|
See doc/README.windows for details.
|
|
|
|
Bug fixes:
|
|
|
|
* Trellis: deallocate old trellis nodes on demand in init(), rather
|
|
than preemptively in clear(). Greatly speeds up forward computation
|
|
for trellis-based LMs (e.g., ClassNgram).
|
|
|
|
* Textstats: fix to handle zero denominator in ppl computation.
|
|
|
|
* disambig: fixed off-by-one error indexing into trellis.
|
|
|
|
* Miscellaneous small fixes for compilation and operation under Windows
|
|
(using the CYGWIN environment).
|
|
|
|
Warning: See doc/README.x86 about a gcc compiler bug that might
|
|
affect you on Intel platforms.
|
|
|
|
1.3.1 25 June 2002
|
|
|
|
Functionality:
|
|
|
|
* nbest-optimize -write-rover-control option conveniently dumps a
|
|
control file for nbest-rover that encodes the optimized parameters.
|
|
* New regression tests for nbest-rover (i.e., nbest-lattice) and
|
|
nbest-optimize.
|
|
* nbest-posteriors, combine-acoustic-scores now all handle and
|
|
preserve Decipher N-best formats. This allows nbest-rover to
|
|
generate sausages with backtrace information if input N-best lists
|
|
contain it (using -nbest-backtrace option).
|
|
* New tool nbest-pron-score for computing pronunciation and pause LM
|
|
scores from N-best hypotheses.
|
|
* Added disambig -totals option to compute total string probabilities
|
|
(same as in hidden-ngram).
|
|
* reverse-lm: simple filter to reverse a bigram backoff LM.
|
|
* lattice-tool -collapse-same-words reduces lattices by merging all
|
|
nodes with identical words (but also creates new paths in lattice).
|
|
* nbest-lattice -prime-with-refs option uses reference strings
|
|
to improve sausage alignment.
|
|
* compute-best-sentence-mix: new script to optimize sentence-level
|
|
interpolation of LMs.
|
|
* nbest-lattice -lattice-files option to align multiple word lattices;
|
|
currently only works with -use-mesh (sausages).
|
|
* hidden-ngram now supports mixture and class N-gram LMs.
|
|
* New class SimpleClassNgram, a more efficient implementation of
|
|
ClassNgram's where each word is assumed to belong to at most one
|
|
class and class expansions are exactly one word long.
|
|
Enabled by -simple-classes switch in ngram, lattice-tool, and
|
|
hidden-ngram.
|
|
* ngram -counts now handles escaped input lines and LM state change
|
|
directives embedded in the input.
|
|
* New tool nbest-pron-score for scoring pronunciations and pauses in
|
|
N-best hypotheses.
|
|
* NgramStats::parseNgram() new function to parse N-gram counts from
|
|
a character string.
|
|
* LM::pplCountsFile() new function to evaluate LM on counts read from
|
|
a file.
|
|
|
|
Bug fixes:
|
|
|
|
* make-ngram-pfsg is no longer limited to trigram models.
|
|
* Avoid NaN values in disambig and hidden-ngram, in cases where lmw or
|
|
mapw are zero and the corresponding log probabilities are -Infinity.
|
|
* Avoid numerical problems in N-best posterior computation by using
|
|
AddLogP() to compute normalizer.
|
|
* anti-ngram no longer requires -refs argument with -all-ngrams.
|
|
* Fixed bug removing noise from N-best lists with backtrace.
|
|
* Code fixes for clean compiles with gcc 3.x.
|
|
* nbest-rover more efficient by using a single invocation of
|
|
nbest-lattice for all input N-best lists.
|
|
* ClassNgram: fixed handling of words that appear as members of a class
|
|
with zero probability, or have zero membership probability.
|
|
* nbest-lattice -record-hyps now outputs hyp ids according to the
|
|
original N-best order, rather than the sorted one.
|
|
* make-hiddens-lm now gives proper unigram probability to hidden-S tag.
|
|
* Compute acoustic scores in Decipher N-best-2 format by subtracting
|
|
token LM scores from total score. This deals correctly with cases where
|
|
the total scores have been adjusted by summing merged hyps, and are no
|
|
longer the sum of all AC and LM word scores.
|
|
* Gawk scripts that test for alphabetic or lowercase characters are
|
|
more portable and handle non-ascii and multibyte characters.
|
|
|
|
The package now includes a paper on SRILM, to appear in ICSLP-2002,
|
|
that gives an overview of the software and its design (doc/paper.ps).
|
|
|
|
1.3.2 3 September 2002
|
|
|
|
New functionality:
|
|
|
|
* Added ngram-count and ngram-count -nonevents option to specify a
|
|
subset of words that are to be non-events, i.e., tokens that can only
|
|
occur in contexts (such as <s>).
|
|
* Extended ngram-count discounting options for up to 9-grams.
|
|
* Added support in Vocab and Ngram classes for processing meta-counts
|
|
(counts-of-counts).
|
|
* Added ngram-count -meta-tag and -kn-counts-modified options to
|
|
support make-big-lm.
|
|
* Added ngram-count -read-with-mincounts flag to suppress counts
|
|
below cuttoff thresholds at reading time. This dramatically lowers
|
|
memory consumption, and speeds up make-big-lm operation (which used
|
|
to use a gawk script for the same purpose).
|
|
* Added option to specify vocabulary to add-pauses-to-pfsg for cases
|
|
where heuristics fail.
|
|
* lattice-tool can now handle arbitrary order LMs for expanding
|
|
lattices. The old trigram expansion algorithm is still available
|
|
with -old-expansion; the compact trigram algorithm is unchanged with
|
|
-compact-expansion.
|
|
* To better support lattice expansion, two new functions have been
|
|
added to the LM interface: contextID() takes an optional word
|
|
argument, to compute the context needed to predict a specific word,
|
|
and contextBOW() is a new interface to compute the backoff weight
|
|
associated with truncating a history.
|
|
* Added makefile support to generate executable versions that use
|
|
"compact" data structures. See item 9 in INSTALL for details, and
|
|
doc/time-space-tradeoff for a simple benchmark result.
|
|
|
|
Bug fixes:
|
|
|
|
* Convert pseudo-log(0) value (-99) in DARPA backoff models back to
|
|
true log(0) on reading. This ensures that non-event words in the
|
|
input are treated as zeroprobs (by the perplexity computation and
|
|
otherwise).
|
|
* Avoid NaN floating point results in N-best rescoring and
|
|
nbest-optimize, by handling 0 * log(0) more carefully.
|
|
* Handle -Inf AM and LM scores in SRILM N-best format.
|
|
* make-big-lm was reworked to support KN in addition to GT discounting.
|
|
Warning: the modified lower-order counts for KN are created using
|
|
merge-batch-counts and can get almost as big as the original counts.
|
|
Beware of the additional disk space and run time requirement!
|
|
* Clear out old parameters before reading or estimating N-gram models.
|
|
* Reading in new class definitions into ClassNgram object now deletes
|
|
old definitions (unless classes file is empty).
|
|
* Destructors for Ngram and ClassNgram now free N-gram and class
|
|
definition memory.
|
|
* nbest-pron-score: avoid core dump when pronunciation information is
|
|
missing from N-best list.
|
|
* make-ngram-pfsg: fixed generation of unigram PFSGs.
|
|
* Avoid use of toupper() in add-pauses-to-pfsg.
|
|
* Handle ngram-count -order 0 and print warning.
|
|
* Avoid using zcat in scripts since it behaves differently on different
|
|
systems and depending on PATH setting.
|
|
* nbest-lattice and nbest-optimize no longer strip a filename part
|
|
following '.' to derive utterance ids; only known file suffixes
|
|
are removed.
|
|
* Fixed bugs in member declarations that were preventing TaggedVocab,
|
|
TaggedNgramStats, and StopNgramStats from working correctly.
|
|
* compute-sclite now ignores utterances with a reference of
|
|
"ignore_time_segment_in_scoring", consistent with NIST STM scoring.
|
|
* Vocab.h now defines SArray_compareKey() for strings over VocabIndex,
|
|
allowing use as keys in sorted arrays.
|
|
* ClassNgram now uses the processed words as the context after an OOV.
|
|
This works better when the input contains context cue tags.
|
|
* i386-solaris platform was not being detected by machine-type script.
|
|
|
|
1.3.3 2 March 2003
|
|
|
|
New functionality:
|
|
|
|
* Increased maximum number of interpolated LMs in ngram, hidden-ngram,
|
|
and lattice-tool to 10.
|
|
* ngram now computes static interpolation (N-gram merging) of up to 10
|
|
input LMs (consistent with handling of dynamic interpolation).
|
|
* ngram and lattice-tool -limit-vocab option limits LM reading to
|
|
those parameters that pertain to words specified by -vocab.
|
|
The LM:read() function got an optional second argument for this
|
|
purpose.
|
|
ngram -limit-vocab -renorm now effectively does the same as the
|
|
change-lm-vocab script. However, the main purpose of -limit-vocab
|
|
is to save memory by discarding N-grams that are not relevant to a
|
|
test set.
|
|
* rescore-decipher -limit-vocab precomputes the vocabulary used by
|
|
N-best lists and invokes ngram -limit-vocab to allow rescoring with
|
|
very large models on machines with little memory.
|
|
* Ngram::mixProbs() now has version that destructively merges an Ngram
|
|
into an existing model. ngram -mix-lm now uses this version, instead
|
|
of the old, non-destructive one, thereby achieving considerable time
|
|
and space savings (only two models, rather than 3, have to be kept in
|
|
memory at a time).
|
|
* ngram-count and ngram -map-unk option, to change the "unknown" word
|
|
token string.
|
|
* compute-sclite, compare-sclite now understand multiple -S options to
|
|
specify intersections of several utterance subsets for scoring.
|
|
* make-batch-counts now ignores lines in input file list that start
|
|
with # (allowing comments in the file list).
|
|
* Added replace-words-with-classes partial=1 option to prevent
|
|
multi-word replacements that include multiple whitespace characters
|
|
(i.e., "a b" is only replaced with a single space between the words).
|
|
* New LM script: sort-lm, reorders N-grams lexicographically, as
|
|
required by some other software (e.g., Sphinx3, pointed out by
|
|
Mikko Kurimo <mikkok@james.hut.fi>).
|
|
* New training script: reverse-text, reverses word order in text file.
|
|
* New pfsg script: pfsg-vocab, extracts vocabulary used in PFSGs.
|
|
|
|
Bug fixes:
|
|
|
|
* disambig and hidden-ngram -keep-unk now also causes LM to be
|
|
treated as open-vocabulary.
|
|
* HiddenNgram class (debug level 2) was omitting the event after
|
|
the last word from the Viterbi backtrace.
|
|
* ngram -expand-classes was including -pau- word in expanded LM.
|
|
* Made backoff computation in Ngram:wordProbBO() more efficient,
|
|
avoiding multiple lookups in the context trie. Gives about a 30%
|
|
speedup in ngram -debug 3 -ppl.
|
|
* ngram -lm reading is faster by about 8% due to a code optimization.
|
|
* ngram-count -order 2 -kndiscount3 no longer aborts with an error.
|
|
The -order option effectively limits the discounting parameters
|
|
computed, so that the model order can be changed without having to
|
|
adjust the smoothing options.
|
|
* make-big-lm -trust-totals option is ignored with KN discounting,
|
|
they don't work well together.
|
|
* make-big-lm now checks that input counts files are not stdin.
|
|
* Reading N-best lists in Decipher format now sets the number-of-words
|
|
score, so that weight rescoring, optimization etc. can use them.
|
|
* ngram-count normalizes the N-gram probabilities for a context to 1
|
|
if the backoff distribution for that context has probability mass 0.
|
|
The latter can happen e.g. if all N-grams for a context have been
|
|
observed and received discounted probabilities. The fix ensures that
|
|
the overall distribution is normalized in this case.
|
|
* rescore-reweight now accepts Decipher N-best lists.
|
|
* nbest-posteriors and nbest-rover now handle Decipher version 2
|
|
N-best lists better (allowing LM and WT weights to be applied).
|
|
* Initialize locale in all top-level programs. disambig, hidden-ngram,
|
|
segment, and segment-nbest were missing it, causing potential problems
|
|
with non-ASCII characters.
|
|
* nbest-lattice -write-vocab option to find vocabulary used in N-best
|
|
list.
|
|
* nbest-pron-score now uses idFromFilename() function to avoid
|
|
over-truncating filenames when inferring sentence ids.
|
|
* Added more strippable filename suffixes in idFromFilename() function.
|
|
* NBest: correctly read in phone backtraces that are time-reversed.
|
|
* compute-oov-rate ignores -pau- tokens.
|
|
* Various N-best scripts now process input directories containing links
|
|
(rather than plain files) correctly.
|
|
* Lattice class takes care to limit range of intlog transition
|
|
probabilities in PFSG output, so as to avoid overflow when converting
|
|
to bytelog scale.
|
|
* make-ngram-pfsg removes temporary file (now placed in /tmp) even
|
|
when killed by signal.
|
|
* Hidden-event and DF N-gram models are documented in detail in ngram
|
|
man page.
|
|
* Test suite result comparisons against reference output now use a
|
|
script that ignores small numerical discrepancies, so as to produce
|
|
fewer false alarms.
|
|
|
|
Portability:
|
|
|
|
* Compiles under MacOS X (MACHINE_TYPE=macosx), thanks to help from
|
|
wooters@icsi.berkeley.edu and jean-philippe.demoulin@enst.fr.
|
|
|
|
1.4 14 February 2004
|
|
|
|
New functionality:
|
|
|
|
* Added support for factored language models, developed by Katrin
|
|
Kirchhoff and Jeff Bilmes, and implemented by Jeff Bilmes.
|
|
A new library, libflm.a, and two new tools, fngram-count and fngram
|
|
are built in the flm/ directory. A conference paper and a technical
|
|
report are included as documentation in flm/doc/. Questions and bug
|
|
reports should be directed to bilmes@ee.washington.edu.
|
|
FLM support has also been integrated into some of the standard
|
|
tools (ngram and hidden-ngram) and is enabled by the -factored option.
|
|
|
|
* Added support in lattice-tool to read/write and rescore HTK lattices.
|
|
See lattice-tool man page for details.
|
|
* The lattice expansion algorithm for general LMs now preserves
|
|
pause and null nodes. Consequently, lattice-tool no longer eliminates
|
|
pause and null nodes prior to applying this algorithm, unless
|
|
-no-pause or -compact-pause was specified.
|
|
* Implemented a new algorithm to build word meshes (confusion networks,
|
|
sausages) from lattices, that is faster than the original Mangu et al.
|
|
method. lattice-tool -posterior-decode uses this to extract 1-best
|
|
word hypotheses, and lattice-tool -write-mesh allows writing of
|
|
sausages to file.
|
|
* The "compact" lattice expansion algorithm that uses backoff nodes
|
|
(described in Weng et al. 1998) has been generalized to handle
|
|
LMs of arbitrary order. As before, this algorithm is triggered by
|
|
lattice-tool -compact-expansion. (To get the old version, which
|
|
handles only trigrams and produces non-identical results, use
|
|
lattice-tool -compact-expansion -old-expansion.)
|
|
* lattice-tool -density allows pruning of lattices to a specified
|
|
density (in addition to the posterior threshold).
|
|
* lattice-tool -multi-char option allows designating characters other
|
|
than underscore as multiword delimiters.
|
|
* Added a "LatticeLM" class that emulates a language model using the
|
|
transition probabilities in a lattice. This is useful for debugging
|
|
and comparing the probabilities assigned by lattices to corresponding
|
|
LM probabiltiies. A new option lattice-tool -ppl makes use of this
|
|
class (analogous to ngram -ppl).
|
|
* lattice-tool lattice algebra operations (or, concatenate) can now
|
|
be applied to multiple input lattices, always using the same lattice
|
|
as second operand.
|
|
|
|
* ngram has enhanced N-best rescoring functionality, allowing
|
|
multiple input lists to be rescored (-nbest-files, -write-nbest-dir,
|
|
-decipher-nbest, -no-reorder, -split-multiwords).
|
|
* rescore-decipher -fast enables a faster rescoring mode that uses
|
|
only the built-in functions of ngram, thus running much faster.
|
|
* New option ngram -rescore-ngram to recompute the probabilities in
|
|
an N-gram model using an arbitrary other LM.
|
|
|
|
* Added original (unmodified) Kneser-Ney discounting (ngram-count
|
|
-ukndiscountN options). Contributed by Jeff Bilmes.
|
|
* New disambig -classes option to read vocabulary maps in
|
|
classes-format(5).
|
|
* New disambig -write-counts option to output word/class substitution
|
|
bigram counts (useful to reestimate class membership probabilities).
|
|
* nbest-pron-score -pause-score-weight creates weighted combination
|
|
of pronunciation and pause LM scores.
|
|
* compute-sclite -noperiods option to delete periods from hyps
|
|
for scoring purposes.
|
|
* New script empty-sentence-lm to modify existing LM to allow
|
|
the empty sentence with a given probability.
|
|
* compute-sclite handles CTM files in RT-03 format.
|
|
* ngram-class -debug 2 prints the initial word-to-class assignments,
|
|
so that the entire class tree can be reconstructed from the output.
|
|
* RefList class has option to read and look up reference words without
|
|
associated ID strings (indexed by integers).
|
|
* Enhanced WordMesh and WordLattice classes to have an optional
|
|
"name" field, used to record utterance ids.
|
|
* New select-vocab command to implement likelihood-optimizing
|
|
vocabulary selection from multiple corpura. Contributed by
|
|
Anand Venkataraman and Wen Wang. See man page for details.
|
|
|
|
Bug fixes:
|
|
|
|
* ngram avoids reading classes file multiple times if -limit-vocab
|
|
is not being used (otherwise it is unavoidable, and will lead to
|
|
errors if the reading is from stdin).
|
|
* Fixed some bugs in compare-sclite and compute-sclite.
|
|
* Modified ngram and compute-best-mix so that the latter works
|
|
with ngram -counts output. ngram -counts now outputs the count
|
|
values != 1 for each N-gram so that compute-best-mix can take them
|
|
into account in the optimization.
|
|
* rescore-reweight and nbest-rover were not handling Decipher N-best
|
|
lists correctly when additional score directories are given.
|
|
* nbest-rover -wer disables use of nbest-lattice -use-mesh option,
|
|
so nbest-rover can be used for old-style word error minimization
|
|
(or even 1-best rescoring, by also specifying -max-rescore 1).
|
|
* lattice-tool -ref-file and -ref-list were being ignored when
|
|
processing only a single input lattice. Fixed so that lattice error
|
|
can now be computed with either -input-lattice or -input-lattice-list.
|
|
* Enhanced MultiwordLM class with new contextID() and contextBOW()
|
|
versions that better reflect the backoff behavior of the wrapped LM
|
|
class. Makes it much more efficient to use the lattice-tool -multiword
|
|
option, i.e., expand a multiword lattice with a non-multiword LM.
|
|
* rescore-decipher -pretty had a bug that caused mapping to be applied
|
|
to the score fields as well, potentially corrupting the format.
|
|
* Fixed bugs in mixture lambda computation (ngram, hidden-ngram,
|
|
lattice-tool), triggered by more than one lambda being zero, or using
|
|
more than 5 mixtures.
|
|
* lattice-tool algebra operations used to crash if operand lattices
|
|
contained NULL nodes.
|
|
* Non-compressed files ending in .gz can now be read successfully.
|
|
* Catch a possible 0/0 problem in the Good-Turing discount estimator.
|
|
* Fixed memory management for strings returned by TaggedVocab::getWord()
|
|
thereby avoiding garbled results.
|
|
* lattice-tool -pre-reduce-iterate and post-reduce-iterate arguments
|
|
where not being used to control number of lattice reduction iterations.
|
|
* Fixed an unitialized memory bug that could produce random results
|
|
in posterior probability computation (and hence in lattice pruning).
|
|
* Fixed a bug in lattice pruning triggered by unnormalized posteriors
|
|
greater than 1.
|
|
|
|
Portability:
|
|
|
|
* Fixed some problems compiling with gcc-3.2.2; eliminated compile-
|
|
time warnings about division by zero in constant definitions.
|
|
* Rewrote some code to work around limitations and warnings in the
|
|
Intel C++ compiler. (In return, got compiled code that runs 10-20%
|
|
faster!) For processor-specific optimizations, use
|
|
make MACHINE_TYPE=i686-p4 .
|
|
* Fixed some script problems that surfaced in latest gawk version.
|
|
* Fixed some problems compiling with Tcl/Tk-8.4.1.
|
|
* FreeBSD support (contributed by Zhang Le <ejoy@peoplemail.com.cn>).
|
|
* Updated Nuance-related features in PFSG scripts and man page.
|
|
* Note: Integration of FLM support required some changes to the
|
|
Vocab and Ngram class interface. In particular, several member
|
|
variables (e.g., Boolean Vocab::unkIndex) have been replaced by virtual
|
|
member functions that return references to the variables (e.g.,
|
|
Boolean &Vocab::unkIndex()). This requires, albeit trivial, changes
|
|
to any client code that accesses these variables.
|
|
|
|
1.4.1 9 May 2004
|
|
|
|
Functionality:
|
|
|
|
* New option lattice-tool -htk-quotes to enable the HTK quoting
|
|
mechanism that allows whitespace and non-printable characters to be
|
|
used in word labels. (This is disabled by default since other SRILM
|
|
tools don't allow such word strings.)
|
|
* New option lattice-tool -add-refs to add a path corresponding to
|
|
the reference word string to each lattice.
|
|
* New option ngram -counts-entropy to compute entropy (log probabilties
|
|
weighted by joint N-gram probability) from counts.
|
|
|
|
Bugs fixed:
|
|
|
|
* nbest-lattice could core dump if references where not supplied.
|
|
* FLM/ProductVocab: fixed problems with mapping of <s> and </s> to
|
|
factored form.
|
|
* Lattice algebra operations (or, concatenate) now preserve HTK link
|
|
information and lattice names.
|
|
* Fixed LM::contextProb() handling of <s> and other non-event tokens.
|
|
This also allowed Ngram:computeContextProb() to be eliminated.
|
|
* LatticeFollowIter iterator no longer takes lookahead parameter --
|
|
lookahead is unlimited and cycles are avoided by keeping a table of
|
|
visited nodes. This also greatly speeds up lattice expansion in
|
|
some cases.
|
|
* Detect negative discounts in modified Kneser-Ney method, arising
|
|
from non-monotonic counts-of-counts.
|
|
* Fixed various debugging output messages in the Lattice class.
|
|
|
|
Portability:
|
|
|
|
* Matthias Thomae <thomae@ei.tum.de> found that make-ngram-pfsg
|
|
(and probably other gawk scripts) may not work correctly with recent
|
|
versions of gawk unless the environment is set to LC_NUMERIC=C.
|
|
|
|
1.4.2 19 October 2004
|
|
|
|
Functionality:
|
|
|
|
* lattice-tool -factored option to handle factored LMs (analogous
|
|
to ngram and hidden-ngram).
|
|
* lattice-tool -nbest-decode generates N-best lists from lattices
|
|
(contributed by Dustin Hillard, University of Washington).
|
|
* lattice-tool -output-ctm option to generate CTM-formatted 1-best
|
|
output, either with -viterbi-decode or with -posterior-decode.
|
|
Of course this requires HTK input lattices containing timemarks.
|
|
* Added version of WordMesh::minimizeWordError() that returns acoustic
|
|
information in a NBestWordInfo array, to support the above.
|
|
* lattice-tool -insert-pause option to insert optional pause nodes in
|
|
lattices.
|
|
* lattice-tool -unk will map unknown words to <unk> instead of
|
|
automatically augmenting the vocabulary (the -map-unk option allows
|
|
the mapping of unknown words to be customized).
|
|
* lattice-tool -acoustic-mesh records word times, scores, and phone
|
|
alignments when confusion networks are built.
|
|
* lattice-tool -ignore-vocab option to define the set of words that
|
|
are ignored in LM processing (like pause nodes).
|
|
* lattice-tool -write-ngrams option to compute expected N-gram counts
|
|
from lattices.
|
|
* HTK lattices now supports up to three "extra" score fields (x1..x3),
|
|
which can be used to rescore hypotheses with arbitrary non-standard
|
|
knowledge sources.
|
|
* Added support for the "s" key in HTK lattices (used to encode
|
|
state alignment info).
|
|
* anti-ngram -min-count option to prune N-grams with expected frequency
|
|
below specified threshold.
|
|
* ngram -adapt-marginals and related options to trigger use of
|
|
unigram marginals adaptation, following Kneser et al. (Eurospeech 97).
|
|
* New LM class AdaptMarginals to support the above.
|
|
* nbest-lattice and lattice-tool -hidden-vocab option allows specifying
|
|
a subvocabulary that should not be aligned with regular words when
|
|
building confusion networks.
|
|
* New VocabDistance subclass SubvocabDistance, to support the above.
|
|
* nbest-optimize -combine-linear and -non-negative options, useful to
|
|
optimize linear combinations of posterior probability scores.
|
|
|
|
Bugs fixed:
|
|
|
|
* lattice-tool: Avoid disconnecting lattice in density pruning.
|
|
* Utility script installation was not working for Cygwin hosts.
|
|
* ProductNgram::contextID() now returns hash code of context used,
|
|
instead of zero, and limits context-used length to order-1.
|
|
* HTK lattice output was omitting wdpenalty value.
|
|
* Improved collision-prone hash function for VocabIndex arrays.
|
|
* Documented order of operations in lattice-tool(1).
|
|
* Fixed excessive /tmp space usage in nbest-rover script, so as to
|
|
avoid frequent incomplete output with large N-best data as a result
|
|
of running out of disk space.
|
|
* Fixed bug in compute-sclite that would garble STM references without
|
|
the optional 6th field.
|
|
* Fixed bug in Trie::insert(), which would always set foundP = true,
|
|
even if a new entry was created.
|
|
* Preserve Lattice:limitIntlogs flags in lattice algebra operations.
|
|
* Use sorted node map iteration in lattice-tool expansion algorithms,
|
|
so that results are not subject to pseudo-random hash table ordering.
|
|
* HTK lattice output no longer has more nodes/links than input
|
|
(provided -no-htk-nulls, -htk-scores-on-nodes, or -htk-words-on-nodes
|
|
are NOT used).
|
|
* Take default lattice name from input filename, rather than output
|
|
filename (which may not be defined), however:
|
|
* The embedded names of output lattices from binary lattice operations
|
|
are derived from the output file name.
|
|
* Fixed bug in reading of word meshes (confusion networks) introduced
|
|
in release 1.4.
|
|
* Fixed a bug in alignments of multiple confusion networks, affecting
|
|
cases where the inputs have posterior masses != 1.
|
|
|
|
1.4.3 3 December 2004
|
|
|
|
Functionality:
|
|
|
|
* Increased the number of extra scores supported in HTK lattices
|
|
(x1, x2, ... x9).
|
|
* lattice-tool -nbest-viterbi option to use Viterbi N-best algorithm,
|
|
which uses less memory (contributed by Jing Zheng).
|
|
* Added nbest-lattice -output-ctm analoguous to lattice-tool.
|
|
* Make -output-ctm output word posteriors in the confidence field.
|
|
* Extend the meaning of the nbest-lattice -max-rescore option so that,
|
|
in lattice mode, it limits the number of hypotheses that are aligned.
|
|
(The meaning of -max-rescore was previously only defined in N-best
|
|
rescoring mode).
|
|
* Added -version option to all top-level programs.
|
|
|
|
Bug fixes:
|
|
|
|
* Improved efficiency and duplicate elimination in A-star N-best
|
|
generation (contributed by Jing Zheng).
|
|
* Worked around a problem with gawk scripts in Linux handling of
|
|
/dev/stderr device which can cause a file to be truncated if stderr is
|
|
redirected to it.
|
|
* MultiAlign::addWords() was not preserving NBestWordInfo.
|
|
|
|
Other:
|
|
|
|
* Various small code changes for compilation with gcc 3.4.3.
|
|
* Maintenance scripts moved to $SRILM/sbin/.
|
|
* Support for commercial releases excluding third-party code
|
|
contributions.
|
|
|
|
1.4.4 6 May 2005
|
|
|
|
Functionality:
|
|
|
|
* ngram-count now allows use of -wbdiscount, -kndiscount, etc.,
|
|
without a specified N-gram order, to set the default discounting
|
|
method for all N-gram orders. As before, this can be overridden by
|
|
-wbdiscount[1-9], -kndiscount[1-9], etc., for specific N-gram
|
|
lengths (suggested by Anand).
|
|
* lattice-tool -keep-pause has additional side-effects if used with
|
|
-nonevents and -ignore-vocab (making pauses behave like regular words).
|
|
* lattice-tool -dictionary-align option triggers use of dictionary
|
|
pronunciations for word mesh alignment (contributed by Dustin Hillard).
|
|
* New option lattice-tool -nbest-duplicates allows control over the
|
|
number of duplicate word hypotheses to output (from Dustin Hillard).
|
|
* Update to the FLM tools from Kevin Duh, to make fngram-count use the
|
|
-vocab option to limit the vocabulary of the estimated model.
|
|
* Added nbest-optimize -hidden-vocab option to constrain the alignment
|
|
of a subvocabulary (analogous to nbest-lattice -hidden-vocab).
|
|
* wlat-stats computes the posterior expected number of words in the
|
|
input lattice.
|
|
|
|
Bug fixes:
|
|
|
|
* ngram -unk maps unknown words in N-best hyps to <unk> instead of
|
|
adding them to the vocabulary.
|
|
* lattice-tool: Don't punt when encountering a NULL word node with
|
|
pronunciation, output a warning instead.
|
|
* lattice-tool -nbest-decode now uses a double-ended heap data
|
|
structure, and -nbest-max-stack drops hypotheses from the bottom
|
|
of the heap instead of the top (contributed by Dustin Hillard).
|
|
* lattice-tool -nbest-decode now does more thorough duplicate removal
|
|
(not just adjacent duplicates are removed).
|
|
* lattice-tool no longer gives an error if input lattice has posteriors
|
|
specified on nodes (even though they are effectively ignored).
|
|
* select-vocab: miscellaneous bug fixes from Anand.
|
|
* nbest-lattice: fixed various bugs with -nbest-backtrace option.
|
|
* compute-sclite: work around bug in csrfilt.sh -dh affecting waveform
|
|
names containing hyphens.
|
|
* Minor tweaks for MacOSX build.
|
|
|
|
1.4.5 28 August 2005
|
|
|
|
Functionality:
|
|
|
|
* ngram -debug 0 -ppl now outputs statistics for each input section
|
|
delimited by escape lines, in addition to overall results (based on
|
|
a modification by Dustin Hillard). ngram -debug 1 and higher behave as
|
|
before.
|
|
* ngram -loglinear-mix implements log-linear mixture LMs.
|
|
* LoglinearMix: new class to support the above.
|
|
* VocabMap: added remove(.) method to remove all entries for given
|
|
source word.
|
|
* WordMesh: added wordColumn() function to return confusion set at
|
|
given position (contributed by Dustin).
|
|
* Lattice: added readMesh() function to read in confusion networks
|
|
(from Dustin).
|
|
* lattice-tool -read-mesh allows handling in confusion network format
|
|
(from Dustin).
|
|
* nbest-optimize -1best-first implements a heuristic strategy whereby
|
|
the relative score weights are first optimized in -1best mode, followed
|
|
by full optimization together with posterior scale.
|
|
* nbest-optimize -max-time forces search to time out if new best
|
|
weights aren't found within a certain number of seconds.
|
|
* New script combine-rover-controls to merge multiple nbest-rover
|
|
control files for system combination.
|
|
|
|
Bug fixes:
|
|
|
|
* disambig clears old map entries when encountering a duplicate
|
|
definition for a source word.
|
|
* nbest-optimize: posterior scaling of fixed weights was broken.
|
|
* WordMesh, nbest-lattice: do better error checking on reading
|
|
confusion network files, handle numalign and posterior specs out of
|
|
order.
|
|
* lattice-tool had a bug in the handling of HTK format lattices that
|
|
do not contain an explicit specification of initial/final nodes.
|
|
* Added proper copy constructors and assignment operators for
|
|
Array, SArray, and LHash classes. This in turn makes the copy
|
|
constructor for NgramLM and other classes work properly.
|
|
(Assignment still doesn't work for some higher-level classes because
|
|
of reference (&) variable members.)
|
|
* Fixed minor bug in the ngram -skipoovs implementation, found by
|
|
Alexandre Patry.
|
|
|
|
Portability:
|
|
|
|
* Port to win32-mingw platform (by Jing Zheng). Doesn't support
|
|
compressed file i/o, or the -max-time options in nbest-optimize and
|
|
lattice-tool.
|
|
* Minor tweaks for compilation with gcc-4.0.1.
|
|
* Renamed HTKLink class to HTKWordInfo, which is more appropriate and
|
|
avoids a naming conflict with SRI's Decipher software.
|
|
|
|
1.4.6 20 January 2006
|
|
|
|
Functionality:
|
|
|
|
* Added support for reading/writing files compressed with bzip2
|
|
(file suffix .bz2). Requires that the bzip2/bunzip2 binaries be
|
|
installed.
|
|
|
|
Bug fixes:
|
|
|
|
* Lattice class now creates completely empty lattices (no nodes).
|
|
This avoids having to first remove a node when reading an actual
|
|
lattice. Empty lattices can be output, but not read (because at
|
|
least an initial/final node has to be defined).
|
|
* lattice-tool -ignore-vocab was not being used in conjunction with
|
|
-viterbi-decode, -posterior-decode, -collapse-same-words, and lattice
|
|
error computation. Words to be ignored are now treated same as
|
|
-noice-vocab in those operations.
|
|
* Fixed a bug in lattice expansion whereby backoff weights were
|
|
dropped at NULL nodes (problem noticed by Teemu Hirsimaki).
|
|
* Fixed bug in reading of node-specific posterior probabilities
|
|
in word meshes.
|
|
* Fixed a bug in lattice-tool -read-mesh, which was not creating
|
|
sentence initial/final tags on initial/final lattice nodes.
|
|
* Fixed a bug in the LatticeFollowIter class that could cause incorrect
|
|
results in LatticeLM (lattice-tool -ppl).
|
|
* When outputting PFSG lattices in HTK format, map PFSG weights to
|
|
HTK acoustic scores. (But, as before, LM rescoring discards input
|
|
PFSG weights and causes the probabilities to be output as LM scores.)
|
|
* Scale wdpenalty values specified in lattice according to log-base.
|
|
Also, scale -htk-wdpenalty specified on command line according to
|
|
-htk-logbase (or default 10).
|
|
* Correctly handle HTK score output with -htk-logbase 0.
|
|
|
|
Portability:
|
|
|
|
* Added workaround for compilers that don't support arrays of
|
|
non-constant size (such as SunStudio and Visual C++). On these
|
|
systems, Array will be used instead.
|
|
|
|
* Added a new compilation option "_s" that triggers use of 2-byte
|
|
integers for vocabulary indices and counts. With compilers that
|
|
implement __attribute__((packed)) correctly, this causes N-gram counts
|
|
to use 1/3 less memory than in the default option, at some limitations
|
|
in functionality. First, only vocabularies of up to 64k words may
|
|
be used. Second, only up to 32k counts exceeding 32k may be stored.
|
|
The latter is typically not a problem because in most natural data
|
|
the number of very frequent words is small.
|
|
Unfortunately, gcc does not currently handle __attribute__((packed))
|
|
correctly, but Intel's icc does.
|
|
|
|
* Tested on Linux for PowerPC-64bit.
|
|
|
|
* Tested on Linux for x86_64, using gcc.
|
|
|
|
* Minor tweaks for Intel icc 8.0.
|
|
|
|
* Tested on Solaris-x86 using Sun Studio 11 compiler.
|
|
Compilation still generates lots of warnings, but the resulting
|
|
binaries work correctly.
|
|
|
|
* Ported to Microsoft Visual C 7.0 (by Jing Zheng);
|
|
See doc/README.windows-mscv.
|
|
|
|
* gcc versions older than 3.4.3 are no longer supported, though
|
|
they might still work.
|
|
|
|
1.5.0 31 July 2006
|
|
|
|
Functionality:
|
|
|
|
* Added support for a binary data format for N-gram backoff models
|
|
which speeds up the reading of model files by a factor of 2
|
|
for full models, and by an order of magnitude if -limit-vocab is used.
|
|
Note that the binary format is machine architecture dependent.
|
|
See the ngram -write-bin-lm option (contributed by Jing Zheng).
|
|
|
|
* disambig now support Bayesian or standard interpolation of up to
|
|
10 LMs, just like ngram and hidden-ngram.
|
|
|
|
* Added disambig -factored option to support factored hidden tag LMs.
|
|
|
|
* Added disambig -escape option to pass information unprocessed to
|
|
the output, similar to hidden-ngram.
|
|
|
|
* New utility script: split-tagged-ngrams, see training-scripts(1)
|
|
man page.
|
|
|
|
* New function Vocab::checkWords() for more efficient implementation
|
|
of the ngram -limit-vocab functionality.
|
|
|
|
* Modified compute-sclite to support scoring of overlapped speech
|
|
with asclite program.
|
|
|
|
* New NgramCountLM class implementing a mixture of count-based
|
|
maximum-likelihood estimators (aka deleted interpolation aka
|
|
Jelinek-Mercer smoothing).
|
|
|
|
* ngram-count and ngram -count-lm options to implement deleted
|
|
estimation and evaluation of NgramCountLM models.
|
|
This option is also supported by hidden-ngram, disambig, and
|
|
lattice-tool.
|
|
|
|
* Added support for ngram counts stored in an indexed directory
|
|
structure, based on a format developed by Thorsten Brants for data
|
|
delivered to LDC by Google. This data format can be used in
|
|
conjunction with the NgramCountLM class, and may be generated
|
|
from standard ngram count files using the make-google-ngrams script
|
|
(see training-scripts(1)).
|
|
|
|
* Added NgramStats::clear() function.
|
|
|
|
* Added the limitVocab option to the NgramStats::read() function.
|
|
In conjunction with NgramCountLM, this allows use of arbitrarily
|
|
large N-gram statistic on limited test sets.
|
|
|
|
* Added ngram-count -limit-vocab option.
|
|
|
|
* Added hidden-ngram -vocab and limit-vocab options.
|
|
Possible incompatibility: the -hidden-vocab wordlist must not contain
|
|
the *noevent* word; it is added implicitly.
|
|
|
|
* Added lattice-tool -write-vocab option to extract vocabulary from
|
|
lattice files.
|
|
|
|
* Added lattice-tool -init-mesh option to align lattice to preexisting
|
|
confusion network.
|
|
|
|
* Added an interface for vocabulary aliasing (name mapping) to
|
|
the Vocab class, and the option -vocab-aliases to the programs
|
|
disambig, hidden-ngram, lattice-tool, nbest-lattice,
|
|
ngram-count, and ngram. This allows direct use of LMs with
|
|
slightly mismatched vocabularies relative to some test data.
|
|
Also, added handling of the -vocab-aliases option to the
|
|
rescore-decipher script, so that large name mapping files can
|
|
be subsetted when -limit-vocab is in effect (so that only the
|
|
relevant portions of an LM are loaded).
|
|
|
|
* disambig now automatically limits LM reading to the words found in
|
|
the map file (suggested by Jing Zheng).
|
|
|
|
* hidden-ngram -bayes and -bayes-length options added to give more
|
|
control over interpolation.
|
|
|
|
* The default count type is now "unsigned long" intead of
|
|
"unsigned int". This makes no difference on 32-bit platforms,
|
|
but on 64-bit platforms it allows the handling of data upwards of
|
|
4.3 billion tokens (which would causes integer overflow on 32bit
|
|
machines).
|
|
|
|
* For 32-bit platforms, added a compile option "_l", which triggers
|
|
use of 64-bit "long long" integers for count storage.
|
|
This uses the XCount class to avoid needing extra memory for count
|
|
storage, assuming that large count values will be sparse.
|
|
|
|
Bug fixes:
|
|
|
|
* Fixed a bug in the handling of -mix-lm[789] options in ngram,
|
|
hidden-ngram and lattice-tool. (With the -bayes option in effect,
|
|
the -mix-lm6 argument was used for -mix-lm[789].)
|
|
|
|
* Fixed memory management in the XCount implementation, which was
|
|
giving incorrect results when compiling with OPTION=_s.
|
|
|
|
* disambig no longer adds <s> and </s> tokens if input already
|
|
contains them (consistent with ngram).
|
|
|
|
* lattice-tool -read-mesh was broken in the previous release, now
|
|
works again.
|
|
|
|
* lattice-tool -density-prune and -nodes-prune now work without
|
|
-posterior-prune being specified.
|
|
|
|
* The -debug option was being ignored with ngram -null .
|
|
|
|
* Fixed a bug in Vocab::remove(VocabString) that could be triggered by
|
|
interactions between ngam -vocab and -vocab-aliases .
|
|
|
|
* Tweaks to MACHINE_TYPE=msvc compilation. updated documentation in
|
|
doc/README.windows-cygwin and doc/README.windows-mscv.
|
|
|
|
* Tweaked compiler flags for Solaris to handle files larger than 2^31.
|
|
|
|
* Prevent possible NaN probabilities in ClassNgram.
|
|
|
|
* Fixed a problem in make-ngram-pfsg triggered by a word named "BO".
|
|
|
|
* Support long int key values in data structures.
|
|
|
|
* rescore-decipher -filter option now works correctly in conjunction
|
|
with -limit-vocab.
|
|
|
|
1.5.1 20 November 2006
|
|
|
|
Functionality:
|
|
|
|
* ngram-count -write-binary is a new option to create binary count
|
|
files, which load much faster. They are recognized automatically by
|
|
ngram-count -read, and can be used in count-based LMs.
|
|
|
|
* Revised binary backoff LM format (ngram -write-bin-lm) to use only
|
|
a single data file and be machine-independent and somewhat more
|
|
compact. Reading the 1.5.0 binary format is still supported, but not
|
|
writing it.
|
|
|
|
* Added lattice-tool -bayes and -bayes-scale options for compatibility
|
|
with ngram and other programs.
|
|
|
|
* New lattice-tool -write-ngram-index option to generate an index of
|
|
N-gram occurrences in a lattice.
|
|
|
|
* New lattice-tool -multiword-dictionary option enables accurate
|
|
handling of acoustic information (timestamps, pronunciations) when the
|
|
-split-multiwords option is used (contributed by Dustin Hillard).
|
|
|
|
* New nbest-optimize -insertion-weight and -word-weights options to
|
|
implement weighted forms of word error optimization.
|
|
|
|
* New option make-ngram-pfsg no_empty_bo=1 to disallow an empty (null)
|
|
path through the PFSG via the unigram backoff.
|
|
|
|
* New script get-unigram-probs to extract unigram probabilities from
|
|
an LM file.
|
|
|
|
Bug fixes:
|
|
|
|
* Enabled large-file (64bit offsets) handling for Linux 32bit
|
|
compilation.
|
|
|
|
* Fixed utility and test scripts to support platforms that don't
|
|
support compressed file I/O. Check test/README for instructions.
|
|
|
|
* Fixed bug in compute-sclite that could lead to failure if
|
|
waveform names contain hyphens, or sort differently after mapping to
|
|
lowercase.
|
|
|
|
* Fixed another bug in compute-sclite that was preventing
|
|
compare-sclite from working.
|
|
|
|
* Fixed a typo-bug in Ngram::estimate that could cause problems in
|
|
handling discounting errors, but in practice seems to have been
|
|
harmless (from Federico Cesari).
|
|
|
|
* Improved MSVC portability:
|
|
- fixed header file usage
|
|
- enabled binary file i/o for binary LMs
|
|
- fixed miscellaneous compiler warnings
|
|
- simplified build (see doc/README.windows-mscv)
|
|
- workaround in WordMesh.cc to avoid a compiler bug (from
|
|
Federico Cesari).
|
|
|
|
* Fixed win32 (Windows gcc, not cygwin) build.
|
|
|
|
1.5.2 6 March 2007
|
|
|
|
Functionality:
|
|
|
|
* Support binary LM formats (based on Ngram binary format) for most
|
|
LM classes.
|
|
|
|
* New lattice-tool -htk-logzero option to set a dummy score to
|
|
replace zero scores found in HTK lattices.
|
|
|
|
Bug fixes:
|
|
|
|
* Make sure Google ngrams can be read in both compressed and
|
|
uncompressed format if platform supports both.
|
|
|
|
* Make sure the file pointer is updated when reading binary Ngram LM.
|
|
This enables reading multiple LMs from one file, and avoids errors
|
|
reading binary class-LMs.
|
|
|
|
* Avoid NaN values when a lattice score is infinity and the
|
|
corresponding scale factor is 0 (the score is ignored in that case).
|
|
|
|
* Avoid degenerate decoding results if lattice hypotheses contain
|
|
-infinity scores. (Effectively, -infinity is replaced by a large
|
|
negative log score, thus allowing the decoder to rank hypotheses based
|
|
on their non-infinity components.)
|
|
|
|
* Updated lattice-tool man page to clarify the interaction of
|
|
LM rescoring and lattice decoding.
|
|
|
|
Portability:
|
|
|
|
* Added configuration for Solaris amd64 platform with
|
|
Sun C compiler (amd64-solaris_spro).
|
|
|
|
* Updated instructions for MSVC build (see doc/EADME.windows-msvc),
|
|
based on imput from Mike Frandsen.
|
|
Merge MSVC .manifest files into binary before installation.
|
|
|
|
1.5.3 28 July 2007
|
|
|
|
Functionality:
|
|
|
|
* New ngram-count -write-binary-lm option to output LM in binary format
|
|
(avoids the need to dump ascii format first, and then convert to
|
|
binary using ngram tool).
|
|
|
|
* New make-google-ngrams yahoo=1 option to read Yahoo ngram corpus
|
|
(which needs to be sorted first, however).
|
|
|
|
* New make-big-lm -ngram-filter option to pipe input counts through
|
|
an arbitrary filter program (e.g., for format conversion).
|
|
|
|
* The make-kn-discount utility will now try to estimate missing
|
|
counts-of-counts based on their global statistics, using an empirical
|
|
law: log f(k) - log f(k+1) = C / k for some constant C.
|
|
Note this functionality is not implemented in the C++ code for KN
|
|
discounting. Therefore, it is only available when building LMs with
|
|
make-big-lm.
|
|
|
|
* New scripts tolower-ngram-counts and uniq-ngram-counts to help
|
|
manipulate counts files.
|
|
|
|
* New option ngram-count -write-vocab-index (for debugging).
|
|
|
|
* Vocab.h: Increased maxWordLength constant from 256 to 1024.
|
|
|
|
* Trie class can now initialize root node size with optional constructor
|
|
argument (similar to other container classes).
|
|
|
|
* LHash and SArray classes have a new function to preallocate space
|
|
following construction (but before any data is inserted).
|
|
|
|
* The platform "i686-p4" has been renamed "i686-icc" (Linux x86 with
|
|
Intel compiler) for consistency.
|
|
|
|
Bugs:
|
|
|
|
* Fixed a buffer overrun problem triggered by nbest rescoring of
|
|
empty hypotheses.
|
|
|
|
* Fixed problem in compute-sclite with extraction of speaker labels
|
|
from ctm files.
|
|
|
|
* NBest class (affecting nbest-pron-score): strip Decipher-specific
|
|
phone diacritic labels separated by underscores from pronunciation
|
|
strings.
|
|
|
|
* Fixed memory leak in Trie::removeTrie(). This was causing a leak
|
|
in NgramLM deallocation.
|
|
|
|
* Fixed a performance bug which caused the building of unigram
|
|
hash tables to have quadratic time complexity (due to an unfortunate
|
|
interaction between hash table iterators and hash functions).
|
|
|
|
* Made make-big-lm detect missing -read option and print usage message.
|
|
Also, handles degenerate -kndiscount with -order 1 now.
|
|
|
|
* Workaround for icc compiler error: optimization disabled for some
|
|
files when using MACHINE_TYPE=i686-m64-icc.
|
|
|
|
1.5.4 2 November 2007
|
|
|
|
Functionality:
|
|
|
|
* New option ngram-count -addsmooth for additive smoothing.
|
|
A corresponding new discounting subclass "AddSmooth" is defined in
|
|
Discount.h.
|
|
|
|
* New option ngram -server-port to start a "probability server"
|
|
(based on a contribution by Elad Dinur).
|
|
|
|
* WordLattice: print lattice name in warning messages.
|
|
|
|
* lattice-tool -keep-unk option to preserve labels of OOV words in
|
|
LM rescoring (currently works only for HTK lattices).
|
|
|
|
* New option nbest-optimize -anti-refs and -anti-ref-weight to
|
|
decorrelate errors with another set of hypotheses.
|
|
|
|
* New support in nbest-optimize for BLEU optimization and Powell search
|
|
(from Jing Zheng).
|
|
|
|
* New option ngram-class -save-maxclasses to start the saving of
|
|
intermediate results when a specified number classes is reached
|
|
(suggested by Shlomo Wavrow and Mats Svenson).
|
|
|
|
Bugs:
|
|
|
|
* Fixed incorrect reference output for test "nbest-rover-acoustic".
|
|
|
|
* Fixed a possible problem with tests "ngram-class" and
|
|
"ngram-count-lm-limit-vocab" in non-C locales.
|
|
|
|
* nbest-lattice: Avoid aligning reference words with -dump-errors or
|
|
-wer, which would cause crash because no lattice is being generated
|
|
internally.
|
|
|
|
* make-batch-counts, merge-batch-counts: be more portable by dynamically
|
|
finding the right options to use with xargs.
|
|
|
|
* add-pauses-to-pfsg: Avoid using a regular expression construct that
|
|
causes a gawk error in UTF-8 locales. However, to ensure this works
|
|
correctly a gawk version of 3.1.5 should be used. See note in
|
|
doc/README.linux. If the test "make-ngram-pfsg" fails a workaround is
|
|
to set LANG=C or LANG=en_US and avoid UTF-8.
|
|
|
|
* Fixes an uninitialized member variable in the unary constructor for
|
|
class File, which was causing garbage to be return on the first
|
|
getline().
|
|
|
|
* common/Makefile.machine.macos: Updated Tcl linking instructions
|
|
(from Chuck Wooters).
|
|
|
|
* Makefile: exit immediately if any of the subdirectories result in
|
|
build errors.
|
|
|
|
1.5.5 6 November 2007
|
|
|
|
Bug fixes:
|
|
|
|
* Fixed Makefile problem in binaries depending on libraries that was
|
|
preventing executables being generated on some platforms.
|
|
|
|
* Fixed a compilation problem with MSVC for nbest-optimize.
|
|
|
|
* Use MSVC _getpid() in ngram -generate random seed initialization.
|
|
|
|
1.5.6 2 January 2008
|
|
|
|
Functionality:
|
|
|
|
* New ngram -use-server option to run the client side of a network LM
|
|
server as implemented by ngram -server-port. Optionally, probabilities
|
|
may be cached in the client (option -cache-served-ngrams).
|
|
Mixtures of one or more network and file-based LMs are also possible.
|
|
|
|
* Likewise, disambig, hidden-gram, and lattice-tool understand the
|
|
-use-server option.
|
|
|
|
* New LMClient class to implement the above (a stub LM subclass that
|
|
queries a server for LM probabilities).
|
|
|
|
* ngram -server-port now behaves like a true server daemon: it handles
|
|
multiple simultaneous or sequential clients, and never exits (unless
|
|
killed). The number of simultaneous clients may be limited with the
|
|
-server-maxclients option.
|
|
|
|
* Support for 7-zip compressed files (suggested by Alexy Khrabrov).
|
|
|
|
* lattice-tool -split-multiwords will now print a warning message
|
|
about multiwords that were not split because their LM probability was
|
|
non-zero.
|
|
|
|
* LoglinearMix LM class supports n-way mixtures directly, giving more
|
|
efficient implementation for n > 2 than recursive object construction
|
|
in ngram (contributed by Tanel Alumae).
|
|
|
|
Bug fixes:
|
|
|
|
* MultiwordLM now implicitly adds all words to the vocabulary, so that
|
|
previously unseen multiwords get split. This has the side effect that
|
|
OOVs will appear as zeroprob words.
|
|
|
|
Documentation:
|
|
|
|
* The doc/FAQ file has been expanded and reformated as a man page.
|
|
It can be viewed with "man srilm-faq" or online at
|
|
http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html .
|
|
The major content additions are questions about the build
|
|
process, how to build a "Google N-gram LM", smoothing issues,
|
|
and OOV-handling (the latter by Deniz Yuret). Corrections and
|
|
additions to this document are most welcome!
|
|
|
|
* A new manual page ngram-discount(7) gives a detailed overview of
|
|
smoothing methods found in SRILM (contributed by Deniz Yuret).
|
|
|
|
* The conversion of man pages to html has been enhanced to better
|
|
handle code samples and nested itemized lists.
|
|
|
|
1.5.7 14 October 2008
|
|
|
|
Functionality:
|
|
|
|
* make-big-lm -text option allows building of LMs that only contain
|
|
N-gram contexts that are needed for a given test set, thus saving
|
|
space.
|
|
|
|
* ngram-count -intersect option allows reading of counts to be
|
|
restricted to an N-gram subset.
|
|
|
|
* NgramStats added a Boolean switch "intersect" and a method
|
|
setCount(), used for implementing the above.
|
|
|
|
* Allow changing the character used to compound multiwords, using the
|
|
new option -multi-char with ngram, anti-ngram, nbest-lattice,
|
|
nbest-optimize, nbest-pron-score, and several of the nbest-scripts.
|
|
|
|
* New options -no-sos and -no-eos for ngram-count and ngram tools,
|
|
to control the insertion of <s> and </s> tokens around sentences.
|
|
|
|
* New lattice-tool -no-expansion option to decode a lattice with a
|
|
new LM without first expanding the lattice (contributed by Jing Zheng).
|
|
|
|
* New CachedMem mix-in class to implement a caching memory allocator
|
|
(contributed by Jing Zheng).
|
|
|
|
* Added lattice-tool -print-sent-tags option to preserve <s> and </s>
|
|
tags in lattice output format, instead of mapping them to null nodes.
|
|
|
|
Documentation:
|
|
|
|
* Added redirecting http links to non-SRILM program documentation
|
|
in manual pages.
|
|
|
|
Portability:
|
|
|
|
* Removed SRI-specific paths etc. from common/Makefile.machine.* .
|
|
Added a mechanism that allows site-specific customizations to be
|
|
recorded in common/Makefile.site.$MACHINE_TYPE to override definitions
|
|
in common/Makefile.machine.$MACHINE_TYPE, without a need to change the
|
|
latter.
|
|
|
|
Bug fixes:
|
|
|
|
* Always output the elements of binary count files and ngram LMs
|
|
in index-sorted order (same as the _c program version). This avoids
|
|
poor performance when reading the data back in.
|
|
|
|
* Fixed LMClient.h so it compiles on win32 and msvc platforms (even
|
|
though it still doesn't do anything, since Unix sockets are not
|
|
supported).
|
|
|
|
* Process ngram-count -writeN options after applying count smoothing,
|
|
so that the effect of any count modifications (e.g., by KN) is seen,
|
|
and consistent with the -write option.
|
|
|
|
* Fixed the timestamps on initial and final nodes of lattice-tool
|
|
-operation or (bug found by gaojie@hccl.ioa.ac.cn).
|
|
|
|
* NgramLM: Handle cases where interpolated discounting leaves no
|
|
backoff probability mass.
|
|
|
|
* AdaptiveMarginals: Now handles words that are added after LM was
|
|
created. This can happen in N-best rescoring and would previously
|
|
cause an assertion failure.
|
|
|
|
* Fixed bugs in IntervalHeap memory allocation, which could
|
|
cause problems in N-best generation from lattices (from Jing Zheng).
|
|
|
|
* Set LC_NUMERIC=C in make-big-lm to avoid problems with non-C
|
|
locales for gawk scripts that compute discounting parameters.
|
|
|
|
1.5.8 10 May 2009
|
|
|
|
Functionality:
|
|
|
|
* merge-batch-counts -float-counts option for merging of fractional
|
|
counts.
|
|
|
|
* compare-sclite now includes statistical significance computation
|
|
based on a matched-pair Sign test.
|
|
|
|
* Added a Perl tool to compute the cumulative binomial distribution,
|
|
contributed by Brett Kessler and David Gelbart.
|
|
|
|
* Don't output LM server banner message for ngram -use-server -debug 0.
|
|
|
|
* The LM::generateSentence() function now takes option argument to
|
|
specify sentence prefix that is to be used to condition subsequent
|
|
word generation (suggested by Alexy Khrabrov). The default is to
|
|
condition on <s> as before, or an empty context if no start-of-sentence
|
|
tag is defined.
|
|
|
|
* A new option ngram -gen-prefixes to read conditioning prefixes
|
|
from a file, and generate random sentences based on them.
|
|
|
|
* New options in nbest-optimize that modify -print-hyps output so that
|
|
only unique hypotheses are included (-print-unique-hyps), and to print
|
|
the original ranks of hypotheses (-print-old-ranks) (from Jing Zheng).
|
|
|
|
* The -version option reports whether support for compressed files
|
|
is available.
|
|
|
|
* Added merge-batch-count -l option to control how many files to merge
|
|
in each iteration.
|
|
|
|
Bug fixes:
|
|
|
|
* ngram-count, NgramLM: disable the Doug Paul smoothing hack (add one
|
|
to denominator when smoothing results in 0 backoff mass) in contexts
|
|
where the entire vocabulary has been observed.
|
|
|
|
* nbest-optimize fixes to the -minimum-bleu-reference functionality
|
|
(from Jing Zheng).
|
|
|
|
* Fixed nbest-optimize bug that was causing incorrect log output with
|
|
gcc 4.x.
|
|
|
|
* Output vocabulary index map in binary ngram count and LM format
|
|
in numerical index order. This avoids a performance bug whereby
|
|
reading the data structures back into _c binary version could take
|
|
a long time due to inefficient insertion order.
|
|
|
|
* Fix ngram -counts with -use-server (from Ergun Bicici).
|
|
|
|
* Fixed memory allocation bug in FLM tag vocabulary handling that could
|
|
lead to crash when interpolating several FLMs.
|
|
|
|
* Rewrote make-batch-counts scripts to
|
|
- avoid problems with limits on command line length
|
|
- support systems that don't have compressed file I/O.
|
|
|
|
* Modified merge-batch-counts script to
|
|
- ensure that unmerged files are always merged in the next iteration,
|
|
to avoid file size imbalance (suggested by Alex Marin)
|
|
- support systems that don't have compressed file I/O.
|
|
|
|
* Fixed a portability issue with Intel icc version 7.0.
|
|
|
|
* compute-sclite fixed to invoke csrfilt.sh script with -t option.
|
|
|
|
1.5.9 24 August 2009
|
|
|
|
Functionality:
|
|
|
|
* Added ngram-count -text-has-weights option to scale counts on a
|
|
per-sentence basis.
|
|
|
|
* LMStats::countString() and NgramStats::countSentence() methods
|
|
generalized to take optional weight string argument (to support the
|
|
above change).
|
|
|
|
* Added compile-time option to generate position-independent code
|
|
(make MAKE_PIC=yes, see INSTALL file).
|
|
|
|
* Added support for xz-compressed files (.xz files offer better
|
|
compression than .gz at the expense of time and memory).
|
|
The xz tool has to be installed separately (http://tukaani.org/xz).
|
|
|
|
Bug fixes:
|
|
|
|
* wlat-to-pfsg generates NULL output labels for initial/final nodes
|
|
with sentence start/end tags (because PFSGs encode those implicitly).
|
|
|
|
* TaggedVocab: check and report if number of tags/words exceeds max.
|
|
Make number of bits allocated for tags/words proportional to
|
|
word size. Parse word/tag strings such that last (not the first)
|
|
slash (/) character is treated as the delimiter.
|
|
|
|
* Documented the lattice-tool -ngrams-time-tolerance option that had
|
|
been previously implemented but omitted from the man page.
|
|
|
|
1.5.10 7 Jan 2010
|
|
|
|
Functionality:
|
|
|
|
* New option ngram -float-counts to allow the -counts option to
|
|
process fractional counts.
|
|
|
|
* The LM::pplCountsFile() and LM::countsProb() have been templatized
|
|
(as a function of count type), and the TextStats class now uses double
|
|
float counts, all in support of the above change.
|
|
|
|
* New option lattice-tool -word-posteriors-for-sentences for computing
|
|
word posteriors based on confusion networks (contributed by Jing Zheng).
|
|
|
|
* lattice-tool now performs confusion network decoding and ngram
|
|
computation AFTER rescoring or expansion with LMs. Therefore the two
|
|
operations can be combined in a single run where previously two
|
|
invocations were necessary.
|
|
|
|
* Added fsm-to-pfsg map_epsilon= option, to translate FSM <eps> symbols
|
|
to another label.
|
|
|
|
* New script filter-event-counts to preprocess a count file for use
|
|
with ngram -counts .
|
|
|
|
* lattice-tool continues processing when one of the lattices specified
|
|
with -in-lattice-list cannot be opened.
|
|
|
|
* Regression tests have been moved to module subdirectories
|
|
(lm/test, flm/test, lattice/test) and can now be run from the
|
|
top-level with "make test". Decompression of data files for platforms
|
|
that don't support compressed file I/O is now automatic.
|
|
|
|
Documentation:
|
|
|
|
* Added new FAQ items covering handling of OOVs and zeroprob words,
|
|
based on input from Nitin Madnani.
|
|
|
|
* Correction to the man page description of the ngram -count-order
|
|
option: It limits the maximal order of processed ngrams.
|
|
|
|
* Corrected and updated ordered list of processing steps in
|
|
lattice-tool man page.
|
|
|
|
Bug fixes:
|
|
|
|
* Use double precision to record log probs in TextStats object.
|
|
|
|
* Workaround for a deficiency in Intel's 7.00 C++ compiler.
|
|
|
|
* lattice-tool was not handling PFSG lattices in (1best or N-best)
|
|
decoding with a LM.
|
|
|
|
* lattice-tool will exit with a non-zero status if any of the lattice
|
|
operations fail.
|
|
|
|
* Fixed some format string/argument mismatches that could bite on
|
|
64-bit platforms.
|
|
|
|
* Updated usage of sort with key specification to conform to latest
|
|
POSIX standard. The old syntax was no longer working with recent
|
|
GNU sort versions.
|
|
|
|
1.5.11 16 June 2010
|
|
|
|
Functionality:
|
|
|
|
* New program "maxalloc" to find the maximum amount of memory that
|
|
can be allocated by a user process in the current environment.
|
|
May be useful to debug out-of-memory conditions.
|
|
|
|
Bug fixes:
|
|
|
|
* Avoid deleting low-posterior null tokens when aligning lattices into
|
|
word meshes.
|
|
|
|
* Map explicit start/end-of-sentence tags in HTK lattices to null,
|
|
since they are already implicitly attached to the start/end nodes
|
|
of the lattice (LM scoring gives anomalous results on repeated tags).
|
|
|
|
* option.[ch]: fixed declaration issues to avoid compiler warnings.
|
|
|
|
* Moved man page for the option library functions to misc/doc.
|
|
|
|
Bug fixes:
|
|
|
|
* Fixes to compile cleanly with gcc -Wall -Wno-unused-variable
|
|
-Wno-uninitialized.
|
|
* Fixed a problem with gcc-4.4 compiles.
|
|
* Fixed a problem with macro definition of fseeko() ftello().
|
|
* Fixed a problem with the lm/ngram-count-wb-subset test, which could
|
|
fail after the test data is uncompressed.
|
|
* Use gzip -d to read gzipped files, avoids shell wrapper overhead.
|
|
|
|
1.5.12 20 Jan 2011
|
|
|
|
Functionality:
|
|
|
|
* Enable lattice-tool -old-decoding if -nbest-duplicates is specified
|
|
(and warn about it).
|
|
* Support make-big-lm -wbdiscount option.
|
|
* New option ngram -prune-history-lm, for specifying a separate LM that
|
|
computes the history marginal probablities needed for N-gram pruning
|
|
purposes. Inspired by C. Chelba et al., "Study on Interaction Between
|
|
Entropy Pruning and Kneser-Ney Smoothing", Proc. Interspeech-2010.
|
|
* Added optional limitVocab argument to VocabMultiMap::read() function.
|
|
This is now used by lattice-tool -limit-vocab to avoid reading parts of
|
|
the dictionary that are not used in the input.
|
|
* Added an option -zeroprob-word to ngram and lattice-tool. It
|
|
specifies a word that should be used as a replacement if the current
|
|
word has probability zero. This is different from -map-unk which only
|
|
applies to OOV words and actually replaces the word label in the output
|
|
lattice, if any.
|
|
* Added new wrapper LM class NonzeroLM, to implement the above.
|
|
|
|
Portability:
|
|
|
|
* New MACHINE_TYPE values for Android-ARM platform: android-armeabi and
|
|
android-armeabi-v7a (from Mike Frandsen).
|
|
* Deleted the htk directory from distribution; it was obsolete and not
|
|
documented.
|
|
|
|
Bug fixes:
|
|
|
|
* Prob.h: guard against under/overflow in intlog and bytelog
|
|
conversions.
|
|
* Replaced gunzip with gzip -d in all scripts (for efficiency).
|
|
* Better option checking in make-big-lm, disallowing mixing of
|
|
discounting methods and use of discounting flags that are not supported.
|
|
* Undefine max() macro in Trellis.h to avoid conflict with some system
|
|
header files.
|
|
* Better support for recent MSVC versions in
|
|
common/Makefile.machine.msvc (from Mile Frandsen).
|
|
* add-pauses-to-pfsg: prevent existing pause nodes from being processed.
|
|
|
|
1.6.0 8 December 2011
|
|
|
|
Functionality:
|
|
|
|
* Added lattice-tool -loglinear-mix option.
|
|
* Add platform-independent strtok_r() function, and replaced all
|
|
instances of strtok().
|
|
Eventual goal is thread safety and re-entrance.
|
|
* Modified File object to allow I/O to/from strings as well as files.
|
|
* Modified code for reading and writing HTK lattices and NBest lists to
|
|
enable I/O to/from strings as well as files, for in-memory processing.
|
|
* Added special-purpose malloc/free implementation for SArray and LHash
|
|
data structures, to reduce overhead for small allocation chunks. Also
|
|
added some allocation statistics reporting (enabled by ngram -memuse
|
|
-debug 1).
|
|
* Added the metadb config file lookup tool.
|
|
* Cumulative binomial script (cumbin) command accepts optional 3rd
|
|
argument to set p parameter.
|
|
|
|
Bug fixes:
|
|
|
|
* Correctly handle lattice-tool -use-server when generating nbest lists
|
|
(server- based LM was previously ignored).
|
|
* lattice-tool -split-multiwords no longer splits words appearing in
|
|
-ignore-vocab.
|
|
* lattice-tool allowed to operate on HTK lattices containing unrecognized
|
|
header fields (but warn about them).
|
|
* Updated reference output for many build platforms to avoid spurious
|
|
test failures.
|
|
* Avoid abnormal backoff weights when lower-order probabilities sum to
|
|
almost one.
|
|
* Avoid test failures for merge-batch-counts and make-ngram-pfsg due to
|
|
locale differences.
|
|
* Fix maxalloc for 64bit systems where "long" is still 32 bits.
|
|
|
|
Building:
|
|
|
|
* Added Microsoft Visual Studio 2005 projects, see
|
|
doc/README.windows-msvc-visual-studio for more information.
|
|
* Added new Makefile targets superclean and pristine to return
|
|
SRILM to pre-build state.
|
|
* Add Makefiles for MACHINE_TYPE macosx-m32 and macosx-m64 to
|
|
allow explicit 32- or 64-bit compilation on MacOS X 10.6. Updated
|
|
GAWK location to allow tests to succeed.
|
|
* Replaced various C-shell helper scripts in sbin/ with Bourne-shell
|
|
versions, for greater portability.
|
|
* New MACHINE_TYPE=msvc64 for 64bit builds with Visual Studio.
|
|
|
|
Documentation:
|
|
|
|
* Added doc/asru2011-srilm.pdf, a paper describing SRILM updates since
|
|
2002. Old ICSLP paper renamed to doc/icslp2002-srilm.pdf .
|
|
|
|
1.7.0 23 December 2012
|
|
|
|
Functionality:
|
|
|
|
* ngram -codebook option for reading of Ngram LMs with quantized parameters
|
|
(contributed by Microsoft).
|
|
* ngram -msweb-lm option for obtaining LM probabilities from the Microsoft
|
|
Web N-gram service (web-ngram.research.microsoft.com). You need to obtain
|
|
a user ID to use this service, see man ngram for details (contributed by
|
|
Microsoft).
|
|
* Added support for dictionary-induced word distance metrics to
|
|
nbest-optimize (-dictionary option).
|
|
* Added support for matrix-defined word distance metrics to
|
|
nbest-optimize (-distances option).
|
|
* ngram -debug 4 -ppl outputs ranking statistics (number of times correct
|
|
word was in top 1, 5, 10), as well as quadratic and absolute loss averages
|
|
(based on code from Omid Madani).
|
|
* nbest-optimize accepts n-best list in SRInterp format and generates
|
|
SRInterp format rover-control file (weights file), when -srinterp-format
|
|
is specified.
|
|
* nbest-optimize accepts SRInterp counts file that contains BLEU and TER
|
|
counts info.
|
|
* lattice-tool -read-mesh will try to preserve acoustic information
|
|
(times, scores, pronunciations) if they are encoded in the input confusion
|
|
network.
|
|
* Support reading of text files in UTF-8 and UTF-16 encodings. All string
|
|
data is internally represented, and output, as ASCII/UTF-8 (contributed
|
|
by Microsoft).
|
|
This feature uses the iconv library. Support for this feature can be
|
|
disabled by compiling with "NO_ICONV=anything" on the make command line.
|
|
|
|
Portability:
|
|
|
|
* Ported LM client/server code to Winsock API (native socket library in
|
|
Windows), enabling this functionality for mingw and MSVC platforms
|
|
(contributed by Microsoft).
|
|
* Let machine-type script return 64bit platform names for Linux and Solaris
|
|
x86 when appropriate. This implies that 64bit binaries are built by
|
|
default on machines that support them.
|
|
* Array.h tweak for clang compiler (from kutlak.roman@gmail.com).
|
|
* Work around a namespace problem in C++11 (from kutlak.roman@gmail.com).
|
|
* Use size_t for hash codes to ensure word width matches pointer type.
|
|
* Fixes for mingw32 build, using Windows APIs for sockets and UTF
|
|
conversion (contributed by Microsoft).
|
|
* Support for 64bit mingw build (MACHINE_TYPE=win64).
|
|
* Updates for MacOSX (MACHINE_TYPE=macosx, thanks to Chuck Wooters).
|
|
* Deal with nonportability of isfinite() and isnan().
|
|
* Changes for thread-safety (by Kyle McIntyre). See doc/README-THREADS
|
|
for details.
|
|
- Modified the remove() methods in various container classes to return
|
|
Boolean instead of a pointer to the removed element. The removed element
|
|
can be gotten with an optional reference argument. This eliminates the
|
|
need for a global static variable.
|
|
- Use STL sort() instead of qsort() in LHash and SArray sorted iterations.
|
|
- Replaced all static variables with thread-local storage via the TLSWrapper
|
|
class, requiring the pthread library. This is available on most platforms,
|
|
but can be disabled at compile-time with -DNO_TLS.
|
|
|
|
Bug fixes:
|
|
|
|
* NgramLM backoff computation fixed to avoid spurious insertion of nonzero
|
|
unigram probabilities and non-unity backoff weights (resulting from
|
|
numerator/denominator values below Prob_Epsilon).
|
|
* lattice-tool does a better job inferring the lattice basename from the
|
|
UTTERANCE string embedded in HTK lattices.
|
|
* Trellis class: use a secondary sorting criterion to make N-best output
|
|
deterministic.
|
|
* WordMesh class: use posterior word probability to decide which acoustic
|
|
information to keep when merging hyps, instead of duration-normalized
|
|
acoustic stores as before. This leads to fewer words with out-of-order
|
|
timestamps when extracting one-best from confusion networks.
|
|
* fix-ctm script: Check for out-of-order word timestamps and adjust them
|
|
minimally as needed to produce a monotonic sequence, as required for
|
|
CTM sorting.
|
|
* Fixed bug in NgramCountLM estimation procedure reported by ariya@jhu.edu.
|
|
* Allow ngram -hidden-vocab to read hidden event properties described in
|
|
man page.
|
|
* Fixed bug in ngram -hidden-vocab -write-lm output.
|
|
* Avoid crash when ngram -hidden-not -ppl is used with debug level 2.
|
|
* Fixed (very rare) bug by which ngram -prune might remove all ngrams
|
|
sharing a common context.
|
|
* Improved ngram -prune-lowprobs by also removing backoff weights that
|
|
have become useless (suggested by Arlo Faria).
|
|
* Check for successful search for HTK lattice start/end nodes, if not
|
|
explicitly specified (reported by nshmyrev@yandex.ru).
|
|
* Handle infinity scores in lattice rescoring, and catch NaN scores when
|
|
reading HTK lattices.
|
|
* make-kn-discounts checks for negative discount values and reports
|
|
error if appropriate.
|
|
* nbest-optimize accepts combined BLEU and error rate objective via switch
|
|
-error-bleu-ratio R (R specifies the error rate weight).
|
|
* lattice-tool -timeout option now uses sigsetjmp/siglongjmp to handle
|
|
timeout alarms. This is necessary in Linux-compatible (including cygwin)
|
|
systems to handle alarms repeatedly.
|
|
* Fixed a bug reading NBestList2.0 format without phone information (led
|
|
to malformed confusion network output).
|
|
* Fixed a bug in Ngram::contextID() that was causing incorrect expansion
|
|
of lattices with pruned backoff models.
|
|
* Fixed a bug in the lattice-tool -keep-unk implementation that was
|
|
sometimes allowing an OOV word label to be output as <unk>.
|
|
* Removed some pseudo-randomness in ngram-class so that results are more
|
|
invariant to OPTION setting and platform properties.
|
|
* Avoid differences due to machine arithmetic in word mesh alignment,
|
|
making confusion network building and posterior decoding more stable
|
|
across platforms.
|
|
* Exclude metatags when writing out the vocabulary of binary Ngram LMs.
|
|
* Fixed some missing dependencies in Visual Studio solution file.
|
|
|
|
1.7.1 4 June 2014
|
|
|
|
* Updated INSTALL, Copyright. Added ACKNOWLEDGEMENTS.
|
|
|
|
Functionality:
|
|
|
|
* Integrated the maximum entropy extension by Tanel Alumae, described
|
|
at http://www.phon.ioc.ee/~tanela/srilm-me/ .
|
|
Please cite Tanel's paper (copied here in doc/is2010-maxent.pdf) if you
|
|
use this functionality in your research.
|
|
* Enable LM server to process multiple commands in a single message
|
|
(separated by newlines). This capability was never documented, but
|
|
existed in the first implementation that used read/write system calls,
|
|
but was lost when we switched to recv/send calls.
|
|
* Generalized the BayesMix LM class to allow an arbitrary number of
|
|
mixture components, similar to LoglinearMix.
|
|
* Added the ngram -context-priors option to read context-dependent
|
|
mixture weight priors from a file.
|
|
* Added the ngram -read-mix-lms option to read the list of interpolated
|
|
LMs, weights and options from a file, specified by the -lm option.
|
|
* Use zlib for I/O from/to gzipped files. Benefits are: (a) works with
|
|
native Windows binaries, (b) avoids subprocess, (c) allows reading
|
|
(though still not writing) of gzipped binary LM and count files.
|
|
* ngram-count -gtNmin options accept floating point values for more
|
|
flexibility with LM estimation from fractional counts.
|
|
* Added lattice-tool -set-lattice-names option to preserve input
|
|
filenames inside lattices.
|
|
* New script replace-unk-words, for replacing OOV words relative to
|
|
a vocabulary with <unk> tag.
|
|
* Added new lattice-tool options -hyp-list -hyp-file -hyp2-list
|
|
-hyp2-file -add-hyps to add ASR hypotheses into word mesh (confusion
|
|
network). The added options are similar to -ref-list -ref-file -add-refs,
|
|
except that the added hypothesized words will not be indicated as
|
|
reference words in the word mesh.
|
|
* Added a function in WordMesh to compute slot-to-slot alignment
|
|
between two confusion networks.
|
|
* Added ngram-class option to limit number of words per class (from
|
|
seppo.enarvi@aalto.fi).
|
|
|
|
Portability:
|
|
|
|
* Added support for 64bit cygwin builds (MACHINE_TYPE=cygwin64).
|
|
|
|
Bug fixes:
|
|
|
|
* ngram -rescore-ngram was not setting the handling of special word
|
|
tokens (<s>, </s>) if the rescored LM was being evaluated in the same
|
|
run.
|
|
* ngram-count -skip needs to read counts one order higher than specified
|
|
by -order .
|
|
* SkipNgram will now try to reestimate the discounting parameters from
|
|
expected counts on each EM iteration (but fall back on initial parameters
|
|
if that fails, e.g., for discounting methods that cannot handle float
|
|
counts).
|
|
* SubVocab instances' handling of metatags and nonevent words is now
|
|
tied to the base Vocab instance.
|
|
* Avoid anomalies in random word generation due to nonzero probabilities
|
|
for nonwords.
|
|
* Cleaned-up select-vocab script from Anand Venkataraman. Now works
|
|
with perl 5.12 and gives consistent results on different platforms.
|
|
Added a test case.
|
|
* Fixed removeTrie() bug that was leading to memory leak in Ngram
|
|
destructor.
|
|
* Fixed bug in LHash iterator that lead to potential double enumeration
|
|
of items after deletions, and could affect Ngram pruning results.
|
|
* Allow number of ngrams in ARPA LM to exceed 2^31. (Vocabulary size
|
|
is still limited to 2^32.)
|
|
* Initialize key and data objects in SArray and LHash containers after
|
|
allocation.
|
|
* Pass Trellis state parameters by reference to avoid copying of
|
|
potentially complex objects.
|
|
* Fixed memory access error in Ngram::clear() for order-1 models.
|
|
* Fixed a problem handling null string states in Trellis.
|
|
* Fix to preserve double precision in NBest acoustic and LM scores.
|
|
* Fixed an error concerning the use of -gtNmin options in the srilm-faq(7)
|
|
man page pointed out by dugast@systran.fr.
|
|
* If a lattice-tool input lattice is a word mesh, avoid calling
|
|
alignLattice() since the input is already a word mesh.
|
|
* Fixes to reading/writing of quantization codebook files.
|
|
* Fixed header comment and test program for Map2::remove().
|
|
|
|
1.7.2 9 November 2016
|
|
|
|
Functionality:
|
|
|
|
* Added interfaces to Lattice and WordMesh that allows external programs
|
|
to map sausage nodes to their original lattice nodes.
|
|
* New VocabDistance subclass StemDistance, comparing words only based on
|
|
their stems.
|
|
* New lattice-tool option -stem-dist triggers StemDistance use in
|
|
confusion network alignments, including -add-hyps and -add-refs processing.
|
|
* Add optional support for keyword spotting (in Lattice.h and
|
|
LatticeIndex.cc) when writing a 1-gram index.
|
|
* Added new File field NBestOptions::nbestRttm2, if it exists then write
|
|
(an approximation to) the NBestList2.0 format output.
|
|
* Added simple Trellis pruning based on relative thresholding of forward
|
|
probabilities (Trellis::prune()).
|
|
* make-big-lm now understands the -ukndiscount option. The make-kn-discounts
|
|
helper script has an option to compute unmodified KN discounts.
|
|
* The -version option now reports the compiler version used.
|
|
* Added ngram-count -write-text option to test conversion of UTF-16 files
|
|
to ASCII/UTF-8.
|
|
* Added ngram -text-has-weights option to allow weighting sentences in ppl
|
|
computation.
|
|
* Added scripts nbest-words and compute-sclite-nbest for conveniently
|
|
computing nbest-optimize -errors information using sclite.
|
|
* Added the nbest-optimize -xval-files option to support cross-validation.
|
|
* Added script search-rover-combo for searching for best combination among
|
|
a list of systems.
|
|
* Added confidence value fields to NBestWordInfo class.
|
|
* Added check to compute-best-mix to warn about word label mismatches between
|
|
input files.
|
|
|
|
Portability:
|
|
|
|
* Honor TMPDIR environment variable in various scripts.
|
|
* Miscellarous MacosX fixes.
|
|
* Include BSD rand48 functions so that random sentence generation gives same
|
|
result on all platforms.
|
|
|
|
Bug fixes:
|
|
|
|
* Avoid leaky backoff by mapping very small probability sums to 0 in BOW
|
|
computation. Otherwise unseen ngrams may end up with nonzero probabilties
|
|
in unsmoothed LMs.
|
|
* Fixed compare-ppls compute-best-mix compute-best-sentence-mix ppl-from-log
|
|
to recognize the MSVC representation of -infinity.
|
|
* Fixed a bug in the handling of zero prefix probabilities in ClassNgram,
|
|
HiddenNgram and HMMofNgrams.
|
|
* Fixed a memory allocation bug that caused the ngram-count-maxent test
|
|
to crash.
|
|
* Fixes to lattice-tool rttm nbest output.
|
|
* Fix for possible endless loop in lattice-tool -posterior-prune due to
|
|
limited float precision (from Seppo Enarvi).
|
|
* Fixed a problem with declaration of Map_nokeyP() that takes reference
|
|
arguments and were missing "const"; was causing crash in segment tool.
|
|
* Workaround for what looks like an optimizer bug in gcc >= 4.9 that can
|
|
cause ngram -prune to core dump.
|
|
* Output TextStats quantities (sentence/word counts, log probs, perplexities),
|
|
model parameters, nbest and lattices scores, and other quantities with full
|
|
precision so as to avoid loss of information.
|
|
* nbest-optimize -1best now outputs a rover-control file that simulates
|
|
Viterbi decoding (by using a small posterior scale).
|
|
* nbest-optimize -errrors now tolerates varying number of reference words
|
|
for the same sentence. This can arise from sclite references with alternate
|
|
words strings.
|
|
* Fixed a stupid bug in uniform-classes.gawk script.
|
|
* Allow combine-rover-controls to merge control files with the same systems
|
|
in them, adding their weights.
|
|
* Updated zlib to version 1.2.8. This fixes a bug whereby gzipped output files
|
|
could end up with zero size (instead of a legal gzipped file that results in a
|
|
zero-length file when decompressed).
|
|
|
|
1.7.3. 9 September 2019
|
|
|
|
Functionality:
|
|
|
|
* Added nbest-oov-counts script to generate OOV counts for nbest hypotheses.
|
|
* Added a simple mechanism for weight tying in nbest-rover control files. A
|
|
system weight of = indicates that it should be tied to the previously listed
|
|
system. This is useful for reducing the number of free parameters when
|
|
searching for good system combinations (search-rover-combo).
|
|
* Add Map_noKey() and Map_noKeyP() for unsigned long long type, to enable use
|
|
with size_t on Windows MSVC.
|
|
* Output from -version now includes compile-time options.
|
|
* Added option ngram -minbackoff to fix up models that have unnormalized
|
|
probabilities or that are not smoothed.
|
|
* Added option ngram -unk-probs to override unknown word probabilities.
|
|
* Added nbest-optimize-args-from-rover-control script, convenient for
|
|
extracting initialization parameters for nbest-optimize from existing
|
|
nbest-rover control file.
|
|
* Added ngram-count -text-has-weights-last option to allow text input with
|
|
count values at ends of lines.
|
|
* Added nbest-rover -missing-nbest option to treat missing nbest lists as if
|
|
an empty hypothesis (no words) had been output, rather than simply skipping
|
|
that nbest list.
|
|
* Added nbest-lattice -time-penalty option, implementing a soft constraint
|
|
on time stamps (when present) during confusion network building and alignment.
|
|
* Added nbest-lattice -average-times option, to average word times instead
|
|
of picking the timing of the highest posterior hypothesis.
|
|
* Added nbest-lattice -suppress-vocab option to disallow certain words in
|
|
posterior decoding.
|
|
* New scripts concat-sausages for chaining word confusion networks together.
|
|
* Added nbest-lattice -dump-lattice-alignments option to output mappings
|
|
between sausage positions and alignment costs.
|
|
* Updated Android build for 64-bit development for armv8 using NDK r20 and clang.
|
|
This almost certainly breaks the 32-bit build for armv7. The last known good 32-bit
|
|
build is in common/Makefile.core.android.r11c, last built using NDK r11c. To use this,
|
|
copy Makefile.core.android.r11c to Makefile.core.android. See doc/README.android.
|
|
|
|
Bug fixes:
|
|
|
|
* Added a new tool nbest-rover-helper that combines the functions of the
|
|
combine-acoustic-scores and nbest-posteriors scripts, doing these computations
|
|
in double precision and faster. nbest-rover now uses this tool (except when
|
|
certain options like -nbest-backtrace are used).
|
|
* nbest-rover strips DOS end-of-line CR characters from the control file, so
|
|
they no longer mess up the parsing of the file.
|
|
* Rationalize the way ties are broken when decoding word confusion networks.
|
|
The word with the lowest internal index is now preferred (and the *DELETE* token
|
|
always comes before all other words), unless the new nbest-lattice option
|
|
-random-tie-break is given. The output order of alternative word hypotheses
|
|
to sausage files is always by probability rank first, then by internal index.
|
|
* The reverse-ngram-counts script now replaces <s> with </s> and vice-versa,
|
|
as required for training reverse-direction LMs, and consistent with reverse-text.
|
|
* Handle comment lines starting with '##' and empty lines in nbest-rover control
|
|
files the same way as in File::getline(), i.e., ignore them.
|
|
* Fixed the syntax for the nbest-optimize -dynamic-random-series options (now
|
|
starts with single dash, as described in man page).
|
|
* Don't let compute-best-mix complain about word mismatches if <unk> is involved.
|
|
* Cast input to isspace() to (unsigned char) to guarantee input is non-negative.
|
|
* Fixed memory management problems in MEModel.
|
|
* Work around a bug in zlib's gzprintf() printing of very long %s arguments; was
|
|
causing long word strings not to be output into .gz files.
|
|
* Removed word string length limit.
|
|
* Removed limit on total line length in outputting ngram count files.
|
|
* Zlib updated to version 1.2.11.
|
|
* nbest-posteriors ensures that bytelog scores are output in fixed-point format.
|
|
* Allow floating point values when parsing bytelog scores in nbest lists.
|
|
* Most robustness to word sausages input files that have missing data for some
|
|
position.
|
|
* Fixed a performance bug when nbest-rover is invoked with -output-ctm option.
|
|
|
|
$Date: 2019/09/09 23:09:32 $
|
|
|
|
|