835 lines
28 KiB
HTML
835 lines
28 KiB
HTML
<! $Id: srilm-faq.7,v 1.13 2019/09/09 22:35:37 stolcke Exp $>
|
|
<HTML>
|
|
<HEADER>
|
|
<TITLE>SRILM-FAQ</TITLE>
|
|
<BODY>
|
|
<H1>SRILM-FAQ</H1>
|
|
<H2> NAME </H2>
|
|
SRILM-FAQ - Frequently asked questions about SRI LM tools
|
|
<H2> SYNOPSIS </H2>
|
|
<PRE>
|
|
man srilm-faq
|
|
</PRE>
|
|
<H2> DESCRIPTION </H2>
|
|
This document tries to answer some of the most frequently asked questions
|
|
about SRILM.
|
|
<H3> Build issues </H3>
|
|
<DL>
|
|
<DT><B> A1) I ran ``make World'' but the $SRILM/bin/$MACHINE_TYPE directory is empty. </B>
|
|
<DD>
|
|
Building the binaries can fail for a variety of reasons.
|
|
Check the following:
|
|
<DL>
|
|
<DT>
|
|
a)
|
|
<DD>
|
|
Make sure the SRILM environment variable is set, or specified on the
|
|
make command line, e.g.:
|
|
<PRE>
|
|
make SRILM=$PWD
|
|
</PRE>
|
|
<DT>
|
|
b)
|
|
<DD>
|
|
Make sure the
|
|
<B> $SRILM/sbin/machine-type </B>
|
|
script returns a valid string for the platform you are trying to build on.
|
|
Known platforms have machine-specific makefiles called
|
|
<PRE>
|
|
$SRILM/common/Makefile.machine.$MACHINE_TYPE
|
|
</PRE>
|
|
If
|
|
<B> machine-type </B>
|
|
does not work for some reason, you can override its output on the command line:
|
|
<PRE>
|
|
make MACHINE_TYPE=xyz
|
|
</PRE>
|
|
If you are building for an unsupported platform create a new machine-specific
|
|
makefile and mail it to stolcke@speech.sri.com.
|
|
<DT>
|
|
c)
|
|
<DD>
|
|
Make sure your compiler works and is invoked correctly.
|
|
You will probably have to edit the
|
|
<B> CC </B>
|
|
and
|
|
<B> CXX </B>
|
|
variables in the platform-specific makefile.
|
|
If you have questions about compiler invocation and best options
|
|
consult a local expert; these things differ widely between sites.
|
|
<DT>
|
|
d)
|
|
<DD>
|
|
The default is to compile with Tcl support.
|
|
This is in fact only used for some testing binaries (which are
|
|
not built by default),
|
|
so it can be turned off if Tcl is not available or presents problems.
|
|
Edit the machine-specific makefile accordingly.
|
|
To use Tcl, locate the
|
|
<B> tcl.h </B>
|
|
header file and the library itself, and set (for example)
|
|
<PRE>
|
|
TCL_INCLUDE = -I/path/to/include
|
|
TCL_LIBRARY = -L/path/to/lib -ltcl8.4
|
|
</PRE>
|
|
To disable Tcl support set
|
|
<PRE>
|
|
NO_TCL = X
|
|
TCL_INCLUDE =
|
|
TCL_LIBRARY =
|
|
</PRE>
|
|
<DT>
|
|
e)
|
|
<DD>
|
|
Make sure you have the C-shell (/bin/csh) installed on your system.
|
|
Otherwise you will see something like
|
|
<PRE>
|
|
make: /sbin/machine-type: Command not found
|
|
</PRE>
|
|
early in the build process.
|
|
On Ubuntu Linux and Cygwin systems "csh" or "tcsh" needs to be installed
|
|
as an optional package.
|
|
<DT>
|
|
f)
|
|
<DD>
|
|
If you cannot get SRILM to build, save the make output to a file
|
|
<PRE>
|
|
make World >& make.output
|
|
</PRE>
|
|
and look for messages indicating errors.
|
|
If you still cannot figure out what the problem is, send the error message
|
|
and immediately preceding lines to the srilm-user list.
|
|
Also include information about your operating system ("uname -a" output)
|
|
and compiler version ("gcc -v" or equivalent for other compilers).
|
|
</DD>
|
|
</DL>
|
|
<DT><B> A2) The regression test outputs differ for all tests. What did I do wrong? </B>
|
|
<DD>
|
|
Most likely the binaries didn't get built or aren't executable
|
|
for some reason.
|
|
Check issue A1).
|
|
<DT><B> A3) I get differing outputs for some of the regression tests. Is that OK? </B>
|
|
<DD>
|
|
It might be.
|
|
The comparison of reference to actual output allows for small numerical
|
|
differences, but
|
|
some of the algorithms make hard decisions based on floating-point computations
|
|
that can result in different outputs as a result of different compiler
|
|
optimizations, machine floating point precisions (Intel versus IEEE format),
|
|
and math libraries.
|
|
Test of this nature include
|
|
<B>ngram-class</B>,<B></B><B></B><B></B>
|
|
<B>disambig</B>,<B></B><B></B><B></B>
|
|
and
|
|
<B>nbest-rover</B>.<B></B><B></B><B></B>
|
|
When encountering differences, diff the output in the
|
|
$SRILM/test/outputs/<I>TEST</I>.$MACHINE_TYPE.stdout file to the corresponding
|
|
$SRILM/test/reference/<I>TEST</I>.stdout, where
|
|
<I> TEST </I>
|
|
is the name of the test that failed.
|
|
Also compare the corresponding .stderr files;
|
|
differences there usually indicate operating-system related problems.
|
|
</DD>
|
|
</DL>
|
|
<H3> Large data and memory issues </H3>
|
|
<DL>
|
|
<DT><B> B1) I'm getting a message saying ``Assertion `body != 0' failed.'' </B>
|
|
<DD>
|
|
You are running out of memory.
|
|
See subsequent questions depending on what you are trying to do.
|
|
<DT>
|
|
Note:
|
|
<DD>
|
|
The above message means you are running
|
|
out of "virtual" memory on your computer, which could be because of
|
|
limits in swap space, administrative resource limits, or limitations of
|
|
the machine architecture (a 32-bit machine cannot address more than
|
|
4GB no matter how many resources your system has).
|
|
Another symptom of not enough memory is that your program runs, but
|
|
very, very slowly, i.e., it is "paging" or "swapping" as it tries to
|
|
use more memory than the machine has RAM installed.
|
|
<DT><B> B2) I am trying to count N-grams in a text file and running out of memory. </B>
|
|
<DD>
|
|
Don't use
|
|
<B> ngram-count </B>
|
|
directly to count N-grams.
|
|
Instead, use the
|
|
<B> make-batch-counts </B>
|
|
and
|
|
<B> merge-batch-counts </B>
|
|
scripts described in
|
|
<A HREF="training-scripts.1.html">training-scripts(1)</A>.
|
|
That way you can create N-gram counts limited only by the maximum file size
|
|
on your system.
|
|
<DT><B> B3) I am trying to build an N-gram LM and ngram-count runs out of memory. </B>
|
|
<DD>
|
|
You are running out of memory either because of the size of ngram counts,
|
|
or of the LM being built. The following are strategies for reducing the
|
|
memory requirements for training LMs.
|
|
<DL>
|
|
<DT>
|
|
a)
|
|
<DD>
|
|
Assuming you are using Good-Turing or Kneser-Ney discounting, don't use
|
|
<B> ngram-count </B>
|
|
in "raw" form.
|
|
Instead, use the
|
|
<B> make-big-lm </B>
|
|
wrapper script described in the
|
|
<A HREF="training-scripts.1.html">training-scripts(1)</A>
|
|
man page.
|
|
<DT>
|
|
b)
|
|
<DD>
|
|
Switch to using the "_c" or "_s" versions of the SRI binaries.
|
|
For
|
|
instructions on how to build them, see the INSTALL file.
|
|
Once built, set your executable search path accordingly, and try
|
|
<B> make-big-lm </B>
|
|
again.
|
|
<DT>
|
|
c)
|
|
<DD>
|
|
Raise the minimum counts for N-grams included in the LM, i.e.,
|
|
the values of the options
|
|
<B>-gt2min</B>,<B></B><B></B><B></B>
|
|
<B>-gt3min</B>,<B></B><B></B><B></B>
|
|
<B>-gt4min</B>,<B></B><B></B><B></B>
|
|
etc.
|
|
The higher order N-grams typically get higher minimum counts.
|
|
<DT>
|
|
d)
|
|
<DD>
|
|
Get a machine with more memory.
|
|
If you are hitting the limitations of a 32-bit machine architecture,
|
|
get a 64-bit machine and recompile SRILM to take advantage of the expanded
|
|
address space.
|
|
(The MACHINE_TYPE=i686-m64 setting is for systems based on
|
|
64-bit AMD processors, as well as recent compatibles from Intel.)
|
|
Note that 64-bit pointers will require a memory overhead in
|
|
themselves, so you will need a machine with significantly, not just a
|
|
little, more memory than 4GB.
|
|
</DD>
|
|
</DL>
|
|
<DT><B> B4) I am trying to apply a large LM to some data and am running out of memory. </B>
|
|
<DD>
|
|
Again, there are several strategies to reduce memory requirements.
|
|
<DL>
|
|
<DT>
|
|
a)
|
|
<DD>
|
|
Use the "_c" or "_s" versions of the SRI binaries.
|
|
See 3b) above.
|
|
<DT>
|
|
b)
|
|
<DD>
|
|
Precompute the vocabulary of your test data and use the
|
|
<B> ngram -limit-vocab </B>
|
|
option to load only the N-gram parameters relevant to your data.
|
|
This approach should allow you to use arbitrarily
|
|
large LMs provided the data is divided into small enough chunks.
|
|
<DT>
|
|
c)
|
|
<DD>
|
|
If the LM can be built on a large machine, but then is to be used on
|
|
machines with limited memory, use
|
|
<B> ngram -prune </B>
|
|
to remove the less important parameters of the model.
|
|
This usually gives huge size reductions with relatively modest performance
|
|
degradation.
|
|
The tradeoff is adjustable by varying the pruning parameter.
|
|
</DD>
|
|
</DL>
|
|
<DT><B> B5) How can I reduce the time it takes to load large LMs into memory? </B>
|
|
<DD>
|
|
The techniques described in 4b) and 4c) above also reduce the load time
|
|
of the LM.
|
|
Additional steps to try are:
|
|
<DL>
|
|
<DT>
|
|
a)
|
|
<DD>
|
|
Convert the LM into binary format, using
|
|
<PRE>
|
|
ngram -order <I>N</I> -lm <I>OLDLM</I> -write-bin-lm <I>NEWLM</I>
|
|
</PRE>
|
|
(This is currently only supported for N-gram-based LMs.)
|
|
You can also generate the LM directly in binary format, using
|
|
<PRE>
|
|
ngram-count ... -lm <I>NEWLM</I> -write-binary-lm
|
|
</PRE>
|
|
The resulting
|
|
<I> NEWLM </I>
|
|
file (which should not be compressed) can be used
|
|
in place of a textual LM file with all compiled SRILM tools
|
|
(but not with
|
|
<A HREF="lm-scripts.1.html">lm-scripts(1)</A>).
|
|
The format is machine-independent, i.e., it can be read on machines with
|
|
different word sizes or byte-order.
|
|
Loading binary LMs is faster because
|
|
(1) it reduces the overhead of parsing the input data, and
|
|
(2) in combination with
|
|
<B> -limit-vocab </B>
|
|
(see 4b)
|
|
it is much faster to skip sections of the LM that are out-of-vocabulary.
|
|
<DT>
|
|
Note:
|
|
<DD>
|
|
There is also a binary format for N-gram counts.
|
|
It can be generated using
|
|
<PRE>
|
|
ngram-count -write-binary <I>COUNTS</I>
|
|
</PRE>
|
|
and has similar advantages as binary LM files.
|
|
<DT>
|
|
b)
|
|
<DD>
|
|
Start a "probability server" that loads the LM ahead of time, and
|
|
then have "LM clients" query the server instead of computing the
|
|
probabilities themselves.
|
|
<BR>
|
|
The server is started on a machine named
|
|
<I> HOST </I>
|
|
using
|
|
<PRE>
|
|
ngram <I>LMOPTIONS</I> -server-port <I>P</I> &
|
|
</PRE>
|
|
where
|
|
<I> P </I>
|
|
is an integer < 2^16 that specifies the TCP/IP port number the
|
|
server will listen on, and
|
|
<I> LMOPTIONS </I>
|
|
are whatever options necessary to define the LM to be used.
|
|
<BR>
|
|
One or more clients (programs such as
|
|
<A HREF="ngram.1.html">ngram(1)</A>,
|
|
<A HREF="disambig.1.html">disambig(1)</A>,
|
|
<A HREF="lattice-tool.1.html">lattice-tool(1)</A>)
|
|
can then query the server using the options
|
|
<PRE>
|
|
-use-server <I>P</I>@<I>HOST</I> -cache-served-ngrams
|
|
</PRE>
|
|
instead of the usual "-lm <I>FILE</I>".
|
|
The
|
|
<B> -cache-served-ngrams </B>
|
|
option is not required but often speeds up performance dramatically by
|
|
saving the results of server lookups in the client for reuse.
|
|
Server-based LMs may be combined with file-based LMs by interpolation;
|
|
see
|
|
<A HREF="ngram.1.html">ngram(1)</A>
|
|
for details.
|
|
</DD>
|
|
</DL>
|
|
<DT><B> B6) How can I use the Google Web N-gram corpus to build an LM? </B>
|
|
<DD>
|
|
Google has made a corpus of 5-grams extracted from 1 tera-words of web data
|
|
available via LDC.
|
|
However, the data is too large to build a standard backoff N-gram, even
|
|
using the techniques described above.
|
|
Instead, we recommend a "count-based" LM smoothed with deleted interpolation.
|
|
Such an LM computes probabilities on the fly from the counts, of which only
|
|
the subsets needed for a given test set need to be loaded into memory.
|
|
LM construction proceeds in the following steps:
|
|
<DL>
|
|
<DT>
|
|
a)
|
|
<DD>
|
|
Make sure you have built SRI binaries either for a 64-bit machine
|
|
(e.g., MACHINE_TYPE=i686-m64 OPTION=_c) or using 64-bit counts (OPTION=_l).
|
|
This is necessary because the data contains N-gram counts exceeding
|
|
the range of 32-bit integers.
|
|
Be sure to invoke all commands below using the path to the appropriate
|
|
binary executable directory.
|
|
<DT>
|
|
b)
|
|
<DD>
|
|
Prepare mapping file for some vocabulary mismatches and call this
|
|
<B>google.aliases</B>:<B></B><B></B><B></B>
|
|
<PRE>
|
|
<S> <s>
|
|
</S> </s>
|
|
<UNK> <unk>
|
|
</PRE>
|
|
<DT>
|
|
c)
|
|
<DD>
|
|
Prepare an initial count-LM parameter file
|
|
<B>google.countlm.0</B>:<B></B><B></B><B></B>
|
|
<PRE>
|
|
order 5
|
|
vocabsize 13588391
|
|
totalcount 1024908267229
|
|
countmodulus 40
|
|
mixweights 15
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
0.5 0.5 0.5 0.5 0.5
|
|
google-counts <I>PATH</I>
|
|
</PRE>
|
|
where
|
|
<I> PATH </I>
|
|
points to the location of the Google N-grams, i.e., the directory containing
|
|
subdirectories "1gms", "2gms", etc.
|
|
Note that the
|
|
<B> vocabsize </B>
|
|
and
|
|
<B> totalcount </B>
|
|
were obtained from the 1gms/vocab.gz and 1gms/total files, respectively.
|
|
(Check that they match and modify as needed.)
|
|
For an explanation of the parameters see the
|
|
<A HREF="ngram.1.html">ngram(1)</A>
|
|
<B> -count-lm </B>
|
|
option.
|
|
<DT>
|
|
d)
|
|
<DD>
|
|
Prepare a text file
|
|
<B> tune.text </B>
|
|
containing data for estimating the mixture weights.
|
|
This data should be representative of, but different from your test data.
|
|
Compute the vocabulary of this data using
|
|
<PRE>
|
|
ngram-count -text tune.text -write-vocab tune.vocab
|
|
</PRE>
|
|
The vocabulary size should not exceed a few thousand to keep memory
|
|
requirements in the following steps manageable.
|
|
<DT>
|
|
e)
|
|
<DD>
|
|
Estimate the mixture weights:
|
|
<PRE>
|
|
ngram-count -debug 1 -order 5 -count-lm \
|
|
-text tune.text -vocab tune.vocab \
|
|
-vocab-aliases google.aliases \
|
|
-limit-vocab \
|
|
-init-lm google.countlm.0 \
|
|
-em-iters 100 \
|
|
-lm google.countlm
|
|
</PRE>
|
|
This will write the estimated LM to
|
|
<B>google.countlm</B>.<B></B><B></B><B></B>
|
|
The output will be identical to the initial LM file, except for the
|
|
updated interpolation weights.
|
|
<DT>
|
|
f)
|
|
<DD>
|
|
Prepare a test data file
|
|
<B>test.text</B>,<B></B><B></B><B></B>
|
|
and its vocabulary
|
|
<B> test.vocab </B>
|
|
as in Step d) above.
|
|
Then apply the LM to the test data:
|
|
<PRE>
|
|
ngram -debug 2 -order 5 -count-lm \
|
|
-lm google.countlm \
|
|
-vocab test.vocab \
|
|
-vocab-aliases google.aliases \
|
|
-limit-vocab \
|
|
-ppl test.text > test.ppl
|
|
</PRE>
|
|
The perplexity output will appear in
|
|
<B> test.ppl. </B>
|
|
<DT>
|
|
g)
|
|
<DD>
|
|
Note that the Google data uses mixed case spellings.
|
|
To apply the LM to lowercase data one needs to prepare a much more
|
|
extensive vocabulary mapping table for the
|
|
<B> -vocab-aliases </B>
|
|
option, namely, one that maps all
|
|
upper- and mixed-case spellings to lowercase strings.
|
|
This mapping file should be restricted to the words appearing in
|
|
<B> tune.text </B>
|
|
and
|
|
<B>test.text</B>,<B></B><B></B><B></B>
|
|
respectively, to avoid defeating the effect of
|
|
<B> -limit-vocab . </B>
|
|
</DD>
|
|
</DL>
|
|
</DL>
|
|
<H3> Smoothing issues </H3>
|
|
<DL>
|
|
<DT><B> C1) What is smoothing and discounting all about? </B>
|
|
<DD>
|
|
<I> Smoothing </I>
|
|
refers to methods that assign probabilities to events (N-grams) that
|
|
do not occur in the training data.
|
|
According to a pure maximum-likelihood estimator these events would have
|
|
probability zero, which is plainly wrong since previously unseen events
|
|
in general do occur in independent test data.
|
|
Because the probability mass is redistributed away from the seen events
|
|
toward the unseen events the resulting model is "smoother" (closer to uniform)
|
|
than the ML model.
|
|
<I> Discounting </I>
|
|
refers to the approach used by many smoothing methods of adjusting the
|
|
empirical counts of seen events downwards.
|
|
The ML estimator (count divided by total number of events) is then applied
|
|
to the discounted count, resulting in a smoother estimate.
|
|
<DT><B> C2) What smoothing methods are there? </B>
|
|
<DD>
|
|
There are many, and SRILM implements are fairly large selection of the
|
|
most popular ones.
|
|
A detailed discussion of these is found in a separate document,
|
|
<A HREF="ngram-discount.7.html">ngram-discount(7)</A>.
|
|
<DT><B> C3) Why am I getting errors or warnings from the smoothing method I'm using? </B>
|
|
<DD>
|
|
The Good-Turing and Kneser-Ney smoothing methods rely on statistics called
|
|
"count-of-counts", the number of words occurring one, twice, three times, etc.
|
|
The formulae for these methods become undefined if the counts-of-counts
|
|
are zero, or not strictly decreasing.
|
|
Some conditions are fatal (such as when the count of singleton words is zero),
|
|
others lead to less smoothing (and warnings).
|
|
To avoid these problems, check for the following possibilities:
|
|
<DL>
|
|
<DT>
|
|
a)
|
|
<DD>
|
|
The data could be very sparse, i.e., the training corpus very small.
|
|
Try using the Witten-Bell discounting method.
|
|
<DT>
|
|
b)
|
|
<DD>
|
|
The vocabulary could be very small, such as when training an LM based on
|
|
characters or parts-of-speech.
|
|
Smoothing is less of an issue in those cases, and the Witten-Bell method
|
|
should work well.
|
|
<DT>
|
|
c)
|
|
<DD>
|
|
The data was manipulated in some way, or artificially generated.
|
|
For example, duplicating data eliminates the odd-numbered counts-of-counts.
|
|
<DT>
|
|
d)
|
|
<DD>
|
|
The vocabulary is limited during counts collection using the
|
|
<B>ngram-count</B><B></B><B></B><B></B>
|
|
<B> -vocab </B>
|
|
option, with the effect that many low-frequency N-grams are eliminated.
|
|
The proper approach is to compute smoothing parameters on the full vocabulary.
|
|
This happens automatically in the
|
|
<B> make-big-lm </B>
|
|
wrapper script, which is preferable to direct use of
|
|
<B>ngram-count</B><B></B><B></B><B></B>
|
|
for other reasons (see issue B3-a above).
|
|
<DT>
|
|
e)
|
|
<DD>
|
|
You are estimating an LM from N-gram counts that have been truncated beforehand,
|
|
e.g., by removing singleton events.
|
|
If you cannot go back to the original data and recompute the counts
|
|
there is a heuristic to extrapolate low counts-of-counts from higher ones.
|
|
The heuristic is invoked automatically (and an informational message is output)
|
|
when
|
|
<B> make-big-lm </B>
|
|
is used to estimate LMs with Kneser-Ney smoothing.
|
|
For details see the paper by W. Wang et al. in ASRU-2007, listed under
|
|
"SEE ALSO".
|
|
</DD>
|
|
</DL>
|
|
<DT><B> C4) How does discounting work in the case of unigrams? </B>
|
|
<DD>
|
|
First, unigrams are discounted using the same method as higher-order
|
|
N-grams, using the specified method.
|
|
The probability mass freed up in this way
|
|
is then either spread evenly over all word types
|
|
that would otherwise have zero probability (this is essentially
|
|
simulating a backoff to zero-grams), or
|
|
if all unigrams already have non-zero probabilities, the
|
|
left-over mass is added to
|
|
<I> all </I>
|
|
unigrams.
|
|
In either case all unigram probabilty probabilities will sum to 1.
|
|
An informational message from
|
|
<B> ngram-count </B>
|
|
will tell which case applies.
|
|
<DT><B> C5) Why do I get a different number of trigrams when building a 4gram model compared to just a trigram model? </B>
|
|
<DD>
|
|
This can happen when Kneser-Ney smoothing is used and the trigram cut-off
|
|
(<B>-gt3min</B>)<B></B><B></B>
|
|
is greater than 1 (as with the default, 2).
|
|
The count cutoffs are applied against the modified counts generated as part of KN smoothing,
|
|
so in the case of a 4gram model the trigrams are modified and the set of ngrams above the cutoff will change.
|
|
</DD>
|
|
</DL>
|
|
<H3> Out-of-vocabulary, zeroprob, and `unknown' words </H3>
|
|
<DL>
|
|
<DT><B> D1) What is the perplexity of an OOV (out of vocabulary) word? </B>
|
|
<DD>
|
|
By default any word not observed in the training data is considered
|
|
OOV and OOV words are silently ignored by the
|
|
<A HREF="ngram.1.html">ngram(1)</A>
|
|
during perplexity (ppl) calculation.
|
|
For example:
|
|
<PRE>
|
|
|
|
$ ngram-count -text turkish.train -lm turkish.lm
|
|
$ ngram -lm turkish.lm -ppl turkish.test
|
|
file turkish.test: 61031 sentences, 1000015 words, 34153 OOVs
|
|
0 zeroprobs, logprob= -3.20177e+06 ppl= 1311.97 ppl1= 2065.09
|
|
|
|
</PRE>
|
|
The statistics printed in the last two lines have the following meanings:
|
|
<DL>
|
|
<DT><B> 34153 OOVs </B>
|
|
<DD>
|
|
This is the number of unknown word tokens, i.e. tokens
|
|
that appear in
|
|
<B> turkish.test </B>
|
|
but not in
|
|
<B> turkish.train </B>
|
|
from which
|
|
<B> turkish.lm </B>
|
|
was generated.
|
|
<DT><B> logprob= -3.20177e+06 </B>
|
|
<DD>
|
|
This gives us the total logprob ignoring the 34153 unknown word tokens.
|
|
The logprob does include the probabilities
|
|
assigned to </s> tokens which are introduced by
|
|
<A HREF="ngram-count.1.html">ngram-count(1)</A>.
|
|
Thus the total number of tokens which this logprob is based on is
|
|
<PRE>
|
|
words - OOVs + sentences = 1000015 - 34153 + 61031
|
|
</PRE>
|
|
<DT><B> ppl = 1311.97 </B>
|
|
<DD>
|
|
This gives us the geometric average of 1/probability of
|
|
each token, i.e., perplexity.
|
|
The exact expression is:
|
|
<PRE>
|
|
ppl = 10^(-logprob / (words - OOVs + sentences))
|
|
</PRE>
|
|
<DT><B> ppl1 = 2065.09 </B>
|
|
<DD>
|
|
This gives us the average perplexity per word excluding the </s> tokens.
|
|
The exact expression is:
|
|
<PRE>
|
|
ppl1 = 10^(-logprob / (words - OOVs))
|
|
</PRE>
|
|
</DD>
|
|
</DL>
|
|
You can verify these numbers by running the
|
|
<B> ngram </B>
|
|
program with the
|
|
<B> -debug 2 </B>
|
|
option, which gives the probability assigned to each token.
|
|
<DT><B> D2) What happens when the OOV word is in the context of an N-gram? </B>
|
|
<DD>
|
|
Exact details depend on the discounting algorithm used, but typically
|
|
the backed-off probability from a lower order N-gram is used. If the
|
|
<B> -unk </B>
|
|
option is used as explained below, an <unk> token is assumed to
|
|
take the place of the OOV word and no back-off may be necessary
|
|
if a corresponding N-gram containing <unk> is found in the LM.
|
|
<DT><B> D3) Isn't it wrong to assign 0 logprob to OOV words? </B>
|
|
<DD>
|
|
That depends on the application.
|
|
If you are comparing multiple language
|
|
models which all consider the same set of words as OOV it may be OK to
|
|
ignore OOV words.
|
|
Note that perplexity comparisons are only ever meaningful
|
|
if the vocabularies of all LMs are the same.
|
|
Therefore, to compare LMs with different sets of OOV words
|
|
(such as when using different tokenization strategies for morphologically
|
|
complex languages) then it becomes important
|
|
to take into account the true cost of the OOV words, or to model all words,
|
|
including OOVs.
|
|
<DT><B> D4) How do I take into account the true cost of the OOV words? </B>
|
|
<DD>
|
|
A simple strategy is to "explode" the OOV words, i.e., split them into
|
|
characters in the training and test data.
|
|
Typically words that appear more than once in the training data are
|
|
considered to be vocabulary words.
|
|
All other words are split into their characters and the
|
|
individual characters are considered tokens.
|
|
Assuming that all characters occur at least once in the training data there
|
|
will be no OOV tokens in the test data.
|
|
Note that this strategy changes the number of tokens in the data set,
|
|
so even though logprob is meaningful be careful when reporting ppl results.
|
|
<DT><B> D5) What if I want to model the OOV words explicitly? </B>
|
|
<DD>
|
|
Maybe a better strategy is to have a separate "letter" model for OOV words.
|
|
This can be easily created using SRILM by using a training
|
|
file listing the OOV words one per line with their characters
|
|
separated by spaces.
|
|
The
|
|
<B> ngram-count </B>
|
|
options
|
|
<B> -ukndiscount </B>
|
|
and
|
|
<B> -order 7 </B>
|
|
seem to work well for this purpose.
|
|
The final logprob results are obtained in two steps.
|
|
First do regular training and testing on your data using
|
|
<B> -vocab </B>
|
|
and
|
|
<B> -unk </B>
|
|
options.
|
|
The resulting logprob will include the cost of the vocabulary words and an
|
|
<unk> token for each OOV word.
|
|
Then apply the letter model to each OOV word in the test set.
|
|
Add the logprobs.
|
|
Here is an example:
|
|
<PRE>
|
|
|
|
# Determine vocabulary:
|
|
ngram-count -text turkish.train -write-order 1 -write turkish.train.1cnt
|
|
awk '$2>1' turkish.train.1cnt | cut -f1 | sort > turkish.train.vocab
|
|
awk '$2==1' turkish.train.1cnt | cut -f1 | sort > turkish.train.oov
|
|
|
|
# Word model:
|
|
ngram-count -kndiscount -interpolate -order 4 -vocab turkish.train.vocab -unk -text turkish.train -lm turkish.train.model
|
|
ngram -order 4 -unk -lm turkish.train.model -ppl turkish.test > turkish.test.ppl
|
|
|
|
# Letter model:
|
|
perl -C -lne 'print join(" ", split(""))' turkish.train.oov > turkish.train.oov.split
|
|
ngram-count -ukndiscount -interpolate -order 7 -text turkish.train.oov.split -lm turkish.train.oov.model
|
|
perl -pe 's/\s+/\n/g' turkish.test | sort > turkish.test.words
|
|
comm -23 turkish.test.words turkish.train.vocab > turkish.test.oov
|
|
perl -C -lne 'print join(" ", split(""))' turkish.test.oov > turkish.test.oov.split
|
|
ngram -order 7 -ppl turkish.test.oov.split -lm turkish.train.oov.model > turkish.test.oov.ppl
|
|
|
|
# Add the logprobs in turkish.test.ppl and turkish.test.oov.ppl.
|
|
|
|
</PRE>
|
|
Again, perplexities are not directly meaningful as computed by SRILM, but you
|
|
can recompute them by hand using the combined logprob value, and the number of
|
|
original word tokens in the test set.
|
|
<DT><B> D6) What are zeroprob words and when do they occur? </B>
|
|
<DD>
|
|
In-vocabulary words that get zero probability are counted as
|
|
"zeroprobs" in the ppl output.
|
|
Just as OOV words, they are excluded from the perplexity
|
|
computation since otherwise the perplexity value would be infinity.
|
|
There are three reasons why zeroprobs could occur in a
|
|
closed vocabulary setting (the default for SRILM):
|
|
<DL>
|
|
<DT>
|
|
a)
|
|
<DD>
|
|
If the same vocabulary is used at test time as was used during
|
|
training, and smoothing is enabled, then the occurrence of zeroprobs
|
|
indicates an anomalous condition and, possibly, a broken language model.
|
|
<DT>
|
|
b)
|
|
<DD>
|
|
If smoothing has been disabled (e.g., by using the option
|
|
<B>-cdiscount 0</B>),<B></B><B></B><B></B>
|
|
then the LM will use maximum likelihood estimates for
|
|
the N-grams and then any unseen N-gram is a zeroprob.
|
|
<DT>
|
|
c)
|
|
<DD>
|
|
If a different vocabulary file is specified at test time than
|
|
the one used in training, then the definition of what counts as an OOV
|
|
will change.
|
|
In particular, a word that wasn't seen in the training data (but is in the
|
|
test vocabulary) will
|
|
<I> not </I>
|
|
be mapped to
|
|
<B> <unk> </B>
|
|
and, therefore, not
|
|
count as an OOV in the perplexity computation.
|
|
However, it will still get zero probability and, therefore, be tallied
|
|
as a zeroprob.
|
|
</DD>
|
|
</DL>
|
|
<DT><B> D7) What is the point of using the <B><unk></B> token? </B>
|
|
<DD>
|
|
Using
|
|
<B> <unk> </B>
|
|
is a practical convenience employed by SRILM.
|
|
Words not in the specified vocabulary are mapped to
|
|
<B><unk></B>,<B></B><B></B><B></B>
|
|
which is equivalent to performing the same mapping
|
|
in a data pre-processing step outside of SRILM.
|
|
Other than that,
|
|
for both LM estimation and evaluation purposes,
|
|
<B> <unk> </B>
|
|
is treated like any other word.
|
|
(In particular, in the computation of discounted probabilities
|
|
there is no special handling of
|
|
<B><unk></B>.)<B></B><B></B><B></B>
|
|
<DT><B> D8) So how do I train an open-vocabulary LM with <B><unk></B>? </B>
|
|
<DD>
|
|
First, make sure to use the
|
|
<B> ngram-count </B>
|
|
<B> -unk </B>
|
|
option, which simply indicates that the
|
|
<B> <unk> </B>
|
|
word should be included in the LM vocabulary, as required for an
|
|
open-vocabulary LM.
|
|
Without this option, N-grams containing
|
|
<B> <unk> </B>
|
|
would simply be discarded.
|
|
An "open vocabulary" LM is simply one that contains
|
|
<B><unk></B>,<B></B><B></B><B></B>
|
|
and can therefore (by virtue of the mapping of OOVs to
|
|
<B><unk></B>)<B></B><B></B><B></B>
|
|
assign a non-zero probability to them.
|
|
Next, we need to ensure there are actual occurrences of
|
|
<B> <unk> </B>
|
|
N-grams
|
|
in the training data so we can obtain meaningful probability estimates
|
|
for them
|
|
(otherwise
|
|
<B> <unk> </B>
|
|
would only get probabilty via unigram discounting, see item C4).
|
|
To get a proper estimate
|
|
of the
|
|
<B> <unk> </B>
|
|
probability, we need to explicitly specify a vocabulary that is not
|
|
a superset of the training data.
|
|
One way to do that is to extract the vocabulary from an independent
|
|
data set, or by only including words with some minimum count (greater than 1)
|
|
in the training data.
|
|
<DT><B> D9) Doesn't ngram-count -addsmooth deal with OOV words by adding a constant to occurrence counts? </B>
|
|
<DD>
|
|
No, all smoothing is applied when building the LM at training time,
|
|
so it must use the
|
|
<B> <unk> </B>
|
|
mechanism to assign probability to words that are first seen in the
|
|
test data.
|
|
Furthermore, even add-constant smoothing requires a fixed, finite
|
|
vocabulary to compute the denominator of its estimator.
|
|
</DD>
|
|
</DL>
|
|
<H2> SEE ALSO </H2>
|
|
<A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="training-scripts.1.html">training-scripts(1)</A>, <A HREF="ngram-discount.7.html">ngram-discount(7)</A>.
|
|
<BR>
|
|
$SRILM/INSTALL
|
|
<BR>
|
|
<a href="http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/">http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/</a>
|
|
<BR>
|
|
<a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13</a>
|
|
<BR>
|
|
W. Wang, A. Stolcke, & J. Zheng,
|
|
Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp. 159-164, Kyoto, 2007.
|
|
<a href="http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz">http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz</a>
|
|
<H2> BUGS </H2>
|
|
This document is work in progress.
|
|
<H2> AUTHOR </H2>
|
|
Andreas Stolcke <andreas.stolcke@microsoft.com>,
|
|
Deniz Yuret <dyuret@ku.edu.tr>,
|
|
Nitin Madnani <nmadnani@umiacs.umd.edu>
|
|
<BR>
|
|
Copyright (c) 2007-2010 SRI International
|
|
<BR>
|
|
Copyright (c) 2011-2017 Andreas Stolcke
|
|
<BR>
|
|
Copyright (c) 2011-2017 Microsoft Corp.
|
|
</BODY>
|
|
</HTML>
|