b2txt25/language_model/srilm-1.7.3/man/html/srilm-faq.7.html

<! $Id: srilm-faq.7,v 1.13 2019/09/09 22:35:37 stolcke Exp $>
<HTML>
<HEADER>
<TITLE>SRILM-FAQ</TITLE>
<BODY>
<H1>SRILM-FAQ</H1>
<H2> NAME </H2>
SRILM-FAQ - Frequently asked questions about SRI LM tools
<H2> SYNOPSIS </H2>
<PRE>
man srilm-faq
</PRE>
<H2> DESCRIPTION </H2>
This document tries to answer some of the most frequently asked questions
about SRILM.
<H3> Build issues </H3>
<DL>
<DT><B> A1) I ran ``make World'' but the $SRILM/bin/$MACHINE_TYPE directory is empty. </B>
<DD>
Building the binaries can fail for a variety of reasons.
Check the following:
<DL>
<DT>
 a)
<DD>
Make sure the SRILM environment variable is set, or specified on the
make command line, e.g.:
<PRE>
	make SRILM=$PWD
</PRE>
<DT>
 b)
<DD>
Make sure the
<B> $SRILM/sbin/machine-type </B>
script returns a valid string for the platform you are trying to build on.
Known platforms have machine-specific makefiles called
<PRE>
	$SRILM/common/Makefile.machine.$MACHINE_TYPE
</PRE>
If
<B> machine-type </B>
does not work for some reason, you can override its output on the command line:
<PRE>
	make MACHINE_TYPE=xyz
</PRE>
If you are building for an unsupported platform create a new machine-specific
makefile and mail it to stolcke@speech.sri.com.
<DT>
 c)
<DD>
Make sure your compiler works and is invoked correctly.
You will probably have to edit the
<B> CC </B>
and
<B> CXX </B>
variables in the platform-specific makefile.
If you have questions about compiler invocation and best options
consult a local expert; these things differ widely between sites.
<DT>
 d)
<DD>
The default is to compile with Tcl support.
This is in fact only used for some testing binaries (which are
not built by default),
so it can be turned off if Tcl is not available or presents problems.
Edit the machine-specific makefile accordingly.
To use Tcl, locate the
<B> tcl.h </B>
header file and the library itself, and set (for example)
<PRE>
	TCL_INCLUDE = -I/path/to/include
	TCL_LIBRARY = -L/path/to/lib -ltcl8.4
</PRE>
To disable Tcl support set
<PRE>
	NO_TCL = X
	TCL_INCLUDE =
	TCL_LIBRARY =
</PRE>
<DT>
 e)
<DD>
Make sure you have the C-shell (/bin/csh) installed on your system.
Otherwise you will see something like
<PRE>
	make: /sbin/machine-type: Command not found
</PRE>
early in the build process.
On Ubuntu Linux and Cygwin systems "csh" or "tcsh" needs to be installed
as an optional package.
<DT>
 f)
<DD>
If you cannot get SRILM to build, save the make output to a file
<PRE>
	make World &gt;&amp; make.output
</PRE>
and look for messages indicating errors.
If you still cannot figure out what the problem is, send the error message
and immediately preceding lines to the srilm-user list.
Also include information about your operating system ("uname -a" output)
and compiler version ("gcc -v" or equivalent for other compilers).
</DD>
</DL>
<DT><B> A2) The regression test outputs differ for all tests. What did I do wrong? </B>
<DD>
Most likely the binaries didn't get built or aren't executable
for some reason.
Check issue A1).
<DT><B> A3) I get differing outputs for some of the regression tests. Is that OK? </B>
<DD>
It might be.
The comparison of reference to actual output allows for small numerical
differences, but
some of the algorithms make hard decisions based on floating-point computations
that can result in different outputs as a result of different compiler
optimizations, machine floating point precisions (Intel versus IEEE format),
and math libraries.
Test of this nature include
<B>ngram-class</B>,<B></B><B></B><B></B>
<B>disambig</B>,<B></B><B></B><B></B>
and
<B>nbest-rover</B>.<B></B><B></B><B></B>
When encountering differences, diff the output in the
$SRILM/test/outputs/<I>TEST</I>.$MACHINE_TYPE.stdout file to the corresponding
$SRILM/test/reference/<I>TEST</I>.stdout, where
<I> TEST </I>
is the name of the test that failed.
Also compare the corresponding .stderr files;
differences there usually indicate operating-system related problems.
</DD>
</DL>
<H3> Large data and memory issues </H3>
<DL>
<DT><B> B1) I'm getting a message saying ``Assertion `body != 0' failed.'' </B>
<DD>
You are running out of memory.
See subsequent questions depending on what you are trying to do.
<DT>
 Note:
<DD>
The above message means you are running
out of "virtual" memory on your computer, which could be because of
limits in swap space, administrative resource limits, or limitations of
the machine architecture (a 32-bit machine cannot address more than
4GB no matter how many resources your system has).
Another symptom of not enough memory is that your program runs, but
very, very slowly, i.e., it is "paging" or "swapping" as it tries to
use more memory than the machine has RAM installed.
<DT><B> B2) I am trying to count N-grams in a text file and running out of memory. </B>
<DD>
Don't use
<B> ngram-count </B>
directly to count N-grams.
Instead, use the
<B> make-batch-counts </B>
and
<B> merge-batch-counts </B>
scripts described in
<A HREF="training-scripts.1.html">training-scripts(1)</A>.
That way you can create N-gram counts limited only by the maximum file size
on your system.
<DT><B> B3) I am trying to build an N-gram LM and ngram-count runs out of memory. </B>
<DD>
You are running out of memory either because of the size of ngram counts,
or of the LM being built. The following are strategies for reducing the
memory requirements for training LMs.
<DL>
<DT>
 a)
<DD>
Assuming you are using Good-Turing or Kneser-Ney discounting, don't use
<B> ngram-count </B>
in "raw" form.
Instead, use the
<B> make-big-lm </B>
wrapper script described in the
<A HREF="training-scripts.1.html">training-scripts(1)</A>
man page.
<DT>
 b)
<DD>
Switch to using the "_c" or "_s" versions of the SRI binaries.
For
instructions on how to build them, see the INSTALL file.
Once built, set your executable search path accordingly, and try
<B> make-big-lm </B>
again.
<DT>
 c)
<DD>
Raise the minimum counts for N-grams included in the LM, i.e.,
the values of the options
<B>-gt2min</B>,<B></B><B></B><B></B>
<B>-gt3min</B>,<B></B><B></B><B></B>
<B>-gt4min</B>,<B></B><B></B><B></B>
etc.
The higher order N-grams typically get higher minimum counts.
<DT>
 d)
<DD>
Get a machine with more memory.
If you are hitting the limitations of a 32-bit machine architecture,
get a 64-bit machine and recompile SRILM to take advantage of the expanded
address space.
(The MACHINE_TYPE=i686-m64 setting is for systems based on
64-bit AMD processors, as well as recent compatibles from Intel.)
Note that 64-bit pointers will require a memory overhead in
themselves, so you will need a machine with significantly, not just a
little, more memory than 4GB.
</DD>
</DL>
<DT><B> B4) I am trying to apply a large LM to some data and am running out of memory. </B>
<DD>
Again, there are several strategies to reduce memory requirements.
<DL>
<DT>
 a)
<DD>
Use the "_c" or "_s" versions of the SRI binaries.
See 3b) above.
<DT>
 b)
<DD>
Precompute the vocabulary of your test data and use the
<B> ngram -limit-vocab </B>
option to load only the N-gram parameters relevant to your data.
This approach should allow you to use arbitrarily
large LMs provided the data is divided into small enough chunks.
<DT>
 c)
<DD>
If the LM can be built on a large machine, but then is to be used on
machines with limited memory, use
<B> ngram -prune </B>
to remove the less important parameters of the model.
This usually gives huge size reductions with relatively modest performance
degradation.
The tradeoff is adjustable by varying the pruning parameter.
</DD>
</DL>
<DT><B> B5) How can I reduce the time it takes to load large LMs into memory? </B>
<DD>
The techniques described in 4b) and 4c) above also reduce the load time
of the LM.
Additional steps to try are:
<DL>
<DT>
 a)
<DD>
Convert the LM into binary format, using
<PRE>
		ngram -order <I>N</I> -lm <I>OLDLM</I> -write-bin-lm <I>NEWLM</I>
</PRE>
(This is currently only supported for N-gram-based LMs.)
You can also generate the LM directly in binary format, using
<PRE>
		ngram-count ... -lm <I>NEWLM</I> -write-binary-lm
</PRE>
The resulting
<I> NEWLM </I>
file (which should not be compressed) can be used
in place of a textual LM file with all compiled SRILM tools
(but not with
<A HREF="lm-scripts.1.html">lm-scripts(1)</A>).
The format is machine-independent, i.e., it can be read on machines with
different word sizes or byte-order.
Loading binary LMs is faster because
(1) it reduces the overhead of parsing the input data, and
(2) in combination with
<B> -limit-vocab </B>
(see 4b)
it is much faster to skip sections of the LM that are out-of-vocabulary.
<DT>
 Note:
<DD>
There is also a binary format for N-gram counts.
It can be generated using
<PRE>
		ngram-count -write-binary <I>COUNTS</I>
</PRE>
and has similar advantages as binary LM files.
<DT>
 b)
<DD>
Start a "probability server" that loads the LM ahead of time, and
then have "LM clients" query the server instead of computing the
probabilities themselves.
<BR>
The server is started on a machine named
<I> HOST </I>
using
<PRE>
		ngram <I>LMOPTIONS</I> -server-port <I>P</I> &amp;
</PRE>
where
<I> P </I>
is an integer &lt; 2^16 that specifies the TCP/IP port number the
server will listen on, and
<I> LMOPTIONS </I>
are whatever options necessary to define the LM to be used.
<BR>
One or more clients (programs such as
<A HREF="ngram.1.html">ngram(1)</A>,
<A HREF="disambig.1.html">disambig(1)</A>,
<A HREF="lattice-tool.1.html">lattice-tool(1)</A>)
can then query the server using the options
<PRE>
		-use-server <I>P</I>@<I>HOST</I> -cache-served-ngrams
</PRE>
instead of the usual "-lm <I>FILE</I>".
The
<B> -cache-served-ngrams </B>
option is not required but often speeds up performance dramatically by
saving the results of server lookups in the client for reuse.
Server-based LMs may be combined with file-based LMs by interpolation;
see
<A HREF="ngram.1.html">ngram(1)</A>
for details.
</DD>
</DL>
<DT><B> B6) How can I use the Google Web N-gram corpus to build an LM? </B>
<DD>
Google has made a corpus of 5-grams extracted from 1 tera-words of web data
available via LDC.
However, the data is too large to build a standard backoff N-gram, even
using the techniques described above.
Instead, we recommend a "count-based" LM smoothed with deleted interpolation.
Such an LM computes probabilities on the fly from the counts, of which only
the subsets needed for a given test set need to be loaded into memory.
LM construction proceeds in the following steps:
<DL>
<DT>
 a)
<DD>
Make sure you have built SRI binaries either for a 64-bit machine
(e.g., MACHINE_TYPE=i686-m64 OPTION=_c) or using 64-bit counts (OPTION=_l).
This is necessary because the data contains N-gram counts exceeding
the range of 32-bit integers.
Be sure to invoke all commands below using the path to the appropriate
binary executable directory.
<DT>
 b)
<DD>
Prepare mapping file for some vocabulary mismatches and call this
<B>google.aliases</B>:<B></B><B></B><B></B>
<PRE>
	&lt;S&gt; &lt;s&gt;
	&lt;/S&gt; &lt;/s&gt;
	&lt;UNK&gt; &lt;unk&gt;
</PRE>
<DT>
 c)
<DD>
Prepare an initial count-LM parameter file
<B>google.countlm.0</B>:<B></B><B></B><B></B>
<PRE>
	order 5
	vocabsize 13588391
	totalcount 1024908267229
	countmodulus 40
	mixweights 15
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	 0.5 0.5 0.5 0.5 0.5
	google-counts <I>PATH</I>
</PRE>
where
<I> PATH </I>
points to the location of the Google N-grams, i.e., the directory containing
subdirectories "1gms", "2gms", etc.
Note that the
<B> vocabsize </B>
and
<B> totalcount </B>
were obtained from the 1gms/vocab.gz and 1gms/total files, respectively.
(Check that they match and modify as needed.)
For an explanation of the parameters see the
<A HREF="ngram.1.html">ngram(1)</A>
<B> -count-lm </B>
option.
<DT>
 d)
<DD>
Prepare a text file
<B> tune.text </B>
containing data for estimating the mixture weights.
This data should be representative of, but different from your test data.
Compute the vocabulary of this data using
<PRE>
	ngram-count -text tune.text -write-vocab tune.vocab
</PRE>
The vocabulary size should not exceed a few thousand to keep memory
requirements in the following steps manageable.
<DT>
 e)
<DD>
Estimate the mixture weights:
<PRE>
	ngram-count -debug 1 -order 5 -count-lm  \
		-text tune.text -vocab tune.vocab \
		-vocab-aliases google.aliases \
		-limit-vocab \
		-init-lm google.countlm.0 \
		-em-iters 100 \
		-lm google.countlm
</PRE>
This will write the estimated LM to
<B>google.countlm</B>.<B></B><B></B><B></B>
The output will be identical to the initial LM file, except for the
updated interpolation weights.
<DT>
 f)
<DD>
Prepare a test data file
<B>test.text</B>,<B></B><B></B><B></B>
and its vocabulary
<B> test.vocab </B>
as in Step d) above.
Then apply the LM to the test data:
<PRE>
	ngram -debug 2 -order 5 -count-lm \
		-lm google.countlm \
		-vocab test.vocab \
		-vocab-aliases google.aliases \
		-limit-vocab \
		-ppl test.text &gt; test.ppl
</PRE>
The perplexity output will appear in
<B> test.ppl. </B>
<DT>
 g)
<DD>
Note that the Google data uses mixed case spellings.
To apply the LM to lowercase data one needs to prepare a much more
extensive vocabulary mapping table for the
<B> -vocab-aliases </B>
option, namely, one that maps all
upper- and mixed-case spellings to lowercase strings.
This mapping file should be restricted to the words appearing in
<B> tune.text </B>
and
<B>test.text</B>,<B></B><B></B><B></B>
respectively, to avoid defeating the effect of
<B> -limit-vocab . </B>
</DD>
</DL>
</DL>
<H3> Smoothing issues </H3>
<DL>
<DT><B> C1) What is smoothing and discounting all about? </B>
<DD>
<I> Smoothing </I>
refers to methods that assign probabilities to events (N-grams) that
do not occur in the training data.
According to a pure maximum-likelihood estimator these events would have
probability zero, which is plainly wrong since previously unseen events
in general do occur in independent test data.
Because the probability mass is redistributed away from the seen events
toward the unseen events the resulting model is "smoother" (closer to uniform)
than the ML model.
<I> Discounting </I>
refers to the approach used by many smoothing methods of adjusting the
empirical counts of seen events downwards.
The ML estimator (count divided by total number of events) is then applied
to the discounted count, resulting in a smoother estimate.
<DT><B> C2) What smoothing methods are there? </B>
<DD>
There are many, and SRILM implements are fairly large selection of the
most popular ones.
A detailed discussion of these is found in a separate document,
<A HREF="ngram-discount.7.html">ngram-discount(7)</A>.
<DT><B> C3) Why am I getting errors or warnings from the smoothing method I'm using? </B>
<DD>
The Good-Turing and Kneser-Ney smoothing methods rely on statistics called
"count-of-counts", the number of words occurring one, twice, three times, etc.
The formulae for these methods become undefined if the counts-of-counts
are zero, or not strictly decreasing.
Some conditions are fatal (such as when the count of singleton words is zero),
others lead to less smoothing (and warnings).
To avoid these problems, check for the following possibilities:
<DL>
<DT>
 a)
<DD>
The data could be very sparse, i.e., the training corpus very small.
Try using the Witten-Bell discounting method.
<DT>
 b)
<DD>
The vocabulary could be very small, such as when training an LM based on
characters or parts-of-speech.
Smoothing is less of an issue in those cases, and the Witten-Bell method
should work well.
<DT>
 c)
<DD>
The data was manipulated in some way, or artificially generated.
For example, duplicating data eliminates the odd-numbered counts-of-counts.
<DT>
 d)
<DD>
The vocabulary is limited during counts collection using the
<B>ngram-count</B><B></B><B></B><B></B>
<B> -vocab </B>
option, with the effect that many low-frequency N-grams are eliminated.
The proper approach is to compute smoothing parameters on the full vocabulary.
This happens automatically in the
<B> make-big-lm </B>
wrapper script, which is preferable to direct use of
<B>ngram-count</B><B></B><B></B><B></B>
for other reasons (see issue B3-a above).
<DT>
 e)
<DD>
You are estimating an LM from N-gram counts that have been truncated beforehand,
e.g., by removing singleton events.
If you cannot go back to the original data and recompute the counts
there is a heuristic to extrapolate low counts-of-counts from higher ones.
The heuristic is invoked automatically (and an informational message is output)
when
<B> make-big-lm </B>
is used to estimate LMs with Kneser-Ney smoothing.
For details see the paper by W. Wang et al. in ASRU-2007, listed under
"SEE ALSO".
</DD>
</DL>
<DT><B> C4) How does discounting work in the case of unigrams? </B>
<DD>
First, unigrams are discounted using the same method as higher-order
N-grams, using the specified method.
The probability mass freed up in this way
is then either spread evenly over all word types
that would otherwise have zero probability (this is essentially
simulating a backoff to zero-grams), or
if all unigrams already have non-zero probabilities, the
left-over mass is added to
<I> all </I>
unigrams.
In either case all unigram probabilty probabilities will sum to 1.
An informational message from
<B> ngram-count </B>
will tell which case applies.
<DT><B> C5) Why do I get a different number of trigrams when building a 4gram model compared to just a trigram model? </B>
<DD>
This can happen when Kneser-Ney smoothing is used and the trigram cut-off
(<B>-gt3min</B>)<B></B><B></B>
is greater than 1 (as with the default, 2).
The count cutoffs are applied against the modified counts generated as part of KN smoothing,
so in the case of a 4gram model the trigrams are modified and the set of ngrams above the cutoff will change.
</DD>
</DL>
<H3> Out-of-vocabulary, zeroprob, and `unknown' words </H3>
<DL>
<DT><B> D1) What is the perplexity of an OOV (out of vocabulary) word? </B>
<DD>
By default any word not observed in the training data is considered
OOV and OOV words are silently ignored by the
<A HREF="ngram.1.html">ngram(1)</A>
during perplexity (ppl) calculation.
For example:
<PRE>

	$ ngram-count -text turkish.train -lm turkish.lm
	$ ngram -lm turkish.lm -ppl turkish.test
	file turkish.test: 61031 sentences, 1000015 words, 34153 OOVs
	0 zeroprobs, logprob= -3.20177e+06 ppl= 1311.97 ppl1= 2065.09

</PRE>
The statistics printed in the last two lines have the following meanings:
<DL>
<DT><B> 34153 OOVs </B>
<DD>
This is the number of unknown word tokens, i.e. tokens
that appear in
<B> turkish.test </B>
but not in
<B> turkish.train </B>
from which
<B> turkish.lm </B>
was generated.
<DT><B> logprob= -3.20177e+06 </B>
<DD>
This gives us the total logprob ignoring the 34153 unknown word tokens.
The logprob does include the probabilities
assigned to &lt;/s&gt; tokens which are introduced by
<A HREF="ngram-count.1.html">ngram-count(1)</A>.
Thus the total number of tokens which this logprob is based on is
<PRE>
	words - OOVs + sentences = 1000015 - 34153 + 61031
</PRE>
<DT><B> ppl = 1311.97 </B>
<DD>
This gives us the geometric average of 1/probability of
each token, i.e., perplexity.
The exact expression is:
<PRE>
	ppl = 10^(-logprob / (words - OOVs + sentences))
</PRE>
<DT><B> ppl1 = 2065.09 </B>
<DD>
This gives us the average perplexity per word excluding the &lt;/s&gt; tokens.
The exact expression is:
<PRE>
	ppl1 = 10^(-logprob / (words - OOVs))
</PRE>
</DD>
</DL>
You can verify these numbers by running the
<B> ngram </B>
program with the
<B> -debug 2 </B>
option, which gives the probability assigned to each token.
<DT><B> D2) What happens when the OOV word is in the context of an N-gram? </B>
<DD>
Exact details depend on the discounting algorithm used, but typically
the backed-off probability from a lower order N-gram is used.  If the
<B> -unk </B>
option is used as explained below, an &lt;unk&gt; token is assumed to
take the place of the OOV word and no back-off may be necessary
if a corresponding N-gram containing &lt;unk&gt; is found in the LM.
<DT><B> D3) Isn't it wrong to assign 0 logprob to OOV words? </B>
<DD>
That depends on the application.
If you are comparing multiple language
models which all consider the same set of words as OOV it may be OK to
ignore OOV words.
Note that perplexity comparisons are only ever meaningful
if the vocabularies of all LMs are the same.
Therefore, to compare LMs with different sets of OOV words
(such as when using different tokenization strategies for morphologically
complex languages) then it becomes important
to take into account the true cost of the OOV words, or to model all words,
including OOVs.
<DT><B> D4) How do I take into account the true cost of the OOV words? </B>
<DD>
A simple strategy is to "explode" the OOV words, i.e., split them into
characters in the training and test data.
Typically words that appear more than once in the training data are
considered to be vocabulary words.
All other words are split into their characters and the
individual characters are considered tokens.
Assuming that all characters occur at least once in the training data there
will be no OOV tokens in the test data.
Note that this strategy changes the number of tokens in the data set,
so even though logprob is meaningful be careful when reporting ppl results.
<DT><B> D5) What if I want to model the OOV words explicitly? </B>
<DD>
Maybe a better strategy is to have a separate "letter" model for OOV words.
This can be easily created using SRILM by using a training
file listing the OOV words one per line with their characters
separated by spaces.
The
<B> ngram-count </B>
options
<B> -ukndiscount </B>
and
<B> -order 7 </B>
seem to work well for this purpose.
The final logprob results are obtained in two steps.
First do regular training and testing on your data using
<B> -vocab </B>
and
<B> -unk </B>
options.
The resulting logprob will include the cost of the vocabulary words and an
&lt;unk&gt; token for each OOV word.
Then apply the letter model to each OOV word in the test set.
Add the logprobs.
Here is an example:
<PRE>

	# Determine vocabulary:
	ngram-count -text turkish.train -write-order 1 -write turkish.train.1cnt
	awk '$2&gt;1'  turkish.train.1cnt | cut -f1 | sort &gt; turkish.train.vocab
	awk '$2==1' turkish.train.1cnt | cut -f1 | sort &gt; turkish.train.oov

	# Word model:
	ngram-count -kndiscount -interpolate -order 4 -vocab turkish.train.vocab -unk -text turkish.train -lm turkish.train.model
	ngram -order 4 -unk -lm turkish.train.model -ppl turkish.test &gt; turkish.test.ppl

	# Letter model:
	perl -C -lne 'print join(" ", split(""))' turkish.train.oov &gt; turkish.train.oov.split
	ngram-count -ukndiscount -interpolate -order 7 -text turkish.train.oov.split -lm turkish.train.oov.model
	perl -pe 's/\s+/\n/g' turkish.test | sort &gt; turkish.test.words
	comm -23 turkish.test.words turkish.train.vocab &gt; turkish.test.oov
	perl -C -lne 'print join(" ", split(""))' turkish.test.oov &gt; turkish.test.oov.split
	ngram -order 7 -ppl turkish.test.oov.split -lm turkish.train.oov.model &gt; turkish.test.oov.ppl

	# Add the logprobs in turkish.test.ppl and turkish.test.oov.ppl.

</PRE>
Again, perplexities are not directly meaningful as computed by SRILM, but you
can recompute them by hand using the combined logprob value, and the number of
original word tokens in the test set.
<DT><B> D6) What are zeroprob words and when do they occur? </B>
<DD>
In-vocabulary words that get zero probability are counted as
"zeroprobs" in the ppl output.
Just as OOV words, they are excluded from the perplexity
computation since otherwise the perplexity value would be infinity.
There are three reasons why zeroprobs could occur in a
closed vocabulary setting (the default for SRILM):
<DL>
<DT>
 a)
<DD>
If the same vocabulary is used at test time as was used during
training, and smoothing is enabled, then the occurrence of zeroprobs
indicates an anomalous condition and, possibly, a broken language model.
<DT>
 b)
<DD>
If smoothing has been disabled (e.g., by using the option
<B>-cdiscount 0</B>),<B></B><B></B><B></B>
then the LM will use maximum likelihood estimates for
the N-grams and then any unseen N-gram is a zeroprob.
<DT>
 c)
<DD>
If a different vocabulary file is specified at test time than
the one used in training, then the definition of what counts as an OOV
will change.
In particular, a word that wasn't seen in the training data (but is in the
test vocabulary) will
<I> not </I>
be mapped to
<B> &lt;unk&gt; </B>
and, therefore, not
count as an OOV in the perplexity computation.
However, it will still get zero probability and, therefore, be tallied
as a zeroprob.
</DD>
</DL>
<DT><B> D7) What is the point of using the <B>&lt;unk&gt;</B> token? </B>
<DD>
Using
<B> &lt;unk&gt; </B>
is a practical convenience employed by SRILM.
Words not in the specified vocabulary are mapped to
<B>&lt;unk&gt;</B>,<B></B><B></B><B></B>
which is equivalent to performing the same mapping
in a data pre-processing step outside of SRILM.
Other than that,
for both LM estimation and evaluation purposes,
<B> &lt;unk&gt; </B>
is treated like any other word.
(In particular, in the computation of discounted probabilities
there is no special handling of
<B>&lt;unk&gt;</B>.)<B></B><B></B><B></B>
<DT><B> D8) So how do I train an open-vocabulary LM with <B>&lt;unk&gt;</B>? </B>
<DD>
First, make sure to use the
<B> ngram-count </B>
<B> -unk </B>
option, which simply indicates that the
<B> &lt;unk&gt; </B>
word should be included in the LM vocabulary, as required for an
open-vocabulary LM.
Without this option, N-grams containing
<B> &lt;unk&gt; </B>
would simply be discarded.
An "open vocabulary" LM is simply one that contains
<B>&lt;unk&gt;</B>,<B></B><B></B><B></B>
and can therefore (by virtue of the mapping of OOVs to
<B>&lt;unk&gt;</B>)<B></B><B></B><B></B>
assign a non-zero probability to them.
Next, we need to ensure there are actual occurrences of
<B> &lt;unk&gt; </B>
N-grams
in the training data so we can obtain meaningful probability estimates
for them
(otherwise
<B> &lt;unk&gt; </B>
would only get probabilty via unigram discounting, see item C4).
To get a proper estimate
of the
<B> &lt;unk&gt; </B>
probability, we need to explicitly specify a vocabulary that is not
a superset of the training data.
One way to do that is to extract the vocabulary from an independent
data set, or by only including words with some minimum count (greater than 1)
in the training data.
<DT><B> D9) Doesn't ngram-count -addsmooth deal with OOV words by adding a constant to occurrence counts? </B>
<DD>
No, all smoothing is applied when building the LM at training time,
so it must use the
<B> &lt;unk&gt; </B>
mechanism to assign probability to words that are first seen in the
test data.
Furthermore, even add-constant smoothing requires a fixed, finite
vocabulary to compute the denominator of its estimator.
</DD>
</DL>
<H2> SEE ALSO </H2>
<A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="training-scripts.1.html">training-scripts(1)</A>, <A HREF="ngram-discount.7.html">ngram-discount(7)</A>.
<BR>
$SRILM/INSTALL
<BR>
<a href="http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/">http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/</a>
<BR>
<a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13</a>
<BR>
W. Wang, A. Stolcke, &amp; J. Zheng,
Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp. 159-164, Kyoto, 2007.
<a href="http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz">http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz</a>
<H2> BUGS </H2>
This document is work in progress.
<H2> AUTHOR </H2>
Andreas Stolcke &lt;andreas.stolcke@microsoft.com&gt;,
Deniz Yuret &lt;dyuret@ku.edu.tr&gt;,
Nitin Madnani &lt;nmadnani@umiacs.umd.edu&gt;
<BR>
Copyright (c) 2007-2010 SRI International
<BR>
Copyright (c) 2011-2017 Andreas Stolcke
<BR>
Copyright (c) 2011-2017 Microsoft Corp.
</BODY>
</HTML>