competition update
This commit is contained in:
690
language_model/srilm-1.7.3/man/html/ngram-discount.7.html
Normal file
690
language_model/srilm-1.7.3/man/html/ngram-discount.7.html
Normal file
@@ -0,0 +1,690 @@
|
||||
<! $Id: ngram-discount.7,v 1.5 2019/09/09 22:35:37 stolcke Exp $>
|
||||
<HTML>
|
||||
<HEADER>
|
||||
<TITLE>ngram-discount</TITLE>
|
||||
<BODY>
|
||||
<H1>ngram-discount</H1>
|
||||
<H2> NAME </H2>
|
||||
ngram-discount - notes on the N-gram smoothing implementations in SRILM
|
||||
<H2> NOTATION </H2>
|
||||
<DL>
|
||||
<DT><I>a</I>_<I>z</I><I></I><I></I>
|
||||
<DD>
|
||||
An N-gram where
|
||||
<I> a </I>
|
||||
is the first word,
|
||||
<I> z </I>
|
||||
is the last word, and "_" represents 0 or more words in between.
|
||||
<DT><I>p</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
<DD>
|
||||
The estimated conditional probability of the <I>n</I>th word
|
||||
<I> z </I>
|
||||
given the first <I>n</I>-1 words
|
||||
(<I>a</I>_)<I></I><I></I>
|
||||
of an N-gram.
|
||||
<DT><I>a</I>_<I></I><I></I><I></I>
|
||||
<DD>
|
||||
The <I>n</I>-1 word prefix of the N-gram
|
||||
<I>a</I>_<I>z</I>.<I></I><I></I>
|
||||
<DT>_<I>z</I><I></I><I></I>
|
||||
<DD>
|
||||
The <I>n</I>-1 word suffix of the N-gram
|
||||
<I>a</I>_<I>z</I>.<I></I><I></I>
|
||||
<DT><I>c</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
<DD>
|
||||
The count of N-gram
|
||||
<I>a</I>_<I>z</I><I></I><I></I>
|
||||
in the training data.
|
||||
<DT><I>n</I>(*_<I>z</I>)<I></I><I></I>
|
||||
<DD>
|
||||
The number of unique N-grams that match a given pattern.
|
||||
``(*)'' represents a wildcard matching a single word.
|
||||
<DT><I>n1</I>,<I>n</I>[1]<I></I><I></I>
|
||||
<DD>
|
||||
The number of unique N-grams with count = 1.
|
||||
</DD>
|
||||
</DL>
|
||||
<H2> DESCRIPTION </H2>
|
||||
<P>
|
||||
N-gram models try to estimate the probability of a word
|
||||
<I> z </I>
|
||||
in the context of the previous <I>n</I>-1 words
|
||||
(<I>a</I>_),<I></I><I></I>
|
||||
i.e.,
|
||||
<I>Pr</I>(<I>z</I>|<I>a</I>_).<I></I>
|
||||
We will
|
||||
denote this conditional probability using
|
||||
<I>p</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
for convenience.
|
||||
One way to estimate
|
||||
<I>p</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
is to look at the number of times word
|
||||
<I> z </I>
|
||||
has followed the previous <I>n</I>-1 words
|
||||
(<I>a</I>_):<I></I><I></I>
|
||||
<PRE>
|
||||
|
||||
(1) <I>p</I>(<I>a</I>_<I>z</I>) = <I>c</I>(<I>a</I>_<I>z</I>)/<I>c</I>(<I>a</I>_)
|
||||
|
||||
</PRE>
|
||||
This is known as the maximum likelihood (ML) estimate.
|
||||
Unfortunately it does not work very well because it assigns zero probability to
|
||||
N-grams that have not been observed in the training data.
|
||||
To avoid the zero probabilities, we take some probability mass from the observed
|
||||
N-grams and distribute it to unobserved N-grams.
|
||||
Such redistribution is known as smoothing or discounting.
|
||||
<P>
|
||||
Most existing smoothing algorithms can be described by the following equation:
|
||||
<PRE>
|
||||
|
||||
(2) <I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>)
|
||||
|
||||
</PRE>
|
||||
If the N-gram
|
||||
<I>a</I>_<I>z</I><I></I><I></I>
|
||||
has been observed in the training data, we use the
|
||||
distribution
|
||||
<I>f</I>(<I>a</I>_<I>z</I>).<I></I>
|
||||
Typically
|
||||
<I>f</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
is discounted to be less than
|
||||
the ML estimate so we have some leftover probability for the
|
||||
<I> z </I>
|
||||
words unseen in the context
|
||||
(<I>a</I>_).<I></I><I></I>
|
||||
Different algorithms mainly differ on how
|
||||
they discount the ML estimate to get
|
||||
<I>f</I>(<I>a</I>_<I>z</I>).<I></I>
|
||||
<P>
|
||||
If the N-gram
|
||||
<I>a</I>_<I>z</I><I></I><I></I>
|
||||
has not been observed in the training data, we use
|
||||
the lower order distribution
|
||||
<I>p</I>(_<I>z</I>).<I></I><I></I>
|
||||
If the context has never been
|
||||
observed (<I>c</I>(<I>a</I>_) = 0),
|
||||
we can use the lower order distribution directly (bow(<I>a</I>_) = 1).
|
||||
Otherwise we need to compute a backoff weight (bow) to
|
||||
make sure probabilities are normalized:
|
||||
</PRE>
|
||||
|
||||
Sum_<I>z</I> <I>p</I>(<I>a</I>_<I>z</I>) = 1
|
||||
|
||||
</PRE>
|
||||
<P>
|
||||
Let
|
||||
<I> Z </I>
|
||||
be the set of all words in the vocabulary,
|
||||
<I> Z0 </I>
|
||||
be the set of all words with <I>c</I>(<I>a</I>_<I>z</I>) = 0, and
|
||||
<I> Z1 </I>
|
||||
be the set of all words with <I>c</I>(<I>a</I>_<I>z</I>) > 0.
|
||||
Given
|
||||
<I>f</I>(<I>a</I>_<I>z</I>),<I></I>
|
||||
bow(<I>a</I>_)<I></I><I></I>
|
||||
can be determined as follows:
|
||||
<PRE>
|
||||
|
||||
(3) Sum_<I>Z</I> <I>p</I>(<I>a</I>_<I>z</I>) = 1
|
||||
Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>) + Sum_<I>Z0</I> bow(<I>a</I>_) <I>p</I>(_<I>z</I>) = 1
|
||||
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / Sum_<I>Z0</I> <I>p</I>(_<I>z</I>)
|
||||
= (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>p</I>(_<I>z</I>))
|
||||
= (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>))
|
||||
|
||||
</PRE>
|
||||
<P>
|
||||
Smoothing is generally done in one of two ways.
|
||||
The backoff models compute
|
||||
<I>p</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
based on the N-gram counts
|
||||
<I>c</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
when <I>c</I>(<I>a</I>_<I>z</I>) > 0, and
|
||||
only consider lower order counts
|
||||
<I>c</I>(_<I>z</I>)<I></I><I></I>
|
||||
when <I>c</I>(<I>a</I>_<I>z</I>) = 0.
|
||||
Interpolated models take lower order counts into account when
|
||||
<I>c</I>(<I>a</I>_<I>z</I>) > 0 as well.
|
||||
A common way to express an interpolated model is:
|
||||
<PRE>
|
||||
|
||||
(4) <I>p</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>)
|
||||
|
||||
</PRE>
|
||||
Where <I>g</I>(<I>a</I>_<I>z</I>) = 0 when <I>c</I>(<I>a</I>_<I>z</I>) = 0
|
||||
and it is discounted to be less than
|
||||
the ML estimate when <I>c</I>(<I>a</I>_<I>z</I>) > 0
|
||||
to reserve some probability mass for
|
||||
the unseen
|
||||
<I> z </I>
|
||||
words.
|
||||
Given
|
||||
<I>g</I>(<I>a</I>_<I>z</I>),<I></I>
|
||||
bow(<I>a</I>_)<I></I><I></I>
|
||||
can be determined as follows:
|
||||
<PRE>
|
||||
|
||||
(5) Sum_<I>Z</I> <I>p(</I><I>a_</I><I>z)</I> = 1
|
||||
Sum_<I>Z1</I> <I>g(</I><I>a_</I><I>z</I>) + Sum_<I>Z</I> bow(<I>a</I>_) <I>p</I>(_<I>z</I>) = 1
|
||||
bow(<I>a</I>_) = 1 - Sum_<I>Z1</I> <I>g</I>(<I>a</I>_<I>z</I>)
|
||||
|
||||
</PRE>
|
||||
<P>
|
||||
An interpolated model can also be expressed in the form of equation
|
||||
(2), which is the way it is represented in the ARPA format model files
|
||||
in SRILM:
|
||||
<PRE>
|
||||
|
||||
(6) <I>f</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>)
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>)
|
||||
|
||||
</PRE>
|
||||
<P>
|
||||
Most algorithms in SRILM have both backoff and interpolated versions.
|
||||
Empirically, interpolated algorithms usually do better than the backoff
|
||||
ones, and Kneser-Ney does better than others.
|
||||
|
||||
<H2> OPTIONS </H2>
|
||||
<P>
|
||||
This section describes the formulation of each discounting option in
|
||||
<A HREF="ngram-count.1.html">ngram-count(1)</A>.
|
||||
After giving the motivation for each discounting method,
|
||||
we will give expressions for
|
||||
<I>f</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
and
|
||||
bow(<I>a</I>_)<I></I><I></I>
|
||||
of Equation 2 in terms of the counts.
|
||||
Note that some counts may not be included in the model
|
||||
file because of the
|
||||
<B> -gtmin </B>
|
||||
options; see Warning 4 in the next section.
|
||||
<P>
|
||||
Backoff versions are the default but interpolated versions of most
|
||||
models are available using the
|
||||
<B> -interpolate </B>
|
||||
option.
|
||||
In this case we will express
|
||||
<I>g</I>(<I>a</I>_z<I>)</I><I></I>
|
||||
and
|
||||
bow(<I>a</I>_)<I></I><I></I>
|
||||
of Equation 4 in terms of the counts as well.
|
||||
Note that the ARPA format model files store the interpolated
|
||||
models and the backoff models the same way using
|
||||
<I>f</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
and
|
||||
bow(<I>a</I>_);<I></I><I></I>
|
||||
see Warning 3 in the next section.
|
||||
The conversion between backoff and
|
||||
interpolated formulations is given in Equation 6.
|
||||
<P>
|
||||
The discounting options may be followed by a digit (1-9) to indicate
|
||||
that only specific N-gram orders be affected.
|
||||
See
|
||||
<A HREF="ngram-count.1.html">ngram-count(1)</A>
|
||||
for more details.
|
||||
<DL>
|
||||
<DT><B>-cdiscount</B><I> D</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Ney's absolute discounting using
|
||||
<I> D </I>
|
||||
as the constant to subtract.
|
||||
<I> D </I>
|
||||
should be between 0 and 1.
|
||||
If
|
||||
<I> Z1 </I>
|
||||
is the set
|
||||
of all words
|
||||
<I> z </I>
|
||||
with <I>c</I>(<I>a</I>_<I>z</I>) > 0:
|
||||
<PRE>
|
||||
|
||||
<I>f</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) - <I>D</I>) / <I>c</I>(<I>a</I>_)
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
||||
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> f(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
||||
|
||||
</PRE>
|
||||
With the
|
||||
<B> -interpolate </B>
|
||||
option we have:
|
||||
<PRE>
|
||||
|
||||
<I>g</I>(<I>a</I>_<I>z</I>) = max(0, <I>c</I>(<I>a</I>_<I>z</I>) - <I>D</I>) / <I>c</I>(<I>a</I>_)
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.4
|
||||
bow(<I>a</I>_) = 1 - Sum_<I>Z1</I> <I>g</I>(<I>a</I>_<I>z</I>) ; Eqn.5
|
||||
= <I>D</I> <I>n</I>(<I>a</I>_*) / <I>c</I>(<I>a</I>_)
|
||||
|
||||
</PRE>
|
||||
The suggested discount factor is:
|
||||
<PRE>
|
||||
|
||||
<I>D</I> = <I>n1</I> / (<I>n1</I> + 2*<I>n2</I>)
|
||||
|
||||
</PRE>
|
||||
where
|
||||
<I> n1 </I>
|
||||
and
|
||||
<I> n2 </I>
|
||||
are the total number of N-grams with exactly one and
|
||||
two counts, respectively.
|
||||
Different discounting constants can be
|
||||
specified for different N-gram orders using options
|
||||
<B>-cdiscount1</B>,<B></B><B></B><B></B>
|
||||
<B>-cdiscount2</B>,<B></B><B></B><B></B>
|
||||
etc.
|
||||
<DT><B>-kndiscount</B> and <B>-ukndiscount</B><B></B><B></B>
|
||||
<DD>
|
||||
Kneser-Ney discounting.
|
||||
This is similar to absolute discounting in
|
||||
that the discounted probability is computed by subtracting a constant
|
||||
<I> D </I>
|
||||
from the N-gram count.
|
||||
The options
|
||||
<B> -kndiscount </B>
|
||||
and
|
||||
<B> -ukndiscount </B>
|
||||
differ as to how this constant is computed.
|
||||
<BR>
|
||||
The main idea of Kneser-Ney is to use a modified probability estimate
|
||||
for lower order N-grams used for backoff.
|
||||
Specifically, the modified
|
||||
probability for a lower order N-gram is taken to be proportional to the
|
||||
number of unique words that precede it in the training data.
|
||||
With discounting and normalization we get:
|
||||
<PRE>
|
||||
|
||||
<I>f</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) - <I>D0</I>) / <I>c</I>(<I>a</I>_) ;; for highest order N-grams
|
||||
<I>f</I>(_<I>z</I>) = (<I>n</I>(*_<I>z</I>) - <I>D1</I>) / <I>n</I>(*_*) ;; for lower order N-grams
|
||||
|
||||
</PRE>
|
||||
where the
|
||||
<I>n</I>(*_<I>z</I>)<I></I><I></I>
|
||||
notation represents the number of unique N-grams that
|
||||
match a given pattern with (*) used as a wildcard for a single word.
|
||||
<I> D0 </I>
|
||||
and
|
||||
<I> D1 </I>
|
||||
represent two different discounting constants, as each N-gram
|
||||
order uses a different discounting constant.
|
||||
The resulting
|
||||
conditional probability and the backoff weight is calculated as given
|
||||
in equations (2) and (3):
|
||||
<PRE>
|
||||
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
||||
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> f(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
||||
|
||||
</PRE>
|
||||
The option
|
||||
<B> -interpolate </B>
|
||||
is used to create the interpolated versions of
|
||||
<B> -kndiscount </B>
|
||||
and
|
||||
<B>-ukndiscount</B>.<B></B><B></B><B></B>
|
||||
In this case we have:
|
||||
<PRE>
|
||||
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.4
|
||||
|
||||
</PRE>
|
||||
Let
|
||||
<I> Z1 </I>
|
||||
be the set {<I>z</I>: <I>c</I>(<I>a</I>_<I>z</I>) > 0}.
|
||||
For highest order N-grams we have:
|
||||
<PRE>
|
||||
|
||||
<I>g</I>(<I>a</I>_<I>z</I>) = max(0, <I>c</I>(<I>a</I>_<I>z</I>) - <I>D</I>) / <I>c</I>(<I>a</I>_)
|
||||
bow(<I>a</I>_) = 1 - Sum_<I>Z1</I> <I>g</I>(<I>a</I>_<I>z</I>)
|
||||
= 1 - Sum_<I>Z1</I> <I>c</I>(<I>a</I>_<I>z</I>) / <I>c</I>(<I>a</I>_) + Sum_<I>Z1</I> <I>D</I> / <I>c</I>(<I>a</I>_)
|
||||
= <I>D</I> <I>n</I>(<I>a</I>_*) / <I>c</I>(<I>a</I>_)
|
||||
|
||||
</PRE>
|
||||
Let
|
||||
<I> Z2 </I>
|
||||
be the set {<I>z</I>: <I>n</I>(*_<I>z</I>) > 0}.
|
||||
For lower order N-grams we have:
|
||||
<PRE>
|
||||
|
||||
<I>g</I>(_<I>z</I>) = max(0, <I>n</I>(*_<I>z</I>) - <I>D</I>) / <I>n</I>(*_*)
|
||||
bow(_) = 1 - Sum_<I>Z2</I> <I>g</I>(_<I>z</I>)
|
||||
= 1 - Sum_<I>Z2</I> <I>n</I>(*_<I>z</I>) / <I>n</I>(*_*) + Sum_<I>Z2</I> <I>D</I> / <I>n</I>(*_*)
|
||||
= <I>D</I> <I>n</I>(_*) / <I>n</I>(*_*)
|
||||
|
||||
</PRE>
|
||||
The original Kneser-Ney discounting
|
||||
(<B>-ukndiscount</B>)<B></B><B></B>
|
||||
uses one discounting constant for each N-gram order.
|
||||
These constants are estimated as
|
||||
<PRE>
|
||||
|
||||
<I>D</I> = <I>n1</I> / (<I>n1</I> + 2*<I>n2</I>)
|
||||
|
||||
</PRE>
|
||||
where
|
||||
<I> n1 </I>
|
||||
and
|
||||
<I> n2 </I>
|
||||
are the total number of N-grams with exactly one and
|
||||
two counts, respectively.
|
||||
<BR>
|
||||
Chen and Goodman's modified Kneser-Ney discounting
|
||||
(<B>-kndiscount</B>)<B></B><B></B>
|
||||
uses three discounting constants for each N-gram order, one for one-count
|
||||
N-grams, one for two-count N-grams, and one for three-plus-count N-grams:
|
||||
<PRE>
|
||||
|
||||
<I>Y</I> = <I>n1</I>/(<I>n1</I>+2*<I>n2</I>)
|
||||
<I>D1</I> = 1 - 2<I>Y</I>(<I>n2</I>/<I>n1</I>)
|
||||
<I>D2</I> = 2 - 3<I>Y</I>(<I>n3</I>/<I>n2</I>)
|
||||
<I>D3+</I> = 3 - 4<I>Y</I>(<I>n4</I>/<I>n3</I>)
|
||||
|
||||
</PRE>
|
||||
<DT><B> Warning: </B>
|
||||
<DD>
|
||||
SRILM implements Kneser-Ney discounting by actually modifying the
|
||||
counts of the lower order N-grams. Thus, when the
|
||||
<B> -write </B>
|
||||
option is
|
||||
used to write the counts with
|
||||
<B> -kndiscount </B>
|
||||
or
|
||||
<B>-ukndiscount</B>,<B></B><B></B><B></B>
|
||||
only the highest order N-grams and N-grams that start with <s> will have their
|
||||
regular counts
|
||||
<I>c</I>(<I>a</I>_<I>z</I>),<I></I>
|
||||
all others will have the modified counts
|
||||
<I>n</I>(*_<I>z</I>)<I></I><I></I>
|
||||
instead.
|
||||
See Warning 2 in the next section.
|
||||
<DT><B> -wbdiscount </B>
|
||||
<DD>
|
||||
Witten-Bell discounting.
|
||||
The intuition is that the weight given
|
||||
to the lower order model should be proportional to the probability of
|
||||
observing an unseen word in the current context
|
||||
(<I>a</I>_).<I></I><I></I>
|
||||
Witten-Bell computes this weight as:
|
||||
<PRE>
|
||||
|
||||
bow(<I>a</I>_) = <I>n</I>(<I>a</I>_*) / (<I>n</I>(<I>a</I>_*) + <I>c</I>(<I>a</I>_))
|
||||
|
||||
</PRE>
|
||||
Here
|
||||
<I>n</I>(<I>a</I>_*)<I></I><I></I>
|
||||
represents the number of unique words following the
|
||||
context
|
||||
(<I>a</I>_)<I></I><I></I>
|
||||
in the training data.
|
||||
Witten-Bell is originally an interpolated discounting method.
|
||||
So with the
|
||||
<B> -interpolate </B>
|
||||
option we get:
|
||||
<PRE>
|
||||
|
||||
<I>g</I>(<I>a</I>_<I>z</I>) = <I>c</I>(<I>a</I>_<I>z</I>) / (<I>n</I>(<I>a</I>_*) + <I>c</I>(<I>a</I>_))
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = <I>g</I>(<I>a</I>_<I>z</I>) + bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.4
|
||||
|
||||
</PRE>
|
||||
Without the
|
||||
<B> -interpolate </B>
|
||||
option we have the backoff version which is
|
||||
implemented by taking
|
||||
<I>f</I>(<I>a</I>_<I>z</I>)<I></I>
|
||||
to be the same as the interpolated
|
||||
<I>g</I>(<I>a</I>_<I>z</I>).<I></I>
|
||||
<PRE>
|
||||
|
||||
<I>f</I>(<I>a</I>_<I>z</I>) = <I>c</I>(<I>a</I>_<I>z</I>) / (<I>n</I>(<I>a</I>_*) + <I>c</I>(<I>a</I>_))
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
||||
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
||||
|
||||
</PRE>
|
||||
<DT><B> -ndiscount </B>
|
||||
<DD>
|
||||
Ristad's natural discounting law.
|
||||
See Ristad's technical report "A natural law of succession"
|
||||
for a justification of the discounting factor.
|
||||
The
|
||||
<B> -interpolate </B>
|
||||
option has no effect, only a backoff version has been implemented.
|
||||
<PRE>
|
||||
|
||||
<I>c</I>(<I>a</I>_<I>z</I>) <I>c</I>(<I>a</I>_) (<I>c</I>(<I>a</I>_) + 1) + <I>n</I>(<I>a</I>_*) (1 - <I>n</I>(<I>a</I>_*))
|
||||
<I>f</I>(<I>a</I>_<I>z</I>) = ------ ---------------------------------------
|
||||
<I>c</I>(<I>a</I>_) <I>c</I>(<I>a</I>_)^2 + <I>c</I>(<I>a</I>_) + 2 <I>n</I>(<I>a</I>_*)
|
||||
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
||||
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> f(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
||||
|
||||
</PRE>
|
||||
<DT><B> -count-lm </B>
|
||||
<DD>
|
||||
Estimate a count-based interpolated LM using Jelinek-Mercer smoothing
|
||||
(Chen & Goodman, 1998), also known as "deleted interpolation."
|
||||
Note that this does not produce a backoff model; instead of
|
||||
count-LM parameter file in the format described in
|
||||
<A HREF="ngram.1.html">ngram(1)</A>
|
||||
needs to be specified using
|
||||
<B>-init-lm</B>,<B></B><B></B><B></B>
|
||||
and a reestimated file in the same format is produced.
|
||||
In the process, the mixture weights that interpolate the ML estimates
|
||||
at all levels of N-grams are estimated using an expectation-maximization (EM)
|
||||
algorithm.
|
||||
The options
|
||||
<B> -em-iters </B>
|
||||
and
|
||||
<B> -em-delta </B>
|
||||
control termination of the EM algorithm.
|
||||
Note that the N-gram counts used to estimate the maximum-likelihood
|
||||
estimates are specified in the
|
||||
<B> -init-lm </B>
|
||||
model file.
|
||||
The counts specified with
|
||||
<B> -read </B>
|
||||
or
|
||||
<B> -text </B>
|
||||
are used only to estimate the interpolation weights.
|
||||
\" ???What does this all mean in terms of the math???
|
||||
<DT><B>-addsmooth</B><I> D</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Smooth by adding
|
||||
<I> D </I>
|
||||
to each N-gram count.
|
||||
This is usually a poor smoothing method,
|
||||
included mainly for instructional purposes.
|
||||
<PRE>
|
||||
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) + <I>D</I>) / (<I>c</I>(<I>a</I>_) + <I>D</I> <I>n</I>(*))
|
||||
|
||||
</PRE>
|
||||
<DT>default
|
||||
<DD>
|
||||
If the user does not specify any discounting options,
|
||||
<B> ngram-count </B>
|
||||
uses Good-Turing discounting (aka Katz smoothing) by default.
|
||||
The Good-Turing estimate states that for any N-gram that occurs
|
||||
<I> r </I>
|
||||
times, we should pretend that it occurs
|
||||
<I>r</I>'<I></I><I></I><I></I>
|
||||
times where
|
||||
<PRE>
|
||||
|
||||
<I>r</I>' = (<I>r</I>+1) <I>n</I>[<I>r</I>+1]/<I>n</I>[<I>r</I>]
|
||||
|
||||
</PRE>
|
||||
Here
|
||||
<I>n</I>[<I>r</I>]<I></I><I></I>
|
||||
is the number of N-grams that occur exactly
|
||||
<I> r </I>
|
||||
times in the training data.
|
||||
<BR>
|
||||
Large counts are taken to be reliable, thus they are not subject to
|
||||
any discounting.
|
||||
By default unigram counts larger than 1 and other N-gram counts larger
|
||||
than 7 are taken to be reliable and maximum
|
||||
likelihood estimates are used.
|
||||
These limits can be modified using the
|
||||
<B>-gt</B><I>n</I><B>max</B><I></I><B></B><I></I><B></B>
|
||||
options.
|
||||
<PRE>
|
||||
|
||||
<I>f</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) / <I>c</I>(<I>a</I>_)) if <I>c</I>(<I>a</I>_<I>z</I>) > <I>gtmax</I>
|
||||
|
||||
</PRE>
|
||||
The lower counts are discounted proportional to the Good-Turing
|
||||
estimate with a small correction
|
||||
<I> A </I>
|
||||
to account for the high-count N-grams not being discounted.
|
||||
If 1 <= <I>c</I>(<I>a</I>_<I>z</I>) <= <I>gtmax</I>:
|
||||
<PRE>
|
||||
|
||||
<I>n</I>[<I>gtmax</I> + 1]
|
||||
<I>A</I> = (<I>gtmax</I> + 1) --------------
|
||||
<I>n</I>[1]
|
||||
|
||||
<I>n</I>[<I>c</I>(<I>a</I>_<I>z</I>) + 1]
|
||||
<I>c</I>'(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) + 1) ---------------
|
||||
<I>n</I>[<I>c</I>(<I>a</I>_<I>z</I>)]
|
||||
|
||||
<I>c</I>(<I>a</I>_<I>z</I>) (<I>c</I>'(<I>a</I>_<I>z</I>) / <I>c</I>(<I>a</I>_<I>z</I>) - <I>A</I>)
|
||||
<I>f</I>(<I>a</I>_<I>z</I>) = -------- ----------------------
|
||||
<I>c</I>(<I>a</I>_) (1 - <I>A</I>)
|
||||
|
||||
</PRE>
|
||||
The
|
||||
<B> -interpolate </B>
|
||||
option has no effect in this case, only a backoff
|
||||
version has been implemented, thus:
|
||||
<PRE>
|
||||
|
||||
<I>p</I>(<I>a</I>_<I>z</I>) = (<I>c</I>(<I>a</I>_<I>z</I>) > 0) ? <I>f</I>(<I>a</I>_<I>z</I>) : bow(<I>a</I>_) <I>p</I>(_<I>z</I>) ; Eqn.2
|
||||
bow(<I>a</I>_) = (1 - Sum_<I>Z1</I> <I>f</I>(<I>a</I>_<I>z</I>)) / (1 - Sum_<I>Z1</I> <I>f</I>(_<I>z</I>)) ; Eqn.3
|
||||
|
||||
</PRE>
|
||||
</DD>
|
||||
</DL>
|
||||
<H2> FILE FORMATS </H2>
|
||||
SRILM can generate simple N-gram counts from plain text files with the
|
||||
following command:
|
||||
<PRE>
|
||||
ngram-count -order <I>N</I> -text <I>file.txt</I> -write <I>file.cnt</I>
|
||||
</PRE>
|
||||
The
|
||||
<B> -order </B>
|
||||
option determines the maximum length of the N-grams.
|
||||
The file
|
||||
<I> file.txt </I>
|
||||
should contain one sentence per line with tokens
|
||||
separated by whitespace.
|
||||
The output
|
||||
<I> file.cnt </I>
|
||||
contains the N-gram
|
||||
tokens followed by a tab and a count on each line:
|
||||
<PRE>
|
||||
|
||||
<I>a</I>_<I>z</I> <tab> <I>c</I>(<I>a</I>_<I>z</I>)
|
||||
|
||||
</PRE>
|
||||
A couple of warnings:
|
||||
<DL>
|
||||
<DT><B> Warning 1 </B>
|
||||
<DD>
|
||||
SRILM implicitly assumes an <s> token in the beginning of each line
|
||||
and an </s> token at the end of each line and counts N-grams that start
|
||||
with <s> and end with </s>.
|
||||
You do not need to include these tags in
|
||||
<I>file.txt</I>.<I></I><I></I><I></I>
|
||||
<DT><B> Warning 2 </B>
|
||||
<DD>
|
||||
When
|
||||
<B> -kndiscount </B>
|
||||
or
|
||||
<B> -ukndiscount </B>
|
||||
options are used, the count file contains modified counts.
|
||||
Specifically, all N-grams of the maximum
|
||||
order, and all N-grams that start with <s> have their regular counts
|
||||
<I>c</I>(<I>a</I>_<I>z</I>),<I></I>
|
||||
but shorter N-grams that do not start with <s> have the number
|
||||
of unique words preceding them
|
||||
<I>n</I>(*<I>a</I>_<I>z</I>)<I></I>
|
||||
instead.
|
||||
See the description of
|
||||
<B> -kndiscount </B>
|
||||
and
|
||||
<B> -ukndiscount </B>
|
||||
for details.
|
||||
</DD>
|
||||
</DL>
|
||||
<P>
|
||||
For most smoothing methods (except
|
||||
<B>-count-lm</B>)<B></B><B></B><B></B>
|
||||
SRILM generates and uses N-gram model files in the ARPA format.
|
||||
A typical command to generate a model file would be:
|
||||
<PRE>
|
||||
ngram-count -order <I>N</I> -text <I>file.txt</I> -lm <I>file.lm</I>
|
||||
</PRE>
|
||||
The ARPA format output
|
||||
<I> file.lm </I>
|
||||
will contain the following information about an N-gram on each line:
|
||||
<PRE>
|
||||
|
||||
log10(<I>f</I>(<I>a</I>_<I>z</I>)) <tab> <I>a</I>_<I>z</I> <tab> log10(bow(<I>a</I>_<I>z</I>))
|
||||
|
||||
</PRE>
|
||||
Based on Equation 2, the first entry represents the base 10 logarithm
|
||||
of the conditional probability (logprob) for the N-gram
|
||||
<I>a</I>_<I>z</I>.<I></I><I></I>
|
||||
This is followed by the actual words in the N-gram separated by spaces.
|
||||
The last and optional entry is the base-10 logarithm of the backoff weight
|
||||
for (<I>n</I>+1)-grams starting with
|
||||
<I>a</I>_<I>z</I>.<I></I><I></I>
|
||||
<DL>
|
||||
<DT><B> Warning 3 </B>
|
||||
<DD>
|
||||
Both backoff and interpolated models are represented in the same
|
||||
format.
|
||||
This means interpolation is done during model building and
|
||||
represented in the ARPA format with logprob and backoff weight using
|
||||
equation (6).
|
||||
<DT><B> Warning 4 </B>
|
||||
<DD>
|
||||
Not all N-grams in the count file necessarily end up in the model file.
|
||||
The options
|
||||
<B>-gtmin</B>,<B></B><B></B><B></B>
|
||||
<B>-gt1min</B>,<B></B><B></B><B></B>
|
||||
...,
|
||||
<B> -gt9min </B>
|
||||
specify the minimum counts
|
||||
for N-grams to be included in the LM (not only for Good-Turing
|
||||
discounting but for the other methods as well).
|
||||
By default all unigrams and bigrams
|
||||
are included, but for higher order N-grams only those with count >= 2 are
|
||||
included.
|
||||
Some exceptions arise, because if one N-gram is included in
|
||||
the model file, all its prefix N-grams have to be included as well.
|
||||
This causes some higher order 1-count N-grams to be included when using
|
||||
KN discounting, which uses modified counts as described in Warning 2.
|
||||
<DT><B> Warning 5 </B>
|
||||
<DD>
|
||||
Not all N-grams in the model file have backoff weights.
|
||||
The highest order N-grams do not need a backoff weight.
|
||||
For lower order N-grams
|
||||
backoff weights are only recorded for those that appear as the prefix
|
||||
of a longer N-gram included in the model.
|
||||
For other lower order N-grams
|
||||
the backoff weight is implicitly 1 (or 0, in log representation).
|
||||
|
||||
</DD>
|
||||
</DL>
|
||||
<H2> SEE ALSO </H2>
|
||||
<A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram-format.5.html">ngram-format(5)</A>,
|
||||
<BR>
|
||||
S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques for
|
||||
Language Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998.
|
||||
<H2> BUGS </H2>
|
||||
Work in progress.
|
||||
<H2> AUTHOR </H2>
|
||||
Deniz Yuret <dyuret@ku.edu.tr>,
|
||||
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
||||
<BR>
|
||||
Copyright (c) 2007 SRI International
|
||||
</BODY>
|
||||
</HTML>
|
||||
Reference in New Issue
Block a user