competition update
This commit is contained in:
414
language_model/srilm-1.7.3/man/cat7/ngram-discount.7
Normal file
414
language_model/srilm-1.7.3/man/cat7/ngram-discount.7
Normal file
@@ -0,0 +1,414 @@
|
||||
ngram-discount(7) Miscellaneous Information Manual ngram-discount(7)
|
||||
|
||||
|
||||
|
||||
NNAAMMEE
|
||||
ngram-discount - notes on the N-gram smoothing implementations in SRILM
|
||||
|
||||
NNOOTTAATTIIOONN
|
||||
_a__z An N-gram where _a is the first word, _z is the last word, and
|
||||
"_" represents 0 or more words in between.
|
||||
|
||||
_p(_a__z) The estimated conditional probability of the _nth word _z given
|
||||
the first _n-1 words (_a_) of an N-gram.
|
||||
|
||||
_a_ The _n-1 word prefix of the N-gram _a__z.
|
||||
|
||||
__z The _n-1 word suffix of the N-gram _a__z.
|
||||
|
||||
_c(_a__z) The count of N-gram _a__z in the training data.
|
||||
|
||||
_n(*__z) The number of unique N-grams that match a given pattern.
|
||||
``(*)'' represents a wildcard matching a single word.
|
||||
|
||||
_n_1,_n[1] The number of unique N-grams with count = 1.
|
||||
|
||||
DDEESSCCRRIIPPTTIIOONN
|
||||
N-gram models try to estimate the probability of a word _z in the con-
|
||||
text of the previous _n-1 words (_a_), i.e., _P_r(_z|_a_). We will denote
|
||||
this conditional probability using _p(_a__z) for convenience. One way to
|
||||
estimate _p(_a__z) is to look at the number of times word _z has followed
|
||||
the previous _n-1 words (_a_):
|
||||
|
||||
(1) _p(_a__z) = _c(_a__z)/_c(_a_)
|
||||
|
||||
This is known as the maximum likelihood (ML) estimate. Unfortunately
|
||||
it does not work very well because it assigns zero probability to N-
|
||||
grams that have not been observed in the training data. To avoid the
|
||||
zero probabilities, we take some probability mass from the observed N-
|
||||
grams and distribute it to unobserved N-grams. Such redistribution is
|
||||
known as smoothing or discounting.
|
||||
|
||||
Most existing smoothing algorithms can be described by the following
|
||||
equation:
|
||||
|
||||
(2) _p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)
|
||||
|
||||
If the N-gram _a__z has been observed in the training data, we use the
|
||||
distribution _f(_a__z). Typically _f(_a__z) is discounted to be less than
|
||||
the ML estimate so we have some leftover probability for the _z words
|
||||
unseen in the context (_a_). Different algorithms mainly differ on how
|
||||
they discount the ML estimate to get _f(_a__z).
|
||||
|
||||
If the N-gram _a__z has not been observed in the training data, we use
|
||||
the lower order distribution _p(__z). If the context has never been
|
||||
observed (_c(_a_) = 0), we can use the lower order distribution directly
|
||||
(bow(_a_) = 1). Otherwise we need to compute a backoff weight (bow) to
|
||||
make sure probabilities are normalized:
|
||||
|
||||
Sum__z _p(_a__z) = 1
|
||||
|
||||
|
||||
Let _Z be the set of all words in the vocabulary, _Z_0 be the set of all
|
||||
words with _c(_a__z) = 0, and _Z_1 be the set of all words with _c(_a__z) > 0.
|
||||
Given _f(_a__z), bow(_a_) can be determined as follows:
|
||||
|
||||
(3) Sum__Z _p(_a__z) = 1
|
||||
Sum__Z_1 _f(_a__z) + Sum__Z_0 bow(_a_) _p(__z) = 1
|
||||
bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / Sum__Z_0 _p(__z)
|
||||
= (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _p(__z))
|
||||
= (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z))
|
||||
|
||||
|
||||
Smoothing is generally done in one of two ways. The backoff models
|
||||
compute _p(_a__z) based on the N-gram counts _c(_a__z) when _c(_a__z) > 0, and
|
||||
only consider lower order counts _c(__z) when _c(_a__z) = 0. Interpolated
|
||||
models take lower order counts into account when _c(_a__z) > 0 as well. A
|
||||
common way to express an interpolated model is:
|
||||
|
||||
(4) _p(_a__z) = _g(_a__z) + bow(_a_) _p(__z)
|
||||
|
||||
Where _g(_a__z) = 0 when _c(_a__z) = 0 and it is discounted to be less than
|
||||
the ML estimate when _c(_a__z) > 0 to reserve some probability mass for
|
||||
the unseen _z words. Given _g(_a__z), bow(_a_) can be determined as fol-
|
||||
lows:
|
||||
|
||||
(5) Sum__Z _p_(_a___z_) = 1
|
||||
Sum__Z_1 _g_(_a___z) + Sum__Z bow(_a_) _p(__z) = 1
|
||||
bow(_a_) = 1 - Sum__Z_1 _g(_a__z)
|
||||
|
||||
|
||||
An interpolated model can also be expressed in the form of equation
|
||||
(2), which is the way it is represented in the ARPA format model files
|
||||
in SRILM:
|
||||
|
||||
(6) _f(_a__z) = _g(_a__z) + bow(_a_) _p(__z)
|
||||
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)
|
||||
|
||||
|
||||
Most algorithms in SRILM have both backoff and interpolated versions.
|
||||
Empirically, interpolated algorithms usually do better than the backoff
|
||||
ones, and Kneser-Ney does better than others.
|
||||
|
||||
|
||||
OOPPTTIIOONNSS
|
||||
This section describes the formulation of each discounting option in
|
||||
nnggrraamm--ccoouunntt(1). After giving the motivation for each discounting
|
||||
method, we will give expressions for _f(_a__z) and bow(_a_) of Equation 2
|
||||
in terms of the counts. Note that some counts may not be included in
|
||||
the model file because of the --ggttmmiinn options; see Warning 4 in the next
|
||||
section.
|
||||
|
||||
Backoff versions are the default but interpolated versions of most mod-
|
||||
els are available using the --iinntteerrppoollaattee option. In this case we will
|
||||
express _g(_a_z_) and bow(_a_) of Equation 4 in terms of the counts as
|
||||
well. Note that the ARPA format model files store the interpolated
|
||||
models and the backoff models the same way using _f(_a__z) and bow(_a_);
|
||||
see Warning 3 in the next section. The conversion between backoff and
|
||||
interpolated formulations is given in Equation 6.
|
||||
|
||||
The discounting options may be followed by a digit (1-9) to indicate
|
||||
that only specific N-gram orders be affected. See nnggrraamm--ccoouunntt(1) for
|
||||
more details.
|
||||
|
||||
--ccddiissccoouunntt _D
|
||||
Ney's absolute discounting using _D as the constant to subtract.
|
||||
_D should be between 0 and 1. If _Z_1 is the set of all words _z
|
||||
with _c(_a__z) > 0:
|
||||
|
||||
_f(_a__z) = (_c(_a__z) - _D) / _c(_a_)
|
||||
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
|
||||
bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
|
||||
|
||||
With the --iinntteerrppoollaattee option we have:
|
||||
|
||||
_g(_a__z) = max(0, _c(_a__z) - _D) / _c(_a_)
|
||||
_p(_a__z) = _g(_a__z) + bow(_a_) _p(__z) ; Eqn.4
|
||||
bow(_a_) = 1 - Sum__Z_1 _g(_a__z) ; Eqn.5
|
||||
= _D _n(_a_*) / _c(_a_)
|
||||
|
||||
The suggested discount factor is:
|
||||
|
||||
_D = _n_1 / (_n_1 + 2*_n_2)
|
||||
|
||||
where _n_1 and _n_2 are the total number of N-grams with exactly one
|
||||
and two counts, respectively. Different discounting constants
|
||||
can be specified for different N-gram orders using options
|
||||
--ccddiissccoouunntt11, --ccddiissccoouunntt22, etc.
|
||||
|
||||
--kknnddiissccoouunntt and --uukknnddiissccoouunntt
|
||||
Kneser-Ney discounting. This is similar to absolute discounting
|
||||
in that the discounted probability is computed by subtracting a
|
||||
constant _D from the N-gram count. The options --kknnddiissccoouunntt and
|
||||
--uukknnddiissccoouunntt differ as to how this constant is computed.
|
||||
The main idea of Kneser-Ney is to use a modified probability
|
||||
estimate for lower order N-grams used for backoff. Specifi-
|
||||
cally, the modified probability for a lower order N-gram is
|
||||
taken to be proportional to the number of unique words that pre-
|
||||
cede it in the training data. With discounting and normaliza-
|
||||
tion we get:
|
||||
|
||||
_f(_a__z) = (_c(_a__z) - _D_0) / _c(_a_) ;; for highest order N-grams
|
||||
_f(__z) = (_n(*__z) - _D_1) / _n(*_*) ;; for lower order N-grams
|
||||
|
||||
where the _n(*__z) notation represents the number of unique N-
|
||||
grams that match a given pattern with (*) used as a wildcard for
|
||||
a single word. _D_0 and _D_1 represent two different discounting
|
||||
constants, as each N-gram order uses a different discounting
|
||||
constant. The resulting conditional probability and the backoff
|
||||
weight is calculated as given in equations (2) and (3):
|
||||
|
||||
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
|
||||
bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
|
||||
|
||||
The option --iinntteerrppoollaattee is used to create the interpolated ver-
|
||||
sions of --kknnddiissccoouunntt and --uukknnddiissccoouunntt. In this case we have:
|
||||
|
||||
_p(_a__z) = _g(_a__z) + bow(_a_) _p(__z) ; Eqn.4
|
||||
|
||||
Let _Z_1 be the set {_z: _c(_a__z) > 0}. For highest order N-grams we
|
||||
have:
|
||||
|
||||
_g(_a__z) = max(0, _c(_a__z) - _D) / _c(_a_)
|
||||
bow(_a_) = 1 - Sum__Z_1 _g(_a__z)
|
||||
= 1 - Sum__Z_1 _c(_a__z) / _c(_a_) + Sum__Z_1 _D / _c(_a_)
|
||||
= _D _n(_a_*) / _c(_a_)
|
||||
|
||||
Let _Z_2 be the set {_z: _n(*__z) > 0}. For lower order N-grams we
|
||||
have:
|
||||
|
||||
_g(__z) = max(0, _n(*__z) - _D) / _n(*_*)
|
||||
bow(_) = 1 - Sum__Z_2 _g(__z)
|
||||
= 1 - Sum__Z_2 _n(*__z) / _n(*_*) + Sum__Z_2 _D / _n(*_*)
|
||||
= _D _n(_*) / _n(*_*)
|
||||
|
||||
The original Kneser-Ney discounting (--uukknnddiissccoouunntt) uses one dis-
|
||||
counting constant for each N-gram order. These constants are
|
||||
estimated as
|
||||
|
||||
_D = _n_1 / (_n_1 + 2*_n_2)
|
||||
|
||||
where _n_1 and _n_2 are the total number of N-grams with exactly one
|
||||
and two counts, respectively.
|
||||
Chen and Goodman's modified Kneser-Ney discounting (--kknnddiissccoouunntt)
|
||||
uses three discounting constants for each N-gram order, one for
|
||||
one-count N-grams, one for two-count N-grams, and one for three-
|
||||
plus-count N-grams:
|
||||
|
||||
_Y = _n_1/(_n_1+2*_n_2)
|
||||
_D_1 = 1 - 2_Y(_n_2/_n_1)
|
||||
_D_2 = 2 - 3_Y(_n_3/_n_2)
|
||||
_D_3_+ = 3 - 4_Y(_n_4/_n_3)
|
||||
|
||||
|
||||
WWaarrnniinngg::
|
||||
SRILM implements Kneser-Ney discounting by actually modifying
|
||||
the counts of the lower order N-grams. Thus, when the --wwrriittee
|
||||
option is used to write the counts with --kknnddiissccoouunntt or --uukknnddiiss--
|
||||
ccoouunntt, only the highest order N-grams and N-grams that start
|
||||
with <s> will have their regular counts _c(_a__z), all others will
|
||||
have the modified counts _n(*__z) instead. See Warning 2 in the
|
||||
next section.
|
||||
|
||||
--wwbbddiissccoouunntt
|
||||
Witten-Bell discounting. The intuition is that the weight given
|
||||
to the lower order model should be proportional to the probabil-
|
||||
ity of observing an unseen word in the current context (_a_).
|
||||
Witten-Bell computes this weight as:
|
||||
|
||||
bow(_a_) = _n(_a_*) / (_n(_a_*) + _c(_a_))
|
||||
|
||||
Here _n(_a_*) represents the number of unique words following the
|
||||
context (_a_) in the training data. Witten-Bell is originally an
|
||||
interpolated discounting method. So with the --iinntteerrppoollaattee
|
||||
option we get:
|
||||
|
||||
_g(_a__z) = _c(_a__z) / (_n(_a_*) + _c(_a_))
|
||||
_p(_a__z) = _g(_a__z) + bow(_a_) _p(__z) ; Eqn.4
|
||||
|
||||
Without the --iinntteerrppoollaattee option we have the backoff version
|
||||
which is implemented by taking _f(_a__z) to be the same as the
|
||||
interpolated _g(_a__z).
|
||||
|
||||
_f(_a__z) = _c(_a__z) / (_n(_a_*) + _c(_a_))
|
||||
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
|
||||
bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
|
||||
|
||||
|
||||
--nnddiissccoouunntt
|
||||
Ristad's natural discounting law. See Ristad's technical report
|
||||
"A natural law of succession" for a justification of the dis-
|
||||
counting factor. The --iinntteerrppoollaattee option has no effect, only a
|
||||
backoff version has been implemented.
|
||||
|
||||
_c(_a__z) _c(_a_) (_c(_a_) + 1) + _n(_a_*) (1 - _n(_a_*))
|
||||
_f(_a__z) = ------ ---------------------------------------
|
||||
_c(_a_) _c(_a_)^2 + _c(_a_) + 2 _n(_a_*)
|
||||
|
||||
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
|
||||
bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
|
||||
|
||||
|
||||
--ccoouunntt--llmm
|
||||
Estimate a count-based interpolated LM using Jelinek-Mercer
|
||||
smoothing (Chen & Goodman, 1998), also known as "deleted inter-
|
||||
polation." Note that this does not produce a backoff model;
|
||||
instead of count-LM parameter file in the format described in
|
||||
nnggrraamm(1) needs to be specified using --iinniitt--llmm, and a reestimated
|
||||
file in the same format is produced. In the process, the mix-
|
||||
ture weights that interpolate the ML estimates at all levels of
|
||||
N-grams are estimated using an expectation-maximization (EM)
|
||||
algorithm. The options --eemm--iitteerrss and --eemm--ddeellttaa control termina-
|
||||
tion of the EM algorithm. Note that the N-gram counts used to
|
||||
estimate the maximum-likelihood estimates are specified in the
|
||||
--iinniitt--llmm model file. The counts specified with --rreeaadd or --tteexxtt
|
||||
are used only to estimate the interpolation weights.
|
||||
|
||||
|
||||
--aaddddssmmooootthh _D
|
||||
Smooth by adding _D to each N-gram count. This is usually a poor
|
||||
smoothing method, included mainly for instructional purposes.
|
||||
|
||||
_p(_a__z) = (_c(_a__z) + _D) / (_c(_a_) + _D _n(*))
|
||||
|
||||
|
||||
default
|
||||
If the user does not specify any discounting options, nnggrraamm--
|
||||
ccoouunntt uses Good-Turing discounting (aka Katz smoothing) by
|
||||
default. The Good-Turing estimate states that for any N-gram
|
||||
that occurs _r times, we should pretend that it occurs _r' times
|
||||
where
|
||||
|
||||
_r' = (_r+1) _n[_r+1]/_n[_r]
|
||||
|
||||
Here _n[_r] is the number of N-grams that occur exactly _r times in
|
||||
the training data.
|
||||
Large counts are taken to be reliable, thus they are not subject
|
||||
to any discounting. By default unigram counts larger than 1 and
|
||||
other N-gram counts larger than 7 are taken to be reliable and
|
||||
maximum likelihood estimates are used. These limits can be mod-
|
||||
ified using the --ggtt_nmmaaxx options.
|
||||
|
||||
_f(_a__z) = (_c(_a__z) / _c(_a_)) if _c(_a__z) > _g_t_m_a_x
|
||||
|
||||
The lower counts are discounted proportional to the Good-Turing
|
||||
estimate with a small correction _A to account for the high-count
|
||||
N-grams not being discounted. If 1 <= _c(_a__z) <= _g_t_m_a_x:
|
||||
|
||||
_n[_g_t_m_a_x + 1]
|
||||
_A = (_g_t_m_a_x + 1) --------------
|
||||
_n[1]
|
||||
|
||||
_n[_c(_a__z) + 1]
|
||||
_c'(_a__z) = (_c(_a__z) + 1) ---------------
|
||||
_n[_c(_a__z)]
|
||||
|
||||
_c(_a__z) (_c'(_a__z) / _c(_a__z) - _A)
|
||||
_f(_a__z) = -------- ----------------------
|
||||
_c(_a_) (1 - _A)
|
||||
|
||||
The --iinntteerrppoollaattee option has no effect in this case, only a back-
|
||||
off version has been implemented, thus:
|
||||
|
||||
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
|
||||
bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
|
||||
|
||||
|
||||
FFIILLEE FFOORRMMAATTSS
|
||||
SRILM can generate simple N-gram counts from plain text files with the
|
||||
following command:
|
||||
ngram-count -order _N -text _f_i_l_e_._t_x_t -write _f_i_l_e_._c_n_t
|
||||
The --oorrddeerr option determines the maximum length of the N-grams. The
|
||||
file _f_i_l_e_._t_x_t should contain one sentence per line with tokens sepa-
|
||||
rated by whitespace. The output _f_i_l_e_._c_n_t contains the N-gram tokens
|
||||
followed by a tab and a count on each line:
|
||||
|
||||
_a__z <tab> _c(_a__z)
|
||||
|
||||
A couple of warnings:
|
||||
|
||||
WWaarrnniinngg 11
|
||||
SRILM implicitly assumes an <s> token in the beginning of each
|
||||
line and an </s> token at the end of each line and counts N-
|
||||
grams that start with <s> and end with </s>. You do not need to
|
||||
include these tags in _f_i_l_e_._t_x_t.
|
||||
|
||||
WWaarrnniinngg 22
|
||||
When --kknnddiissccoouunntt or --uukknnddiissccoouunntt options are used, the count
|
||||
file contains modified counts. Specifically, all N-grams of the
|
||||
maximum order, and all N-grams that start with <s> have their
|
||||
regular counts _c(_a__z), but shorter N-grams that do not start
|
||||
with <s> have the number of unique words preceding them _n(*_a__z)
|
||||
instead. See the description of --kknnddiissccoouunntt and --uukknnddiissccoouunntt
|
||||
for details.
|
||||
|
||||
For most smoothing methods (except --ccoouunntt--llmm) SRILM generates and uses
|
||||
N-gram model files in the ARPA format. A typical command to generate a
|
||||
model file would be:
|
||||
ngram-count -order _N -text _f_i_l_e_._t_x_t -lm _f_i_l_e_._l_m
|
||||
The ARPA format output _f_i_l_e_._l_m will contain the following information
|
||||
about an N-gram on each line:
|
||||
|
||||
log10(_f(_a__z)) <tab> _a__z <tab> log10(bow(_a__z))
|
||||
|
||||
Based on Equation 2, the first entry represents the base 10 logarithm
|
||||
of the conditional probability (logprob) for the N-gram _a__z. This is
|
||||
followed by the actual words in the N-gram separated by spaces. The
|
||||
last and optional entry is the base-10 logarithm of the backoff weight
|
||||
for (_n+1)-grams starting with _a__z.
|
||||
|
||||
WWaarrnniinngg 33
|
||||
Both backoff and interpolated models are represented in the same
|
||||
format. This means interpolation is done during model building
|
||||
and represented in the ARPA format with logprob and backoff
|
||||
weight using equation (6).
|
||||
|
||||
WWaarrnniinngg 44
|
||||
Not all N-grams in the count file necessarily end up in the
|
||||
model file. The options --ggttmmiinn, --ggtt11mmiinn, ..., --ggtt99mmiinn specify
|
||||
the minimum counts for N-grams to be included in the LM (not
|
||||
only for Good-Turing discounting but for the other methods as
|
||||
well). By default all unigrams and bigrams are included, but
|
||||
for higher order N-grams only those with count >= 2 are
|
||||
included. Some exceptions arise, because if one N-gram is
|
||||
included in the model file, all its prefix N-grams have to be
|
||||
included as well. This causes some higher order 1-count N-grams
|
||||
to be included when using KN discounting, which uses modified
|
||||
counts as described in Warning 2.
|
||||
|
||||
WWaarrnniinngg 55
|
||||
Not all N-grams in the model file have backoff weights. The
|
||||
highest order N-grams do not need a backoff weight. For lower
|
||||
order N-grams backoff weights are only recorded for those that
|
||||
appear as the prefix of a longer N-gram included in the model.
|
||||
For other lower order N-grams the backoff weight is implicitly 1
|
||||
(or 0, in log representation).
|
||||
|
||||
|
||||
SSEEEE AALLSSOO
|
||||
ngram(1), ngram-count(1), ngram-format(5),
|
||||
S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques
|
||||
for Language Modeling,'' TR-10-98, Computer Science Group, Harvard
|
||||
Univ., 1998.
|
||||
|
||||
BBUUGGSS
|
||||
Work in progress.
|
||||
|
||||
AAUUTTHHOORR
|
||||
Deniz Yuret <dyuret@ku.edu.tr>, Andreas Stolcke <stolcke@icsi.berke-
|
||||
ley.edu>
|
||||
Copyright (c) 2007 SRI International
|
||||
|
||||
|
||||
|
||||
SRILM Miscellaneous $Date: 2019/09/09 22:35:37 $ ngram-discount(7)
|
||||
Reference in New Issue
Block a user