competition update

This commit is contained in:
nckcard
2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions

View File

@@ -0,0 +1,414 @@
ngram-discount(7) Miscellaneous Information Manual ngram-discount(7)
NNAAMMEE
ngram-discount - notes on the N-gram smoothing implementations in SRILM
NNOOTTAATTIIOONN
_a__z An N-gram where _a is the first word, _z is the last word, and
"_" represents 0 or more words in between.
_p(_a__z) The estimated conditional probability of the _nth word _z given
the first _n-1 words (_a_) of an N-gram.
_a_ The _n-1 word prefix of the N-gram _a__z.
__z The _n-1 word suffix of the N-gram _a__z.
_c(_a__z) The count of N-gram _a__z in the training data.
_n(*__z) The number of unique N-grams that match a given pattern.
``(*)'' represents a wildcard matching a single word.
_n_1,_n[1] The number of unique N-grams with count = 1.
DDEESSCCRRIIPPTTIIOONN
N-gram models try to estimate the probability of a word _z in the con-
text of the previous _n-1 words (_a_), i.e., _P_r(_z|_a_). We will denote
this conditional probability using _p(_a__z) for convenience. One way to
estimate _p(_a__z) is to look at the number of times word _z has followed
the previous _n-1 words (_a_):
(1) _p(_a__z) = _c(_a__z)/_c(_a_)
This is known as the maximum likelihood (ML) estimate. Unfortunately
it does not work very well because it assigns zero probability to N-
grams that have not been observed in the training data. To avoid the
zero probabilities, we take some probability mass from the observed N-
grams and distribute it to unobserved N-grams. Such redistribution is
known as smoothing or discounting.
Most existing smoothing algorithms can be described by the following
equation:
(2) _p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)
If the N-gram _a__z has been observed in the training data, we use the
distribution _f(_a__z). Typically _f(_a__z) is discounted to be less than
the ML estimate so we have some leftover probability for the _z words
unseen in the context (_a_). Different algorithms mainly differ on how
they discount the ML estimate to get _f(_a__z).
If the N-gram _a__z has not been observed in the training data, we use
the lower order distribution _p(__z). If the context has never been
observed (_c(_a_) = 0), we can use the lower order distribution directly
(bow(_a_) = 1). Otherwise we need to compute a backoff weight (bow) to
make sure probabilities are normalized:
Sum__z _p(_a__z) = 1
Let _Z be the set of all words in the vocabulary, _Z_0 be the set of all
words with _c(_a__z) = 0, and _Z_1 be the set of all words with _c(_a__z) > 0.
Given _f(_a__z), bow(_a_) can be determined as follows:
(3) Sum__Z _p(_a__z) = 1
Sum__Z_1 _f(_a__z) + Sum__Z_0 bow(_a_) _p(__z) = 1
bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / Sum__Z_0 _p(__z)
= (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _p(__z))
= (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z))
Smoothing is generally done in one of two ways. The backoff models
compute _p(_a__z) based on the N-gram counts _c(_a__z) when _c(_a__z) > 0, and
only consider lower order counts _c(__z) when _c(_a__z) = 0. Interpolated
models take lower order counts into account when _c(_a__z) > 0 as well. A
common way to express an interpolated model is:
(4) _p(_a__z) = _g(_a__z) + bow(_a_) _p(__z)
Where _g(_a__z) = 0 when _c(_a__z) = 0 and it is discounted to be less than
the ML estimate when _c(_a__z) > 0 to reserve some probability mass for
the unseen _z words. Given _g(_a__z), bow(_a_) can be determined as fol-
lows:
(5) Sum__Z _p_(_a___z_) = 1
Sum__Z_1 _g_(_a___z) + Sum__Z bow(_a_) _p(__z) = 1
bow(_a_) = 1 - Sum__Z_1 _g(_a__z)
An interpolated model can also be expressed in the form of equation
(2), which is the way it is represented in the ARPA format model files
in SRILM:
(6) _f(_a__z) = _g(_a__z) + bow(_a_) _p(__z)
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)
Most algorithms in SRILM have both backoff and interpolated versions.
Empirically, interpolated algorithms usually do better than the backoff
ones, and Kneser-Ney does better than others.
OOPPTTIIOONNSS
This section describes the formulation of each discounting option in
nnggrraamm--ccoouunntt(1). After giving the motivation for each discounting
method, we will give expressions for _f(_a__z) and bow(_a_) of Equation 2
in terms of the counts. Note that some counts may not be included in
the model file because of the --ggttmmiinn options; see Warning 4 in the next
section.
Backoff versions are the default but interpolated versions of most mod-
els are available using the --iinntteerrppoollaattee option. In this case we will
express _g(_a_z_) and bow(_a_) of Equation 4 in terms of the counts as
well. Note that the ARPA format model files store the interpolated
models and the backoff models the same way using _f(_a__z) and bow(_a_);
see Warning 3 in the next section. The conversion between backoff and
interpolated formulations is given in Equation 6.
The discounting options may be followed by a digit (1-9) to indicate
that only specific N-gram orders be affected. See nnggrraamm--ccoouunntt(1) for
more details.
--ccddiissccoouunntt _D
Ney's absolute discounting using _D as the constant to subtract.
_D should be between 0 and 1. If _Z_1 is the set of all words _z
with _c(_a__z) > 0:
_f(_a__z) = (_c(_a__z) - _D) / _c(_a_)
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
With the --iinntteerrppoollaattee option we have:
_g(_a__z) = max(0, _c(_a__z) - _D) / _c(_a_)
_p(_a__z) = _g(_a__z) + bow(_a_) _p(__z) ; Eqn.4
bow(_a_) = 1 - Sum__Z_1 _g(_a__z) ; Eqn.5
= _D _n(_a_*) / _c(_a_)
The suggested discount factor is:
_D = _n_1 / (_n_1 + 2*_n_2)
where _n_1 and _n_2 are the total number of N-grams with exactly one
and two counts, respectively. Different discounting constants
can be specified for different N-gram orders using options
--ccddiissccoouunntt11, --ccddiissccoouunntt22, etc.
--kknnddiissccoouunntt and --uukknnddiissccoouunntt
Kneser-Ney discounting. This is similar to absolute discounting
in that the discounted probability is computed by subtracting a
constant _D from the N-gram count. The options --kknnddiissccoouunntt and
--uukknnddiissccoouunntt differ as to how this constant is computed.
The main idea of Kneser-Ney is to use a modified probability
estimate for lower order N-grams used for backoff. Specifi-
cally, the modified probability for a lower order N-gram is
taken to be proportional to the number of unique words that pre-
cede it in the training data. With discounting and normaliza-
tion we get:
_f(_a__z) = (_c(_a__z) - _D_0) / _c(_a_) ;; for highest order N-grams
_f(__z) = (_n(*__z) - _D_1) / _n(*_*) ;; for lower order N-grams
where the _n(*__z) notation represents the number of unique N-
grams that match a given pattern with (*) used as a wildcard for
a single word. _D_0 and _D_1 represent two different discounting
constants, as each N-gram order uses a different discounting
constant. The resulting conditional probability and the backoff
weight is calculated as given in equations (2) and (3):
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
The option --iinntteerrppoollaattee is used to create the interpolated ver-
sions of --kknnddiissccoouunntt and --uukknnddiissccoouunntt. In this case we have:
_p(_a__z) = _g(_a__z) + bow(_a_) _p(__z) ; Eqn.4
Let _Z_1 be the set {_z: _c(_a__z) > 0}. For highest order N-grams we
have:
_g(_a__z) = max(0, _c(_a__z) - _D) / _c(_a_)
bow(_a_) = 1 - Sum__Z_1 _g(_a__z)
= 1 - Sum__Z_1 _c(_a__z) / _c(_a_) + Sum__Z_1 _D / _c(_a_)
= _D _n(_a_*) / _c(_a_)
Let _Z_2 be the set {_z: _n(*__z) > 0}. For lower order N-grams we
have:
_g(__z) = max(0, _n(*__z) - _D) / _n(*_*)
bow(_) = 1 - Sum__Z_2 _g(__z)
= 1 - Sum__Z_2 _n(*__z) / _n(*_*) + Sum__Z_2 _D / _n(*_*)
= _D _n(_*) / _n(*_*)
The original Kneser-Ney discounting (--uukknnddiissccoouunntt) uses one dis-
counting constant for each N-gram order. These constants are
estimated as
_D = _n_1 / (_n_1 + 2*_n_2)
where _n_1 and _n_2 are the total number of N-grams with exactly one
and two counts, respectively.
Chen and Goodman's modified Kneser-Ney discounting (--kknnddiissccoouunntt)
uses three discounting constants for each N-gram order, one for
one-count N-grams, one for two-count N-grams, and one for three-
plus-count N-grams:
_Y = _n_1/(_n_1+2*_n_2)
_D_1 = 1 - 2_Y(_n_2/_n_1)
_D_2 = 2 - 3_Y(_n_3/_n_2)
_D_3_+ = 3 - 4_Y(_n_4/_n_3)
WWaarrnniinngg::
SRILM implements Kneser-Ney discounting by actually modifying
the counts of the lower order N-grams. Thus, when the --wwrriittee
option is used to write the counts with --kknnddiissccoouunntt or --uukknnddiiss--
ccoouunntt, only the highest order N-grams and N-grams that start
with <s> will have their regular counts _c(_a__z), all others will
have the modified counts _n(*__z) instead. See Warning 2 in the
next section.
--wwbbddiissccoouunntt
Witten-Bell discounting. The intuition is that the weight given
to the lower order model should be proportional to the probabil-
ity of observing an unseen word in the current context (_a_).
Witten-Bell computes this weight as:
bow(_a_) = _n(_a_*) / (_n(_a_*) + _c(_a_))
Here _n(_a_*) represents the number of unique words following the
context (_a_) in the training data. Witten-Bell is originally an
interpolated discounting method. So with the --iinntteerrppoollaattee
option we get:
_g(_a__z) = _c(_a__z) / (_n(_a_*) + _c(_a_))
_p(_a__z) = _g(_a__z) + bow(_a_) _p(__z) ; Eqn.4
Without the --iinntteerrppoollaattee option we have the backoff version
which is implemented by taking _f(_a__z) to be the same as the
interpolated _g(_a__z).
_f(_a__z) = _c(_a__z) / (_n(_a_*) + _c(_a_))
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
--nnddiissccoouunntt
Ristad's natural discounting law. See Ristad's technical report
"A natural law of succession" for a justification of the dis-
counting factor. The --iinntteerrppoollaattee option has no effect, only a
backoff version has been implemented.
_c(_a__z) _c(_a_) (_c(_a_) + 1) + _n(_a_*) (1 - _n(_a_*))
_f(_a__z) = ------ ---------------------------------------
_c(_a_) _c(_a_)^2 + _c(_a_) + 2 _n(_a_*)
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
--ccoouunntt--llmm
Estimate a count-based interpolated LM using Jelinek-Mercer
smoothing (Chen & Goodman, 1998), also known as "deleted inter-
polation." Note that this does not produce a backoff model;
instead of count-LM parameter file in the format described in
nnggrraamm(1) needs to be specified using --iinniitt--llmm, and a reestimated
file in the same format is produced. In the process, the mix-
ture weights that interpolate the ML estimates at all levels of
N-grams are estimated using an expectation-maximization (EM)
algorithm. The options --eemm--iitteerrss and --eemm--ddeellttaa control termina-
tion of the EM algorithm. Note that the N-gram counts used to
estimate the maximum-likelihood estimates are specified in the
--iinniitt--llmm model file. The counts specified with --rreeaadd or --tteexxtt
are used only to estimate the interpolation weights.
--aaddddssmmooootthh _D
Smooth by adding _D to each N-gram count. This is usually a poor
smoothing method, included mainly for instructional purposes.
_p(_a__z) = (_c(_a__z) + _D) / (_c(_a_) + _D _n(*))
default
If the user does not specify any discounting options, nnggrraamm--
ccoouunntt uses Good-Turing discounting (aka Katz smoothing) by
default. The Good-Turing estimate states that for any N-gram
that occurs _r times, we should pretend that it occurs _r' times
where
_r' = (_r+1) _n[_r+1]/_n[_r]
Here _n[_r] is the number of N-grams that occur exactly _r times in
the training data.
Large counts are taken to be reliable, thus they are not subject
to any discounting. By default unigram counts larger than 1 and
other N-gram counts larger than 7 are taken to be reliable and
maximum likelihood estimates are used. These limits can be mod-
ified using the --ggtt_nmmaaxx options.
_f(_a__z) = (_c(_a__z) / _c(_a_)) if _c(_a__z) > _g_t_m_a_x
The lower counts are discounted proportional to the Good-Turing
estimate with a small correction _A to account for the high-count
N-grams not being discounted. If 1 <= _c(_a__z) <= _g_t_m_a_x:
_n[_g_t_m_a_x + 1]
_A = (_g_t_m_a_x + 1) --------------
_n[1]
_n[_c(_a__z) + 1]
_c'(_a__z) = (_c(_a__z) + 1) ---------------
_n[_c(_a__z)]
_c(_a__z) (_c'(_a__z) / _c(_a__z) - _A)
_f(_a__z) = -------- ----------------------
_c(_a_) (1 - _A)
The --iinntteerrppoollaattee option has no effect in this case, only a back-
off version has been implemented, thus:
_p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z) ; Eqn.2
bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
FFIILLEE FFOORRMMAATTSS
SRILM can generate simple N-gram counts from plain text files with the
following command:
ngram-count -order _N -text _f_i_l_e_._t_x_t -write _f_i_l_e_._c_n_t
The --oorrddeerr option determines the maximum length of the N-grams. The
file _f_i_l_e_._t_x_t should contain one sentence per line with tokens sepa-
rated by whitespace. The output _f_i_l_e_._c_n_t contains the N-gram tokens
followed by a tab and a count on each line:
_a__z <tab> _c(_a__z)
A couple of warnings:
WWaarrnniinngg 11
SRILM implicitly assumes an <s> token in the beginning of each
line and an </s> token at the end of each line and counts N-
grams that start with <s> and end with </s>. You do not need to
include these tags in _f_i_l_e_._t_x_t.
WWaarrnniinngg 22
When --kknnddiissccoouunntt or --uukknnddiissccoouunntt options are used, the count
file contains modified counts. Specifically, all N-grams of the
maximum order, and all N-grams that start with <s> have their
regular counts _c(_a__z), but shorter N-grams that do not start
with <s> have the number of unique words preceding them _n(*_a__z)
instead. See the description of --kknnddiissccoouunntt and --uukknnddiissccoouunntt
for details.
For most smoothing methods (except --ccoouunntt--llmm) SRILM generates and uses
N-gram model files in the ARPA format. A typical command to generate a
model file would be:
ngram-count -order _N -text _f_i_l_e_._t_x_t -lm _f_i_l_e_._l_m
The ARPA format output _f_i_l_e_._l_m will contain the following information
about an N-gram on each line:
log10(_f(_a__z)) <tab> _a__z <tab> log10(bow(_a__z))
Based on Equation 2, the first entry represents the base 10 logarithm
of the conditional probability (logprob) for the N-gram _a__z. This is
followed by the actual words in the N-gram separated by spaces. The
last and optional entry is the base-10 logarithm of the backoff weight
for (_n+1)-grams starting with _a__z.
WWaarrnniinngg 33
Both backoff and interpolated models are represented in the same
format. This means interpolation is done during model building
and represented in the ARPA format with logprob and backoff
weight using equation (6).
WWaarrnniinngg 44
Not all N-grams in the count file necessarily end up in the
model file. The options --ggttmmiinn, --ggtt11mmiinn, ..., --ggtt99mmiinn specify
the minimum counts for N-grams to be included in the LM (not
only for Good-Turing discounting but for the other methods as
well). By default all unigrams and bigrams are included, but
for higher order N-grams only those with count >= 2 are
included. Some exceptions arise, because if one N-gram is
included in the model file, all its prefix N-grams have to be
included as well. This causes some higher order 1-count N-grams
to be included when using KN discounting, which uses modified
counts as described in Warning 2.
WWaarrnniinngg 55
Not all N-grams in the model file have backoff weights. The
highest order N-grams do not need a backoff weight. For lower
order N-grams backoff weights are only recorded for those that
appear as the prefix of a longer N-gram included in the model.
For other lower order N-grams the backoff weight is implicitly 1
(or 0, in log representation).
SSEEEE AALLSSOO
ngram(1), ngram-count(1), ngram-format(5),
S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques
for Language Modeling,'' TR-10-98, Computer Science Group, Harvard
Univ., 1998.
BBUUGGSS
Work in progress.
AAUUTTHHOORR
Deniz Yuret <dyuret@ku.edu.tr>, Andreas Stolcke <stolcke@icsi.berke-
ley.edu>
Copyright (c) 2007 SRI International
SRILM Miscellaneous $Date: 2019/09/09 22:35:37 $ ngram-discount(7)