competition update

2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions
--- a/language_model/srilm-1.7.3/man/cat7/ngram-discount.7
+++ b/language_model/srilm-1.7.3/man/cat7/ngram-discount.7
@@ -0,0 +1,414 @@
+ngram-discount(7)      Miscellaneous Information Manual      ngram-discount(7)
+
+
+
+NNAAMMEE
+       ngram-discount - notes on the N-gram smoothing implementations in SRILM
+
+NNOOTTAATTIIOONN
+       _a__z       An  N-gram where _a is the first word, _z is the last word, and
+                 "_" represents 0 or more words in between.
+
+       _p(_a__z)    The estimated conditional probability of the _nth word _z given
+                 the first _n-1 words (_a_) of an N-gram.
+
+       _a_        The _n-1 word prefix of the N-gram _a__z.
+
+       __z        The _n-1 word suffix of the N-gram _a__z.
+
+       _c(_a__z)    The count of N-gram _a__z in the training data.
+
+       _n(*__z)    The  number  of  unique  N-grams  that match a given pattern.
+                 ``(*)'' represents a wildcard matching a single word.
+
+       _n_1,_n[1]   The number of unique N-grams with count = 1.
+
+DDEESSCCRRIIPPTTIIOONN
+       N-gram models try to estimate the probability of a word _z in  the  con-
+       text  of  the  previous _n-1 words (_a_), i.e., _P_r(_z|_a_).  We will denote
+       this conditional probability using _p(_a__z) for convenience.  One way  to
+       estimate  _p(_a__z)  is to look at the number of times word _z has followed
+       the previous _n-1 words (_a_):
+
+       (1)  _p(_a__z) = _c(_a__z)/_c(_a_)
+
+       This is known as the maximum likelihood (ML)  estimate.   Unfortunately
+       it  does  not  work very well because it assigns zero probability to N-
+       grams that have not been observed in the training data.  To  avoid  the
+       zero  probabilities, we take some probability mass from the observed N-
+       grams and distribute it to unobserved N-grams.  Such redistribution  is
+       known as smoothing or discounting.
+
+       Most  existing  smoothing  algorithms can be described by the following
+       equation:
+
+       (2)  _p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)
+
+       If the N-gram _a__z has been observed in the training data,  we  use  the
+       distribution  _f(_a__z).   Typically  _f(_a__z) is discounted to be less than
+       the ML estimate so we have some leftover probability for  the  _z  words
+       unseen  in the context (_a_).  Different algorithms mainly differ on how
+       they discount the ML estimate to get _f(_a__z).
+
+       If the N-gram _a__z has not been observed in the training  data,  we  use
+       the  lower  order  distribution  _p(__z).   If the context has never been
+       observed (_c(_a_) = 0), we can use the lower order distribution  directly
+       (bow(_a_)  = 1).  Otherwise we need to compute a backoff weight (bow) to
+       make sure probabilities are normalized:
+
+            Sum__z _p(_a__z) = 1
+
+
+       Let _Z be the set of all words in the vocabulary, _Z_0 be the set  of  all
+       words  with _c(_a__z) = 0, and _Z_1 be the set of all words with _c(_a__z) > 0.
+       Given _f(_a__z), bow(_a_) can be determined as follows:
+
+       (3)  Sum__Z  _p(_a__z) = 1
+            Sum__Z_1 _f(_a__z) + Sum__Z_0 bow(_a_) _p(__z) = 1
+            bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / Sum__Z_0 _p(__z)
+                    = (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _p(__z))
+                    = (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z))
+
+
+       Smoothing is generally done in one of two  ways.   The  backoff  models
+       compute  _p(_a__z)  based on the N-gram counts _c(_a__z) when _c(_a__z) > 0, and
+       only consider lower order counts _c(__z) when _c(_a__z) =  0.   Interpolated
+       models take lower order counts into account when _c(_a__z) > 0 as well.  A
+       common way to express an interpolated model is:
+
+       (4)  _p(_a__z) = _g(_a__z) + bow(_a_) _p(__z)
+
+       Where _g(_a__z) = 0 when _c(_a__z) = 0 and it is discounted to be  less  than
+       the  ML  estimate  when _c(_a__z) > 0 to reserve some probability mass for
+       the unseen _z words.  Given _g(_a__z), bow(_a_) can be  determined  as  fol-
+       lows:
+
+       (5)  Sum__Z  _p_(_a___z_) = 1
+            Sum__Z_1 _g_(_a___z) + Sum__Z bow(_a_) _p(__z) = 1
+            bow(_a_) = 1 - Sum__Z_1 _g(_a__z)
+
+
+       An  interpolated  model  can  also be expressed in the form of equation
+       (2), which is the way it is represented in the ARPA format model  files
+       in SRILM:
+
+       (6)  _f(_a__z) = _g(_a__z) + bow(_a_) _p(__z)
+            _p(_a__z) = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)
+
+
+       Most  algorithms  in SRILM have both backoff and interpolated versions.
+       Empirically, interpolated algorithms usually do better than the backoff
+       ones, and Kneser-Ney does better than others.
+
+
+OOPPTTIIOONNSS
+       This  section  describes  the formulation of each discounting option in
+       nnggrraamm--ccoouunntt(1).  After  giving  the  motivation  for  each  discounting
+       method,  we  will give expressions for _f(_a__z) and bow(_a_) of Equation 2
+       in terms of the counts.  Note that some counts may not be  included  in
+       the model file because of the --ggttmmiinn options; see Warning 4 in the next
+       section.
+
+       Backoff versions are the default but interpolated versions of most mod-
+       els  are available using the --iinntteerrppoollaattee option.  In this case we will
+       express _g(_a_z_) and bow(_a_) of Equation 4 in  terms  of  the  counts  as
+       well.   Note  that  the  ARPA format model files store the interpolated
+       models and the backoff models the same way using  _f(_a__z)  and  bow(_a_);
+       see  Warning 3 in the next section.  The conversion between backoff and
+       interpolated formulations is given in Equation 6.
+
+       The discounting options may be followed by a digit  (1-9)  to  indicate
+       that  only  specific N-gram orders be affected.  See nnggrraamm--ccoouunntt(1) for
+       more details.
+
+       --ccddiissccoouunntt _D
+              Ney's absolute discounting using _D as the constant to  subtract.
+              _D  should  be  between 0 and 1.  If _Z_1 is the set of all words _z
+              with _c(_a__z) > 0:
+
+                   _f(_a__z)  = (_c(_a__z) - _D) / _c(_a_)
+                   _p(_a__z)  = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)    ; Eqn.2
+                   bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
+
+              With the --iinntteerrppoollaattee option we have:
+
+                   _g(_a__z)  = max(0, _c(_a__z) - _D) / _c(_a_)
+                   _p(_a__z)  = _g(_a__z) + bow(_a_) _p(__z)   ; Eqn.4
+                   bow(_a_) = 1 - Sum__Z_1 _g(_a__z)        ; Eqn.5
+                           = _D _n(_a_*) / _c(_a_)
+
+              The suggested discount factor is:
+
+                   _D = _n_1 / (_n_1 + 2*_n_2)
+
+              where _n_1 and _n_2 are the total number of N-grams with exactly one
+              and  two  counts, respectively.  Different discounting constants
+              can be specified  for  different  N-gram  orders  using  options
+              --ccddiissccoouunntt11, --ccddiissccoouunntt22, etc.
+
+       --kknnddiissccoouunntt and --uukknnddiissccoouunntt
+              Kneser-Ney discounting.  This is similar to absolute discounting
+              in that the discounted probability is computed by subtracting  a
+              constant  _D  from the N-gram count.  The options --kknnddiissccoouunntt and
+              --uukknnddiissccoouunntt differ as to how this constant is computed.
+              The main idea of Kneser-Ney is to  use  a  modified  probability
+              estimate  for  lower  order  N-grams used for backoff.  Specifi-
+              cally, the modified probability for  a  lower  order  N-gram  is
+              taken to be proportional to the number of unique words that pre-
+              cede it in the training data.  With discounting  and  normaliza-
+              tion we get:
+
+                   _f(_a__z) = (_c(_a__z) - _D_0) / _c(_a_)     ;; for highest order N-grams
+                   _f(__z)  = (_n(*__z) - _D_1) / _n(*_*)    ;; for lower order N-grams
+
+              where  the  _n(*__z)  notation  represents the number of unique N-
+              grams that match a given pattern with (*) used as a wildcard for
+              a  single  word.   _D_0 and _D_1 represent two different discounting
+              constants, as each N-gram order  uses  a  different  discounting
+              constant.  The resulting conditional probability and the backoff
+              weight is calculated as given in equations (2) and (3):
+
+                   _p(_a__z)  = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)     ; Eqn.2
+                   bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z))  ; Eqn.3
+
+              The option --iinntteerrppoollaattee is used to create the interpolated  ver-
+              sions of --kknnddiissccoouunntt and --uukknnddiissccoouunntt.  In this case we have:
+
+                   _p(_a__z) = _g(_a__z) + bow(_a_) _p(__z)  ; Eqn.4
+
+              Let _Z_1 be the set {_z: _c(_a__z) > 0}.  For highest order N-grams we
+              have:
+
+                   _g(_a__z)  = max(0, _c(_a__z) - _D) / _c(_a_)
+                   bow(_a_) = 1 - Sum__Z_1 _g(_a__z)
+                           = 1 - Sum__Z_1 _c(_a__z) / _c(_a_) + Sum__Z_1 _D / _c(_a_)
+                           = _D _n(_a_*) / _c(_a_)
+
+              Let _Z_2 be the set {_z: _n(*__z) > 0}.  For lower order  N-grams  we
+              have:
+
+                   _g(__z)  = max(0, _n(*__z) - _D) / _n(*_*)
+                   bow(_) = 1 - Sum__Z_2 _g(__z)
+                          = 1 - Sum__Z_2 _n(*__z) / _n(*_*) + Sum__Z_2 _D / _n(*_*)
+                          = _D _n(_*) / _n(*_*)
+
+              The original Kneser-Ney discounting (--uukknnddiissccoouunntt) uses one dis-
+              counting constant for each N-gram order.   These  constants  are
+              estimated as
+
+                   _D = _n_1 / (_n_1 + 2*_n_2)
+
+              where _n_1 and _n_2 are the total number of N-grams with exactly one
+              and two counts, respectively.
+              Chen and Goodman's modified Kneser-Ney discounting (--kknnddiissccoouunntt)
+              uses  three discounting constants for each N-gram order, one for
+              one-count N-grams, one for two-count N-grams, and one for three-
+              plus-count N-grams:
+
+                   _Y   = _n_1/(_n_1+2*_n_2)
+                   _D_1  = 1 - 2_Y(_n_2/_n_1)
+                   _D_2  = 2 - 3_Y(_n_3/_n_2)
+                   _D_3_+ = 3 - 4_Y(_n_4/_n_3)
+
+
+       WWaarrnniinngg::
+              SRILM  implements  Kneser-Ney  discounting by actually modifying
+              the counts of the lower order N-grams.  Thus,  when  the  --wwrriittee
+              option  is used to write the counts with --kknnddiissccoouunntt or --uukknnddiiss--
+              ccoouunntt, only the highest order N-grams  and  N-grams  that  start
+              with  <s> will have their regular counts _c(_a__z), all others will
+              have the modified counts _n(*__z) instead.  See Warning 2  in  the
+              next section.
+
+       --wwbbddiissccoouunntt
+              Witten-Bell discounting.  The intuition is that the weight given
+              to the lower order model should be proportional to the probabil-
+              ity  of  observing  an  unseen word in the current context (_a_).
+              Witten-Bell computes this weight as:
+
+                   bow(_a_) = _n(_a_*) / (_n(_a_*) + _c(_a_))
+
+              Here _n(_a_*) represents the number of unique words following  the
+              context (_a_) in the training data.  Witten-Bell is originally an
+              interpolated  discounting  method.   So  with  the  --iinntteerrppoollaattee
+              option we get:
+
+                   _g(_a__z) = _c(_a__z) / (_n(_a_*) + _c(_a_))
+                   _p(_a__z) = _g(_a__z) + bow(_a_) _p(__z)    ; Eqn.4
+
+              Without  the  --iinntteerrppoollaattee  option  we  have the backoff version
+              which is implemented by taking _f(_a__z) to  be  the  same  as  the
+              interpolated _g(_a__z).
+
+                   _f(_a__z)  = _c(_a__z) / (_n(_a_*) + _c(_a_))
+                   _p(_a__z)  = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)    ; Eqn.2
+                   bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
+
+
+       --nnddiissccoouunntt
+              Ristad's natural discounting law.  See Ristad's technical report
+              "A natural law of succession" for a justification  of  the  dis-
+              counting  factor.  The --iinntteerrppoollaattee option has no effect, only a
+              backoff version has been implemented.
+
+                             _c(_a__z)  _c(_a_) (_c(_a_) + 1) + _n(_a_*) (1 - _n(_a_*))
+                   _f(_a__z)  = ------  ---------------------------------------
+                             _c(_a_)        _c(_a_)^2 + _c(_a_) + 2 _n(_a_*)
+
+                   _p(_a__z)  = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)    ; Eqn.2
+                   bow(_a_) = (1 - Sum__Z_1 f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
+
+
+       --ccoouunntt--llmm
+              Estimate a  count-based  interpolated  LM  using  Jelinek-Mercer
+              smoothing  (Chen & Goodman, 1998), also known as "deleted inter-
+              polation."  Note that this does not  produce  a  backoff  model;
+              instead  of  count-LM  parameter file in the format described in
+              nnggrraamm(1) needs to be specified using --iinniitt--llmm, and a reestimated
+              file  in  the same format is produced.  In the process, the mix-
+              ture weights that interpolate the ML estimates at all levels  of
+              N-grams  are  estimated  using  an expectation-maximization (EM)
+              algorithm.  The options --eemm--iitteerrss and --eemm--ddeellttaa control termina-
+              tion  of  the EM algorithm.  Note that the N-gram counts used to
+              estimate the maximum-likelihood estimates are specified  in  the
+              --iinniitt--llmm  model  file.  The counts specified with --rreeaadd or --tteexxtt
+              are used only to estimate the interpolation weights.
+
+
+       --aaddddssmmooootthh _D
+              Smooth by adding _D to each N-gram count.  This is usually a poor
+              smoothing method, included mainly for instructional purposes.
+
+                   _p(_a__z) = (_c(_a__z) + _D) / (_c(_a_) + _D _n(*))
+
+
+       default
+              If  the  user  does  not specify any discounting options, nnggrraamm--
+              ccoouunntt uses  Good-Turing  discounting  (aka  Katz  smoothing)  by
+              default.   The  Good-Turing  estimate states that for any N-gram
+              that occurs _r times, we should pretend that it occurs  _r'  times
+              where
+
+                   _r' = (_r+1) _n[_r+1]/_n[_r]
+
+              Here _n[_r] is the number of N-grams that occur exactly _r times in
+              the training data.
+              Large counts are taken to be reliable, thus they are not subject
+              to any discounting.  By default unigram counts larger than 1 and
+              other N-gram counts larger than 7 are taken to be  reliable  and
+              maximum likelihood estimates are used.  These limits can be mod-
+              ified using the --ggtt_nmmaaxx options.
+
+                   _f(_a__z) = (_c(_a__z) / _c(_a_))  if _c(_a__z) > _g_t_m_a_x
+
+              The lower counts are discounted proportional to the  Good-Turing
+              estimate with a small correction _A to account for the high-count
+              N-grams not being discounted.  If 1 <= _c(_a__z) <= _g_t_m_a_x:
+
+                                 _n[_g_t_m_a_x + 1]
+                _A = (_g_t_m_a_x + 1) --------------
+                                    _n[1]
+
+                                        _n[_c(_a__z) + 1]
+                _c'(_a__z) = (_c(_a__z) + 1) ---------------
+                                          _n[_c(_a__z)]
+
+                          _c(_a__z)   (_c'(_a__z) / _c(_a__z) - _A)
+                _f(_a__z) = --------  ----------------------
+                           _c(_a_)         (1 - _A)
+
+              The --iinntteerrppoollaattee option has no effect in this case, only a back-
+              off version has been implemented, thus:
+
+                   _p(_a__z)  = (_c(_a__z) > 0) ? _f(_a__z) : bow(_a_) _p(__z)    ; Eqn.2
+                   bow(_a_) = (1 - Sum__Z_1 _f(_a__z)) / (1 - Sum__Z_1 _f(__z)) ; Eqn.3
+
+
+FFIILLEE FFOORRMMAATTSS
+       SRILM  can generate simple N-gram counts from plain text files with the
+       following command:
+            ngram-count -order _N -text _f_i_l_e_._t_x_t -write _f_i_l_e_._c_n_t
+       The --oorrddeerr option determines the maximum length of  the  N-grams.   The
+       file  _f_i_l_e_._t_x_t  should  contain one sentence per line with tokens sepa-
+       rated by whitespace.  The output _f_i_l_e_._c_n_t contains  the  N-gram  tokens
+       followed by a tab and a count on each line:
+
+            _a__z <tab> _c(_a__z)
+
+       A couple of warnings:
+
+       WWaarrnniinngg 11
+              SRILM  implicitly  assumes an <s> token in the beginning of each
+              line and an </s> token at the end of each  line  and  counts  N-
+              grams that start with <s> and end with </s>.  You do not need to
+              include these tags in _f_i_l_e_._t_x_t.
+
+       WWaarrnniinngg 22
+              When --kknnddiissccoouunntt or --uukknnddiissccoouunntt options  are  used,  the  count
+              file contains modified counts.  Specifically, all N-grams of the
+              maximum order, and all N-grams that start with  <s>  have  their
+              regular  counts  _c(_a__z),  but  shorter N-grams that do not start
+              with <s> have the number of unique words preceding them  _n(*_a__z)
+              instead.   See  the  description of --kknnddiissccoouunntt and --uukknnddiissccoouunntt
+              for details.
+
+       For most smoothing methods (except --ccoouunntt--llmm) SRILM generates and  uses
+       N-gram model files in the ARPA format.  A typical command to generate a
+       model file would be:
+            ngram-count -order _N -text _f_i_l_e_._t_x_t -lm _f_i_l_e_._l_m
+       The ARPA format output _f_i_l_e_._l_m will contain the  following  information
+       about an N-gram on each line:
+
+            log10(_f(_a__z)) <tab> _a__z <tab> log10(bow(_a__z))
+
+       Based  on  Equation 2, the first entry represents the base 10 logarithm
+       of the conditional probability (logprob) for the N-gram _a__z.   This  is
+       followed  by  the  actual words in the N-gram separated by spaces.  The
+       last and optional entry is the base-10 logarithm of the backoff  weight
+       for (_n+1)-grams starting with _a__z.
+
+       WWaarrnniinngg 33
+              Both backoff and interpolated models are represented in the same
+              format.  This means interpolation is done during model  building
+              and  represented  in  the  ARPA  format with logprob and backoff
+              weight using equation (6).
+
+       WWaarrnniinngg 44
+              Not all N-grams in the count file  necessarily  end  up  in  the
+              model  file.   The options --ggttmmiinn, --ggtt11mmiinn, ..., --ggtt99mmiinn specify
+              the minimum counts for N-grams to be included  in  the  LM  (not
+              only  for  Good-Turing  discounting but for the other methods as
+              well).  By default all unigrams and bigrams  are  included,  but
+              for  higher  order  N-grams  only  those  with  count  >=  2 are
+              included.  Some exceptions  arise,  because  if  one  N-gram  is
+              included  in  the  model file, all its prefix N-grams have to be
+              included as well.  This causes some higher order 1-count N-grams
+              to  be  included  when using KN discounting, which uses modified
+              counts as described in Warning 2.
+
+       WWaarrnniinngg 55
+              Not all N-grams in the model file  have  backoff  weights.   The
+              highest  order  N-grams do not need a backoff weight.  For lower
+              order N-grams backoff weights are only recorded for  those  that
+              appear  as  the prefix of a longer N-gram included in the model.
+              For other lower order N-grams the backoff weight is implicitly 1
+              (or 0, in log representation).
+
+
+SSEEEE AALLSSOO
+       ngram(1), ngram-count(1), ngram-format(5),
+       S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques
+       for Language Modeling,''  TR-10-98,  Computer  Science  Group,  Harvard
+       Univ., 1998.
+
+BBUUGGSS
+       Work in progress.
+
+AAUUTTHHOORR
+       Deniz  Yuret  <dyuret@ku.edu.tr>,  Andreas Stolcke <stolcke@icsi.berke-
+       ley.edu>
+       Copyright (c) 2007 SRI International
+
+
+
+SRILM Miscellaneous      $Date: 2019/09/09 22:35:37 $        ngram-discount(7)