competition update

2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions
--- a/language_model/srilm-1.7.3/man/html/ngram-class.1.html
+++ b/language_model/srilm-1.7.3/man/html/ngram-class.1.html
@@ -0,0 +1,202 @@
+<! $Id: ngram-class.1,v 1.10 2019/09/09 22:35:37 stolcke Exp $>
+<HTML>
+<HEADER>
+<TITLE>ngram-class</TITLE>
+<BODY>
+<H1>ngram-class</H1>
+<H2> NAME </H2>
+ngram-class - induce word classes from N-gram statistics
+<H2> SYNOPSIS </H2>
+<PRE>
+<B>ngram-class</B> [ <B>-help</B> ] <I>option</I> ...
+</PRE>
+<H2> DESCRIPTION </H2>
+<B> ngram-class </B>
+induces word classes from distributional statistics,
+so as to minimize perplexity of a class-based N-gram model
+given the provided word N-gram counts.
+Presently, only bigram statistics are used, i.e., the induced classes
+are best suited for a class-bigram language model.
+<P>
+The program generates the class N-gram counts and class expansions
+needed by
+<A HREF="ngram-count.1.html">ngram-count(1)</A>
+and
+<A HREF="ngram.1.html">ngram(1)</A>,
+respectively to train and to apply the class N-gram model.
+<H2> OPTIONS </H2>
+<P>
+Each filename argument can be an ASCII file, or a 
+compressed file (name ending in .Z or .gz), or ``-'' to indicate
+stdin/stdout.
+<DL>
+<DT><B> -help </B>
+<DD>
+Print option summary.
+<DT><B> -version </B>
+<DD>
+Print version information.
+<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Set debugging output at
+<I>level</I>.<I></I><I></I><I></I>
+Level 0 means no debugging.
+Debugging messages are written to stderr.
+A useful level to trace the formation of classes is 2.
+</DD>
+</DL>
+<H3> Input Options </H3>
+<DL>
+<DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Read a vocabulary from file.
+Subsequently, out-of-vocabulary words in both counts or text are
+replaced with the unknown-word token.
+If this option is not specified all words found are implicitly added
+to the vocabulary.
+<DT><B> -tolower </B>
+<DD>
+Map the vocabulary to lowercase.
+<DT><B>-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Read N-gram counts from a file.
+Each line contains an N-gram of 
+words, followed by an integer count, all separated by whitespace.
+Repeated counts for the same N-gram are added.
+Counts collected by 
+<B> -text </B>
+and 
+<B> -counts </B>
+are additive as well.
+<BR>
+Note that the input should contain consistent lower- and higher-order
+counts (i.e., unigrams and bigrams), as would be generated by
+<A HREF="ngram-count.1.html">ngram-count(1)</A>.
+<DT><B>-text</B><I> textfile</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Generate N-gram counts from text file.
+<I> textfile </I>
+should contain one sentence unit per line.
+Begin/end sentence tokens are added if not already present.
+Empty lines are ignored.
+</DD>
+</DL>
+<H3> Class Merging </H3>
+<DL>
+<DT><B>-numclasses</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+The target number of classes to induce.
+A zero argument suppresses automatic class merging altogether
+(e.g., for use with 
+<B> -interact). </B>
+<DT><B> -full </B>
+<DD>
+Perform full greedy merging over all classes starting with one class per
+word.
+This is the O(V^3) algorithm described in Brown et al. (1992).
+<DT><B> -incremental </B>
+<DD>
+Perform incremental greedy merging, starting with 
+one class each for the 
+<I> C </I>
+most frequent words, and then adding one word at a time.
+This is the O(V*C^2) algorithm described in Brown et al. (1992);
+it is the default.
+<DT><B> -maxwordsperclass  M </B>
+<DD>
+Limits the number of words in a class to
+<I> M </I>
+in incremental merging.
+By default there is no such limit.
+<DT><B> -interact </B>
+<DD>
+Enter a primitive interactive interface when done with automatic class
+induction, allowing manual specification of additional merging steps.
+<DT><B>-noclass-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Read a list of vocabulary items from
+<I> file </I>
+that are to be excluded from classes.
+These words or tags do no undergo class merging, but their 
+N-gram counts still affect the optimization of model perplexity.
+<BR>
+The default is to exclude the sentence begin/end tags (&lt;s&gt; and &lt;/s&gt;)
+from class merging; this can be suppressed by specifying
+<B>-noclass-vocab /dev/null</B>.<B></B><B></B><B></B>
+<DT><B>-read</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Read initial class memberships from 
+<I>file</I>.<I></I><I></I><I></I>
+Class memberships need to be stored in 
+<A HREF="classes-format.5.html">classes-format(5)</A>
+with the additional condition that probabilities are obligatory
+and that each membership definition must specify exactly one word.
+</DD>
+</DL>
+<H3> Output Options </H3>
+<DL>
+<DT><B>-class-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Write class N-gram counts to
+<I> file </I>
+when done.
+The format is the same as for word N-gram counts, and can be
+read by
+<A HREF="ngram-count.1.html">ngram-count(1)</A>
+to estimate a class-N-gram model.
+<DT><B>-classes</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Write class definitions (member words and their probabilities) to
+<I> file </I>
+when done.
+The output format is the same as required by the
+<B> -classes </B>
+option of 
+<A HREF="ngram.1.html">ngram(1)</A>.
+<DT><B>-save</B><I> S</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Save the class counts and/or class definitions every
+<I> S </I>
+iterations during induction.
+The filenames are obtained from the
+<B> -class-counts </B>
+and
+<B> -classes </B>
+options, respectively, by appending the iteration number.
+This is convenient for producing sets of classes at different granularities
+during the same run.
+The saved class memberships can also be used with the
+<B> -read </B>
+option to restart class merging at a later time.
+<I>S</I>=0<I></I><I></I><I></I>
+(the default) suppresses the saving actions.
+<DT><B>-save-maxclasses</B><I> K</I><B></B><I></I><B></B><I></I><B></B>
+<DD>
+Modifies the action of
+<B> -save </B>
+so as to only start saving once the number of classes reaches
+<I>K</I>.<I></I><I></I><I></I>
+(The iteration numbers embedded in filenames will start at 0 from that point.)
+</DD>
+</DL>
+<H2> SEE ALSO </H2>
+<A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="classes-format.5.html">classes-format(5)</A>.
+<BR>
+P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,
+``Class-Based n-gram Models of Natural Language,''
+<I>Computational Linguistics</I> 18(4), 467-479, 1992.
+<H2> BUGS </H2>
+Classes are optimized only for bigram models at present.
+<H2> AUTHOR </H2>
+Andreas Stolcke &lt;stolcke@icsi.berkeley.edu&gt;,
+Seppo Enarvi &lt;seppo.enarvi@aalto.fi&gt;
+<BR>
+Copyright (c) 1999-2010 SRI International
+<BR>
+Copyright (c) 2013-2014 Seppo Enarvi
+<BR>
+Copyright (c) 2011-2014 Andreas Stolcke
+<BR>
+Copyright (c) 2012-2014 Microsoft Corp.
+</BODY>
+</HTML>