competition update
This commit is contained in:
202
language_model/srilm-1.7.3/man/html/ngram-class.1.html
Normal file
202
language_model/srilm-1.7.3/man/html/ngram-class.1.html
Normal file
@@ -0,0 +1,202 @@
|
||||
<! $Id: ngram-class.1,v 1.10 2019/09/09 22:35:37 stolcke Exp $>
|
||||
<HTML>
|
||||
<HEADER>
|
||||
<TITLE>ngram-class</TITLE>
|
||||
<BODY>
|
||||
<H1>ngram-class</H1>
|
||||
<H2> NAME </H2>
|
||||
ngram-class - induce word classes from N-gram statistics
|
||||
<H2> SYNOPSIS </H2>
|
||||
<PRE>
|
||||
<B>ngram-class</B> [ <B>-help</B> ] <I>option</I> ...
|
||||
</PRE>
|
||||
<H2> DESCRIPTION </H2>
|
||||
<B> ngram-class </B>
|
||||
induces word classes from distributional statistics,
|
||||
so as to minimize perplexity of a class-based N-gram model
|
||||
given the provided word N-gram counts.
|
||||
Presently, only bigram statistics are used, i.e., the induced classes
|
||||
are best suited for a class-bigram language model.
|
||||
<P>
|
||||
The program generates the class N-gram counts and class expansions
|
||||
needed by
|
||||
<A HREF="ngram-count.1.html">ngram-count(1)</A>
|
||||
and
|
||||
<A HREF="ngram.1.html">ngram(1)</A>,
|
||||
respectively to train and to apply the class N-gram model.
|
||||
<H2> OPTIONS </H2>
|
||||
<P>
|
||||
Each filename argument can be an ASCII file, or a
|
||||
compressed file (name ending in .Z or .gz), or ``-'' to indicate
|
||||
stdin/stdout.
|
||||
<DL>
|
||||
<DT><B> -help </B>
|
||||
<DD>
|
||||
Print option summary.
|
||||
<DT><B> -version </B>
|
||||
<DD>
|
||||
Print version information.
|
||||
<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Set debugging output at
|
||||
<I>level</I>.<I></I><I></I><I></I>
|
||||
Level 0 means no debugging.
|
||||
Debugging messages are written to stderr.
|
||||
A useful level to trace the formation of classes is 2.
|
||||
</DD>
|
||||
</DL>
|
||||
<H3> Input Options </H3>
|
||||
<DL>
|
||||
<DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Read a vocabulary from file.
|
||||
Subsequently, out-of-vocabulary words in both counts or text are
|
||||
replaced with the unknown-word token.
|
||||
If this option is not specified all words found are implicitly added
|
||||
to the vocabulary.
|
||||
<DT><B> -tolower </B>
|
||||
<DD>
|
||||
Map the vocabulary to lowercase.
|
||||
<DT><B>-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Read N-gram counts from a file.
|
||||
Each line contains an N-gram of
|
||||
words, followed by an integer count, all separated by whitespace.
|
||||
Repeated counts for the same N-gram are added.
|
||||
Counts collected by
|
||||
<B> -text </B>
|
||||
and
|
||||
<B> -counts </B>
|
||||
are additive as well.
|
||||
<BR>
|
||||
Note that the input should contain consistent lower- and higher-order
|
||||
counts (i.e., unigrams and bigrams), as would be generated by
|
||||
<A HREF="ngram-count.1.html">ngram-count(1)</A>.
|
||||
<DT><B>-text</B><I> textfile</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Generate N-gram counts from text file.
|
||||
<I> textfile </I>
|
||||
should contain one sentence unit per line.
|
||||
Begin/end sentence tokens are added if not already present.
|
||||
Empty lines are ignored.
|
||||
</DD>
|
||||
</DL>
|
||||
<H3> Class Merging </H3>
|
||||
<DL>
|
||||
<DT><B>-numclasses</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
The target number of classes to induce.
|
||||
A zero argument suppresses automatic class merging altogether
|
||||
(e.g., for use with
|
||||
<B> -interact). </B>
|
||||
<DT><B> -full </B>
|
||||
<DD>
|
||||
Perform full greedy merging over all classes starting with one class per
|
||||
word.
|
||||
This is the O(V^3) algorithm described in Brown et al. (1992).
|
||||
<DT><B> -incremental </B>
|
||||
<DD>
|
||||
Perform incremental greedy merging, starting with
|
||||
one class each for the
|
||||
<I> C </I>
|
||||
most frequent words, and then adding one word at a time.
|
||||
This is the O(V*C^2) algorithm described in Brown et al. (1992);
|
||||
it is the default.
|
||||
<DT><B> -maxwordsperclass M </B>
|
||||
<DD>
|
||||
Limits the number of words in a class to
|
||||
<I> M </I>
|
||||
in incremental merging.
|
||||
By default there is no such limit.
|
||||
<DT><B> -interact </B>
|
||||
<DD>
|
||||
Enter a primitive interactive interface when done with automatic class
|
||||
induction, allowing manual specification of additional merging steps.
|
||||
<DT><B>-noclass-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Read a list of vocabulary items from
|
||||
<I> file </I>
|
||||
that are to be excluded from classes.
|
||||
These words or tags do no undergo class merging, but their
|
||||
N-gram counts still affect the optimization of model perplexity.
|
||||
<BR>
|
||||
The default is to exclude the sentence begin/end tags (<s> and </s>)
|
||||
from class merging; this can be suppressed by specifying
|
||||
<B>-noclass-vocab /dev/null</B>.<B></B><B></B><B></B>
|
||||
<DT><B>-read</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Read initial class memberships from
|
||||
<I>file</I>.<I></I><I></I><I></I>
|
||||
Class memberships need to be stored in
|
||||
<A HREF="classes-format.5.html">classes-format(5)</A>
|
||||
with the additional condition that probabilities are obligatory
|
||||
and that each membership definition must specify exactly one word.
|
||||
</DD>
|
||||
</DL>
|
||||
<H3> Output Options </H3>
|
||||
<DL>
|
||||
<DT><B>-class-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Write class N-gram counts to
|
||||
<I> file </I>
|
||||
when done.
|
||||
The format is the same as for word N-gram counts, and can be
|
||||
read by
|
||||
<A HREF="ngram-count.1.html">ngram-count(1)</A>
|
||||
to estimate a class-N-gram model.
|
||||
<DT><B>-classes</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Write class definitions (member words and their probabilities) to
|
||||
<I> file </I>
|
||||
when done.
|
||||
The output format is the same as required by the
|
||||
<B> -classes </B>
|
||||
option of
|
||||
<A HREF="ngram.1.html">ngram(1)</A>.
|
||||
<DT><B>-save</B><I> S</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Save the class counts and/or class definitions every
|
||||
<I> S </I>
|
||||
iterations during induction.
|
||||
The filenames are obtained from the
|
||||
<B> -class-counts </B>
|
||||
and
|
||||
<B> -classes </B>
|
||||
options, respectively, by appending the iteration number.
|
||||
This is convenient for producing sets of classes at different granularities
|
||||
during the same run.
|
||||
The saved class memberships can also be used with the
|
||||
<B> -read </B>
|
||||
option to restart class merging at a later time.
|
||||
<I>S</I>=0<I></I><I></I><I></I>
|
||||
(the default) suppresses the saving actions.
|
||||
<DT><B>-save-maxclasses</B><I> K</I><B></B><I></I><B></B><I></I><B></B>
|
||||
<DD>
|
||||
Modifies the action of
|
||||
<B> -save </B>
|
||||
so as to only start saving once the number of classes reaches
|
||||
<I>K</I>.<I></I><I></I><I></I>
|
||||
(The iteration numbers embedded in filenames will start at 0 from that point.)
|
||||
</DD>
|
||||
</DL>
|
||||
<H2> SEE ALSO </H2>
|
||||
<A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="classes-format.5.html">classes-format(5)</A>.
|
||||
<BR>
|
||||
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,
|
||||
``Class-Based n-gram Models of Natural Language,''
|
||||
<I>Computational Linguistics</I> 18(4), 467-479, 1992.
|
||||
<H2> BUGS </H2>
|
||||
Classes are optimized only for bigram models at present.
|
||||
<H2> AUTHOR </H2>
|
||||
Andreas Stolcke <stolcke@icsi.berkeley.edu>,
|
||||
Seppo Enarvi <seppo.enarvi@aalto.fi>
|
||||
<BR>
|
||||
Copyright (c) 1999-2010 SRI International
|
||||
<BR>
|
||||
Copyright (c) 2013-2014 Seppo Enarvi
|
||||
<BR>
|
||||
Copyright (c) 2011-2014 Andreas Stolcke
|
||||
<BR>
|
||||
Copyright (c) 2012-2014 Microsoft Corp.
|
||||
</BODY>
|
||||
</HTML>
|
||||
Reference in New Issue
Block a user