b2txt25/language_model/srilm-1.7.3/man/html/ngram-class.1.html

<! $Id: ngram-class.1,v 1.10 2019/09/09 22:35:37 stolcke Exp $>
<HTML>
<HEADER>
<TITLE>ngram-class</TITLE>
<BODY>
<H1>ngram-class</H1>
<H2> NAME </H2>
ngram-class - induce word classes from N-gram statistics
<H2> SYNOPSIS </H2>
<PRE>
<B>ngram-class</B> [ <B>-help</B> ] <I>option</I> ...
</PRE>
<H2> DESCRIPTION </H2>
<B> ngram-class </B>
induces word classes from distributional statistics,
so as to minimize perplexity of a class-based N-gram model
given the provided word N-gram counts.
Presently, only bigram statistics are used, i.e., the induced classes
are best suited for a class-bigram language model.
<P>
The program generates the class N-gram counts and class expansions
needed by
<A HREF="ngram-count.1.html">ngram-count(1)</A>
and
<A HREF="ngram.1.html">ngram(1)</A>,
respectively to train and to apply the class N-gram model.
<H2> OPTIONS </H2>
<P>
Each filename argument can be an ASCII file, or a 
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
<DL>
<DT><B> -help </B>
<DD>
Print option summary.
<DT><B> -version </B>
<DD>
Print version information.
<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Set debugging output at
<I>level</I>.<I></I><I></I><I></I>
Level 0 means no debugging.
Debugging messages are written to stderr.
A useful level to trace the formation of classes is 2.
</DD>
</DL>
<H3> Input Options </H3>
<DL>
<DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Read a vocabulary from file.
Subsequently, out-of-vocabulary words in both counts or text are
replaced with the unknown-word token.
If this option is not specified all words found are implicitly added
to the vocabulary.
<DT><B> -tolower </B>
<DD>
Map the vocabulary to lowercase.
<DT><B>-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Read N-gram counts from a file.
Each line contains an N-gram of 
words, followed by an integer count, all separated by whitespace.
Repeated counts for the same N-gram are added.
Counts collected by 
<B> -text </B>
and 
<B> -counts </B>
are additive as well.
<BR>
Note that the input should contain consistent lower- and higher-order
counts (i.e., unigrams and bigrams), as would be generated by
<A HREF="ngram-count.1.html">ngram-count(1)</A>.
<DT><B>-text</B><I> textfile</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Generate N-gram counts from text file.
<I> textfile </I>
should contain one sentence unit per line.
Begin/end sentence tokens are added if not already present.
Empty lines are ignored.
</DD>
</DL>
<H3> Class Merging </H3>
<DL>
<DT><B>-numclasses</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
<DD>
The target number of classes to induce.
A zero argument suppresses automatic class merging altogether
(e.g., for use with 
<B> -interact). </B>
<DT><B> -full </B>
<DD>
Perform full greedy merging over all classes starting with one class per
word.
This is the O(V^3) algorithm described in Brown et al. (1992).
<DT><B> -incremental </B>
<DD>
Perform incremental greedy merging, starting with 
one class each for the 
<I> C </I>
most frequent words, and then adding one word at a time.
This is the O(V*C^2) algorithm described in Brown et al. (1992);
it is the default.
<DT><B> -maxwordsperclass  M </B>
<DD>
Limits the number of words in a class to
<I> M </I>
in incremental merging.
By default there is no such limit.
<DT><B> -interact </B>
<DD>
Enter a primitive interactive interface when done with automatic class
induction, allowing manual specification of additional merging steps.
<DT><B>-noclass-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Read a list of vocabulary items from
<I> file </I>
that are to be excluded from classes.
These words or tags do no undergo class merging, but their 
N-gram counts still affect the optimization of model perplexity.
<BR>
The default is to exclude the sentence begin/end tags (&lt;s&gt; and &lt;/s&gt;)
from class merging; this can be suppressed by specifying
<B>-noclass-vocab /dev/null</B>.<B></B><B></B><B></B>
<DT><B>-read</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Read initial class memberships from 
<I>file</I>.<I></I><I></I><I></I>
Class memberships need to be stored in 
<A HREF="classes-format.5.html">classes-format(5)</A>
with the additional condition that probabilities are obligatory
and that each membership definition must specify exactly one word.
</DD>
</DL>
<H3> Output Options </H3>
<DL>
<DT><B>-class-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Write class N-gram counts to
<I> file </I>
when done.
The format is the same as for word N-gram counts, and can be
read by
<A HREF="ngram-count.1.html">ngram-count(1)</A>
to estimate a class-N-gram model.
<DT><B>-classes</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Write class definitions (member words and their probabilities) to
<I> file </I>
when done.
The output format is the same as required by the
<B> -classes </B>
option of 
<A HREF="ngram.1.html">ngram(1)</A>.
<DT><B>-save</B><I> S</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Save the class counts and/or class definitions every
<I> S </I>
iterations during induction.
The filenames are obtained from the
<B> -class-counts </B>
and
<B> -classes </B>
options, respectively, by appending the iteration number.
This is convenient for producing sets of classes at different granularities
during the same run.
The saved class memberships can also be used with the
<B> -read </B>
option to restart class merging at a later time.
<I>S</I>=0<I></I><I></I><I></I>
(the default) suppresses the saving actions.
<DT><B>-save-maxclasses</B><I> K</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Modifies the action of
<B> -save </B>
so as to only start saving once the number of classes reaches
<I>K</I>.<I></I><I></I><I></I>
(The iteration numbers embedded in filenames will start at 0 from that point.)
</DD>
</DL>
<H2> SEE ALSO </H2>
<A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="classes-format.5.html">classes-format(5)</A>.
<BR>
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,
``Class-Based n-gram Models of Natural Language,''
<I>Computational Linguistics</I> 18(4), 467-479, 1992.
<H2> BUGS </H2>
Classes are optimized only for bigram models at present.
<H2> AUTHOR </H2>
Andreas Stolcke &lt;stolcke@icsi.berkeley.edu&gt;,
Seppo Enarvi &lt;seppo.enarvi@aalto.fi&gt;
<BR>
Copyright (c) 1999-2010 SRI International
<BR>
Copyright (c) 2013-2014 Seppo Enarvi
<BR>
Copyright (c) 2011-2014 Andreas Stolcke
<BR>
Copyright (c) 2012-2014 Microsoft Corp.
</BODY>
</HTML>
competition update 2025-07-02 12:18:09 -07:00			`<! $Id: ngram-class.1,v 1.10 2019/09/09 22:35:37 stolcke Exp $>`
			`<HTML>`
			`<HEADER>`
			`<TITLE>ngram-class</TITLE>`
			`<BODY>`
			`<H1>ngram-class</H1>`
			`<H2> NAME </H2>`
			`ngram-class - induce word classes from N-gram statistics`
			`<H2> SYNOPSIS </H2>`
			`<PRE>`
			`<B>ngram-class</B> [ <B>-help</B> ] <I>option</I> ...`
			`</PRE>`
			`<H2> DESCRIPTION </H2>`
			`<B> ngram-class </B>`
			`induces word classes from distributional statistics,`
			`so as to minimize perplexity of a class-based N-gram model`
			`given the provided word N-gram counts.`
			`Presently, only bigram statistics are used, i.e., the induced classes`
			`are best suited for a class-bigram language model.`
			`<P>`
			`The program generates the class N-gram counts and class expansions`
			`needed by`
			`<A HREF="ngram-count.1.html">ngram-count(1)</A>`
			`and`
			`<A HREF="ngram.1.html">ngram(1)</A>,`
			`respectively to train and to apply the class N-gram model.`
			`<H2> OPTIONS </H2>`
			`<P>`
			`Each filename argument can be an ASCII file, or a`
			compressed file (name ending in .Z or .gz), or ``-'' to indicate
			`stdin/stdout.`
			`<DL>`
			`<DT><B> -help </B>`
			`<DD>`
			`Print option summary.`
			`<DT><B> -version </B>`
			`<DD>`
			`Print version information.`
			`<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Set debugging output at`
			`<I>level</I>.<I></I><I></I><I></I>`
			`Level 0 means no debugging.`
			`Debugging messages are written to stderr.`
			`A useful level to trace the formation of classes is 2.`
			`</DD>`
			`</DL>`
			`<H3> Input Options </H3>`
			`<DL>`
			`<DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Read a vocabulary from file.`
			`Subsequently, out-of-vocabulary words in both counts or text are`
			`replaced with the unknown-word token.`
			`If this option is not specified all words found are implicitly added`
			`to the vocabulary.`
			`<DT><B> -tolower </B>`
			`<DD>`
			`Map the vocabulary to lowercase.`
			`<DT><B>-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Read N-gram counts from a file.`
			`Each line contains an N-gram of`
			`words, followed by an integer count, all separated by whitespace.`
			`Repeated counts for the same N-gram are added.`
			`Counts collected by`
			`<B> -text </B>`
			`and`
			`<B> -counts </B>`
			`are additive as well.`
			`<BR>`
			`Note that the input should contain consistent lower- and higher-order`
			`counts (i.e., unigrams and bigrams), as would be generated by`
			`<A HREF="ngram-count.1.html">ngram-count(1)</A>.`
			`<DT><B>-text</B><I> textfile</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Generate N-gram counts from text file.`
			`<I> textfile </I>`
			`should contain one sentence unit per line.`
			`Begin/end sentence tokens are added if not already present.`
			`Empty lines are ignored.`
			`</DD>`
			`</DL>`
			`<H3> Class Merging </H3>`
			`<DL>`
			`<DT><B>-numclasses</B><I> C</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`The target number of classes to induce.`
			`A zero argument suppresses automatic class merging altogether`
			`(e.g., for use with`
			`<B> -interact). </B>`
			`<DT><B> -full </B>`
			`<DD>`
			`Perform full greedy merging over all classes starting with one class per`
			`word.`
			`This is the O(V^3) algorithm described in Brown et al. (1992).`
			`<DT><B> -incremental </B>`
			`<DD>`
			`Perform incremental greedy merging, starting with`
			`one class each for the`
			`<I> C </I>`
			`most frequent words, and then adding one word at a time.`
			`This is the O(V*C^2) algorithm described in Brown et al. (1992);`
			`it is the default.`
			`<DT><B> -maxwordsperclass M </B>`
			`<DD>`
			`Limits the number of words in a class to`
			`<I> M </I>`
			`in incremental merging.`
			`By default there is no such limit.`
			`<DT><B> -interact </B>`
			`<DD>`
			`Enter a primitive interactive interface when done with automatic class`
			`induction, allowing manual specification of additional merging steps.`
			`<DT><B>-noclass-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Read a list of vocabulary items from`
			`<I> file </I>`
			`that are to be excluded from classes.`
			`These words or tags do no undergo class merging, but their`
			`N-gram counts still affect the optimization of model perplexity.`
			`<BR>`
			`The default is to exclude the sentence begin/end tags (<s> and </s>)`
			`from class merging; this can be suppressed by specifying`
			`<B>-noclass-vocab /dev/null</B>.<B></B><B></B><B></B>`
			`<DT><B>-read</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Read initial class memberships from`
			`<I>file</I>.<I></I><I></I><I></I>`
			`Class memberships need to be stored in`
			`<A HREF="classes-format.5.html">classes-format(5)</A>`
			`with the additional condition that probabilities are obligatory`
			`and that each membership definition must specify exactly one word.`
			`</DD>`
			`</DL>`
			`<H3> Output Options </H3>`
			`<DL>`
			`<DT><B>-class-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Write class N-gram counts to`
			`<I> file </I>`
			`when done.`
			`The format is the same as for word N-gram counts, and can be`
			`read by`
			`<A HREF="ngram-count.1.html">ngram-count(1)</A>`
			`to estimate a class-N-gram model.`
			`<DT><B>-classes</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Write class definitions (member words and their probabilities) to`
			`<I> file </I>`
			`when done.`
			`The output format is the same as required by the`
			`<B> -classes </B>`
			`option of`
			`<A HREF="ngram.1.html">ngram(1)</A>.`
			`<DT><B>-save</B><I> S</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Save the class counts and/or class definitions every`
			`<I> S </I>`
			`iterations during induction.`
			`The filenames are obtained from the`
			`<B> -class-counts </B>`
			`and`
			`<B> -classes </B>`
			`options, respectively, by appending the iteration number.`
			`This is convenient for producing sets of classes at different granularities`
			`during the same run.`
			`The saved class memberships can also be used with the`
			`<B> -read </B>`
			`option to restart class merging at a later time.`
			`<I>S</I>=0<I></I><I></I><I></I>`
			`(the default) suppresses the saving actions.`
			`<DT><B>-save-maxclasses</B><I> K</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Modifies the action of`
			`<B> -save </B>`
			`so as to only start saving once the number of classes reaches`
			`<I>K</I>.<I></I><I></I><I></I>`
			`(The iteration numbers embedded in filenames will start at 0 from that point.)`
			`</DD>`
			`</DL>`
			`<H2> SEE ALSO </H2>`
			`<A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="classes-format.5.html">classes-format(5)</A>.`
			`<BR>`
			`P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,`
			``Class-Based n-gram Models of Natural Language,''
			`<I>Computational Linguistics</I> 18(4), 467-479, 1992.`
			`<H2> BUGS </H2>`
			`Classes are optimized only for bigram models at present.`
			`<H2> AUTHOR </H2>`
			`Andreas Stolcke <stolcke@icsi.berkeley.edu>,`
			`Seppo Enarvi <seppo.enarvi@aalto.fi>`
			`<BR>`
			`Copyright (c) 1999-2010 SRI International`
			`<BR>`
			`Copyright (c) 2013-2014 Seppo Enarvi`
			`<BR>`
			`Copyright (c) 2011-2014 Andreas Stolcke`
			`<BR>`
			`Copyright (c) 2012-2014 Microsoft Corp.`
			`</BODY>`
			`</HTML>`