203 lines
		
	
	
		
			6.4 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			203 lines
		
	
	
		
			6.4 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <! $Id: ngram-class.1,v 1.10 2019/09/09 22:35:37 stolcke Exp $>
 | |
| <HTML>
 | |
| <HEADER>
 | |
| <TITLE>ngram-class</TITLE>
 | |
| <BODY>
 | |
| <H1>ngram-class</H1>
 | |
| <H2> NAME </H2>
 | |
| ngram-class - induce word classes from N-gram statistics
 | |
| <H2> SYNOPSIS </H2>
 | |
| <PRE>
 | |
| <B>ngram-class</B> [ <B>-help</B> ] <I>option</I> ...
 | |
| </PRE>
 | |
| <H2> DESCRIPTION </H2>
 | |
| <B> ngram-class </B>
 | |
| induces word classes from distributional statistics,
 | |
| so as to minimize perplexity of a class-based N-gram model
 | |
| given the provided word N-gram counts.
 | |
| Presently, only bigram statistics are used, i.e., the induced classes
 | |
| are best suited for a class-bigram language model.
 | |
| <P>
 | |
| The program generates the class N-gram counts and class expansions
 | |
| needed by
 | |
| <A HREF="ngram-count.1.html">ngram-count(1)</A>
 | |
| and
 | |
| <A HREF="ngram.1.html">ngram(1)</A>,
 | |
| respectively to train and to apply the class N-gram model.
 | |
| <H2> OPTIONS </H2>
 | |
| <P>
 | |
| Each filename argument can be an ASCII file, or a 
 | |
| compressed file (name ending in .Z or .gz), or ``-'' to indicate
 | |
| stdin/stdout.
 | |
| <DL>
 | |
| <DT><B> -help </B>
 | |
| <DD>
 | |
| Print option summary.
 | |
| <DT><B> -version </B>
 | |
| <DD>
 | |
| Print version information.
 | |
| <DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Set debugging output at
 | |
| <I>level</I>.<I></I><I></I><I></I>
 | |
| Level 0 means no debugging.
 | |
| Debugging messages are written to stderr.
 | |
| A useful level to trace the formation of classes is 2.
 | |
| </DD>
 | |
| </DL>
 | |
| <H3> Input Options </H3>
 | |
| <DL>
 | |
| <DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Read a vocabulary from file.
 | |
| Subsequently, out-of-vocabulary words in both counts or text are
 | |
| replaced with the unknown-word token.
 | |
| If this option is not specified all words found are implicitly added
 | |
| to the vocabulary.
 | |
| <DT><B> -tolower </B>
 | |
| <DD>
 | |
| Map the vocabulary to lowercase.
 | |
| <DT><B>-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Read N-gram counts from a file.
 | |
| Each line contains an N-gram of 
 | |
| words, followed by an integer count, all separated by whitespace.
 | |
| Repeated counts for the same N-gram are added.
 | |
| Counts collected by 
 | |
| <B> -text </B>
 | |
| and 
 | |
| <B> -counts </B>
 | |
| are additive as well.
 | |
| <BR>
 | |
| Note that the input should contain consistent lower- and higher-order
 | |
| counts (i.e., unigrams and bigrams), as would be generated by
 | |
| <A HREF="ngram-count.1.html">ngram-count(1)</A>.
 | |
| <DT><B>-text</B><I> textfile</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Generate N-gram counts from text file.
 | |
| <I> textfile </I>
 | |
| should contain one sentence unit per line.
 | |
| Begin/end sentence tokens are added if not already present.
 | |
| Empty lines are ignored.
 | |
| </DD>
 | |
| </DL>
 | |
| <H3> Class Merging </H3>
 | |
| <DL>
 | |
| <DT><B>-numclasses</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| The target number of classes to induce.
 | |
| A zero argument suppresses automatic class merging altogether
 | |
| (e.g., for use with 
 | |
| <B> -interact). </B>
 | |
| <DT><B> -full </B>
 | |
| <DD>
 | |
| Perform full greedy merging over all classes starting with one class per
 | |
| word.
 | |
| This is the O(V^3) algorithm described in Brown et al. (1992).
 | |
| <DT><B> -incremental </B>
 | |
| <DD>
 | |
| Perform incremental greedy merging, starting with 
 | |
| one class each for the 
 | |
| <I> C </I>
 | |
| most frequent words, and then adding one word at a time.
 | |
| This is the O(V*C^2) algorithm described in Brown et al. (1992);
 | |
| it is the default.
 | |
| <DT><B> -maxwordsperclass  M </B>
 | |
| <DD>
 | |
| Limits the number of words in a class to
 | |
| <I> M </I>
 | |
| in incremental merging.
 | |
| By default there is no such limit.
 | |
| <DT><B> -interact </B>
 | |
| <DD>
 | |
| Enter a primitive interactive interface when done with automatic class
 | |
| induction, allowing manual specification of additional merging steps.
 | |
| <DT><B>-noclass-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Read a list of vocabulary items from
 | |
| <I> file </I>
 | |
| that are to be excluded from classes.
 | |
| These words or tags do no undergo class merging, but their 
 | |
| N-gram counts still affect the optimization of model perplexity.
 | |
| <BR>
 | |
| The default is to exclude the sentence begin/end tags (<s> and </s>)
 | |
| from class merging; this can be suppressed by specifying
 | |
| <B>-noclass-vocab /dev/null</B>.<B></B><B></B><B></B>
 | |
| <DT><B>-read</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Read initial class memberships from 
 | |
| <I>file</I>.<I></I><I></I><I></I>
 | |
| Class memberships need to be stored in 
 | |
| <A HREF="classes-format.5.html">classes-format(5)</A>
 | |
| with the additional condition that probabilities are obligatory
 | |
| and that each membership definition must specify exactly one word.
 | |
| </DD>
 | |
| </DL>
 | |
| <H3> Output Options </H3>
 | |
| <DL>
 | |
| <DT><B>-class-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Write class N-gram counts to
 | |
| <I> file </I>
 | |
| when done.
 | |
| The format is the same as for word N-gram counts, and can be
 | |
| read by
 | |
| <A HREF="ngram-count.1.html">ngram-count(1)</A>
 | |
| to estimate a class-N-gram model.
 | |
| <DT><B>-classes</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Write class definitions (member words and their probabilities) to
 | |
| <I> file </I>
 | |
| when done.
 | |
| The output format is the same as required by the
 | |
| <B> -classes </B>
 | |
| option of 
 | |
| <A HREF="ngram.1.html">ngram(1)</A>.
 | |
| <DT><B>-save</B><I> S</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Save the class counts and/or class definitions every
 | |
| <I> S </I>
 | |
| iterations during induction.
 | |
| The filenames are obtained from the
 | |
| <B> -class-counts </B>
 | |
| and
 | |
| <B> -classes </B>
 | |
| options, respectively, by appending the iteration number.
 | |
| This is convenient for producing sets of classes at different granularities
 | |
| during the same run.
 | |
| The saved class memberships can also be used with the
 | |
| <B> -read </B>
 | |
| option to restart class merging at a later time.
 | |
| <I>S</I>=0<I></I><I></I><I></I>
 | |
| (the default) suppresses the saving actions.
 | |
| <DT><B>-save-maxclasses</B><I> K</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Modifies the action of
 | |
| <B> -save </B>
 | |
| so as to only start saving once the number of classes reaches
 | |
| <I>K</I>.<I></I><I></I><I></I>
 | |
| (The iteration numbers embedded in filenames will start at 0 from that point.)
 | |
| </DD>
 | |
| </DL>
 | |
| <H2> SEE ALSO </H2>
 | |
| <A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="classes-format.5.html">classes-format(5)</A>.
 | |
| <BR>
 | |
| P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,
 | |
| ``Class-Based n-gram Models of Natural Language,''
 | |
| <I>Computational Linguistics</I> 18(4), 467-479, 1992.
 | |
| <H2> BUGS </H2>
 | |
| Classes are optimized only for bigram models at present.
 | |
| <H2> AUTHOR </H2>
 | |
| Andreas Stolcke <stolcke@icsi.berkeley.edu>,
 | |
| Seppo Enarvi <seppo.enarvi@aalto.fi>
 | |
| <BR>
 | |
| Copyright (c) 1999-2010 SRI International
 | |
| <BR>
 | |
| Copyright (c) 2013-2014 Seppo Enarvi
 | |
| <BR>
 | |
| Copyright (c) 2011-2014 Andreas Stolcke
 | |
| <BR>
 | |
| Copyright (c) 2012-2014 Microsoft Corp.
 | |
| </BODY>
 | |
| </HTML>
 | 
