203 lines
		
	
	
		
			6.4 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
		
		
			
		
	
	
			203 lines
		
	
	
		
			6.4 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
|   | <! $Id: ngram-class.1,v 1.10 2019/09/09 22:35:37 stolcke Exp $> | ||
|  | <HTML> | ||
|  | <HEADER> | ||
|  | <TITLE>ngram-class</TITLE> | ||
|  | <BODY> | ||
|  | <H1>ngram-class</H1> | ||
|  | <H2> NAME </H2> | ||
|  | ngram-class - induce word classes from N-gram statistics | ||
|  | <H2> SYNOPSIS </H2> | ||
|  | <PRE> | ||
|  | <B>ngram-class</B> [ <B>-help</B> ] <I>option</I> ... | ||
|  | </PRE> | ||
|  | <H2> DESCRIPTION </H2> | ||
|  | <B> ngram-class </B> | ||
|  | induces word classes from distributional statistics, | ||
|  | so as to minimize perplexity of a class-based N-gram model | ||
|  | given the provided word N-gram counts. | ||
|  | Presently, only bigram statistics are used, i.e., the induced classes | ||
|  | are best suited for a class-bigram language model. | ||
|  | <P> | ||
|  | The program generates the class N-gram counts and class expansions | ||
|  | needed by | ||
|  | <A HREF="ngram-count.1.html">ngram-count(1)</A> | ||
|  | and | ||
|  | <A HREF="ngram.1.html">ngram(1)</A>, | ||
|  | respectively to train and to apply the class N-gram model. | ||
|  | <H2> OPTIONS </H2> | ||
|  | <P> | ||
|  | Each filename argument can be an ASCII file, or a  | ||
|  | compressed file (name ending in .Z or .gz), or ``-'' to indicate | ||
|  | stdin/stdout. | ||
|  | <DL> | ||
|  | <DT><B> -help </B> | ||
|  | <DD> | ||
|  | Print option summary. | ||
|  | <DT><B> -version </B> | ||
|  | <DD> | ||
|  | Print version information. | ||
|  | <DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Set debugging output at | ||
|  | <I>level</I>.<I></I><I></I><I></I> | ||
|  | Level 0 means no debugging. | ||
|  | Debugging messages are written to stderr. | ||
|  | A useful level to trace the formation of classes is 2. | ||
|  | </DD> | ||
|  | </DL> | ||
|  | <H3> Input Options </H3> | ||
|  | <DL> | ||
|  | <DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Read a vocabulary from file. | ||
|  | Subsequently, out-of-vocabulary words in both counts or text are | ||
|  | replaced with the unknown-word token. | ||
|  | If this option is not specified all words found are implicitly added | ||
|  | to the vocabulary. | ||
|  | <DT><B> -tolower </B> | ||
|  | <DD> | ||
|  | Map the vocabulary to lowercase. | ||
|  | <DT><B>-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Read N-gram counts from a file. | ||
|  | Each line contains an N-gram of  | ||
|  | words, followed by an integer count, all separated by whitespace. | ||
|  | Repeated counts for the same N-gram are added. | ||
|  | Counts collected by  | ||
|  | <B> -text </B> | ||
|  | and  | ||
|  | <B> -counts </B> | ||
|  | are additive as well. | ||
|  | <BR> | ||
|  | Note that the input should contain consistent lower- and higher-order | ||
|  | counts (i.e., unigrams and bigrams), as would be generated by | ||
|  | <A HREF="ngram-count.1.html">ngram-count(1)</A>. | ||
|  | <DT><B>-text</B><I> textfile</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Generate N-gram counts from text file. | ||
|  | <I> textfile </I> | ||
|  | should contain one sentence unit per line. | ||
|  | Begin/end sentence tokens are added if not already present. | ||
|  | Empty lines are ignored. | ||
|  | </DD> | ||
|  | </DL> | ||
|  | <H3> Class Merging </H3> | ||
|  | <DL> | ||
|  | <DT><B>-numclasses</B><I> C</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | The target number of classes to induce. | ||
|  | A zero argument suppresses automatic class merging altogether | ||
|  | (e.g., for use with  | ||
|  | <B> -interact). </B> | ||
|  | <DT><B> -full </B> | ||
|  | <DD> | ||
|  | Perform full greedy merging over all classes starting with one class per | ||
|  | word. | ||
|  | This is the O(V^3) algorithm described in Brown et al. (1992). | ||
|  | <DT><B> -incremental </B> | ||
|  | <DD> | ||
|  | Perform incremental greedy merging, starting with  | ||
|  | one class each for the  | ||
|  | <I> C </I> | ||
|  | most frequent words, and then adding one word at a time. | ||
|  | This is the O(V*C^2) algorithm described in Brown et al. (1992); | ||
|  | it is the default. | ||
|  | <DT><B> -maxwordsperclass  M </B> | ||
|  | <DD> | ||
|  | Limits the number of words in a class to | ||
|  | <I> M </I> | ||
|  | in incremental merging. | ||
|  | By default there is no such limit. | ||
|  | <DT><B> -interact </B> | ||
|  | <DD> | ||
|  | Enter a primitive interactive interface when done with automatic class | ||
|  | induction, allowing manual specification of additional merging steps. | ||
|  | <DT><B>-noclass-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Read a list of vocabulary items from | ||
|  | <I> file </I> | ||
|  | that are to be excluded from classes. | ||
|  | These words or tags do no undergo class merging, but their  | ||
|  | N-gram counts still affect the optimization of model perplexity. | ||
|  | <BR> | ||
|  | The default is to exclude the sentence begin/end tags (<s> and </s>) | ||
|  | from class merging; this can be suppressed by specifying | ||
|  | <B>-noclass-vocab /dev/null</B>.<B></B><B></B><B></B> | ||
|  | <DT><B>-read</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Read initial class memberships from  | ||
|  | <I>file</I>.<I></I><I></I><I></I> | ||
|  | Class memberships need to be stored in  | ||
|  | <A HREF="classes-format.5.html">classes-format(5)</A> | ||
|  | with the additional condition that probabilities are obligatory | ||
|  | and that each membership definition must specify exactly one word. | ||
|  | </DD> | ||
|  | </DL> | ||
|  | <H3> Output Options </H3> | ||
|  | <DL> | ||
|  | <DT><B>-class-counts</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Write class N-gram counts to | ||
|  | <I> file </I> | ||
|  | when done. | ||
|  | The format is the same as for word N-gram counts, and can be | ||
|  | read by | ||
|  | <A HREF="ngram-count.1.html">ngram-count(1)</A> | ||
|  | to estimate a class-N-gram model. | ||
|  | <DT><B>-classes</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Write class definitions (member words and their probabilities) to | ||
|  | <I> file </I> | ||
|  | when done. | ||
|  | The output format is the same as required by the | ||
|  | <B> -classes </B> | ||
|  | option of  | ||
|  | <A HREF="ngram.1.html">ngram(1)</A>. | ||
|  | <DT><B>-save</B><I> S</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Save the class counts and/or class definitions every | ||
|  | <I> S </I> | ||
|  | iterations during induction. | ||
|  | The filenames are obtained from the | ||
|  | <B> -class-counts </B> | ||
|  | and | ||
|  | <B> -classes </B> | ||
|  | options, respectively, by appending the iteration number. | ||
|  | This is convenient for producing sets of classes at different granularities | ||
|  | during the same run. | ||
|  | The saved class memberships can also be used with the | ||
|  | <B> -read </B> | ||
|  | option to restart class merging at a later time. | ||
|  | <I>S</I>=0<I></I><I></I><I></I> | ||
|  | (the default) suppresses the saving actions. | ||
|  | <DT><B>-save-maxclasses</B><I> K</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Modifies the action of | ||
|  | <B> -save </B> | ||
|  | so as to only start saving once the number of classes reaches | ||
|  | <I>K</I>.<I></I><I></I><I></I> | ||
|  | (The iteration numbers embedded in filenames will start at 0 from that point.) | ||
|  | </DD> | ||
|  | </DL> | ||
|  | <H2> SEE ALSO </H2> | ||
|  | <A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="classes-format.5.html">classes-format(5)</A>. | ||
|  | <BR> | ||
|  | P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer, | ||
|  | ``Class-Based n-gram Models of Natural Language,'' | ||
|  | <I>Computational Linguistics</I> 18(4), 467-479, 1992. | ||
|  | <H2> BUGS </H2> | ||
|  | Classes are optimized only for bigram models at present. | ||
|  | <H2> AUTHOR </H2> | ||
|  | Andreas Stolcke <stolcke@icsi.berkeley.edu>, | ||
|  | Seppo Enarvi <seppo.enarvi@aalto.fi> | ||
|  | <BR> | ||
|  | Copyright (c) 1999-2010 SRI International | ||
|  | <BR> | ||
|  | Copyright (c) 2013-2014 Seppo Enarvi | ||
|  | <BR> | ||
|  | Copyright (c) 2011-2014 Andreas Stolcke | ||
|  | <BR> | ||
|  | Copyright (c) 2012-2014 Microsoft Corp. | ||
|  | </BODY> | ||
|  | </HTML> |