104 lines
		
	
	
		
			3.8 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
		
		
			
		
	
	
			104 lines
		
	
	
		
			3.8 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
|   | <! $Id: multi-ngram.1,v 1.5 2019/09/09 22:35:36 stolcke Exp $> | ||
|  | <HTML> | ||
|  | <HEADER> | ||
|  | <TITLE>multi-ngram</TITLE> | ||
|  | <BODY> | ||
|  | <H1>multi-ngram</H1> | ||
|  | <H2> NAME </H2> | ||
|  | multi-ngram - build multiword N-gram models | ||
|  | <H2> SYNOPSIS </H2> | ||
|  | <PRE> | ||
|  | <B>multi-ngram</B> [ <B>-help</B> ] <I>option</I> ... | ||
|  | </PRE> | ||
|  | <H2> DESCRIPTION </H2> | ||
|  | <B> multi-ngram </B> | ||
|  | builds N-gram language models that contain multiwords, i.e., compound words | ||
|  | that are a concatenation of words from some prior given model. | ||
|  | It will optionally generate multiword N-grams and insert them into | ||
|  | an existing, reference N-gram model, so as to cover multiwords occuring  | ||
|  | in a specified vocabulary. | ||
|  | It will then assign probabilities to the multiword N-grams so that word | ||
|  | strings containing multiwords have the same probabilities as the strings | ||
|  | of component words in the reference model. | ||
|  | <P> | ||
|  | Note that the inverse operation (expanding a multiword N-gram to contain | ||
|  | only regular words) is subsumed by the  | ||
|  | <B> ngram -expand-classes </B> | ||
|  | function. | ||
|  | <H2> OPTIONS </H2> | ||
|  | Each filename argument can be an ASCII file, or a  | ||
|  | compressed file (name ending in .Z or .gz), or ``-'' to indicate | ||
|  | stdin/stdout. | ||
|  | <DL> | ||
|  | <DT><B> -help </B> | ||
|  | <DD> | ||
|  | Print option summary. | ||
|  | <DT><B> -version </B> | ||
|  | <DD> | ||
|  | Print version information. | ||
|  | <DT><B>-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Set the maximal N-gram order to be used from the reference model. | ||
|  | NOTE: The order of the model is not set automatically when a model | ||
|  | file is read, so the same file can be used at various orders. | ||
|  | To use models of order higher than 3 it is always necessary to specify this | ||
|  | option. | ||
|  | <DT><B>-multi-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | The maximal N-gram order in the multiword-based model. | ||
|  | <DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Set the debugging output level (0 means no debugging output). | ||
|  | <DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Words to be added to the model. | ||
|  | In particular, this should include all the multiwords to be added. | ||
|  | <DT><B>-multi-char</B><I> C</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Character used to delimit component words in multiwords | ||
|  | (an underscore character by default). | ||
|  | <DT><B>-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Reference N-gram model. | ||
|  | <DT><B>-multi-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Model containing multiwords; the N-grams in this model will be assigned | ||
|  | new probabilities based on the reference model. | ||
|  | If this option is  | ||
|  | <I> not </I> | ||
|  | given then the multiword model will be generated by adding multiword | ||
|  | N-grams to the reference model. | ||
|  | <DT><B> -prune-unseen-ngrams </B> | ||
|  | <DD> | ||
|  | This option prevents the insertion of multiword N-grams whose component | ||
|  | N-grams are not contained in the reference model. | ||
|  | For example, for a multiword bigram "a_b c_d" to be inserted, a trigram | ||
|  | reference model must contain the trigrams "a b c" and "b c d". | ||
|  | If the reference model were a bigram LM, it would have to contain | ||
|  | "a b", "b c", and "c d". | ||
|  | This option is important to control the size of the multiword LM for | ||
|  | large vocabularies. | ||
|  | <DT><B>-write-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <DD> | ||
|  | Output location of the generated multiword model. | ||
|  | </DD> | ||
|  | </DL> | ||
|  | <H2> SEE ALSO </H2> | ||
|  | <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-format.5.html">ngram-format(5)</A>. | ||
|  | <H2> BUGS </H2> | ||
|  | This program is a hack for cases were the original training data is  | ||
|  | not available and a multiword model has to be generated from an existing | ||
|  | model. | ||
|  | <BR> | ||
|  | The resulting model is no longer properly normalized, since the  | ||
|  | same word string can potentially be represented with or without multiwords. | ||
|  | <BR> | ||
|  | The generation of multiword N-grams uses a heuristic algorithm that  | ||
|  | works well for bigrams and trigrams, but is not exhaustive. | ||
|  | <H2> AUTHOR </H2> | ||
|  | Andreas Stolcke <stolcke@icsi.berkeley.edu> | ||
|  | <BR> | ||
|  | Copyright (c) 2000-2004 SRI International | ||
|  | </BODY> | ||
|  | </HTML> |