104 lines
		
	
	
		
			3.8 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			104 lines
		
	
	
		
			3.8 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <! $Id: multi-ngram.1,v 1.5 2019/09/09 22:35:36 stolcke Exp $>
 | |
| <HTML>
 | |
| <HEADER>
 | |
| <TITLE>multi-ngram</TITLE>
 | |
| <BODY>
 | |
| <H1>multi-ngram</H1>
 | |
| <H2> NAME </H2>
 | |
| multi-ngram - build multiword N-gram models
 | |
| <H2> SYNOPSIS </H2>
 | |
| <PRE>
 | |
| <B>multi-ngram</B> [ <B>-help</B> ] <I>option</I> ...
 | |
| </PRE>
 | |
| <H2> DESCRIPTION </H2>
 | |
| <B> multi-ngram </B>
 | |
| builds N-gram language models that contain multiwords, i.e., compound words
 | |
| that are a concatenation of words from some prior given model.
 | |
| It will optionally generate multiword N-grams and insert them into
 | |
| an existing, reference N-gram model, so as to cover multiwords occuring 
 | |
| in a specified vocabulary.
 | |
| It will then assign probabilities to the multiword N-grams so that word
 | |
| strings containing multiwords have the same probabilities as the strings
 | |
| of component words in the reference model.
 | |
| <P>
 | |
| Note that the inverse operation (expanding a multiword N-gram to contain
 | |
| only regular words) is subsumed by the 
 | |
| <B> ngram -expand-classes </B>
 | |
| function.
 | |
| <H2> OPTIONS </H2>
 | |
| Each filename argument can be an ASCII file, or a 
 | |
| compressed file (name ending in .Z or .gz), or ``-'' to indicate
 | |
| stdin/stdout.
 | |
| <DL>
 | |
| <DT><B> -help </B>
 | |
| <DD>
 | |
| Print option summary.
 | |
| <DT><B> -version </B>
 | |
| <DD>
 | |
| Print version information.
 | |
| <DT><B>-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Set the maximal N-gram order to be used from the reference model.
 | |
| NOTE: The order of the model is not set automatically when a model
 | |
| file is read, so the same file can be used at various orders.
 | |
| To use models of order higher than 3 it is always necessary to specify this
 | |
| option.
 | |
| <DT><B>-multi-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| The maximal N-gram order in the multiword-based model.
 | |
| <DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Set the debugging output level (0 means no debugging output).
 | |
| <DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Words to be added to the model.
 | |
| In particular, this should include all the multiwords to be added.
 | |
| <DT><B>-multi-char</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Character used to delimit component words in multiwords
 | |
| (an underscore character by default).
 | |
| <DT><B>-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Reference N-gram model.
 | |
| <DT><B>-multi-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Model containing multiwords; the N-grams in this model will be assigned
 | |
| new probabilities based on the reference model.
 | |
| If this option is 
 | |
| <I> not </I>
 | |
| given then the multiword model will be generated by adding multiword
 | |
| N-grams to the reference model.
 | |
| <DT><B> -prune-unseen-ngrams </B>
 | |
| <DD>
 | |
| This option prevents the insertion of multiword N-grams whose component
 | |
| N-grams are not contained in the reference model.
 | |
| For example, for a multiword bigram "a_b c_d" to be inserted, a trigram
 | |
| reference model must contain the trigrams "a b c" and "b c d".
 | |
| If the reference model were a bigram LM, it would have to contain
 | |
| "a b", "b c", and "c d".
 | |
| This option is important to control the size of the multiword LM for
 | |
| large vocabularies.
 | |
| <DT><B>-write-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <DD>
 | |
| Output location of the generated multiword model.
 | |
| </DD>
 | |
| </DL>
 | |
| <H2> SEE ALSO </H2>
 | |
| <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-format.5.html">ngram-format(5)</A>.
 | |
| <H2> BUGS </H2>
 | |
| This program is a hack for cases were the original training data is 
 | |
| not available and a multiword model has to be generated from an existing
 | |
| model.
 | |
| <BR>
 | |
| The resulting model is no longer properly normalized, since the 
 | |
| same word string can potentially be represented with or without multiwords.
 | |
| <BR>
 | |
| The generation of multiword N-grams uses a heuristic algorithm that 
 | |
| works well for bigrams and trigrams, but is not exhaustive.
 | |
| <H2> AUTHOR </H2>
 | |
| Andreas Stolcke <stolcke@icsi.berkeley.edu>
 | |
| <BR>
 | |
| Copyright (c) 2000-2004 SRI International
 | |
| </BODY>
 | |
| </HTML>
 | 
