b2txt25/language_model/srilm-1.7.3/man/html/multi-ngram.1.html

<! $Id: multi-ngram.1,v 1.5 2019/09/09 22:35:36 stolcke Exp $>
<HTML>
<HEADER>
<TITLE>multi-ngram</TITLE>
<BODY>
<H1>multi-ngram</H1>
<H2> NAME </H2>
multi-ngram - build multiword N-gram models
<H2> SYNOPSIS </H2>
<PRE>
<B>multi-ngram</B> [ <B>-help</B> ] <I>option</I> ...
</PRE>
<H2> DESCRIPTION </H2>
<B> multi-ngram </B>
builds N-gram language models that contain multiwords, i.e., compound words
that are a concatenation of words from some prior given model.
It will optionally generate multiword N-grams and insert them into
an existing, reference N-gram model, so as to cover multiwords occuring 
in a specified vocabulary.
It will then assign probabilities to the multiword N-grams so that word
strings containing multiwords have the same probabilities as the strings
of component words in the reference model.
<P>
Note that the inverse operation (expanding a multiword N-gram to contain
only regular words) is subsumed by the 
<B> ngram -expand-classes </B>
function.
<H2> OPTIONS </H2>
Each filename argument can be an ASCII file, or a 
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
<DL>
<DT><B> -help </B>
<DD>
Print option summary.
<DT><B> -version </B>
<DD>
Print version information.
<DT><B>-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Set the maximal N-gram order to be used from the reference model.
NOTE: The order of the model is not set automatically when a model
file is read, so the same file can be used at various orders.
To use models of order higher than 3 it is always necessary to specify this
option.
<DT><B>-multi-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B>
<DD>
The maximal N-gram order in the multiword-based model.
<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Set the debugging output level (0 means no debugging output).
<DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Words to be added to the model.
In particular, this should include all the multiwords to be added.
<DT><B>-multi-char</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Character used to delimit component words in multiwords
(an underscore character by default).
<DT><B>-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Reference N-gram model.
<DT><B>-multi-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Model containing multiwords; the N-grams in this model will be assigned
new probabilities based on the reference model.
If this option is 
<I> not </I>
given then the multiword model will be generated by adding multiword
N-grams to the reference model.
<DT><B> -prune-unseen-ngrams </B>
<DD>
This option prevents the insertion of multiword N-grams whose component
N-grams are not contained in the reference model.
For example, for a multiword bigram "a_b c_d" to be inserted, a trigram
reference model must contain the trigrams "a b c" and "b c d".
If the reference model were a bigram LM, it would have to contain
"a b", "b c", and "c d".
This option is important to control the size of the multiword LM for
large vocabularies.
<DT><B>-write-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Output location of the generated multiword model.
</DD>
</DL>
<H2> SEE ALSO </H2>
<A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-format.5.html">ngram-format(5)</A>.
<H2> BUGS </H2>
This program is a hack for cases were the original training data is 
not available and a multiword model has to be generated from an existing
model.
<BR>
The resulting model is no longer properly normalized, since the 
same word string can potentially be represented with or without multiwords.
<BR>
The generation of multiword N-grams uses a heuristic algorithm that 
works well for bigrams and trigrams, but is not exhaustive.
<H2> AUTHOR </H2>
Andreas Stolcke &lt;stolcke@icsi.berkeley.edu&gt;
<BR>
Copyright (c) 2000-2004 SRI International
</BODY>
</HTML>
competition update 2025-07-02 12:18:09 -07:00			`<! $Id: multi-ngram.1,v 1.5 2019/09/09 22:35:36 stolcke Exp $>`
			`<HTML>`
			`<HEADER>`
			`<TITLE>multi-ngram</TITLE>`
			`<BODY>`
			`<H1>multi-ngram</H1>`
			`<H2> NAME </H2>`
			`multi-ngram - build multiword N-gram models`
			`<H2> SYNOPSIS </H2>`
			`<PRE>`
			`<B>multi-ngram</B> [ <B>-help</B> ] <I>option</I> ...`
			`</PRE>`
			`<H2> DESCRIPTION </H2>`
			`<B> multi-ngram </B>`
			`builds N-gram language models that contain multiwords, i.e., compound words`
			`that are a concatenation of words from some prior given model.`
			`It will optionally generate multiword N-grams and insert them into`
			`an existing, reference N-gram model, so as to cover multiwords occuring`
			`in a specified vocabulary.`
			`It will then assign probabilities to the multiword N-grams so that word`
			`strings containing multiwords have the same probabilities as the strings`
			`of component words in the reference model.`
			`<P>`
			`Note that the inverse operation (expanding a multiword N-gram to contain`
			`only regular words) is subsumed by the`
			`<B> ngram -expand-classes </B>`
			`function.`
			`<H2> OPTIONS </H2>`
			`Each filename argument can be an ASCII file, or a`
			compressed file (name ending in .Z or .gz), or ``-'' to indicate
			`stdin/stdout.`
			`<DL>`
			`<DT><B> -help </B>`
			`<DD>`
			`Print option summary.`
			`<DT><B> -version </B>`
			`<DD>`
			`Print version information.`
			`<DT><B>-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Set the maximal N-gram order to be used from the reference model.`
			`NOTE: The order of the model is not set automatically when a model`
			`file is read, so the same file can be used at various orders.`
			`To use models of order higher than 3 it is always necessary to specify this`
			`option.`
			`<DT><B>-multi-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`The maximal N-gram order in the multiword-based model.`
			`<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Set the debugging output level (0 means no debugging output).`
			`<DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Words to be added to the model.`
			`In particular, this should include all the multiwords to be added.`
			`<DT><B>-multi-char</B><I> C</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Character used to delimit component words in multiwords`
			`(an underscore character by default).`
			`<DT><B>-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Reference N-gram model.`
			`<DT><B>-multi-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Model containing multiwords; the N-grams in this model will be assigned`
			`new probabilities based on the reference model.`
			`If this option is`
			`<I> not </I>`
			`given then the multiword model will be generated by adding multiword`
			`N-grams to the reference model.`
			`<DT><B> -prune-unseen-ngrams </B>`
			`<DD>`
			`This option prevents the insertion of multiword N-grams whose component`
			`N-grams are not contained in the reference model.`
			`For example, for a multiword bigram "a_b c_d" to be inserted, a trigram`
			`reference model must contain the trigrams "a b c" and "b c d".`
			`If the reference model were a bigram LM, it would have to contain`
			`"a b", "b c", and "c d".`
			`This option is important to control the size of the multiword LM for`
			`large vocabularies.`
			`<DT><B>-write-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>`
			`<DD>`
			`Output location of the generated multiword model.`
			`</DD>`
			`</DL>`
			`<H2> SEE ALSO </H2>`
			`<A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-format.5.html">ngram-format(5)</A>.`
			`<H2> BUGS </H2>`
			`This program is a hack for cases were the original training data is`
			`not available and a multiword model has to be generated from an existing`
			`model.`
			`<BR>`
			`The resulting model is no longer properly normalized, since the`
			`same word string can potentially be represented with or without multiwords.`
			`<BR>`
			`The generation of multiword N-grams uses a heuristic algorithm that`
			`works well for bigrams and trigrams, but is not exhaustive.`
			`<H2> AUTHOR </H2>`
			`Andreas Stolcke <stolcke@icsi.berkeley.edu>`
			`<BR>`
			`Copyright (c) 2000-2004 SRI International`
			`</BODY>`
			`</HTML>`