158 lines
		
	
	
		
			5.9 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
		
		
			
		
	
	
			158 lines
		
	
	
		
			5.9 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
|   | <! $Id: lm-scripts.1,v 1.9 2019/09/09 22:35:36 stolcke Exp $> | ||
|  | <HTML> | ||
|  | <HEADER> | ||
|  | <TITLE>lm-scripts</TITLE> | ||
|  | <BODY> | ||
|  | <H1>lm-scripts</H1> | ||
|  | <H2> NAME </H2> | ||
|  | lm-scripts, add-dummy-bows, change-lm-vocab, empty-sentence-lm, get-unigram-probs, make-hiddens-lm, make-lm-subset, make-sub-lm, remove-lowprob-ngrams, reverse-lm, sort-lm - manipulate N-gram language models | ||
|  | <H2> SYNOPSIS </H2> | ||
|  | <PRE> | ||
|  | <B>add-dummy-bows</B> [ <I>lm-file</I> ] <B>></B> <I>new-lm-file</I> | ||
|  | <B>change-lm-vocab</B> <B>-vocab</B> <I>vocab</I> <B>-lm</B> <I>lm-file</I> <B>-write-lm</B> <I>new-lm-file</I> \ | ||
|  | 	[ <B>-tolower</B> ] [ <B>-subset</B> ] [ <I>ngram-options</I> ... ] | ||
|  | <B>empty-sentence-lm</B> <B>-prob</B> <I>p</I> <B>-lm</B> <I>lm-file</I> <B>-write-lm</B> <I>new-lm-file</I> \ | ||
|  | 	[ <I>ngram-options</I> ... ] | ||
|  | <B>get-unigram-probs</B> [ <B>linear=1</B> ] [ <I>lm-file</I> ] | ||
|  | <B>make-hiddens-lm</B> [ <I>lm-file</I> ] <B>></B> <I>hiddens-lm-file</I> | ||
|  | <B>make-lm-subset</B> <I>count-file</I>|<B>-</B> [ lm-file |<B>-</B> ] <B>></B> <I>new-lm-file</I> | ||
|  | <B>make-sub-lm</B> [ <B>maxorder=</B><I>N</I> ] [ <I>lm-file</I> ] <B>></B> <I>new-lm-file</I> | ||
|  | <B>remove-lowprob-ngrams</B> [ <I>lm-file</I> ] <B>></B> <I>new-lm-file</I> | ||
|  | <B>reverse-lm</B>  [ <I>lm-file</I> ] <B>></B> <I>new-lm-file</I> | ||
|  | <B>sort-lm</B> [ <I>lm-file</I> ] <B>></B> <I>sorted-lm-file</I> | ||
|  | </PRE> | ||
|  | <H2> DESCRIPTION </H2> | ||
|  | These scripts perform various useful manipulations on N-gram models | ||
|  | in their textual representation. | ||
|  | Most operate on backoff N-grams in ARPA | ||
|  | <A HREF="ngram-format.5.html">ngram-format(5)</A>. | ||
|  | <P> | ||
|  | Since these tools are implemented as scripts they don't automatically | ||
|  | input or output compressed model files correctly, unlike the main | ||
|  | SRILM tools. | ||
|  | However, since most scripts work with data from standard input or | ||
|  | to standard output (by leaving out the file argument, or specifying it  | ||
|  | as ``-'') it is easy to combine them with  | ||
|  | <A HREF="gunzip.1.html">gunzip(1)</A> | ||
|  | or | ||
|  | <A HREF="gzip.1.html">gzip(1)</A> | ||
|  | on the command line. | ||
|  | <P> | ||
|  | Also note that many of the scripts take their options with the  | ||
|  | <A HREF="gawk.1.html">gawk(1)</A> | ||
|  | syntax | ||
|  | <I>option</I><B>=</B><I>value</I><B></B><I></I><B></B><I></I> | ||
|  | instead of the more common | ||
|  | <B>-</B><I>option</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | <I>value</I>.<I></I><I></I><I></I> | ||
|  | <P> | ||
|  | <B> add-dummy-bows </B> | ||
|  | adds dummy backoff weights to N-grams, even where they  | ||
|  | are not required, to satisfy some broken software that expects | ||
|  | backoff weights on all N-grams (except those of highest order). | ||
|  | <P> | ||
|  | <B> change-lm-vocab </B> | ||
|  | modifies the vocabulary of an LM to be that in  | ||
|  | <I>vocab</I>.<I></I><I></I><I></I> | ||
|  | Any N-grams containing out-of-vocabulary words are removed, | ||
|  | new words receive a unigram probability, and the model | ||
|  | is renormalized. | ||
|  | The  | ||
|  | <B> -tolower </B> | ||
|  | option causes case distinctions to be ignored. | ||
|  | <B> -subset </B> | ||
|  | only removes words from the LM vocabulary, without adding any. | ||
|  | Any remaining | ||
|  | <I> ngram-options </I> | ||
|  | are passes to | ||
|  | <A HREF="ngram.1.html">ngram(1)</A>, | ||
|  | and can be used to set debugging level, N-gram order, etc. | ||
|  | <P> | ||
|  | <B> empty-sentence-lm </B> | ||
|  | modifies an LM so that it allows the empty sentence with  | ||
|  | probability | ||
|  | <I>p</I>.<I></I><I></I><I></I> | ||
|  | This is useful to modify existing LMs that are trained on non-empty | ||
|  | sentences only. | ||
|  | <I> ngram-options </I> | ||
|  | are passes to | ||
|  | <A HREF="ngram.1.html">ngram(1)</A>, | ||
|  | and can be used to set debugging level, N-gram order, etc. | ||
|  | <P> | ||
|  | <B> make-hiddens-lm </B> | ||
|  | constructs an N-gram model that can be used with the | ||
|  | <B> ngram -hiddens </B> | ||
|  | option. | ||
|  | The new model contains intra-utterance sentence boundary | ||
|  | tags ``<#s>'' with the same probability as the original model | ||
|  | had final sentence tags </s>. | ||
|  | Also, utterance-initial words are not conditioned on <s> and | ||
|  | there is no penalty associated with utterance-final </s>. | ||
|  | Such as model might work better it the test corpus is segmented  | ||
|  | at places other than proper <s> boundaries. | ||
|  | <P> | ||
|  | <B> make-lm-subset </B> | ||
|  | forms a new LM containing only the N-grams found in the  | ||
|  | <I>count-file</I>,<I></I><I></I><I></I> | ||
|  | in  | ||
|  | <A HREF="ngram-count.1.html">ngram-count(1)</A> | ||
|  | format. | ||
|  | The result still needs to be renormalized with | ||
|  | <B> ngram -renorm </B> | ||
|  | (which will also adjust the N-gram counts in the header). | ||
|  | <P> | ||
|  | <B> make-sub-lm </B> | ||
|  | removes N-grams of order exceeding | ||
|  | <I>N</I>.<I></I><I></I><I></I> | ||
|  | This function is now redundant, since | ||
|  | all SRILM tools can do this implicitly (without using extra memory  | ||
|  | and very small time overhead) when reading N-gram models | ||
|  | with the appropriate | ||
|  | <B> -order </B> | ||
|  | parameter. | ||
|  | <P> | ||
|  | <B> remove-lowprob-ngrams </B> | ||
|  | eliminates N-grams whose probability is lower than that which they | ||
|  | would receive through backoff. | ||
|  | This is useful when building finite-state networks for N-gram | ||
|  | models. | ||
|  | However, this function is now performed much faster by  | ||
|  | <A HREF="ngram.1.html">ngram(1)</A> | ||
|  | with the | ||
|  | <B> -prune-lowprobs </B> | ||
|  | option. | ||
|  | <P> | ||
|  | <B> reverse-lm </B> | ||
|  | produces a new LM that generates sentences with probabilities equal | ||
|  | to the reversed sentences in the input model. | ||
|  | <P> | ||
|  | <B> sort-lm </B> | ||
|  | sorts the n-grams in an LM in lexicographic order (left-most words being | ||
|  | the most significant). | ||
|  | This is not a requirement for SRILM, but might be necessary for some  | ||
|  | other LM software. | ||
|  | (The LMs output by SRILM are sorted somewhat differently, reflecting  | ||
|  | the internal data structures used; that is also the order that should give | ||
|  | best cache utilization when using SRILM to read models.) | ||
|  | <P> | ||
|  | <B> get-unigram-probs </B> | ||
|  | extracts the unigram probabilities in a simple table format | ||
|  | from a backoff language model. | ||
|  | The  | ||
|  | <B> linear=1 </B> | ||
|  | option causes probabilities to be output on a linear (instead of log) scale. | ||
|  | <H2> SEE ALSO </H2> | ||
|  | <A HREF="ngram-format.5.html">ngram-format(5)</A>, <A HREF="ngram.1.html">ngram(1)</A>. | ||
|  | <H2> BUGS </H2> | ||
|  | These are quick-and-dirty scripts, what do you expect? | ||
|  | <BR> | ||
|  | <B> reverse-lm </B> | ||
|  | supports only bigram LMs, and can produce improper probability estimates  | ||
|  | as a result of inconsistent marginals in the input model. | ||
|  | <H2> AUTHOR </H2> | ||
|  | Andreas Stolcke <stolcke@icsi.berkeley.edu> | ||
|  | <BR> | ||
|  | Copyright (c) 1995-2006 SRI International | ||
|  | </BODY> | ||
|  | </HTML> |