141 lines
		
	
	
		
			4.6 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
		
		
			
		
	
	
			141 lines
		
	
	
		
			4.6 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
|   | <! $Id: ppl-scripts.1,v 1.9 2019/09/09 22:35:37 stolcke Exp $> | ||
|  | <HTML> | ||
|  | <HEADER> | ||
|  | <TITLE>ppl-scripts</TITLE> | ||
|  | <BODY> | ||
|  | <H1>ppl-scripts</H1> | ||
|  | <H2> NAME </H2> | ||
|  | ppl-scripts, add-ppls, compare-ppls, compute-best-mix, compute-best-sentence-mix, filter-event-counts, hits-from-log, ppl-from-log, subtract-ppls - manipulate perplexities | ||
|  | <H2> SYNOPSIS </H2> | ||
|  | <PRE> | ||
|  | <B>add-ppls</B> [ <I>ppl-file</I> ... ] | ||
|  | <B>subtract-ppls</B> <I>ppl-file1</I> [ <I>ppl-file2</I> ... ] | ||
|  | <B>ppl-from-log</B> [ <I>ppl-file</I> ... ] | ||
|  | <B>hits-from-log</B> [ <I>ppl-file</I> ... ] | ||
|  | <B>compare-ppls</B> [ <B>mindelta=</B><I>D</I> ] <I>ppl-file1</I> <I>ppl-file2</I> | ||
|  | <B>compute-best-mix</B> [ <B>lambda='</B><I>l1 l2</I> ...<B>'</B> ] [ <B>precision=</B><I>P</I> ] \ | ||
|  | 	<I>ppl-file1</I> [ <I>ppl-file2</I> ... ] | ||
|  | <B>compute-best-sentence-mix</B> [ <B>lambda='</B><I>l1 l2</I> ...<B>'</B> ] [ <B>precision=</B><I>P</I> ] \ | ||
|  | 	[ <B>addone=</B><I>c</I> ] <I>ppl-file1</I> [ <I>ppl-file2</I> ... ] | ||
|  | <B>filter-event-counts</B> [ <B>order=</B><I>N</I> ] [ <B>escape=</B>\fIstring\P ] [ <I>counts</I> ... ] | ||
|  | </PRE> | ||
|  | <H2> DESCRIPTION </H2> | ||
|  | These scripts process the output of the  | ||
|  | <A HREF="ngram.1.html">ngram(1)</A> | ||
|  | option | ||
|  | <B> -ppl </B> | ||
|  | to extract various useful information. | ||
|  | They are particularly convenient in analyzing the performance (perplexity) of  | ||
|  | language models on specific subsets of the test data, | ||
|  | or to compare and combine multiple models. | ||
|  | <P> | ||
|  | <B> add-ppls </B> | ||
|  | takes several ppl output files and computes an aggregate perplexity and | ||
|  | corpus statistics. | ||
|  | Its output is suitable for subsequent manipulation by | ||
|  | <B> add-ppls </B> | ||
|  | or | ||
|  | <B>subtract-ppls</B>.<B></B><B></B><B></B> | ||
|  | <P> | ||
|  | <B> subtract-ppls </B> | ||
|  | similarly computes an aggregate perplexity by removing the | ||
|  | statistics of zero or more | ||
|  | <I> ppl-file2 </I> | ||
|  | from those in | ||
|  | <I>ppl-file1</I>.<I></I><I></I><I></I> | ||
|  | Its output is suitable for subsequent manipulation by | ||
|  | <B> add-ppls </B> | ||
|  | or | ||
|  | <B>subtract-ppls</B>.<B></B><B></B><B></B> | ||
|  | <P> | ||
|  | <B> ppl-from-log </B> | ||
|  | recomputes the total perplexities and statistics from individual | ||
|  | lines in | ||
|  | <B> ngram -debug 2 -ppl </B> | ||
|  | output. | ||
|  | Combined with some filtering of that output this allows computing  | ||
|  | perplexities on interesting subsets of words. | ||
|  | <P> | ||
|  | <B> hits-from-log </B> | ||
|  | computes N-gram hit rates from | ||
|  | <B> ngram -debug 2 -ppl </B> | ||
|  | output. | ||
|  | <P> | ||
|  | <B> compare-ppls </B> | ||
|  | tallies the number of words for which two language models produce the same, | ||
|  | higher, or lower probabilities. | ||
|  | The input files should be  | ||
|  | <B> ngram -debug 2 -ppl </B> | ||
|  | output for the two models on the same test set. | ||
|  | The parameter | ||
|  | <I> D </I> | ||
|  | is the minimum absolute difference for two log probabilities to be  | ||
|  | considered different (the default is 0). | ||
|  | <P> | ||
|  | <B> compute-best-mix </B> | ||
|  | takes the output of several | ||
|  | <B> ngram -debug 2 -ppl </B> | ||
|  | runs on the same test set and computes the optimal interpolation  | ||
|  | weights for the corresponding models, | ||
|  | i.e., the weights that minimize the perplexity of an interpolated model. | ||
|  | Initial weights may be specified as | ||
|  | <I>l1 l2 ...</I>.<I></I><I></I><I></I> | ||
|  | The computation is iterative and stops when the interpolation weights | ||
|  | change by less than | ||
|  | <I> P </I> | ||
|  | (default 0.001). | ||
|  | <P> | ||
|  | <B> compute-best-sentence-mix </B> | ||
|  | similarly optimizes the weights for sentence-level interpolation of LMs. | ||
|  | It requires input files generated by | ||
|  | <B>ngram -debug 1 -ppl</B>.<B></B><B></B><B></B> | ||
|  | (Sentence-level mixtures can be implemented using the  | ||
|  | <B> ngram -hmm </B> | ||
|  | option, by constructing a suitable HMM structure.) | ||
|  | The  | ||
|  | <B>addone=</B><I>c</I><B></B><I></I><B></B><I></I><B></B> | ||
|  | option performs Laplace smoothing by adding  | ||
|  | <I> c </I> | ||
|  | to the estimated posterior counts for each model. | ||
|  | <P> | ||
|  | <B> filter-event-counts </B> | ||
|  | prepares a count file for for perplexity computation. | ||
|  | It removes counts that do not represent events to the LM. | ||
|  | The  | ||
|  | <B>order=</B>N<B></B><B></B><B></B> | ||
|  | option specifies the maximal N-gram order to use. | ||
|  | The effect of filtering is such that | ||
|  | <PRE> | ||
|  | 	ngram -order <I>N</I> -lm <I>LM</I> -ppl <I>TEXT</I> | ||
|  | </PRE> | ||
|  | and | ||
|  | <PRE> | ||
|  | 	ngram-count -order <I>N</I> -text <I>TEXT</I> -write - | \ | ||
|  | 	filter-event-counts order=<I>N</I> | \ | ||
|  | 	ngram -order <I>N</I> -lm <I>LM</I> -counts - | ||
|  | </PRE> | ||
|  | yield the same result. | ||
|  | The  | ||
|  | <B> escape= </B> | ||
|  | option specifies a string that causes all input lines beginning with  | ||
|  | that string to be passed through | ||
|  | (useful in combination with | ||
|  | <B>ngram -escape</B>).<B></B><B></B><B></B> | ||
|  | </PRE> | ||
|  | <H2> SEE ALSO </H2> | ||
|  | <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-count.1.html">ngram-count(1)</A>. | ||
|  | <H2> BUGS </H2> | ||
|  | Most scripts depend on the idiosyncrasies of | ||
|  | <B> ngram -ppl </B> | ||
|  | output. | ||
|  | <H2> AUTHOR </H2> | ||
|  | Andreas Stolcke <stolcke@icsi.berkeley.edu> | ||
|  | <BR> | ||
|  | Copyright (c) 1995-2009 SRI International | ||
|  | <BR> | ||
|  | Copyright (c) 2011-2016 Andreas Stolcke | ||
|  | <BR> | ||
|  | Copyright (c) 2011-2016 Microsoft Corp. | ||
|  | </BODY> | ||
|  | </HTML> |