141 lines
4.6 KiB
HTML
141 lines
4.6 KiB
HTML
<! $Id: ppl-scripts.1,v 1.9 2019/09/09 22:35:37 stolcke Exp $>
|
|
<HTML>
|
|
<HEADER>
|
|
<TITLE>ppl-scripts</TITLE>
|
|
<BODY>
|
|
<H1>ppl-scripts</H1>
|
|
<H2> NAME </H2>
|
|
ppl-scripts, add-ppls, compare-ppls, compute-best-mix, compute-best-sentence-mix, filter-event-counts, hits-from-log, ppl-from-log, subtract-ppls - manipulate perplexities
|
|
<H2> SYNOPSIS </H2>
|
|
<PRE>
|
|
<B>add-ppls</B> [ <I>ppl-file</I> ... ]
|
|
<B>subtract-ppls</B> <I>ppl-file1</I> [ <I>ppl-file2</I> ... ]
|
|
<B>ppl-from-log</B> [ <I>ppl-file</I> ... ]
|
|
<B>hits-from-log</B> [ <I>ppl-file</I> ... ]
|
|
<B>compare-ppls</B> [ <B>mindelta=</B><I>D</I> ] <I>ppl-file1</I> <I>ppl-file2</I>
|
|
<B>compute-best-mix</B> [ <B>lambda='</B><I>l1 l2</I> ...<B>'</B> ] [ <B>precision=</B><I>P</I> ] \
|
|
<I>ppl-file1</I> [ <I>ppl-file2</I> ... ]
|
|
<B>compute-best-sentence-mix</B> [ <B>lambda='</B><I>l1 l2</I> ...<B>'</B> ] [ <B>precision=</B><I>P</I> ] \
|
|
[ <B>addone=</B><I>c</I> ] <I>ppl-file1</I> [ <I>ppl-file2</I> ... ]
|
|
<B>filter-event-counts</B> [ <B>order=</B><I>N</I> ] [ <B>escape=</B>\fIstring\P ] [ <I>counts</I> ... ]
|
|
</PRE>
|
|
<H2> DESCRIPTION </H2>
|
|
These scripts process the output of the
|
|
<A HREF="ngram.1.html">ngram(1)</A>
|
|
option
|
|
<B> -ppl </B>
|
|
to extract various useful information.
|
|
They are particularly convenient in analyzing the performance (perplexity) of
|
|
language models on specific subsets of the test data,
|
|
or to compare and combine multiple models.
|
|
<P>
|
|
<B> add-ppls </B>
|
|
takes several ppl output files and computes an aggregate perplexity and
|
|
corpus statistics.
|
|
Its output is suitable for subsequent manipulation by
|
|
<B> add-ppls </B>
|
|
or
|
|
<B>subtract-ppls</B>.<B></B><B></B><B></B>
|
|
<P>
|
|
<B> subtract-ppls </B>
|
|
similarly computes an aggregate perplexity by removing the
|
|
statistics of zero or more
|
|
<I> ppl-file2 </I>
|
|
from those in
|
|
<I>ppl-file1</I>.<I></I><I></I><I></I>
|
|
Its output is suitable for subsequent manipulation by
|
|
<B> add-ppls </B>
|
|
or
|
|
<B>subtract-ppls</B>.<B></B><B></B><B></B>
|
|
<P>
|
|
<B> ppl-from-log </B>
|
|
recomputes the total perplexities and statistics from individual
|
|
lines in
|
|
<B> ngram -debug 2 -ppl </B>
|
|
output.
|
|
Combined with some filtering of that output this allows computing
|
|
perplexities on interesting subsets of words.
|
|
<P>
|
|
<B> hits-from-log </B>
|
|
computes N-gram hit rates from
|
|
<B> ngram -debug 2 -ppl </B>
|
|
output.
|
|
<P>
|
|
<B> compare-ppls </B>
|
|
tallies the number of words for which two language models produce the same,
|
|
higher, or lower probabilities.
|
|
The input files should be
|
|
<B> ngram -debug 2 -ppl </B>
|
|
output for the two models on the same test set.
|
|
The parameter
|
|
<I> D </I>
|
|
is the minimum absolute difference for two log probabilities to be
|
|
considered different (the default is 0).
|
|
<P>
|
|
<B> compute-best-mix </B>
|
|
takes the output of several
|
|
<B> ngram -debug 2 -ppl </B>
|
|
runs on the same test set and computes the optimal interpolation
|
|
weights for the corresponding models,
|
|
i.e., the weights that minimize the perplexity of an interpolated model.
|
|
Initial weights may be specified as
|
|
<I>l1 l2 ...</I>.<I></I><I></I><I></I>
|
|
The computation is iterative and stops when the interpolation weights
|
|
change by less than
|
|
<I> P </I>
|
|
(default 0.001).
|
|
<P>
|
|
<B> compute-best-sentence-mix </B>
|
|
similarly optimizes the weights for sentence-level interpolation of LMs.
|
|
It requires input files generated by
|
|
<B>ngram -debug 1 -ppl</B>.<B></B><B></B><B></B>
|
|
(Sentence-level mixtures can be implemented using the
|
|
<B> ngram -hmm </B>
|
|
option, by constructing a suitable HMM structure.)
|
|
The
|
|
<B>addone=</B><I>c</I><B></B><I></I><B></B><I></I><B></B>
|
|
option performs Laplace smoothing by adding
|
|
<I> c </I>
|
|
to the estimated posterior counts for each model.
|
|
<P>
|
|
<B> filter-event-counts </B>
|
|
prepares a count file for for perplexity computation.
|
|
It removes counts that do not represent events to the LM.
|
|
The
|
|
<B>order=</B>N<B></B><B></B><B></B>
|
|
option specifies the maximal N-gram order to use.
|
|
The effect of filtering is such that
|
|
<PRE>
|
|
ngram -order <I>N</I> -lm <I>LM</I> -ppl <I>TEXT</I>
|
|
</PRE>
|
|
and
|
|
<PRE>
|
|
ngram-count -order <I>N</I> -text <I>TEXT</I> -write - | \
|
|
filter-event-counts order=<I>N</I> | \
|
|
ngram -order <I>N</I> -lm <I>LM</I> -counts -
|
|
</PRE>
|
|
yield the same result.
|
|
The
|
|
<B> escape= </B>
|
|
option specifies a string that causes all input lines beginning with
|
|
that string to be passed through
|
|
(useful in combination with
|
|
<B>ngram -escape</B>).<B></B><B></B><B></B>
|
|
</PRE>
|
|
<H2> SEE ALSO </H2>
|
|
<A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-count.1.html">ngram-count(1)</A>.
|
|
<H2> BUGS </H2>
|
|
Most scripts depend on the idiosyncrasies of
|
|
<B> ngram -ppl </B>
|
|
output.
|
|
<H2> AUTHOR </H2>
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
|
<BR>
|
|
Copyright (c) 1995-2009 SRI International
|
|
<BR>
|
|
Copyright (c) 2011-2016 Andreas Stolcke
|
|
<BR>
|
|
Copyright (c) 2011-2016 Microsoft Corp.
|
|
</BODY>
|
|
</HTML>
|