147 lines
5.2 KiB
HTML
147 lines
5.2 KiB
HTML
<! $Id: anti-ngram.1,v 1.9 2019/09/09 22:35:36 stolcke Exp $>
|
|
<HTML>
|
|
<HEADER>
|
|
<TITLE>anti-ngram</TITLE>
|
|
<BODY>
|
|
<H1>anti-ngram</H1>
|
|
<H2> NAME </H2>
|
|
anti-ngram - count posterior-weighted N-grams in N-best lists
|
|
<H2> SYNOPSIS </H2>
|
|
<PRE>
|
|
<B>anti-ngram</B> [ <B>-help</B> ] <I>option</I> ...
|
|
</PRE>
|
|
<H2> DESCRIPTION </H2>
|
|
<B> anti-ngram </B>
|
|
counts the N-grams in a set of N-best hypotheses lists.
|
|
The N-gram counts are weighted by the posterior probabilities of the
|
|
hypotheses they occur in.
|
|
Thus,
|
|
<B> anti-ngram </B>
|
|
can be used to construct language models of word sequences
|
|
that are acoustically confusable with correct hypotheses.
|
|
The counts output should be processed with
|
|
<B> ngram-count -float-counts </B>
|
|
to estimate a language model.
|
|
<H2> OPTIONS </H2>
|
|
<P>
|
|
Each filename argument can be an ASCII file, or a
|
|
compressed file (name ending in .Z or .gz), or ``-'' to indicate
|
|
stdin/stdout.
|
|
<DL>
|
|
<DT><B> -help </B>
|
|
<DD>
|
|
Print option summary.
|
|
<DT><B> -version </B>
|
|
<DD>
|
|
Print version information.
|
|
<DT><B>-refs</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Read the reference transcripts from
|
|
<I>file</I>.<I></I><I></I><I></I>
|
|
Each line should contain an utterance ID followed by the transcript words.
|
|
<DT><B>-nbest-files</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
List of N-best files.
|
|
The base components of filenames must correspond to the utterance IDs found
|
|
in the reference file.
|
|
<DT><B>-max-nbest</B><I> n</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Limits the number of hypotheses read from each N-best list to the first
|
|
<I>n</I>.<I></I><I></I><I></I>
|
|
<DT><B>-order</B><I> n</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Set the maximal order (length) of N-grams to count.
|
|
The default order is 3.
|
|
<DT><B>-lm</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Reads an ARPA language model from
|
|
<I> file </I>
|
|
and rescores the N-best lists with it prior to counting N-grams.
|
|
<DT><B>-classes</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Interpret the LM as a class-based N-gram and read class definitions
|
|
in
|
|
<A HREF="classes-format.5.html">classes-format(5)</A>
|
|
from
|
|
<I>file</I>.<I></I><I></I><I></I>
|
|
<DT><B> -tolower </B>
|
|
<DD>
|
|
Map vocabulary to lowercase, eliminating case distinctions.
|
|
<DT><B> -multiwords </B>
|
|
<DD>
|
|
Split multiwords (words joined by '_') into their components when
|
|
reading N-best lists.
|
|
<DT><B>-multi-char</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Character used to delimit component words in multiwords
|
|
(an underscore character by default).
|
|
<DT><B>-rescore-lmw</B><I> lmw</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Sets the language model weight used in combining the language model log
|
|
probabilities with acoustic log probabilities
|
|
(only relevant if separate scores are given in the N-best input).
|
|
<DT><B>-rescore-wtw</B><I> wtw</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Sets the word transition weight used to weight the number of words relative to
|
|
the acoustic log probabilities
|
|
(only relevant if separate scores are given in the N-best input).
|
|
<DT><B>-posterior-scale</B><I> scale</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Divide the total weighted log score by
|
|
<I> scale </I>
|
|
when computing normalized posterior probabilities.
|
|
This controls the peakedness of the posterior distribution.
|
|
The default value is whatever was chosen for
|
|
<B>-rescore-lmw</B>,<B></B><B></B><B></B>
|
|
so that language model scores are scaled to have weight 1,
|
|
and acoustic scores have weight 1/<I>lmw</I>.
|
|
<DT><B> -all-ngrams </B>
|
|
<DD>
|
|
Causes even N-grams that occur in the reference string to be counted.
|
|
By default N-best N-grams that also occur in the corresponding reference
|
|
are excluded.
|
|
<DT><B>-min-count</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Prune all N-grams from the output that have counts less than
|
|
<I>C</I>.<I></I><I></I><I></I>
|
|
<DT><B>-read-counts</B><I> countsfile</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Read N-gram counts from a file.
|
|
Each line contains an N-gram of
|
|
words, followed by an integer or fractional count, all separated by whitespace.
|
|
Repeated counts for the same N-gram are added.
|
|
N-grams from N-best lists are added to those read with this option.
|
|
<DT><B>-write-counts</B><I> countsfile</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Writes total N-gram counts to
|
|
<I>countsfile</I>.<I></I><I></I><I></I>
|
|
The default is to write to stdout.
|
|
<DT><B> -sort </B>
|
|
<DD>
|
|
Output counts in lexicographic order, as required for
|
|
<A HREF="ngram-merge.1.html">ngram-merge(1)</A>.
|
|
<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
|
|
<DD>
|
|
Set debugging output level.
|
|
Level 0 means no debugging.
|
|
Debugging messages are written to stderr.
|
|
</DD>
|
|
</DL>
|
|
<H2> SEE ALSO </H2>
|
|
<A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-merge.1.html">ngram-merge(1)</A>, <A HREF="ngram-count.1.html">ngram-count(1)</A>, <A HREF="nbest-scripts.1.html">nbest-scripts(1)</A>,
|
|
<A HREF="classes-format.5.html">classes-format(5)</A>,
|
|
<BR>
|
|
A. Stolcke et al., "The SRI March 2000 Hub-5 Conversational Speech
|
|
Transcription System",
|
|
<I>Proc. NIST Speech Transcription Workshop</I>, College Park, MD, 2000.
|
|
<H2> BUGS </H2>
|
|
There is no
|
|
<B> -vocab </B>
|
|
option to limit the vocabulary.
|
|
<H2> AUTHOR </H2>
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
|
<BR>
|
|
Copyright (c) 2000-2008 SRI International
|
|
</BODY>
|
|
</HTML>
|