Files
b2txt25/language_model/srilm-1.7.3/man/man1/anti-ngram.1
2025-07-02 12:18:09 -07:00

138 lines
4.0 KiB
Groff

.\" $Id: anti-ngram.1,v 1.9 2019/09/09 22:35:36 stolcke Exp $
.TH anti-ngram 1 "$Date: 2019/09/09 22:35:36 $" "SRILM Tools"
.SH NAME
anti-ngram \- count posterior-weighted N-grams in N-best lists
.SH SYNOPSIS
.nf
\fBanti-ngram\fP [ \fB\-help\fP ] \fIoption\fP ...
.fi
.SH DESCRIPTION
.B anti-ngram
counts the N-grams in a set of N-best hypotheses lists.
The N-gram counts are weighted by the posterior probabilities of the
hypotheses they occur in.
Thus,
.B anti-ngram
can be used to construct language models of word sequences
that are acoustically confusable with correct hypotheses.
The counts output should be processed with
.B "ngram-count \-float-counts"
to estimate a language model.
.SH OPTIONS
.PP
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
.TP
.B \-help
Print option summary.
.TP
.B \-version
Print version information.
.TP
.BI \-refs " file"
Read the reference transcripts from
.IR file .
Each line should contain an utterance ID followed by the transcript words.
.TP
.BI \-nbest-files " file"
List of N-best files.
The base components of filenames must correspond to the utterance IDs found
in the reference file.
.TP
.BI \-max-nbest " n"
Limits the number of hypotheses read from each N-best list to the first
.IR n .
.TP
.BI \-order " n"
Set the maximal order (length) of N-grams to count.
The default order is 3.
.TP
.BI \-lm " file"
Reads an ARPA language model from
.I file
and rescores the N-best lists with it prior to counting N-grams.
.TP
.BI \-classes " file"
Interpret the LM as a class-based N-gram and read class definitions
in
.BR classes-format (5)
from
.IR file .
.TP
.B \-tolower
Map vocabulary to lowercase, eliminating case distinctions.
.TP
.B \-multiwords
Split multiwords (words joined by '_') into their components when
reading N-best lists.
.TP
.BI \-multi-char " C"
Character used to delimit component words in multiwords
(an underscore character by default).
.TP
.BI \-rescore-lmw " lmw"
Sets the language model weight used in combining the language model log
probabilities with acoustic log probabilities
(only relevant if separate scores are given in the N-best input).
.TP
.BI \-rescore-wtw " wtw"
Sets the word transition weight used to weight the number of words relative to
the acoustic log probabilities
(only relevant if separate scores are given in the N-best input).
.TP
.BI \-posterior-scale " scale"
Divide the total weighted log score by
.I scale
when computing normalized posterior probabilities.
This controls the peakedness of the posterior distribution.
The default value is whatever was chosen for
.BR \-rescore-lmw ,
so that language model scores are scaled to have weight 1,
and acoustic scores have weight 1/\fIlmw\fP.
.TP
.B \-all-ngrams
Causes even N-grams that occur in the reference string to be counted.
By default N-best N-grams that also occur in the corresponding reference
are excluded.
.TP
.BI \-min-count " C"
Prune all N-grams from the output that have counts less than
.IR C .
.TP
.BI \-read-counts " countsfile"
Read N-gram counts from a file.
Each line contains an N-gram of
words, followed by an integer or fractional count, all separated by whitespace.
Repeated counts for the same N-gram are added.
N-grams from N-best lists are added to those read with this option.
.TP
.BI \-write-counts " countsfile"
Writes total N-gram counts to
.IR countsfile .
The default is to write to stdout.
.TP
.B \-sort
Output counts in lexicographic order, as required for
.BR ngram-merge (1).
.TP
.BI \-debug " level"
Set debugging output level.
Level 0 means no debugging.
Debugging messages are written to stderr.
.SH "SEE ALSO"
ngram(1), ngram-merge(1), ngram-count(1), nbest-scripts(1),
classes-format(5),
.br
A. Stolcke et al., "The SRI March 2000 Hub-5 Conversational Speech
Transcription System",
\fIProc. NIST Speech Transcription Workshop\fP, College Park, MD, 2000.
.SH BUGS
There is no
.B \-vocab
option to limit the vocabulary.
.SH AUTHOR
Andreas Stolcke <stolcke@icsi.berkeley.edu>
.br
Copyright (c) 2000\-2008 SRI International