138 lines
		
	
	
		
			4.0 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
			
		
		
	
	
			138 lines
		
	
	
		
			4.0 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
| .\" $Id: anti-ngram.1,v 1.9 2019/09/09 22:35:36 stolcke Exp $
 | |
| .TH anti-ngram 1 "$Date: 2019/09/09 22:35:36 $" "SRILM Tools"
 | |
| .SH NAME
 | |
| anti-ngram \- count posterior-weighted N-grams in N-best lists
 | |
| .SH SYNOPSIS
 | |
| .nf
 | |
| \fBanti-ngram\fP [ \fB\-help\fP ] \fIoption\fP ...
 | |
| .fi
 | |
| .SH DESCRIPTION
 | |
| .B anti-ngram
 | |
| counts the N-grams in a set of N-best hypotheses lists.
 | |
| The N-gram counts are weighted by the posterior probabilities of the
 | |
| hypotheses they occur in.
 | |
| Thus, 
 | |
| .B anti-ngram 
 | |
| can be used to construct language models of word sequences
 | |
| that are acoustically confusable with correct hypotheses.
 | |
| The counts output should be processed with
 | |
| .B "ngram-count \-float-counts"
 | |
| to estimate a language model.
 | |
| .SH OPTIONS
 | |
| .PP
 | |
| Each filename argument can be an ASCII file, or a 
 | |
| compressed file (name ending in .Z or .gz), or ``-'' to indicate
 | |
| stdin/stdout.
 | |
| .TP
 | |
| .B \-help
 | |
| Print option summary.
 | |
| .TP
 | |
| .B \-version
 | |
| Print version information.
 | |
| .TP
 | |
| .BI \-refs " file"
 | |
| Read the reference transcripts from 
 | |
| .IR file .
 | |
| Each line should contain an utterance ID followed by the transcript words.
 | |
| .TP
 | |
| .BI \-nbest-files " file"
 | |
| List of N-best files.
 | |
| The base components of filenames must correspond to the utterance IDs found
 | |
| in the reference file.
 | |
| .TP
 | |
| .BI \-max-nbest " n"
 | |
| Limits the number of hypotheses read from each N-best list to the first
 | |
| .IR n .
 | |
| .TP
 | |
| .BI \-order " n"
 | |
| Set the maximal order (length) of N-grams to count.
 | |
| The default order is 3.
 | |
| .TP
 | |
| .BI \-lm " file"
 | |
| Reads an ARPA language model from 
 | |
| .I file
 | |
| and rescores the N-best lists with it prior to counting N-grams.
 | |
| .TP
 | |
| .BI \-classes " file"
 | |
| Interpret the LM as a class-based N-gram and read class definitions
 | |
| in 
 | |
| .BR classes-format (5)
 | |
| from
 | |
| .IR file .
 | |
| .TP
 | |
| .B \-tolower
 | |
| Map vocabulary to lowercase, eliminating case distinctions.
 | |
| .TP
 | |
| .B \-multiwords
 | |
| Split multiwords (words joined by '_') into their components when
 | |
| reading N-best lists.
 | |
| .TP
 | |
| .BI \-multi-char " C"
 | |
| Character used to delimit component words in multiwords
 | |
| (an underscore character by default).
 | |
| .TP
 | |
| .BI \-rescore-lmw " lmw"
 | |
| Sets the language model weight used in combining the language model log
 | |
| probabilities with acoustic log probabilities
 | |
| (only relevant if separate scores are given in the N-best input).
 | |
| .TP
 | |
| .BI \-rescore-wtw " wtw"
 | |
| Sets the word transition weight used to weight the number of words relative to
 | |
| the acoustic log probabilities
 | |
| (only relevant if separate scores are given in the N-best input).
 | |
| .TP
 | |
| .BI \-posterior-scale " scale"
 | |
| Divide the total weighted log score by 
 | |
| .I scale
 | |
| when computing normalized posterior probabilities.
 | |
| This controls the peakedness of the posterior distribution. 
 | |
| The default value is whatever was chosen for 
 | |
| .BR \-rescore-lmw , 
 | |
| so that language model scores are scaled to have weight 1,
 | |
| and acoustic scores have weight 1/\fIlmw\fP.
 | |
| .TP
 | |
| .B \-all-ngrams
 | |
| Causes even N-grams that occur in the reference string to be counted.
 | |
| By default N-best N-grams that also occur in the corresponding reference 
 | |
| are excluded.
 | |
| .TP
 | |
| .BI \-min-count " C"
 | |
| Prune all N-grams from the output that have counts less than
 | |
| .IR C .
 | |
| .TP
 | |
| .BI \-read-counts " countsfile"
 | |
| Read N-gram counts from a file.
 | |
| Each line contains an N-gram of 
 | |
| words, followed by an integer or fractional count, all separated by whitespace.
 | |
| Repeated counts for the same N-gram are added.
 | |
| N-grams from N-best lists are added to those read with this option.
 | |
| .TP
 | |
| .BI \-write-counts " countsfile"
 | |
| Writes total N-gram counts to
 | |
| .IR countsfile .
 | |
| The default is to write to stdout.
 | |
| .TP
 | |
| .B \-sort
 | |
| Output counts in lexicographic order, as required for
 | |
| .BR ngram-merge (1).
 | |
| .TP
 | |
| .BI \-debug " level"
 | |
| Set debugging output level.
 | |
| Level 0 means no debugging.
 | |
| Debugging messages are written to stderr.
 | |
| .SH "SEE ALSO"
 | |
| ngram(1), ngram-merge(1), ngram-count(1), nbest-scripts(1),
 | |
| classes-format(5),
 | |
| .br
 | |
| A. Stolcke et al., "The SRI March 2000 Hub-5 Conversational Speech
 | |
| Transcription System",
 | |
| \fIProc. NIST Speech Transcription Workshop\fP, College Park, MD, 2000.
 | |
| .SH BUGS
 | |
| There is no
 | |
| .B \-vocab
 | |
| option to limit the vocabulary.
 | |
| .SH AUTHOR
 | |
| Andreas Stolcke <stolcke@icsi.berkeley.edu>
 | |
| .br
 | |
| Copyright (c) 2000\-2008 SRI International
 | 
