428 lines
12 KiB
Groff
428 lines
12 KiB
Groff
.\" $Id: hidden-ngram.1,v 1.34 2019/09/09 22:35:36 stolcke Exp $
|
|
.TH hidden-ngram 1 "$Date: 2019/09/09 22:35:36 $" "SRILM Tools"
|
|
.SH NAME
|
|
hidden-ngram \- tag hidden events between words
|
|
.SH SYNOPSIS
|
|
.nf
|
|
\fBhidden-ngram\fP [ \fB\-help\fP ] \fIoption\fP ...
|
|
.fi
|
|
.SH DESCRIPTION
|
|
.B hidden-ngram
|
|
tags a stream of word tokens with hidden events occurring between words.
|
|
For example, an unsegmented text could be tagged for sentence boundaries
|
|
(the hidden events in this case being `boundary' and `no-boundary').
|
|
The most likely hidden tag sequence consistent with the given word
|
|
sequence is found according to an N-gram language model over both
|
|
words and hidden tags.
|
|
.PP
|
|
.B hidden-ngram
|
|
is a generalization of
|
|
.BR segment (1).
|
|
.SH OPTIONS
|
|
.PP
|
|
Each filename argument can be an ASCII file, or a
|
|
compressed file (name ending in .Z or .gz), or ``-'' to indicate
|
|
stdin/stdout.
|
|
.TP
|
|
.B \-help
|
|
Print option summary.
|
|
.TP
|
|
.B \-version
|
|
Print version information.
|
|
.TP
|
|
.BI \-text " file"
|
|
Specifies the file containing the word sequences to be tagged
|
|
(one sentence per line).
|
|
Start- and end-of-sentence tags are
|
|
.I not
|
|
added by the program, but should be included in the input if the
|
|
language model uses them.
|
|
.TP
|
|
.BI \-escape " string"
|
|
Set an ``escape string.''
|
|
Input lines starting with
|
|
.I string
|
|
are not processed and passed unchanged to stdout instead.
|
|
This allows associated information to be passed to scoring scripts etc.
|
|
.TP
|
|
.BI \-text\-map " file"
|
|
Read the input words from a map file contain both the words and
|
|
additional likelihoods of events following each word.
|
|
Each line contains one input word, plus optional hidden-event/likelihood
|
|
pairs in the format
|
|
.nf
|
|
\fIw\fP \fIe1\fP [\fIp1\fP] \fIe2\fP [\fIp2\fP] ...
|
|
.fi
|
|
If a \fIp\fP value is omitted a likelihood of 1 is assumed.
|
|
All events not explicitly listed are given likelihood 0, and are
|
|
hence excluded for that word.
|
|
In particular, the label
|
|
.B *noevent*
|
|
must be listed to allow absence of a hidden event.
|
|
Input word strings are assembled from multiple lines of
|
|
.B \-text\-map
|
|
input until either an end-of-sentence token </s> is found, or an escaped
|
|
line (see
|
|
.BR \-escape )
|
|
is encountered.
|
|
.TP
|
|
.B \-logmap
|
|
Interpret numeric values in the
|
|
.B \-text\-map
|
|
file as log probabilities, rather
|
|
than probabilities.
|
|
.TP
|
|
.BI \-lm " file"
|
|
Specifies the word/tag language model as a standard ARPA N-gram backoff model
|
|
file in
|
|
.BR ngram-format (5).
|
|
.TP
|
|
.BI \-use-server " S"
|
|
Use a network LM server (typically implemented by
|
|
.BR ngram (1)
|
|
with the
|
|
.B \-server-port
|
|
option) as the main model.
|
|
The server specification
|
|
.I S
|
|
can be an unsigned integer port number (referring to a server port running on
|
|
the local host),
|
|
a hostname (referring to default port 2525 on the named host),
|
|
or a string of the form
|
|
.IR port @ host ,
|
|
where
|
|
.I port
|
|
is a portnumber and
|
|
.I host
|
|
is either a hostname ("dukas.speech.sri.com")
|
|
or IP number in dotted-quad format ("140.44.1.15").
|
|
.br
|
|
For server-based LMs, the
|
|
.B \-order
|
|
option limits the context length of N-grams queried by the client
|
|
(with 0 denoting unlimited length).
|
|
Hence, the effective LM order is the mimimum of the client-specified value
|
|
and any limit implemented in the server.
|
|
.br
|
|
When
|
|
.B \-use-server
|
|
is specified, the arguments to the options
|
|
.BR \-mix-lm ,
|
|
.BR \-mix-lm2 ,
|
|
etc. are also interpreted as network LM server specifications provided
|
|
they contain a '@' character and do not contain a '/' character.
|
|
This allows the creation of mixtures of several file- and/or
|
|
network-based LMs.
|
|
.TP
|
|
.B \-cache-served-ngrams
|
|
Enables client-side caching of N-gram probabilities to eliminated duplicate
|
|
network queries, in conjunction with
|
|
.BR \-use-server .
|
|
This can results in a substantial speedup
|
|
but requires memory in the client that may grow linearly with the
|
|
amount of data processed.
|
|
.TP
|
|
.BI \-order " n"
|
|
Set the effective N-gram order used by the language model to
|
|
.IR n .
|
|
Default is 3 (use a trigram model).
|
|
.TP
|
|
.BI \-classes " file"
|
|
Interpret the LM as an N-gram over word classes.
|
|
The expansions of the classes are given in
|
|
.IR file
|
|
in
|
|
.BR classes-format (5).
|
|
Tokens in the LM that are not defined as classes in
|
|
.I file
|
|
are assumed to be plain words, so that the LM can contain mixed N-grams over
|
|
both words and word classes.
|
|
.TP
|
|
.BR \-simple-classes
|
|
Assume a "simple" class model: each word is member of at most one word class,
|
|
and class expansions are exactly one word long.
|
|
.TP
|
|
.BI \-mix-lm " file"
|
|
Read a second N-gram model for interpolation purposes.
|
|
The second and any additional interpolated models can also be class N-grams
|
|
(using the same
|
|
.B \-classes
|
|
definitions).
|
|
.TP
|
|
.B \-factored
|
|
Interpret the files specified by
|
|
.BR \-lm ,
|
|
.BR \-mix-lm ,
|
|
etc. as factored N-gram model specifications.
|
|
See
|
|
.BR ngram (1)
|
|
for more details.
|
|
.TP
|
|
.BI \-lambda " weight"
|
|
Set the weight of the main model when interpolating with
|
|
.BR \-mix-lm .
|
|
Default value is 0.5.
|
|
.TP
|
|
.BI \-mix-lm2 " file"
|
|
.TP
|
|
.BI \-mix-lm3 " file"
|
|
.TP
|
|
.BI \-mix-lm4 " file"
|
|
.TP
|
|
.BI \-mix-lm5 " file"
|
|
.TP
|
|
.BI \-mix-lm6 " file"
|
|
.TP
|
|
.BI \-mix-lm7 " file"
|
|
.TP
|
|
.BI \-mix-lm8 " file"
|
|
.TP
|
|
.BI \-mix-lm9 " file"
|
|
Up to 9 more N-gram models can be specified for interpolation.
|
|
.TP
|
|
.BI \-mix-lambda2 " weight"
|
|
.TP
|
|
.BI \-mix-lambda3 " weight"
|
|
.TP
|
|
.BI \-mix-lambda4 " weight"
|
|
.TP
|
|
.BI \-mix-lambda5 " weight"
|
|
.TP
|
|
.BI \-mix-lambda6 " weight"
|
|
.TP
|
|
.BI \-mix-lambda7 " weight"
|
|
.TP
|
|
.BI \-mix-lambda8 " weight"
|
|
.TP
|
|
.BI \-mix-lambda9 " weight"
|
|
These are the weights for the additional mixture components, corresponding
|
|
to
|
|
.B \-mix-lm2
|
|
through
|
|
.BR \-mix-lm9 .
|
|
The weight for the
|
|
.B \-mix-lm
|
|
model is 1 minus the sum of
|
|
.B \-lambda
|
|
and
|
|
.B \-mix-lambda2
|
|
through
|
|
.BR \-mix-lambda9 .
|
|
.TP
|
|
.BI \-context-priors " file"
|
|
Read context-dependent mixture weight priors from
|
|
.IR file .
|
|
Each line in
|
|
.I file
|
|
should contain a context N-gram (most recent word first) followed by a vector
|
|
of mixture weights whose length matches the number of LMs being interpolated.
|
|
.TP
|
|
.BI \-bayes " length"
|
|
Interpolate models using posterior probabilities
|
|
based on the likelihoods of local N-gram contexts of length
|
|
.IR length .
|
|
The
|
|
.B \-lambda
|
|
values are used as prior mixture weights in this case.
|
|
This option can also be combined with
|
|
.BR \-context-priors ,
|
|
in which case the
|
|
.I length
|
|
parameter also controls how many words of context are maximally used to look up
|
|
mixture weights.
|
|
If
|
|
.BR \-context-priors
|
|
is used without
|
|
.BR \-bayes ,
|
|
the context length used is set by the
|
|
.B \-order
|
|
option and Bayesian interpolation is disabled, as when
|
|
.I scale
|
|
(see next) is zero.
|
|
.TP
|
|
.BI \-bayes-scale " scale"
|
|
Set the exponential scale factor on the context likelihood in conjunction
|
|
with the
|
|
.B \-bayes
|
|
function.
|
|
Default value is 1.0.
|
|
.TP
|
|
.BI \-lmw " W"
|
|
Scales the language model probabilities by a factor
|
|
.IR W .
|
|
Default language model weight is 1.
|
|
.TP
|
|
.BI \-mapw " W"
|
|
Scales the likelihood map probability by a factor
|
|
.IR W .
|
|
Default map weight is 1.
|
|
.TP
|
|
.B \-tolower
|
|
Map vocabulary to lowercase, removing case distinctions.
|
|
.TP
|
|
.BI \-vocab " file"
|
|
Initialize the vocabulary for the LM from
|
|
.IR file .
|
|
This is useful in conjunction with
|
|
.BR \-limit-vocab .
|
|
.TP
|
|
.BI \-vocab-aliases " file"
|
|
Reads vocabulary alias definitions from
|
|
.IR file ,
|
|
consisting of lines of the form
|
|
.nf
|
|
\fIalias\fP \fIword\fP
|
|
.fi
|
|
This causes all tokens
|
|
.I alias
|
|
to be mapped to
|
|
.IR word .
|
|
.TP
|
|
.BI \-hidden-vocab " file"
|
|
Read the list of hidden tags from
|
|
.IR file .
|
|
Note: This is a subset of the vocabulary contained in the language model.
|
|
.TP
|
|
.B \-limit-vocab
|
|
Discard LM parameters on reading that do not pertain to the words
|
|
specified in the vocabulary, either by
|
|
.B \-vocab
|
|
or
|
|
.BR \-hidden-vocab .
|
|
The default is that words used in the LM are automatically added to the
|
|
vocabulary.
|
|
This option can be used to reduce the memory requirements for large LMs
|
|
that are going to be evaluated only on a small vocabulary subset.
|
|
.TP
|
|
.B \-force-event
|
|
Forces a non-default event after every word.
|
|
This is useful for language models that represent the default event
|
|
explicitly with a tag, rather than implicitly by the absence of a tag
|
|
between words (which is the default).
|
|
.TP
|
|
.B \-keep-unk
|
|
Do not map unknown input words to the <unk> token.
|
|
Instead, output the input word unchanged.
|
|
Also, with this option the LM is assumed to be open-vocabulary
|
|
(the default is close-vocabulary).
|
|
.TP
|
|
.B \-fb
|
|
Perform forward-backward decoding of the input token sequence.
|
|
Outputs the tags that have the highest posterior probability,
|
|
for each position.
|
|
The default is to use Viterbi decoding, i.e., the output is the
|
|
tag sequence with the highest joint posterior probability.
|
|
.TP
|
|
.B \-fw-only
|
|
Similar to
|
|
.BR \-fb ,
|
|
but uses only the forward probabilities for computing posteriors.
|
|
This may be used to simulate on-line prediction of tags, without the
|
|
benefit of future context.
|
|
.TP
|
|
.B \-continuous
|
|
Process all words in the input as one sequence of words, irrespective of
|
|
line breaks.
|
|
Normally each line is processed separately as a sentence.
|
|
Input tokens are output one-per-line, followed by event tokens.
|
|
.TP
|
|
.B \-posteriors
|
|
Output the table of posterior probabilities for each
|
|
tag position.
|
|
If
|
|
.B \-fb
|
|
is also specified the posterior probabilities will be computed using
|
|
forward-backward probabilities; otherwise an approximation will be used
|
|
that is based on the probability of the most likely path containing
|
|
a given tag at given position.
|
|
.TP
|
|
.B \-totals
|
|
Output the total string probability for each input sentence.
|
|
If
|
|
.B \-fb
|
|
is also specified this probability is obtained by summing over all
|
|
hidden event sequences; otherwise it is calculated (i.e., underestimated)
|
|
using the most probably hidden event sequence.
|
|
.TP
|
|
.BI \-nbest " N"
|
|
Output the
|
|
.I N
|
|
best hypotheses instead of just the first best when
|
|
doing Viterbi search.
|
|
If
|
|
.IR N >1,
|
|
then each hypothesis is prefixed by the tag
|
|
.BI NBEST_ n " " x ,
|
|
where
|
|
.I n
|
|
is the rank of the hypothesis in the N-best list and
|
|
.I x
|
|
its score, the negative log of the combined probability of transitions
|
|
and observations of the corresponding HMM path.
|
|
.TP
|
|
.BI \-write-counts " file"
|
|
Write the posterior weighted counts of n-grams, including those
|
|
with hidden tags, summed over the entire input data, to
|
|
.IR file .
|
|
The posterior probabilities should normally be computed with the
|
|
forward-backward algorithm (instead of Viterbi), so the
|
|
.B \-fb
|
|
option is usually also specified.
|
|
Only n-grams whose contexts occur in the language model are output.
|
|
.TP
|
|
.BI \-unk-prob " L"
|
|
Specifies that unknown words and other words having zero probability in
|
|
the language model be assigned a log probability of
|
|
.IR L .
|
|
This is -100 by default but might be set to 0, e.g., to compute
|
|
perplexities excluding unknown words.
|
|
.TP
|
|
.B \-debug
|
|
Sets debugging output level.
|
|
.PP
|
|
Each filename argument can be an ASCII file, or a compressed
|
|
file (name ending in .Z or .gz), or ``-'' to indicate
|
|
stdin/stdout.
|
|
.SH BUGS
|
|
The
|
|
.B \-continuous
|
|
and
|
|
.B \-text\-map
|
|
options effectively disable
|
|
.BR \-keep-unk ,
|
|
i.e., unknown input words are always mapped to <unk>.
|
|
Also,
|
|
.B \-continuous
|
|
doesn't preserve the positions of escaped input lines relative to
|
|
the input.
|
|
.br
|
|
The dynamic programming for event decoding is not efficiently interleaved
|
|
with that required to evaluate class N-gram models;
|
|
therefore, the state space generated
|
|
in decoding with
|
|
.BR \-classes
|
|
quickly becomes infeasibly large unless
|
|
.BR \-simple-classes
|
|
is also specified.
|
|
.PP
|
|
The file given by
|
|
.B \-classes
|
|
is read multiple times if
|
|
.B \-limit-vocab
|
|
is in effect or if a mixture of LMs is specified.
|
|
This will lead to incorrect behavior if the argument of
|
|
.B \-classes
|
|
is stdin (``-'').
|
|
.SH "SEE ALSO"
|
|
ngram(1), ngram-count(1), disambig(1), segment(1),
|
|
ngram-format(5), classes-format(5).
|
|
.br
|
|
A. Stolcke et al., ``Automatic Detection of Sentence Boundaries and
|
|
Disfluencies based on Recognized Words,''
|
|
\fIProc. ICSLP\fP, 2247\-2250, Sydney.
|
|
.SH AUTHORS
|
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>,
|
|
Anand Venkataraman <anand@speech.sri.com>
|
|
.br
|
|
Copyright (c) 1998\-2006 SRI International
|