b2txt25/language_model/srilm-1.7.3/man/man1/hidden-ngram.1

.\" $Id: hidden-ngram.1,v 1.34 2019/09/09 22:35:36 stolcke Exp $
.TH hidden-ngram 1 "$Date: 2019/09/09 22:35:36 $" "SRILM Tools"
.SH NAME
hidden-ngram \- tag hidden events between words
.SH SYNOPSIS
.nf
\fBhidden-ngram\fP [ \fB\-help\fP ] \fIoption\fP ...
.fi
.SH DESCRIPTION
.B hidden-ngram
tags a stream of word tokens with hidden events occurring between words.
For example, an unsegmented text could be tagged for sentence boundaries
(the hidden events in this case being `boundary' and `no-boundary').
The most likely hidden tag sequence consistent with the given word
sequence is found according to an N-gram language model over both
words and hidden tags.
.PP
.B hidden-ngram
is a generalization of
.BR segment (1).
.SH OPTIONS
.PP
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
.TP
.B \-help
Print option summary.
.TP
.B \-version
Print version information.
.TP
.BI \-text " file"
Specifies the file containing the word sequences to be tagged
(one sentence per line).
Start- and end-of-sentence tags are
.I not
added by the program, but should be included in the input if the
language model uses them.
.TP
.BI \-escape " string"
Set an ``escape string.''
Input lines starting with
.I string
are not processed and passed unchanged to stdout instead.
This allows associated information to be passed to scoring scripts etc.
.TP
.BI \-text\-map " file"
Read the input words from a map file contain both the words and
additional likelihoods of events following each word.
Each line contains one input word, plus optional hidden-event/likelihood
pairs in the format
.nf
	\fIw\fP	\fIe1\fP [\fIp1\fP] \fIe2\fP [\fIp2\fP] ...
.fi
If a \fIp\fP value is omitted a likelihood of 1 is assumed.
All events not explicitly listed are given likelihood 0, and are
hence excluded for that word.
In particular, the label
.B *noevent*
must be listed to allow absence of a hidden event.
Input word strings are assembled from multiple lines of
.B \-text\-map
input until either an end-of-sentence token </s> is found, or an escaped
line (see
.BR \-escape )
is encountered.
.TP
.B \-logmap
Interpret numeric values in the
.B \-text\-map
file as log probabilities, rather
than probabilities.
.TP
.BI \-lm " file"
Specifies the word/tag language model as a standard ARPA N-gram backoff model
file in
.BR ngram-format (5).
.TP
.BI \-use-server " S"
Use a network LM server (typically implemented by
.BR ngram (1)
with the
.B \-server-port
option) as the main model.
The server specification
.I S
can be an unsigned integer port number (referring to a server port running on
the local host),
a hostname (referring to default port 2525 on the named host),
or a string of the form
.IR port @ host ,
where
.I port
is a portnumber and
.I host
is either a hostname ("dukas.speech.sri.com")
or IP number in dotted-quad format ("140.44.1.15").
.br
For server-based LMs, the
.B \-order
option limits the context length of N-grams queried by the client
(with 0 denoting unlimited length).
Hence, the effective LM order is the mimimum of the client-specified value
and any limit implemented in the server.
.br
When
.B \-use-server
is specified, the arguments to the options
.BR \-mix-lm ,
.BR \-mix-lm2 ,
etc. are also interpreted as network LM server specifications provided
they contain a '@' character and do not contain a '/' character.
This allows the creation of mixtures of several file- and/or
network-based LMs.
.TP
.B \-cache-served-ngrams
Enables client-side caching of N-gram probabilities to eliminated duplicate
network queries, in conjunction with
.BR \-use-server .
This can results in a substantial speedup
but requires memory in the client that may grow linearly with the
amount of data processed.
.TP
.BI \-order " n"
Set the effective N-gram order used by the language model to
.IR n .
Default is 3 (use a trigram model).
.TP
.BI \-classes " file"
Interpret the LM as an N-gram over word classes.
The expansions of the classes are given in
.IR file
in
.BR classes-format (5).
Tokens in the LM that are not defined as classes in
.I file
are assumed to be plain words, so that the LM can contain mixed N-grams over
both words and word classes.
.TP
.BR \-simple-classes
Assume a "simple" class model: each word is member of at most one word class,
and class expansions are exactly one word long.
.TP
.BI \-mix-lm " file"
Read a second N-gram model for interpolation purposes.
The second and any additional interpolated models can also be class N-grams
(using the same
.B \-classes
definitions).
.TP
.B \-factored
Interpret the files specified by
.BR \-lm ,
.BR \-mix-lm ,
etc. as factored N-gram model specifications.
See
.BR ngram (1)
for more details.
.TP
.BI \-lambda " weight"
Set the weight of the main model when interpolating with
.BR \-mix-lm .
Default value is 0.5.
.TP
.BI \-mix-lm2 " file"
.TP
.BI \-mix-lm3 " file"
.TP
.BI \-mix-lm4 " file"
.TP
.BI \-mix-lm5 " file"
.TP
.BI \-mix-lm6 " file"
.TP
.BI \-mix-lm7 " file"
.TP
.BI \-mix-lm8 " file"
.TP
.BI \-mix-lm9 " file"
Up to 9 more N-gram models can be specified for interpolation.
.TP
.BI \-mix-lambda2 " weight"
.TP
.BI \-mix-lambda3 " weight"
.TP
.BI \-mix-lambda4 " weight"
.TP
.BI \-mix-lambda5 " weight"
.TP
.BI \-mix-lambda6 " weight"
.TP
.BI \-mix-lambda7 " weight"
.TP
.BI \-mix-lambda8 " weight"
.TP
.BI \-mix-lambda9 " weight"
These are the weights for the additional mixture components, corresponding
to
.B \-mix-lm2
through
.BR \-mix-lm9 .
The weight for the
.B \-mix-lm
model is 1 minus the sum of
.B \-lambda
and
.B \-mix-lambda2
through
.BR \-mix-lambda9 .
.TP
.BI \-context-priors " file"
Read context-dependent mixture weight priors from
.IR file .
Each line in
.I file
should contain a context N-gram (most recent word first) followed by a vector
of mixture weights whose length matches the number of LMs being interpolated.
.TP
.BI \-bayes " length"
Interpolate models using posterior probabilities
based on the likelihoods of local N-gram contexts of length
.IR length .
The
.B \-lambda
values are used as prior mixture weights in this case.
This option can also be combined with
.BR \-context-priors ,
in which case the
.I length
parameter also controls how many words of context are maximally used to look up
mixture weights.
If
.BR \-context-priors
is used without
.BR \-bayes ,
the context length used is set by the
.B \-order
option and Bayesian interpolation is disabled, as when
.I scale
(see next) is zero.
.TP
.BI \-bayes-scale " scale"
Set the exponential scale factor on the context likelihood in conjunction
with the
.B \-bayes
function.
Default value is 1.0.
.TP
.BI \-lmw " W"
Scales the language model probabilities by a factor
.IR W .
Default language model weight is 1.
.TP
.BI \-mapw " W"
Scales the likelihood map probability by a factor
.IR W .
Default map weight is 1.
.TP
.B \-tolower
Map vocabulary to lowercase, removing case distinctions.
.TP
.BI \-vocab " file"
Initialize the vocabulary for the LM from
.IR file .
This is useful in conjunction with
.BR \-limit-vocab .
.TP
.BI \-vocab-aliases " file"
Reads vocabulary alias definitions from
.IR file ,
consisting of lines of the form
.nf
	\fIalias\fP \fIword\fP
.fi
This causes all tokens
.I alias
to be mapped to
.IR word .
.TP
.BI \-hidden-vocab " file"
Read the list of hidden tags from
.IR file .
Note: This is a subset of the vocabulary contained in the language model.
.TP
.B \-limit-vocab
Discard LM parameters on reading that do not pertain to the words
specified in the vocabulary, either by
.B \-vocab
or
.BR \-hidden-vocab .
The default is that words used in the LM are automatically added to the
vocabulary.
This option can be used to reduce the memory requirements for large LMs
that are going to be evaluated only on a small vocabulary subset.
.TP
.B \-force-event
Forces a non-default event after every word.
This is useful for language models that represent the default event
explicitly with a tag, rather than implicitly by the absence of a tag
between words (which is the default).
.TP
.B \-keep-unk
Do not map unknown input words to the <unk> token.
Instead, output the input word unchanged.
Also, with this option the LM is assumed to be open-vocabulary
(the default is close-vocabulary).
.TP
.B \-fb
Perform forward-backward decoding of the input token sequence.
Outputs the tags that have the highest posterior probability,
for each position.
The default is to use Viterbi decoding, i.e., the output is the
tag sequence with the highest joint posterior probability.
.TP
.B \-fw-only
Similar to
.BR \-fb ,
but uses only the forward probabilities for computing posteriors.
This may be used to simulate on-line prediction of tags, without the
benefit of future context.
.TP
.B \-continuous
Process all words in the input as one sequence of words, irrespective of
line breaks.
Normally each line is processed separately as a sentence.
Input tokens are output one-per-line, followed by event tokens.
.TP
.B \-posteriors
Output the table of posterior probabilities for each
tag position.
If
.B \-fb
is also specified the posterior probabilities will be computed using
forward-backward probabilities; otherwise an approximation will be used
that is based on the probability of the most likely path containing
a given tag at given position.
.TP
.B \-totals
Output the total string probability for each input sentence.
If
.B \-fb
is also specified this probability is obtained by summing over all
hidden event sequences; otherwise it is calculated (i.e., underestimated)
using the most probably hidden event sequence.
.TP
.BI \-nbest " N"
Output the
.I N
best hypotheses instead of just the first best when
doing Viterbi search.
If
.IR N >1,
then each hypothesis is prefixed by the tag
.BI NBEST_ n " " x ,
where
.I n
is the rank of the hypothesis in the N-best list and
.I x
its score, the negative log of the combined probability of transitions
and observations of the corresponding HMM path.
.TP
.BI \-write-counts " file"
Write the posterior weighted counts of n-grams, including those
with hidden tags, summed over the entire input data, to
.IR file .
The posterior probabilities should normally be computed with the
forward-backward algorithm (instead of Viterbi), so the
.B \-fb
option is usually also specified.
Only n-grams whose contexts occur in the language model are output.
.TP
.BI \-unk-prob " L"
Specifies that unknown words and other words having zero probability in
the language model be assigned a log probability of
.IR L .
This is -100 by default but might be set to 0, e.g., to compute
perplexities excluding unknown words.
.TP
.B \-debug
Sets debugging output level.
.PP
Each filename argument can be an ASCII file, or a compressed
file  (name  ending  in  .Z  or  .gz),  or ``-'' to indicate
stdin/stdout.
.SH BUGS
The
.B \-continuous
and
.B \-text\-map
options effectively disable
.BR \-keep-unk ,
i.e., unknown input words are always mapped to <unk>.
Also,
.B \-continuous
doesn't preserve the positions of escaped input lines relative to
the input.
.br
The dynamic programming for event decoding is not efficiently interleaved
with that required to evaluate class N-gram models;
therefore, the state space generated
in decoding with
.BR \-classes
quickly becomes infeasibly large unless
.BR \-simple-classes
is also specified.
.PP
The file given by
.B \-classes
is read multiple times if
.B \-limit-vocab
is in effect or if a mixture of LMs is specified.
This will lead to incorrect behavior if the argument of
.B \-classes
is stdin (``-'').
.SH "SEE ALSO"
ngram(1), ngram-count(1), disambig(1), segment(1),
ngram-format(5), classes-format(5).
.br
A. Stolcke et al., ``Automatic Detection of Sentence Boundaries and
Disfluencies based on Recognized Words,''
\fIProc. ICSLP\fP, 2247\-2250, Sydney.
.SH AUTHORS
Andreas Stolcke <stolcke@icsi.berkeley.edu>,
Anand Venkataraman <anand@speech.sri.com>
.br
Copyright (c) 1998\-2006 SRI International