b2txt25/language_model/srilm-1.7.3/man/html/wlat-format.5.html

<! $Id: wlat-format.5,v 1.10 2019/02/06 09:53:12 stolcke Exp $>
<HTML>
<HEADER>
<TITLE>wlat-format</TITLE>
<BODY>
<H1>wlat-format</H1>
<H2> NAME </H2>
wlat-format - File format for SRILM word posterior lattices
<H2> SYNOPSIS </H2>
Word lattices:
<PRE>
<B>version 2</B>
<B>name</B> <I>s</I>
<B>initial</B> <I>i</I>
<B>final</B> <I>f</I>
<B>node</B> <I>n</I> <I>w</I> <I>a</I> <I>p</I> <I>n1</I> <I>p1</I> <I>n2</I> <I>p2</I> ...
...
</PRE>
<P>
Word meshes (confusion networks):
<PRE>
<B>name</B> <I>s</I>
<B>numaligns</B> <I>N</I>
<B>posterior</B> <I>P</I>
<B>align</B> <I>a</I> <I>w1</I> <I>p1</I> <I>w2</I> <I>p2</I> ...
<B>reference</B> <I>a</I> <I>w</I>
<B>hyps</B> <I>a</I> <I>w</I> <I>h1</I> <I>h2</I> ...
<B>info</B> <I>a</I> <I>w</I> <I>start</I> <I>dur</I> <I>ascore</I> <I>gscore</I> <I>phones</I> <I>phonedurs</I>
<B>time</B> <I>a</I> <I>t</I>
...
</PRE>
<H2> DESCRIPTION </H2>
Word posterior lattices and meshes are lattices generated by aligning
N-best hypotheses with
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>,
or by aligning PFSG or HTK lattices with
<A HREF="lattice-tool.1.html">lattice-tool(1)</A>.
They compactly encode possible word hypotheses sequences and their
posterior probabilities.
(Word meshes have become generally known as ``confusion networks'' or
``sausages.'')
<P>
A word lattice is a partially ordered directed graph with nodes representing
word hypotheses.
Nodes are identified by non-negative integers.
The file format specifies the initial node
<I>i</I>,<I></I><I></I><I></I>
the final node
<I>f</I>,<I></I><I></I><I></I>
and any number of additional nodes
<I>n</I>.<I></I><I></I><I></I>
For each node
<I> n </I>
the following associated information is given on the same line:
the word identity
<I> w </I>
(the string ``NULL'' is used with initial and final nodes),
the alignment position
<I> a </I>
(identical values in this field identify hypotheses that occur at the
same position),
and the word posterior probability
<I>p</I>.<I></I><I></I><I></I>
Following these values, zero or more transitions to successor nodes
are specified, each given by the node index
<I> ni </I>
and the transition posterior probability
<I>pi</I>.<I></I><I></I><I></I>
In a properly normalized word lattice the transition posteriors
<I> pi </I>
sum up to the node posterior
<I>p</I>.<I></I><I></I><I></I>
<P>
Word meshes represent a more constrained lattice format in which
word hypotheses are in a total order.
A mesh contains a number of alignment positions, and a set of
mutually exclusive word hypotheses in each position (the ``confusion sets'').
The word mesh represents all sentence hypotheses that can be
generated by freely combining word hypotheses at each position.
The file format specifies the number of alignment positions
<I>A</I><I></I><I></I><I></I>
and the total posterior probability mass
<I> P </I>
contained in the lattice,
followed by one or more confusion set specifications.
For each alignment position
<I>a</I>,<I></I><I></I><I></I>
the hypothesized words
<I> wi </I>
and their posterior probabilities
<I> pi </I>
are listed in alternation.
The pseudo-word string
<B> *DELETE* </B>
represents an empty hypothesis.
<P>
Optionally, the word mesh format encodes additional information about
the hypothesis alignment from which it resulted.
The keyword
<B> reference </B>
specifies the correct word
<I> w </I>
that was aligned at position
<I>a</I>.<I></I><I></I><I></I>
The keyword
<B> hyps </B>
is used to list the sentence hypotheses of which a certain word
hypothesis was a part.
The word hypothesis is identified by an alignment postion
<I> a </I>
and the word string
<I>w</I>,<I></I><I></I><I></I>
and is followed by the integer IDs
<I> hi </I>
(typically, the N-best ranks)
of the associated sentence hypotheses.
<P>
As another optional element, the word mesh can contain word-level acoustic and
temporal information,
following the keyword
<B>info</B>,<B></B><B></B><B></B>
the alignment position
<I>a</I>,<I></I><I></I><I></I>
and the word identity
<I>w</I>.<I></I><I></I><I></I>
This information is derived by
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>
from word- and phone-level backtraces of N-best
hypotheses (as represented in Decipher NBestList2.0 format).
The details of this information are defined in the SRILM class
<B> NBestWordInfo </B>
and subject to change, but currently include the following.
<I>start</I>:<I></I><I></I><I></I>
word start time (in seconds from the beginning of the waveform);
<I>dur</I>:<I></I><I></I><I></I>
word duration (in seconds);
<I>ascore</I>:<I></I><I></I><I></I>
acoustic model likelihood (log base 10);
<I>gscore</I>:<I></I><I></I><I></I>
grammar (LM and pronunciation) score (log base 10);
<I>phones</I>:<I></I><I></I><I></I>
sequence of phones in word (separated by colons);
<I>phonedurs</I>:<I></I><I></I><I></I>
sequence of phone durations (in numbers of frames, separated by colons).
When word meshes are derived from HTK format lattices, pronunciation field
will consist of the HTK phone alignment information, which encodes both
phone sequence and durations; the phone duration field in turn is used
to encode the duration model scores, if present.
<B> Note: </B>
The encoded information pertains to the word hypothesis with the highest
posterior probability among all hypotheses of the same word aligned
to a given word mesh position.
<P>
The
<B> time </B>
keyword is used for debugging purposes and encodes the estimated timestamp
<I> t </I>
of an alignment position
<I> a </I>
when the input contains backtrace information.
It is ignored when reading in word meshes.
<P>
Both formats optionally encode the associated utterance IDs in the
<B> name </B>
field.
Word lattices and meshes can be converted to PFSG format using
the script
<B>wlat-to-pfsg</B>.<B></B><B></B><B></B>
<H2> SEE ALSO </H2>
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>, <A HREF="lattice-tool.1.html">lattice-tool(1)</A>,
<A HREF="pfsg-scripts.1.html">pfsg-scripts(1)</A>, <A HREF="pfsg-format.5.html">pfsg-format(5)</A>, <A HREF="nbest-format.5.html">nbest-format(5)</A>.
<BR>
L. Mangu, E. Brill, &amp; A. Stolcke, ``Finding consensus in speech recognition:
word error minimization and other applications of confusion networks,''
<I>Computer Speech and Language</I> 14(4), 373-400, 2000.
<H2> BUGS </H2>
Detailed alignment and acoustic information is so far only implemented
for word meshes, although conceptually it would apply equally to word lattices.
<H2> AUTHOR </H2>
Andreas Stolcke &lt;andreas.stolcke@microsoft.com&gt;
<BR>
Copyright 2001-2011 SRI International
<BR>
Copyright 2011-2019 Microsoft Corp.
</BODY>
</HTML>