187 lines
6.5 KiB
HTML
187 lines
6.5 KiB
HTML
<! $Id: wlat-format.5,v 1.10 2019/02/06 09:53:12 stolcke Exp $>
|
|
<HTML>
|
|
<HEADER>
|
|
<TITLE>wlat-format</TITLE>
|
|
<BODY>
|
|
<H1>wlat-format</H1>
|
|
<H2> NAME </H2>
|
|
wlat-format - File format for SRILM word posterior lattices
|
|
<H2> SYNOPSIS </H2>
|
|
Word lattices:
|
|
<PRE>
|
|
<B>version 2</B>
|
|
<B>name</B> <I>s</I>
|
|
<B>initial</B> <I>i</I>
|
|
<B>final</B> <I>f</I>
|
|
<B>node</B> <I>n</I> <I>w</I> <I>a</I> <I>p</I> <I>n1</I> <I>p1</I> <I>n2</I> <I>p2</I> ...
|
|
...
|
|
</PRE>
|
|
<P>
|
|
Word meshes (confusion networks):
|
|
<PRE>
|
|
<B>name</B> <I>s</I>
|
|
<B>numaligns</B> <I>N</I>
|
|
<B>posterior</B> <I>P</I>
|
|
<B>align</B> <I>a</I> <I>w1</I> <I>p1</I> <I>w2</I> <I>p2</I> ...
|
|
<B>reference</B> <I>a</I> <I>w</I>
|
|
<B>hyps</B> <I>a</I> <I>w</I> <I>h1</I> <I>h2</I> ...
|
|
<B>info</B> <I>a</I> <I>w</I> <I>start</I> <I>dur</I> <I>ascore</I> <I>gscore</I> <I>phones</I> <I>phonedurs</I>
|
|
<B>time</B> <I>a</I> <I>t</I>
|
|
...
|
|
</PRE>
|
|
<H2> DESCRIPTION </H2>
|
|
Word posterior lattices and meshes are lattices generated by aligning
|
|
N-best hypotheses with
|
|
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>,
|
|
or by aligning PFSG or HTK lattices with
|
|
<A HREF="lattice-tool.1.html">lattice-tool(1)</A>.
|
|
They compactly encode possible word hypotheses sequences and their
|
|
posterior probabilities.
|
|
(Word meshes have become generally known as ``confusion networks'' or
|
|
``sausages.'')
|
|
<P>
|
|
A word lattice is a partially ordered directed graph with nodes representing
|
|
word hypotheses.
|
|
Nodes are identified by non-negative integers.
|
|
The file format specifies the initial node
|
|
<I>i</I>,<I></I><I></I><I></I>
|
|
the final node
|
|
<I>f</I>,<I></I><I></I><I></I>
|
|
and any number of additional nodes
|
|
<I>n</I>.<I></I><I></I><I></I>
|
|
For each node
|
|
<I> n </I>
|
|
the following associated information is given on the same line:
|
|
the word identity
|
|
<I> w </I>
|
|
(the string ``NULL'' is used with initial and final nodes),
|
|
the alignment position
|
|
<I> a </I>
|
|
(identical values in this field identify hypotheses that occur at the
|
|
same position),
|
|
and the word posterior probability
|
|
<I>p</I>.<I></I><I></I><I></I>
|
|
Following these values, zero or more transitions to successor nodes
|
|
are specified, each given by the node index
|
|
<I> ni </I>
|
|
and the transition posterior probability
|
|
<I>pi</I>.<I></I><I></I><I></I>
|
|
In a properly normalized word lattice the transition posteriors
|
|
<I> pi </I>
|
|
sum up to the node posterior
|
|
<I>p</I>.<I></I><I></I><I></I>
|
|
<P>
|
|
Word meshes represent a more constrained lattice format in which
|
|
word hypotheses are in a total order.
|
|
A mesh contains a number of alignment positions, and a set of
|
|
mutually exclusive word hypotheses in each position (the ``confusion sets'').
|
|
The word mesh represents all sentence hypotheses that can be
|
|
generated by freely combining word hypotheses at each position.
|
|
The file format specifies the number of alignment positions
|
|
<I>A</I><I></I><I></I><I></I>
|
|
and the total posterior probability mass
|
|
<I> P </I>
|
|
contained in the lattice,
|
|
followed by one or more confusion set specifications.
|
|
For each alignment position
|
|
<I>a</I>,<I></I><I></I><I></I>
|
|
the hypothesized words
|
|
<I> wi </I>
|
|
and their posterior probabilities
|
|
<I> pi </I>
|
|
are listed in alternation.
|
|
The pseudo-word string
|
|
<B> *DELETE* </B>
|
|
represents an empty hypothesis.
|
|
<P>
|
|
Optionally, the word mesh format encodes additional information about
|
|
the hypothesis alignment from which it resulted.
|
|
The keyword
|
|
<B> reference </B>
|
|
specifies the correct word
|
|
<I> w </I>
|
|
that was aligned at position
|
|
<I>a</I>.<I></I><I></I><I></I>
|
|
The keyword
|
|
<B> hyps </B>
|
|
is used to list the sentence hypotheses of which a certain word
|
|
hypothesis was a part.
|
|
The word hypothesis is identified by an alignment postion
|
|
<I> a </I>
|
|
and the word string
|
|
<I>w</I>,<I></I><I></I><I></I>
|
|
and is followed by the integer IDs
|
|
<I> hi </I>
|
|
(typically, the N-best ranks)
|
|
of the associated sentence hypotheses.
|
|
<P>
|
|
As another optional element, the word mesh can contain word-level acoustic and
|
|
temporal information,
|
|
following the keyword
|
|
<B>info</B>,<B></B><B></B><B></B>
|
|
the alignment position
|
|
<I>a</I>,<I></I><I></I><I></I>
|
|
and the word identity
|
|
<I>w</I>.<I></I><I></I><I></I>
|
|
This information is derived by
|
|
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>
|
|
from word- and phone-level backtraces of N-best
|
|
hypotheses (as represented in Decipher NBestList2.0 format).
|
|
The details of this information are defined in the SRILM class
|
|
<B> NBestWordInfo </B>
|
|
and subject to change, but currently include the following.
|
|
<I>start</I>:<I></I><I></I><I></I>
|
|
word start time (in seconds from the beginning of the waveform);
|
|
<I>dur</I>:<I></I><I></I><I></I>
|
|
word duration (in seconds);
|
|
<I>ascore</I>:<I></I><I></I><I></I>
|
|
acoustic model likelihood (log base 10);
|
|
<I>gscore</I>:<I></I><I></I><I></I>
|
|
grammar (LM and pronunciation) score (log base 10);
|
|
<I>phones</I>:<I></I><I></I><I></I>
|
|
sequence of phones in word (separated by colons);
|
|
<I>phonedurs</I>:<I></I><I></I><I></I>
|
|
sequence of phone durations (in numbers of frames, separated by colons).
|
|
When word meshes are derived from HTK format lattices, pronunciation field
|
|
will consist of the HTK phone alignment information, which encodes both
|
|
phone sequence and durations; the phone duration field in turn is used
|
|
to encode the duration model scores, if present.
|
|
<B> Note: </B>
|
|
The encoded information pertains to the word hypothesis with the highest
|
|
posterior probability among all hypotheses of the same word aligned
|
|
to a given word mesh position.
|
|
<P>
|
|
The
|
|
<B> time </B>
|
|
keyword is used for debugging purposes and encodes the estimated timestamp
|
|
<I> t </I>
|
|
of an alignment position
|
|
<I> a </I>
|
|
when the input contains backtrace information.
|
|
It is ignored when reading in word meshes.
|
|
<P>
|
|
Both formats optionally encode the associated utterance IDs in the
|
|
<B> name </B>
|
|
field.
|
|
Word lattices and meshes can be converted to PFSG format using
|
|
the script
|
|
<B>wlat-to-pfsg</B>.<B></B><B></B><B></B>
|
|
<H2> SEE ALSO </H2>
|
|
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>, <A HREF="lattice-tool.1.html">lattice-tool(1)</A>,
|
|
<A HREF="pfsg-scripts.1.html">pfsg-scripts(1)</A>, <A HREF="pfsg-format.5.html">pfsg-format(5)</A>, <A HREF="nbest-format.5.html">nbest-format(5)</A>.
|
|
<BR>
|
|
L. Mangu, E. Brill, & A. Stolcke, ``Finding consensus in speech recognition:
|
|
word error minimization and other applications of confusion networks,''
|
|
<I>Computer Speech and Language</I> 14(4), 373-400, 2000.
|
|
<H2> BUGS </H2>
|
|
Detailed alignment and acoustic information is so far only implemented
|
|
for word meshes, although conceptually it would apply equally to word lattices.
|
|
<H2> AUTHOR </H2>
|
|
Andreas Stolcke <andreas.stolcke@microsoft.com>
|
|
<BR>
|
|
Copyright 2001-2011 SRI International
|
|
<BR>
|
|
Copyright 2011-2019 Microsoft Corp.
|
|
</BODY>
|
|
</HTML>
|