266 lines
9.6 KiB
HTML
266 lines
9.6 KiB
HTML
![]() |
<! $Id: Vocab.3,v 1.3 2019/09/09 22:35:37 stolcke Exp $>
|
||
|
<HTML>
|
||
|
<HEADER>
|
||
|
<TITLE>Vocab</TITLE>
|
||
|
<BODY>
|
||
|
<H1>Vocab</H1>
|
||
|
<H2> NAME </H2>
|
||
|
Vocab - Vocabulary indexing for SRILM
|
||
|
<H2> SYNOPSIS </H2>
|
||
|
<PRE>
|
||
|
<B> #include <Vocab.h> </B>
|
||
|
</PRE>
|
||
|
<H2> DESCRIPTION </H2>
|
||
|
The
|
||
|
<B> Vocab </B>
|
||
|
class represents sets of string tokens as typically used for vocabularies,
|
||
|
word class names, etc. Additionally, Vocab provides a mapping from
|
||
|
such string tokens (type <B>VocabString</B>) to integers (type <B>VocabIndex</B>).
|
||
|
VocabIndex values are typically used to index words in language models to
|
||
|
conserve space and speed up comparisons etc. Thus,
|
||
|
<B>Vocab</B> essentially
|
||
|
implements a symbol table into which strings can be ``interned.''
|
||
|
<H2> TYPES </H2>
|
||
|
<DL>
|
||
|
<DT><B> VocabIndex </B>
|
||
|
<DD>
|
||
|
A non-negative integer for representing a string internally.
|
||
|
<DT><B> VocabString </B>
|
||
|
<DD>
|
||
|
A character array representing a vocabulary item (e.g., a word).
|
||
|
</DD>
|
||
|
</DL>
|
||
|
<H2> CONSTANTS </H2>
|
||
|
<DL>
|
||
|
<DT><B> maxWordLength </B>
|
||
|
<DD>
|
||
|
Maximum number of characters in a VocabString.
|
||
|
<DT><B> Vocab_None </B>
|
||
|
<DD>
|
||
|
A special VocabIndex used to denote no vocabulary item and to
|
||
|
terminate VocabIndex arrays.
|
||
|
<DT><B> Vocab_Unknown </B>
|
||
|
<DD>
|
||
|
<DT><B> Vocab_SentStart </B>
|
||
|
<DD>
|
||
|
<DT><B> Vocab_SentEnd </B>
|
||
|
<DD>
|
||
|
<DT><B> Vocab_Pause </B>
|
||
|
<DD>
|
||
|
Default VocabString values for some common, predefined vocabulary items:
|
||
|
unknown word, sentence begin, sentence end, and pause, respectively.
|
||
|
</DD>
|
||
|
</DL>
|
||
|
<H2> CLASS MEMBERS </H2>
|
||
|
<DL>
|
||
|
<DT><B> Vocab(VocabIndex <I>start</I> = 0, VocabIndex <I>end</I> = 0x7fffffff) </B>
|
||
|
<DD>
|
||
|
When initializing a Vocab object,
|
||
|
<I>start</I> and <I>end</I> optionally set the minimum and maximum VocabIndex
|
||
|
values assigned by the vocabulary.
|
||
|
Indices are allocated in increasing order starting at <I>start</I>.
|
||
|
<DT><B> VocabIndex addWord(VocabString <I>name</I>) </B>
|
||
|
<DD>
|
||
|
Looks up the index of a word string <I>name</I>, adding the word if not already
|
||
|
part of the vocabulary.
|
||
|
<DT><B> VocabString getWord(VocabIndex <I>index</I>) </B>
|
||
|
<DD>
|
||
|
Returns the VocabString for <I>index</I>, or 0 if the index isn't defined.
|
||
|
<DT><B> getIndex(VocabString <I>name</I>) </B>
|
||
|
<DD>
|
||
|
Returns the VocabIndex for word <I>name</I>, or
|
||
|
<B> Vocab_None </B>
|
||
|
if the word isn't defined.
|
||
|
(Unlike <B>addWord()</B>,
|
||
|
this will not extend the vocabulary if the word is undefined.)
|
||
|
<DT><B> void remove(VocabString <I>name</I>) </B>
|
||
|
<DD>
|
||
|
<DT><B> void remove(VocabIndex <I>index</I>) </B>
|
||
|
<DD>
|
||
|
Deletes a vocabulary item, either by name or by index.
|
||
|
<DT><B> unsigned int numWords() </B>
|
||
|
<DD>
|
||
|
Returns the number of current vocabulary entries.
|
||
|
<DT><B> VocabIndex highIndex() </B>
|
||
|
<DD>
|
||
|
Returns the highest VocabIndex value assigned so far.
|
||
|
The next word added will receive an index that is one greater.
|
||
|
When allocating various meaningful vocabulary subsets into
|
||
|
contiguous ranges, this function can be used to determine the
|
||
|
corresponding boundaries in VocabIndex space, and then use these
|
||
|
values to test subset membership etc.
|
||
|
<DT><B> VocabIndex unkIndex </B>
|
||
|
<DD>
|
||
|
The index of the unknown word (by default assigned to <B>Vocab_Unknown</B>).
|
||
|
<DT><B> VocabIndex ssIndex </B>
|
||
|
<DD>
|
||
|
The index of the sentence-start tag (by default assignedrto <B>Vocab_SentStart</B>).
|
||
|
<DT><B> VocabIndex seIndex </B>
|
||
|
<DD>
|
||
|
The index of the sentence-end tag (by default assigned to <B>Vocab_SentEnd</B>).
|
||
|
<DT><B> VocabIndex pauseIndex </B>
|
||
|
<DD>
|
||
|
The index of the pause tag (by default assigned to <B>Vocab_Pause</B>).
|
||
|
<DT><B> Boolean unkIsWord </B>
|
||
|
<DD>
|
||
|
When <B>true</B>,
|
||
|
the unknown word is considered a regular word (default <B>false</B>).
|
||
|
<DT><B> Boolean toLower </B>
|
||
|
<DD>
|
||
|
When <B>true</B>, all word strings are mapped to lowercase.
|
||
|
This is convenient to combine vocabularies, language models, etc.,
|
||
|
whose vocabularies differ only in the case convention
|
||
|
(default <B>false</B>).
|
||
|
<DT><B> Boolean isNonEvent(VocabString <I>word</I>) </B>
|
||
|
<DD>
|
||
|
<DT><B> Boolean isNonEvent(VocabIndex <I>word</I>) </B>
|
||
|
<DD>
|
||
|
Tests a word string or index for being an ``non-event'', i.e., a
|
||
|
token that is not assigned probability in a language model.
|
||
|
By default, sentence-start, pauses, and unknown words are non-events.
|
||
|
<DT><B> unsigned read(File &<I>file</I>) </B>
|
||
|
<DD>
|
||
|
Reads word strings from a file and adds them to the vocabulary.
|
||
|
For convenience, only the first word on each line is significant
|
||
|
(so extra information could be contained in such a file).
|
||
|
Returns the number of words read.
|
||
|
<DT><B> void write(File &<I>file</I>, Boolean <I>sorted</I> = true) </B>
|
||
|
<DD>
|
||
|
Write the vocabulary strings to a file in a format compatible with
|
||
|
<B>read()</B>.
|
||
|
The <I>sorted</I> argument controls whether the output is
|
||
|
lexicographically sorted.
|
||
|
</DD>
|
||
|
</DL>
|
||
|
<P>
|
||
|
Often times one wants to manipulate not single vocabulary items, but
|
||
|
strings of them, e.g., to represent sentences.
|
||
|
Word strings are represented as self-delimiting arrays of type
|
||
|
<B> VocabString * </B>
|
||
|
or
|
||
|
<B>VocabIndex *</B>.<B></B><B></B><B></B>
|
||
|
The last element in a string is 0 or <B>Vocab_None</B>, respectively.
|
||
|
<DL>
|
||
|
<DT><B> unsigned getWords(const VocabIndex *<I>wids</I>, VocabString *<I>words</I>, unsigned <I>max</I>) </B>
|
||
|
<DD>
|
||
|
Extends <B>getWord()</B> to strings of word.
|
||
|
The result is placed in <I>words</I>, which must have room for at least
|
||
|
<I>max</I> words.
|
||
|
Returns the actual number of indices in <I>wids</I>.
|
||
|
<DT><B> unsigned addWords(const VocabString *<I>words</I>, VocabIndex *<I>wids</I>, unsigned <I>max</I>) </B>
|
||
|
<DD>
|
||
|
Extends <B>addWord()</B> to strings of indices.
|
||
|
The result is placed in <I>wids</I>, which must have room for at least
|
||
|
<I>max</I> indices.
|
||
|
Returns the actual number of words in <I>words</I>.
|
||
|
<DT><B> unsigned getIndices(const VocabString *<I>words</I>, VocabIndex *<I>wids</I>, unsigned <I>max</I>) </B>
|
||
|
<DD>
|
||
|
Extends <B>getIndex()</B> to strings of indices.
|
||
|
The result is placed in <I>wids</I>, which must have room for at least
|
||
|
<I>max</I> indices.
|
||
|
Returns the actual number of words in <I>words</I>.
|
||
|
</DD>
|
||
|
</DL>
|
||
|
<H2> FUNCTIONS </H2>
|
||
|
The following static member functions are utilities to manipulate strings of
|
||
|
vocabulary items, independent of a particular vocabulary.
|
||
|
<DL>
|
||
|
<DT><B> unsigned parseWords(char *<I>line</I>, VocabString *<I>words</I>, unsigned <I>max</I>) </B>
|
||
|
<DD>
|
||
|
Parses a character string <I>line</I> into whitespace-delimited words.
|
||
|
On return, <I>words</I> contains pointers to null-terminated substrings of
|
||
|
<I>line</I> (whose contents is modified in the process).
|
||
|
<I>words</I> must have room for at least <I>max</I> pointers.
|
||
|
Returns the actual number of words parsed.
|
||
|
<DT><B> unsigned length(const VocabIndex *<I>words</I>) </B>
|
||
|
<DD>
|
||
|
<DT><B> unsigned length(const VocabString *<I>words</I>) </B>
|
||
|
<DD>
|
||
|
Returns the number items in a word string.
|
||
|
<DT><B> Boolean contains(const VocabIndex *<I>words</I>, VocabIndex <I>word</I>) </B>
|
||
|
<DD>
|
||
|
Returns <I>true</I> if the <I>word</I> occurs among <I>words</I>.
|
||
|
<DT><B> VocabIndex *reverse(VocabIndex *<I>words</I>) </B>
|
||
|
<DD>
|
||
|
<DT><B> VocabString *reverse(VocabString *<I>words</I>) </B>
|
||
|
<DD>
|
||
|
Reverses a string of words in place (and returns it as a result).
|
||
|
<DT><B> void write(File &<I>file</I>, const VocabString *<I>words</I>) </B>
|
||
|
<DD>
|
||
|
Writes a string of space-delimited words to a file.
|
||
|
<DT><B> int compare(VocabIndex <I>word1</I>, VocabIndex <I>word2</I>) </B>
|
||
|
<DD>
|
||
|
<DT><B> int compare(VocabString <I>word1</I>, VocabString <I>word2</I>) </B>
|
||
|
<DD>
|
||
|
Compares two vocabulary items lexicographically.
|
||
|
Returns -1, 0, +1 for less than, equal, or greater than, respectively.
|
||
|
<DT><B> int compare(const VocabIndex *<I>words1</I>, const VocabIndex *<I>words2</I>) </B>
|
||
|
<DD>
|
||
|
<DT><B> int compare(const VocabIndex *<I>words1</I>, const VocabIndex *<I>words2</I>) </B>
|
||
|
<DD>
|
||
|
Extends the order of <I>compare()</I> to strings of words.
|
||
|
</DD>
|
||
|
</DL>
|
||
|
<P>
|
||
|
For compatibilty with the C library calling conventions, <B>compare()</B>
|
||
|
cannot be a member function of a Vocab object.
|
||
|
For index-based comparisons the associated vocabulary needs to be
|
||
|
set globally.
|
||
|
This is achieved by calling the <B>compareIndex()</B> member function
|
||
|
of a Vocab object.
|
||
|
<DL>
|
||
|
<DT><B> ostream &operator<< (ostream &, const VocabString *<I>words</I>) </B>
|
||
|
<DD>
|
||
|
<DT><B> ostream &operator<< (ostream &, const VocabIndex *<I>words</I>) </B>
|
||
|
<DD>
|
||
|
These operators output strings of words to a stream.
|
||
|
For the second variant, the Vocab object used for interpreting indices
|
||
|
needs to be identified globally by calling the <I>use()</I> member function
|
||
|
on the object.
|
||
|
</DD>
|
||
|
</DL>
|
||
|
<H2> ITERATORS </H2>
|
||
|
The
|
||
|
<B> VocabIter </B>
|
||
|
class provides iteration over vocabularies.
|
||
|
An iteration returns the elements of a Vocab in some unspecified,
|
||
|
but deterministic order.
|
||
|
<P>
|
||
|
When copied or used in initialization of other objects,
|
||
|
VocabIter objects retain the current ``position'' in an iteration.
|
||
|
This allows nested iterations that enumerate all pairs of distinct elements,
|
||
|
etc.
|
||
|
<P>
|
||
|
NOTE: While an iteration over a Vocab object is ongoing, no modifications
|
||
|
are allowed to the object, <I>except</I> removal of the
|
||
|
``current'' vocabulary item.
|
||
|
<DL>
|
||
|
<DT><B> VocabIter(Vocab &<I>vocab</I>, Boolean <I>sorted</I> = false) </B>
|
||
|
<DD>
|
||
|
Creates an iteration over <I>vocab</I>.
|
||
|
If <I>sorted</I> is set to <B>true</B> the vocabulary items will
|
||
|
be enumerated in lexicographic order.
|
||
|
<DT><B> void init() </B>
|
||
|
<DD>
|
||
|
Reinitializes the iteration to its beginning.
|
||
|
<DT><B> VocabString next() </B>
|
||
|
<DD>
|
||
|
<DT><B> VocabString next(VocabIndex &<I>index</I>) </B>
|
||
|
<DD>
|
||
|
Steps the iteration and returns the next word string.
|
||
|
Optionally, the associated word index is returned in <I>index</I>.
|
||
|
Returns 0 if the vocabulary is exhausted.
|
||
|
</DD>
|
||
|
</DL>
|
||
|
<H2> SEE ALSO </H2>
|
||
|
<A HREF="LM.3.html">LM(3)</A>, <A HREF="File.3.html">File(3)</A>
|
||
|
<H2> BUGS </H2>
|
||
|
There is no good way to synchronize VocabIndex values across
|
||
|
multiple Vocab objects.
|
||
|
<H2> AUTHOR </H2>
|
||
|
Andreas Stolcke <stolcke@icsi.berkeley.edu>
|
||
|
<BR>
|
||
|
Copyright (c) 1995-1996 SRI International
|
||
|
</BODY>
|
||
|
</HTML>
|