106 lines
5.1 KiB
Markdown
106 lines
5.1 KiB
Markdown
![]() |
# LM for WeNet
|
||
|
|
||
|
WeNet uses n-gram based statistical language model and the WFST framework to support the custom language model.
|
||
|
And LM is only supported in runtime of WeNet.
|
||
|
|
||
|
## Motivation
|
||
|
|
||
|
Why n-gram based LM? This may be the first question many people will ask.
|
||
|
Now that LM based on RNN and Transformer is in full swing, why does WeNet go backward?
|
||
|
The reason is simple, it is for productivity.
|
||
|
The n-gram-based language model has mature and complete training tools,
|
||
|
any amount of corpus can be trained, the training is very fast, the hotfix is easy,
|
||
|
and it has a wide range of mature applications in actual products.
|
||
|
|
||
|
Why WFST? It may be the second question many people will ask.
|
||
|
Since both industry and research have been working so hard to abandon traditional speech recognition,
|
||
|
especially the complex decoding technology. Why does WeNet back?
|
||
|
The reason is also very simple, it is for productivity.
|
||
|
WFST is a standard and powerful tool in traditional speech recognition.
|
||
|
And based on this solution, we have mature and complete bug fix solutions and product solutions,
|
||
|
such as that we can use the replace function in WFST for class-based personalization such as contact recognition.
|
||
|
|
||
|
Therefore, just like WeNet's design goal "Production first and Production Ready",
|
||
|
LM in WeNet also puts productivity as the first priority.
|
||
|
So it draws on many very productive tools and solutions accumulated in traditional speech recognition.
|
||
|
The difference to traditional speech recognition are:
|
||
|
|
||
|
1. The training in WeNet is pure end to end.
|
||
|
2. As described below, LM is optional in decoding, you can choose whether to use LM according to your needs and application scenarios.
|
||
|
|
||
|
|
||
|
## System Design
|
||
|
|
||
|
The whole system is shown in the bellowing picture. There are two ways to generate N-best.
|
||
|
|
||
|

|
||
|
|
||
|
1. Without LM, we use CTC prefix beam search to generate N-best.
|
||
|
2. With LM, we use CTC WFST search to generate N-best and CTC WFST search is the traditional WFST based decoder.
|
||
|
|
||
|
There are two main parts of the CTC WFST based search.
|
||
|
|
||
|
The first is building the decoding graph, which is to compose the model unit T, the lexicon L and the language model G into one unified graph TLG. And in which:
|
||
|
1. T is the model unit in E2E training. Typically it's char in Chinese, char or BPE in English.
|
||
|
2. L is the lexicon, the lexicon is very simple. What we need to do is just split a word into its modeling unit sequence.
|
||
|
For example, the word "我们" is split into two chars "我 们", and the word "APPLE" is split into five letters "A P P L E".
|
||
|
We can see there is no phonemes and there is no need to design pronunciation on purpose.
|
||
|
3. G is the language model, namely compiling the n-gram to standard WFST representation.
|
||
|
|
||
|
The second is the decoder, which is the same as the traditional decoder, which uses the standard Viterbi beam search algorithm in decoding.
|
||
|
|
||
|
## Implementation
|
||
|
|
||
|
WeNet draws on the decoder and related tools in Kaldi to support LM and WFST based decoding.
|
||
|
For ease of using and keeping independence, we directly migrated the code related to decoding in Kaldi to [this directory](https://github.com/wenet-e2e/wenet/tree/main/runtime/core/kaldi) in WeNet runtime.
|
||
|
And modify and organize according to the following principles:
|
||
|
1. To minimize changes, the migrated code remains the same directory structure as the original.
|
||
|
2. We use GLOG to replace the log system in Kaldi.
|
||
|
3. We modify the code format to meet the lint requirements of the code style in WeNet.
|
||
|
|
||
|
The core code is https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/ctc_wfst_beam_search.cc,
|
||
|
which wraps the LatticeFasterDecoder in Kaldi.
|
||
|
And we use blank frame skipping to speed up decoding.
|
||
|
|
||
|
In addition, WeNet also migrated related tools for building the decoding graph,
|
||
|
such as arpa2fst, fstdeterminizestar, fsttablecompose, fstminimizeencoded, and other tools.
|
||
|
So all the tools related to LM are built-in tools and can be used out of the box.
|
||
|
|
||
|
|
||
|
## Results
|
||
|
|
||
|
We get consistent gain (3%~10%) on different datasets,
|
||
|
including aishell, aishell2, and librispeech,
|
||
|
please go to the corresponding example dataset for the details.
|
||
|
|
||
|
## How to use?
|
||
|
|
||
|
Here is an example from aishell, which shows how to prepare the dictionary, how to train the LM,
|
||
|
how to build the graph, and how to decode with the runtime.
|
||
|
|
||
|
``` sh
|
||
|
# 7.1 Prepare dict
|
||
|
unit_file=$dict
|
||
|
mkdir -p data/local/dict
|
||
|
cp $unit_file data/local/dict/units.txt
|
||
|
tools/fst/prepare_dict.py $unit_file ${data}/resource_aishell/lexicon.txt \
|
||
|
data/local/dict/lexicon.txt
|
||
|
# 7.2 Train lm
|
||
|
lm=data/local/lm
|
||
|
mkdir -p $lm
|
||
|
tools/filter_scp.pl data/train/text \
|
||
|
$data/data_aishell/transcript/aishell_transcript_v0.8.txt > $lm/text
|
||
|
local/aishell_train_lms.sh
|
||
|
# 7.3 Build decoding TLG
|
||
|
tools/fst/compile_lexicon_token_fst.sh \
|
||
|
data/local/dict data/local/tmp data/local/lang
|
||
|
tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1;
|
||
|
# 7.4 Decoding with runtime
|
||
|
./tools/decode.sh --nj 16 \
|
||
|
--beam 15.0 --lattice_beam 7.5 --max_active 7000 \
|
||
|
--blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 \
|
||
|
--fst_path data/lang_test/TLG.fst \
|
||
|
data/test/wav.scp data/test/text $dir/final.zip \
|
||
|
data/lang_test/words.txt $dir/lm_with_runtime
|
||
|
```
|