106 lines
		
	
	
		
			5.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			106 lines
		
	
	
		
			5.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # LM for WeNet
 | |
| 
 | |
| WeNet uses n-gram based statistical language model and the WFST framework to support the custom language model.
 | |
| And LM is only supported in runtime of WeNet.
 | |
| 
 | |
| ## Motivation
 | |
| 
 | |
| Why n-gram based LM? This may be the first question many people will ask.
 | |
| Now that LM based on RNN and Transformer is in full swing, why does WeNet go backward?
 | |
| The reason is simple, it is for productivity.
 | |
| The n-gram-based language model has mature and complete training tools,
 | |
| any amount of corpus can be trained, the training is very fast, the hotfix is easy,
 | |
| and it has a wide range of mature applications in actual products.
 | |
| 
 | |
| Why WFST? It may be the second question many people will ask.
 | |
| Since both industry and research have been working so hard to abandon traditional speech recognition,
 | |
| especially the complex decoding technology. Why does WeNet back?
 | |
| The reason is also very simple, it is for productivity.
 | |
| WFST is a standard and powerful tool in traditional speech recognition.
 | |
| And based on this solution, we have mature and complete bug fix solutions and product solutions,
 | |
| such as that we can use the replace function in WFST for class-based personalization such as contact recognition.
 | |
| 
 | |
| Therefore, just like WeNet's design goal "Production first and Production Ready",
 | |
| LM in WeNet also puts productivity as the first priority.
 | |
| So it draws on many very productive tools and solutions accumulated in traditional speech recognition.
 | |
| The difference to traditional speech recognition are:
 | |
| 
 | |
| 1. The training in WeNet is pure end to end.
 | |
| 2. As described below, LM is optional in decoding, you can choose whether to use LM according to your needs and application scenarios.
 | |
| 
 | |
| 
 | |
| ## System Design
 | |
| 
 | |
| The whole system is shown in the bellowing picture. There are two ways to generate N-best.
 | |
| 
 | |
| 
 | |
| 
 | |
| 1. Without LM, we use CTC prefix beam search to generate N-best.
 | |
| 2. With LM, we use CTC WFST search to generate N-best and CTC WFST search is the traditional WFST based decoder.
 | |
| 
 | |
| There are two main parts of the CTC WFST based search.
 | |
| 
 | |
| The first is building the decoding graph, which is to compose the model unit T, the lexicon L and the language model G into one unified graph TLG. And in which:
 | |
| 1. T is the model unit in E2E training. Typically it's char in Chinese, char or BPE in English.
 | |
| 2. L is the lexicon, the lexicon is very simple. What we need to do is just split a word into its modeling unit sequence.
 | |
| For example, the word "我们" is split into two chars "我 们", and the word "APPLE" is split into five letters "A P P L E".
 | |
| We can see there is no phonemes and there is no need to design pronunciation on purpose.
 | |
| 3. G is the language model, namely compiling the n-gram to standard WFST representation.
 | |
| 
 | |
| The second is the decoder, which is the same as the traditional decoder, which uses the standard Viterbi beam search algorithm in decoding.
 | |
| 
 | |
| ## Implementation
 | |
| 
 | |
| WeNet draws on the decoder and related tools in Kaldi to support LM and WFST based decoding.
 | |
| For ease of using and keeping independence, we directly migrated the code related to decoding in Kaldi to [this directory](https://github.com/wenet-e2e/wenet/tree/main/runtime/core/kaldi) in WeNet runtime.
 | |
| And modify and organize according to the following principles:
 | |
| 1. To minimize changes, the migrated code remains the same directory structure as the original.
 | |
| 2. We use GLOG to replace the log system in Kaldi.
 | |
| 3. We modify the code format to meet the lint requirements of the code style in WeNet.
 | |
| 
 | |
| The core code is https://github.com/wenet-e2e/wenet/blob/main/runtime/core/decoder/ctc_wfst_beam_search.cc,
 | |
| which wraps the LatticeFasterDecoder in Kaldi.
 | |
| And we use blank frame skipping to speed up decoding.
 | |
| 
 | |
| In addition, WeNet also migrated related tools for building the decoding graph,
 | |
| such as arpa2fst, fstdeterminizestar, fsttablecompose, fstminimizeencoded, and other tools.
 | |
| So all the tools related to LM are built-in tools and can be used out of the box.
 | |
| 
 | |
| 
 | |
| ## Results
 | |
| 
 | |
| We get consistent gain (3%~10%) on different datasets,
 | |
| including aishell, aishell2, and librispeech,
 | |
| please go to the corresponding example dataset for the details.
 | |
| 
 | |
| ## How to use?
 | |
| 
 | |
| Here is an example from aishell, which shows how to prepare the dictionary, how to train the LM,
 | |
| how to build the graph, and how to decode with the runtime.
 | |
| 
 | |
| ``` sh
 | |
| # 7.1 Prepare dict
 | |
| unit_file=$dict
 | |
| mkdir -p data/local/dict
 | |
| cp $unit_file data/local/dict/units.txt
 | |
| tools/fst/prepare_dict.py $unit_file ${data}/resource_aishell/lexicon.txt \
 | |
|     data/local/dict/lexicon.txt
 | |
| # 7.2 Train lm
 | |
| lm=data/local/lm
 | |
| mkdir -p $lm
 | |
| tools/filter_scp.pl data/train/text \
 | |
|      $data/data_aishell/transcript/aishell_transcript_v0.8.txt > $lm/text
 | |
| local/aishell_train_lms.sh
 | |
| # 7.3 Build decoding TLG
 | |
| tools/fst/compile_lexicon_token_fst.sh \
 | |
|     data/local/dict data/local/tmp data/local/lang
 | |
| tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1;
 | |
| # 7.4 Decoding with runtime
 | |
| ./tools/decode.sh --nj 16 \
 | |
|     --beam 15.0 --lattice_beam 7.5 --max_active 7000 \
 | |
|     --blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 \
 | |
|     --fst_path data/lang_test/TLG.fst \
 | |
|     data/test/wav.scp data/test/text $dir/final.zip \
 | |
|     data/lang_test/words.txt $dir/lm_with_runtime
 | |
| ```
 | 
