193 lines
		
	
	
		
			7.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			193 lines
		
	
	
		
			7.4 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ## Tutorial
 | |
| 
 | |
| If you meet any problems when going through this tutorial, please feel free to ask in github [issues](https://github.com/mobvoi/wenet/issues). Thanks for any kind of feedback.
 | |
| 
 | |
| ### Setup environment
 | |
| - Clone the repo
 | |
| 
 | |
| ```sh
 | |
| git clone https://github.com/mobvoi/wenet.git
 | |
| ```
 | |
| 
 | |
| 
 | |
| 
 | |
| - Install Conda
 | |
| 
 | |
| https://docs.conda.io/en/latest/miniconda.html
 | |
| 
 | |
| 
 | |
| - Create Conda env
 | |
| 
 | |
| Pytorch 1.6.0 is recommended. We met some error with NCCL when using 1.7.0 on 2080 Ti.
 | |
| 
 | |
| ```
 | |
| conda create -n wenet python=3.8
 | |
| conda activate wenet
 | |
| pip install -r requirements.txt
 | |
| conda install pytorch==1.6.0 cudatoolkit=10.1 torchaudio=0.6.0 -c pytorch
 | |
| ```
 | |
| 
 | |
| ### First Experiment
 | |
| 
 | |
| We provide a recipe `example/aishell/s0/run.sh` on aishell-1 data.
 | |
| 
 | |
| The recipe is simple and we suggest you run each stage one by one manually and check the result to understand the whole process.
 | |
| 
 | |
| ```
 | |
| cd example/aishell/s0
 | |
| bash run.sh --stage -1 --stop-stage -1
 | |
| bash run.sh --stage 0 --stop-stage 0
 | |
| bash run.sh --stage 1 --stop-stage 1
 | |
| bash run.sh --stage 2 --stop-stage 2
 | |
| bash run.sh --stage 3 --stop-stage 3
 | |
| bash run.sh --stage 4 --stop-stage 4
 | |
| bash run.sh --stage 5 --stop-stage 5
 | |
| bash run.sh --stage 6 --stop-stage 6
 | |
| ```
 | |
| 
 | |
| You could also just run the whole script
 | |
| ```
 | |
| bash run.sh --stage -1 --stop-stage 6
 | |
| ```
 | |
| 
 | |
| 
 | |
| #### Stage -1: Download data
 | |
| 
 | |
| This stage downloads the aishell-1 data to the local path `$data`. This may take several hours. If you have already downloaded the data, please change the `$data` variable in `run.sh` and start from `--stage 0`.
 | |
| 
 | |
| #### Stage 0: Prepare Training data
 | |
| 
 | |
| In this stage, `local/aishell_data_prep.sh` organizes the original aishell-1 data into two files:
 | |
| * **wav.scp** each line records two tab-separated columns : `wav_id` and `wav_path`
 | |
| * **text**  each line records two tab-separated columns :  `wav_id` and `text_label`
 | |
| 
 | |
| **wav.scp**
 | |
| ```
 | |
| BAC009S0002W0122 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
 | |
| BAC009S0002W0123 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0123.wav
 | |
| BAC009S0002W0124 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0124.wav
 | |
| BAC009S0002W0125 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0125.wav
 | |
| ...
 | |
| ```
 | |
| 
 | |
| **text**
 | |
| ```
 | |
| BAC009S0002W0122 而对楼市成交抑制作用最大的限购
 | |
| BAC009S0002W0123 也成为地方政府的眼中钉
 | |
| BAC009S0002W0124 自六月底呼和浩特市率先宣布取消限购后
 | |
| BAC009S0002W0125 各地政府便纷纷跟进
 | |
| ...
 | |
| ```
 | |
| 
 | |
| If you want to train using your customized data, just organize the data into two files `wav.scp` and `text`, and start from `stage 1`.
 | |
| 
 | |
| 
 | |
| #### Stage 1: Extract optinal cmvn features
 | |
| 
 | |
| `example/aishell/s0` uses raw wav as input and and [TorchAudio](https://pytorch.org/audio/stable/index.html) to extract the features just-in-time in dataloader. So in this step we just copy the training wav.scp and text file into the `raw_wav/train/` dir.
 | |
| 
 | |
| `tools/compute_cmvn_stats.py` is used to extract global cmvn(cepstral mean and variance normalization) statistics. These statistics will be used to normalize the acoustic features. Setting `cmvn=false` will skip this step.
 | |
| 
 | |
| #### Stage 2: Generate label token dictionary
 | |
| 
 | |
| The dict is a map between label tokens (we use characters for Aishell-1) and
 | |
|  the integer indices.
 | |
| 
 | |
| An example dict is as follows
 | |
| ```
 | |
| <blank> 0
 | |
| <unk> 1
 | |
| 一 2
 | |
| 丁 3
 | |
| ...
 | |
| 龚 4230
 | |
| 龟 4231
 | |
| <sos/eos> 4232
 | |
| ```
 | |
| 
 | |
| * `<blank>` denotes the blank symbol for CTC.
 | |
| * `<unk>` denotes the unknown token, any out-of-vocabulary tokens will be mapped into it.
 | |
| * `<sos/eos>` denotes start-of-speech and end-of-speech symbols for attention based encoder decoder training, and they shares the same id.
 | |
| 
 | |
| #### Stage 3: Prepare WeNet data format
 | |
| 
 | |
| This stage generates a single WeNet format file including all the input/output information needed by neural network training/evaluation.
 | |
| 
 | |
| See the generated training feature file in `fbank_pitch/train/format.data`.
 | |
| 
 | |
| In the WeNet format file , each line records a data sample of seven tab-separated columns. For example, a line is as follows (tab replaced with newline here):
 | |
| 
 | |
| ```
 | |
| utt:BAC009S0764W0121
 | |
| feat:/export/data/asr-data/OpenSLR/33/data_aishell/wav/test/S0764/BAC009S0764W0121.wav
 | |
| feat_shape:4.2039375
 | |
| text:甚至出现交易几乎停滞的情况
 | |
| token:甚 至 出 现 交 易 几 乎 停 滞 的 情 况
 | |
| tokenid:2474 3116 331 2408 82 1684 321 47 235 2199 2553 1319 307
 | |
| token_shape:13,4233
 | |
| ```
 | |
| 
 | |
| `feat_shape` is the duration(in seconds) of the wav.
 | |
| 
 | |
| #### Stage 4: Neural Network training
 | |
| 
 | |
| The NN model is trained in this step.
 | |
| 
 | |
| - Multi-GPU mode
 | |
| 
 | |
| If using DDP mode for multi-GPU, we suggest using `dist_backend="nccl"`. If the NCCL does not work, try using `gloo` or use `torch==1.6.0`
 | |
| Set the GPU ids in CUDA_VISIBLE_DEVICES. For example, set `export CUDA_VISIBLE_DEVICES="0,1,2,3,6,7"` to use card 0,1,2,3,6,7.
 | |
| 
 | |
| - Resume training
 | |
| 
 | |
| If your experiment is terminated after running several epochs for some reasons (e.g. the GPU is accidentally used by other people and is out-of-memory ), you could continue the training from a checkpoint model. Just find out the finished epoch in `exp/your_exp/`, set  `checkpoint=exp/your_exp/$n.pt` and run the `run.sh --stage 4`. Then the training will continue from the $n+1.pt
 | |
| 
 | |
| - Config
 | |
| 
 | |
| The config of neural network structure, optimization parameter, loss parameters, and dataset can be set in a YAML format file.
 | |
| 
 | |
| In `conf/`,  we provide several models like transformer and conformer. see `conf/train_conformer.yaml` for reference.
 | |
| 
 | |
| - Use Tensorboard
 | |
| 
 | |
| The training takes several hours. The actual time depends on the number and type of your GPU cards. In an 8-card 2080 Ti machine, it takes about less than one day for 50 epochs.
 | |
| You could use tensorboard to monitor the loss.
 | |
| 
 | |
| ```
 | |
| tensorboard --logdir tensorboard/$your_exp_name/ --port 12598 --bind_all
 | |
| ```
 | |
| 
 | |
| #### Stage 5: Recognize wav using the trained model
 | |
| 
 | |
| This stage shows how to recognize a set of wavs into texts. It also shows how to do the model averaging.
 | |
| 
 | |
| - Average model
 | |
| 
 | |
| If `${average_checkpoint}` is set to `true`, the best `${average_num}` models on cross validation set will be averaged to generate a boosted model and used for recognition.
 | |
| 
 | |
| - Decoding
 | |
| 
 | |
| Recognition is also called decoding or inference. The function of the NN will be applied on the input acoustic feature sequence to output a sequence of text.
 | |
| 
 | |
| Four decoding methods are provided in WeNet:
 | |
| 
 | |
| * `ctc_greedy_search` : encoder + CTC greedy search
 | |
| * `ctc_prefix_beam_search` :  encoder + CTC prefix beam search
 | |
| * `attention` : encoder + attention-based decoder decoding
 | |
| * `attention_rescoring` : rescoring the ctc candidates from ctc prefix beam search with encoder output on attention-based decoder.
 | |
| 
 | |
| In general, attention_rescoring is the best method. Please see [U2 paper](https://arxiv.org/pdf/2012.05481.pdf) for the details of these algorithms.
 | |
| 
 | |
| `--beam_size` is a tunable parameter, a large beam size may get better results but also cause higher computation cost.
 | |
| 
 | |
| `--batch_size` can be greater than 1 for "ctc_greedy_search" and "attention" decoding mode, and must be 1 for "ctc_prefix_beam_search" and "attention_rescoring" decoding mode.
 | |
| 
 | |
| - WER evaluation
 | |
| 
 | |
| `tools/compute-wer.py` will calculate the word (or char) error rate of the result. If you run the recipe without any change, you may get WER ~= 5%.
 | |
| 
 | |
| 
 | |
| #### Stage 6: Export the trained model
 | |
| 
 | |
| `wenet/bin/export_jit.py` will export the trained model using Libtorch. The exported model files can be easily used for inference in other programming languages such as C++.
 | 
