b2txt25/language_model/docs/tutorial.md

## Tutorial

If you meet any problems when going through this tutorial, please feel free to ask in github [issues](https://github.com/mobvoi/wenet/issues). Thanks for any kind of feedback.

### Setup environment
- Clone the repo

```sh
git clone https://github.com/mobvoi/wenet.git
```


- Install Conda

https://docs.conda.io/en/latest/miniconda.html


- Create Conda env

Pytorch 1.6.0 is recommended. We met some error with NCCL when using 1.7.0 on 2080 Ti.

```
conda create -n wenet python=3.8
conda activate wenet
pip install -r requirements.txt
conda install pytorch==1.6.0 cudatoolkit=10.1 torchaudio=0.6.0 -c pytorch
```

### First Experiment

We provide a recipe `example/aishell/s0/run.sh` on aishell-1 data.

The recipe is simple and we suggest you run each stage one by one manually and check the result to understand the whole process.

```
cd example/aishell/s0
bash run.sh --stage -1 --stop-stage -1
bash run.sh --stage 0 --stop-stage 0
bash run.sh --stage 1 --stop-stage 1
bash run.sh --stage 2 --stop-stage 2
bash run.sh --stage 3 --stop-stage 3
bash run.sh --stage 4 --stop-stage 4
bash run.sh --stage 5 --stop-stage 5
bash run.sh --stage 6 --stop-stage 6
```

You could also just run the whole script
```
bash run.sh --stage -1 --stop-stage 6
```


#### Stage -1: Download data

This stage downloads the aishell-1 data to the local path `$data`. This may take several hours. If you have already downloaded the data, please change the `$data` variable in `run.sh` and start from `--stage 0`.

#### Stage 0: Prepare Training data

In this stage, `local/aishell_data_prep.sh` organizes the original aishell-1 data into two files:
* **wav.scp** each line records two tab-separated columns : `wav_id` and `wav_path`
* **text**  each line records two tab-separated columns :  `wav_id` and `text_label`

**wav.scp**
```
BAC009S0002W0122 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0124.wav
BAC009S0002W0125 /export/data/asr-data/OpenSLR/33/data_aishell/wav/train/S0002/BAC009S0002W0125.wav
...
```

**text**
```
BAC009S0002W0122 而对楼市成交抑制作用最大的限购
BAC009S0002W0123 也成为地方政府的眼中钉
BAC009S0002W0124 自六月底呼和浩特市率先宣布取消限购后
BAC009S0002W0125 各地政府便纷纷跟进
...
```

If you want to train using your customized data, just organize the data into two files `wav.scp` and `text`, and start from `stage 1`.


#### Stage 1: Extract optinal cmvn features

`example/aishell/s0` uses raw wav as input and and [TorchAudio](https://pytorch.org/audio/stable/index.html) to extract the features just-in-time in dataloader. So in this step we just copy the training wav.scp and text file into the `raw_wav/train/` dir.

`tools/compute_cmvn_stats.py` is used to extract global cmvn(cepstral mean and variance normalization) statistics. These statistics will be used to normalize the acoustic features. Setting `cmvn=false` will skip this step.

#### Stage 2: Generate label token dictionary

The dict is a map between label tokens (we use characters for Aishell-1) and
 the integer indices.

An example dict is as follows
```
<blank> 0
<unk> 1
一 2
丁 3
...
龚 4230
龟 4231
<sos/eos> 4232
```

* `<blank>` denotes the blank symbol for CTC.
* `<unk>` denotes the unknown token, any out-of-vocabulary tokens will be mapped into it.
* `<sos/eos>` denotes start-of-speech and end-of-speech symbols for attention based encoder decoder training, and they shares the same id.

#### Stage 3: Prepare WeNet data format

This stage generates a single WeNet format file including all the input/output information needed by neural network training/evaluation.

See the generated training feature file in `fbank_pitch/train/format.data`.

In the WeNet format file , each line records a data sample of seven tab-separated columns. For example, a line is as follows (tab replaced with newline here):

```
utt:BAC009S0764W0121
feat:/export/data/asr-data/OpenSLR/33/data_aishell/wav/test/S0764/BAC009S0764W0121.wav
feat_shape:4.2039375
text:甚至出现交易几乎停滞的情况
token:甚 至 出 现 交 易 几 乎 停 滞 的 情 况
tokenid:2474 3116 331 2408 82 1684 321 47 235 2199 2553 1319 307
token_shape:13,4233
```

`feat_shape` is the duration(in seconds) of the wav.

#### Stage 4: Neural Network training

The NN model is trained in this step.

- Multi-GPU mode

If using DDP mode for multi-GPU, we suggest using `dist_backend="nccl"`. If the NCCL does not work, try using `gloo` or use `torch==1.6.0`
Set the GPU ids in CUDA_VISIBLE_DEVICES. For example, set `export CUDA_VISIBLE_DEVICES="0,1,2,3,6,7"` to use card 0,1,2,3,6,7.

- Resume training

If your experiment is terminated after running several epochs for some reasons (e.g. the GPU is accidentally used by other people and is out-of-memory ), you could continue the training from a checkpoint model. Just find out the finished epoch in `exp/your_exp/`, set  `checkpoint=exp/your_exp/$n.pt` and run the `run.sh --stage 4`. Then the training will continue from the $n+1.pt

- Config

The config of neural network structure, optimization parameter, loss parameters, and dataset can be set in a YAML format file.

In `conf/`,  we provide several models like transformer and conformer. see `conf/train_conformer.yaml` for reference.

- Use Tensorboard

The training takes several hours. The actual time depends on the number and type of your GPU cards. In an 8-card 2080 Ti machine, it takes about less than one day for 50 epochs.
You could use tensorboard to monitor the loss.

```
tensorboard --logdir tensorboard/$your_exp_name/ --port 12598 --bind_all
```

#### Stage 5: Recognize wav using the trained model

This stage shows how to recognize a set of wavs into texts. It also shows how to do the model averaging.

- Average model

If `${average_checkpoint}` is set to `true`, the best `${average_num}` models on cross validation set will be averaged to generate a boosted model and used for recognition.

- Decoding

Recognition is also called decoding or inference. The function of the NN will be applied on the input acoustic feature sequence to output a sequence of text.

Four decoding methods are provided in WeNet:

* `ctc_greedy_search` : encoder + CTC greedy search
* `ctc_prefix_beam_search` :  encoder + CTC prefix beam search
* `attention` : encoder + attention-based decoder decoding
* `attention_rescoring` : rescoring the ctc candidates from ctc prefix beam search with encoder output on attention-based decoder.

In general, attention_rescoring is the best method. Please see [U2 paper](https://arxiv.org/pdf/2012.05481.pdf) for the details of these algorithms.

`--beam_size` is a tunable parameter, a large beam size may get better results but also cause higher computation cost.

`--batch_size` can be greater than 1 for "ctc_greedy_search" and "attention" decoding mode, and must be 1 for "ctc_prefix_beam_search" and "attention_rescoring" decoding mode.

- WER evaluation

`tools/compute-wer.py` will calculate the word (or char) error rate of the result. If you run the recipe without any change, you may get WER ~= 5%.


#### Stage 6: Export the trained model

`wenet/bin/export_jit.py` will export the trained model using Libtorch. The exported model files can be easily used for inference in other programming languages such as C++.