b2txt25/README.md

项目部分代码基于baseline仓库修改
- 数据集通过download_data.py文件下载。
- 代码仓库:【dev2分支】
    - 个人gitea仓库：（github限制上传文件大小，哎。虽然我后面在这里也把大文件删了，，，，，）http://zchens.cn:3000/zchen/b2txt25/src/branch/dev2
    - github仓库：https://github.com/ZH-CEN/nejm-brain-to-text/tree/dev2 (这个仓库好像后面没维护了)

# Idea
这个模型没有记录在论文和ppt中，因为————很晚才想到，前面都在研究那个生成时构建树（只能说逻辑是可以实现的，代码在哪里呢？不知道=-=），这个目前代码主要的已经完工，在gpu环境下可以训练了。但是，参数量比baseline 还大一点点，减少batch_size后能在p100上训练，但是实在是太太太太太慢了。kaggle 的 TPU v5e-8 用起来很很不趁手。就算换5090跑，出了结果（参数量大约增了40%，乐观估计起码训练7小时）也没时间调优，甚至测评代码也没好，所以，罢了。不过我觉得模型设计还是挺好的，但我严重怀疑是有人做过，毕竟学习噪声这点好像是马老师讲的时候提过的，当时就好奇怎么学习噪声，现在才想明白。应该是有人做过了的吧。

模型在model_training_nnn文件夹下，主要修改了rnn_trainer.py和rnn_model.py。其他文件没有动。README.md也没有动。
训练的话直接运行rnn_trainer.py这个就好，配置文件rnn.yaml可能要改成gpu加速。tpu的环境还没调好，hhhhh。evaluate_model.py 也还需要调一下。

本项目提出的噪声分离对抗（对抗是否存在我还没捋清楚，脑子糊了，不管了）模型可能已经被提出过，毕竟改动比较小。但我确实没有时间去寻论文出处，在此之前已经提出过多个Idea，大多都发现已有相关论文。例如在本项目期间想到的生成时构建树模型（仿照ACT动态自适应RNN和RNN构建树），简单的实验**陆续**发现已经有人做了，模型完全体的话，设计复杂程度太高，掂量自身实力，确实没有时间。所以就刚想出来把这个噪声模型先做了吧。虽然我觉得要在RNN上设计噪声分离，还是有很多底层代码需要修改
## 核心思路
- RNN内部的三模型架构：
    - 语音识别模型：接受原始数据于噪声模型的残差作为输入，训练目标为最大化分类准确率
    - 噪声语音模型：接受噪声模型输出，训练目标为最大化分类准确率
    - 噪声模型：训练时直接连接上面两个模型的输出权重梯度（噪声的梯度取负数），传递到噪声模型中。
- 推理时：
    - 数据进入噪声模型得到输出A，原始数据减去A得到残差B，即原声。
    - 残差B进入语音识别模型进行识别，得到最终输出。
没了，比较简单的修改，能不能写出来以及能不能有效果就不知道了。能跑就行，哈哈。
# An Accurate and Rapidly Calibrating Speech Neuroprosthesis
*The New England Journal of Medicine* (2024)

Nicholas S. Card, Maitreyee Wairagkar, Carrina Iacobacci,
Xianda Hou, Tyler Singer-Clark, Francis R. Willett,
Erin M. Kunz, Chaofei Fan, Maryam Vahdati Nia,
Darrel R. Deo, Aparna Srinivasan, Eun Young Choi,
Matthew F. Glasser, Leigh R. Hochberg,
Jaimie M. Henderson, Kiarash Shahlaie,
Sergey D. Stavisky*, and David M. Brandman*.

<span style="font-size:0.8em;">\* denotes co-senior authors</span>

![Speech neuroprosthesis overview](b2txt_methods_overview.png)

## Overview
This repository contains the code and data necessary to reproduce the results of the paper ["*An Accurate and Rapidly Calibrating Speech Neuroprosthesis*" by Card et al. (2024), *N Eng J Med*](https://www.nejm.org/doi/full/10.1056/NEJMoa2314132).

The code is organized into five main directories: `utils`, `analyses`, `data`, `model_training`, and `language_model`:
- The `utils` directory contains utility functions used throughout the code.
- The `analyses` directory contains the code necessary to reproduce results shown in the main text and supplemental appendix.
- The `data` directory contains the data necessary to reproduce the results in the paper. Download it from Dryad using the link above and place it in this directory.
- The `model_training` directory contains the code necessary to train and evaluate the brain-to-text model. See the README.md in that folder for more detailed instructions.
- The `language_model` directory contains the ngram language model implementation and a pretrained 1gram language model. Pretrained 3gram and 5gram language models can be downloaded [here](https://datadryad.org/dataset/doi:10.5061/dryad.x69p8czpq) (`languageModel.tar.gz` and `languageModel_5gram.tar.gz`). See [`language_model/README.md`](language_model/README.md) for more information.

## Competition
This repository also includes baseline model training and evaluation code for the [Brain-to-Text '25 Competition](https://www.kaggle.com/competitions/brain-to-text-25). The competition is hosted on Kaggle, and the code in this repository is designed to help participants train and evaluate their own models for the competition. The baseline model provided here is a custom PyTorch implementation of the RNN model used in the paper, which can be trained and evaluated using the provided data.

## Data
### Data Overview
The data used in this repository (which can be downloaded from [Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dncjsxm85), either manually from the website, or using `download_data.py`) consists of various datasets for recreating figures and training/evaluating the brain-to-text model:
- `t15_copyTask.pkl`: This file contains the online Copy Task results required for generating Figure 2.
- `t15_personalUse.pkl`: This file contains the Conversation Mode data required for generating Figure 4.
- `t15_copyTask_neuralData.zip`: This dataset contains the neural data for the Copy Task.
    - There are 10,948 sentences from 45 sessions spanning 20 months. Each trial of data includes: 
        - The session date, block number, and trial number
        - 512 neural features (2 features [-4.5 RMS threshold crossings and spike band power] per electrode, 256 electrodes), binned at 20 ms resolution. The data were recorded from the speech motor cortex via four high-density microelectrode arrays (64 electrodes each). The 512 features are ordered as follows in all data files: 
            - 0-64: ventral 6v threshold crossings
            - 65-128: area 4 threshold crossings
            - 129-192: 55b threshold crossings
            - 193-256: dorsal 6v threshold crossings
            - 257-320: ventral 6v spike band power
            - 321-384: area 4 spike band power
            - 385-448: 55b spike band power
            - 449-512: dorsal 6v spike band power
        - The ground truth sentence label
        - The ground truth phoneme sequence label
    - The data is split into training, validation, and test sets. The test set does not include ground truth sentence or phoneme labels.
    - Data for each session/split is stored in `.hdf5` files. An example of how to load this data using the Python `h5py` library is provided in the [`model_training/evaluate_model_helpers.py`](model_training/evaluate_model_helpers.py) file in the `load_h5py_file()` function.
    - Each block of data contains sentences drawn from a range of corpuses (Switchboard, OpenWebText2, a 50-word corpus, a custom frequent-word corpus, and a corpus of random word sequences). Furthermore, the majority of the data is during attempted vocalized speaking, but some of it is during attempted silent speaking. [`data/t15_copyTaskData_description.csv`](data/t15_copyTaskData_description.csv) contains a block-by-block description of the Copy Task data, including the session date, block number, number of trials, the corpus used, and what split the data is in (train, val, or test). The speaking strategy for each block is intentionally not listed here.
- `t15_pretrained_rnn_baseline.zip`: This dataset contains the pretrained RNN baseline model checkpoint and args. An example of how to load this model and use it for inference is provided in the [`model_training/evaluate_model.py`](model_training/evaluate_model.py) file.

### Data Directory Structure
Please download these datasets from [Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dncjsxm85) and place them in the `data` directory. Be sure to unzip `t15_copyTask_neuralData.zip` and place the resulting `hdf5_data_final` folder into the `data` directory. Likewise, unzip `t15_pretrained_rnn_baseline.zip` and place the resulting `t15_pretrained_rnn_baseline` folder into the `data` directory. The final directory structure should look like this:
```
data/
├── t15_copyTask.pkl
├── t15_personalUse.pkl
├── hdf5_data_final/
│   ├── t15.2023.08.11/
│   │   ├── data_train.hdf5
│   ├── t15.2023.08.13/
│   │   ├── data_train.hdf5
│   │   ├── data_val.hdf5
│   │   ├── data_test.hdf5
│   ├── ...
├── t15_pretrained_rnn_baseline/
│   ├── checkpoint/
│   │   ├── args.yaml
│   │   ├── best_checkpoint
│   ├── training_log
```

## Dependencies
- The code has only been tested on Ubuntu 22.04 with two NVIDIA RTX 4090 GPUs.
- We recommend using a conda environment to manage the dependencies. To install miniconda, follow the instructions [here](https://docs.anaconda.com/miniconda/miniconda-install/).
- Redis is required for communication between python processes. To install redis on Ubuntu:
    - https://redis.io/docs/getting-started/installation/install-redis-on-linux/
    - In terminal:
        ```bash
        curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
        
        echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
        
        sudo apt-get update
        sudo apt-get install redis
        ```
    - Turn off autorestarting for the redis server in terminal:
        - `sudo systemctl disable redis-server`
- `CMake >= 3.14` and `gcc >= 10.1` are required for the ngram language model installation. You can install these on linux with `sudo apt-get install cmake` and `sudo apt-get install build-essential`.

## Python environment setup for model training and evaluation
To create a conda environment with the necessary dependencies, run the following command from the root directory of this repository:
```bash
./setup.sh
```

Verify it worked by activating the conda environment with the command `conda activate b2txt25`.

## Python environment setup for ngram language model and OPT rescoring
We use an ngram language model plus rescoring via the [Facebook OPT 6.7b](https://huggingface.co/facebook/opt-6.7b) LLM. A pretrained 1gram language model is included in this repository at [`language_model/pretrained_language_models/openwebtext_1gram_lm_sil`](language_model/pretrained_language_models/openwebtext_1gram_lm_sil). Pretrained 3gram and 5gram language models are available for download [here](https://datadryad.org/dataset/doi:10.5061/dryad.x69p8czpq) (`languageModel.tar.gz` and `languageModel_5gram.tar.gz`). Note that the 3gram model requires ~60GB of RAM, and the 5gram model requires ~300GB of RAM. Furthermore, OPT 6.7b requires a GPU with at least ~12.4 GB of VRAM to load for inference.

Our Kaldi-based ngram implementation requires a different version of torch than our model training pipeline, so running the ngram language models requires an additional seperate python conda environment. To create this conda environment, run the following command from the root directory of this repository. For more detailed instructions, see the README.md in the [`language_model`](language_model) subdirectory.
```bash
./setup_lm.sh
```

Verify it worked by activating the conda environment with the command `conda activate b2txt25_lm`.
-												备份1

											
										
										
											2025-10-12 09:11:32 +08:00
+								项目部分代码基于baseline仓库修改
-												final version? maybe

											
										
										
											2025-10-12 23:36:16 +08:00
+								- 数据集通过download_data.py文件下载。
 								- 代码仓库:【dev2分支】
 								    - 个人gitea仓库：（github限制上传文件大小，哎。虽然我后面在这里也把大文件删了，，，，，）http://zchens.cn:3000/zchen/b2txt25/src/branch/dev2
-												tpu

											
										
										
											2025-10-12 23:36:58 +08:00
+								    - github仓库：https://github.com/ZH-CEN/nejm-brain-to-text/tree/dev2 (这个仓库好像后面没维护了)
-												final version? maybe

											
										
										
											2025-10-12 23:36:16 +08:00
-												备份1

											
										
										
											2025-10-12 09:11:32 +08:00
+								# Idea
-												final version? maybe

											
										
										
											2025-10-12 23:36:16 +08:00
+								这个模型没有记录在论文和ppt中，因为————很晚才想到，前面都在研究那个生成时构建树（只能说逻辑是可以实现的，代码在哪里呢？不知道=-=），这个目前代码主要的已经完工，在gpu环境下可以训练了。但是，参数量比baseline 还大一点点，减少batch_size后能在p100上训练，但是实在是太太太太太慢了。kaggle 的 TPU v5e-8 用起来很很不趁手。就算换5090跑，出了结果（参数量大约增了40%，乐观估计起码训练7小时）也没时间调优，甚至测评代码也没好，所以，罢了。不过我觉得模型设计还是挺好的，但我严重怀疑是有人做过，毕竟学习噪声这点好像是马老师讲的时候提过的，当时就好奇怎么学习噪声，现在才想明白。应该是有人做过了的吧。
 								模型在model_training_nnn文件夹下，主要修改了rnn_trainer.py和rnn_model.py。其他文件没有动。README.md也没有动。
 								训练的话直接运行rnn_trainer.py这个就好，配置文件rnn.yaml可能要改成gpu加速。tpu的环境还没调好，hhhhh。evaluate_model.py 也还需要调一下。
 								本项目提出的噪声分离对抗（对抗是否存在我还没捋清楚，脑子糊了，不管了）模型可能已经被提出过，毕竟改动比较小。但我确实没有时间去寻论文出处，在此之前已经提出过多个Idea，大多都发现已有相关论文。例如在本项目期间想到的生成时构建树模型（仿照ACT动态自适应RNN和RNN构建树），简单的实验**陆续**发现已经有人做了，模型完全体的话，设计复杂程度太高，掂量自身实力，确实没有时间。所以就刚想出来把这个噪声模型先做了吧。虽然我觉得要在RNN上设计噪声分离，还是有很多底层代码需要修改
-												备份1

											
										
										
											2025-10-12 09:11:32 +08:00
+								## 核心思路
 								- RNN内部的三模型架构：
 								    - 语音识别模型：接受原始数据于噪声模型的残差作为输入，训练目标为最大化分类准确率
 								    - 噪声语音模型：接受噪声模型输出，训练目标为最大化分类准确率
 								    - 噪声模型：训练时直接连接上面两个模型的输出权重梯度（噪声的梯度取负数），传递到噪声模型中。
 								- 推理时：
 								    - 数据进入噪声模型得到输出A，原始数据减去A得到残差B，即原声。
 								    - 残差B进入语音识别模型进行识别，得到最终输出。
 								没了，比较简单的修改，能不能写出来以及能不能有效果就不知道了。能跑就行，哈哈。
-												Copy Task figure and environment setup

											
										
										
											2024-08-14 12:00:20 -07:00
+								# An Accurate and Rapidly Calibrating Speech Neuroprosthesis
 								*The New England Journal of Medicine* (2024)
 								Nicholas S. Card, Maitreyee Wairagkar, Carrina Iacobacci,
 								Xianda Hou, Tyler Singer-Clark, Francis R. Willett,
 								Erin M. Kunz, Chaofei Fan, Maryam Vahdati Nia,
 								Darrel R. Deo, Aparna Srinivasan, Eun Young Choi,
 								Matthew F. Glasser, Leigh R. Hochberg,
 								Jaimie M. Henderson, Kiarash Shahlaie,
 								Sergey D. Stavisky*, and David M. Brandman*.
 								<span style="font-size:0.8em;">\* denotes co-senior authors</span>
 								![Speech neuroprosthesis overview](b2txt_methods_overview.png)
 								## Overview
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
+								This repository contains the code and data necessary to reproduce the results of the paper ["*An Accurate and Rapidly Calibrating Speech Neuroprosthesis*" by Card et al. (2024), *N Eng J Med*](https://www.nejm.org/doi/full/10.1056/NEJMoa2314132).
-												Copy Task figure and environment setup

											
										
										
											2024-08-14 12:00:20 -07:00
-												typo fix

											
										
										
											2025-07-02 15:06:26 -07:00
+								The code is organized into five main directories: `utils`, `analyses`, `data`, `model_training`, and `language_model`:
-												Copy Task figure and environment setup

											
										
										
											2024-08-14 12:00:20 -07:00
+								- The `utils` directory contains utility functions used throughout the code.
 								- The `analyses` directory contains the code necessary to reproduce results shown in the main text and supplemental appendix.
 								- The `data` directory contains the data necessary to reproduce the results in the paper. Download it from Dryad using the link above and place it in this directory.
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
+								- The `model_training` directory contains the code necessary to train and evaluate the brain-to-text model. See the README.md in that folder for more detailed instructions.
-												readme file linking

											
										
										
											2025-07-02 15:34:24 -07:00
+								- The `language_model` directory contains the ngram language model implementation and a pretrained 1gram language model. Pretrained 3gram and 5gram language models can be downloaded [here](https://datadryad.org/dataset/doi:10.5061/dryad.x69p8czpq) (`languageModel.tar.gz` and `languageModel_5gram.tar.gz`). See [`language_model/README.md`](language_model/README.md) for more information.
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
-												data description CSV and competition link

											
										
										
											2025-07-07 10:21:01 -07:00
+								## Competition
 								This repository also includes baseline model training and evaluation code for the [Brain-to-Text '25 Competition](https://www.kaggle.com/competitions/brain-to-text-25). The competition is hosted on Kaggle, and the code in this repository is designed to help participants train and evaluate their own models for the competition. The baseline model provided here is a custom PyTorch implementation of the RNN model used in the paper, which can be trained and evaluated using the provided data.
-												additional documentation

											
										
										
											2025-07-02 14:28:34 -07:00
+								## Data
-												better data download & unzip instructions

											
										
										
											2025-07-03 13:54:46 -07:00
+								### Data Overview
-												Added a script to auto-download the data from Dryad

											
										
										
											2025-07-06 12:29:53 -07:00
+								The data used in this repository (which can be downloaded from [Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dncjsxm85), either manually from the website, or using `download_data.py`) consists of various datasets for recreating figures and training/evaluating the brain-to-text model:
-												additional documentation

											
										
										
											2025-07-02 14:28:34 -07:00
+								- `t15_copyTask.pkl`: This file contains the online Copy Task results required for generating Figure 2.
 								- `t15_personalUse.pkl`: This file contains the Conversation Mode data required for generating Figure 4.
-												more details on neural data

											
										
										
											2025-07-02 15:18:07 -07:00
+								- `t15_copyTask_neuralData.zip`: This dataset contains the neural data for the Copy Task.
-												typo fix, get corpus for each trial

											
										
										
											2025-07-14 13:58:34 -07:00
+								    - There are 10,948 sentences from 45 sessions spanning 20 months. Each trial of data includes:
-												more details on neural data

											
										
										
											2025-07-02 15:18:07 -07:00
+								        - The session date, block number, and trial number
-												additional small text changes

											
										
										
											2025-07-02 15:24:41 -07:00
+								        - 512 neural features (2 features [-4.5 RMS threshold crossings and spike band power] per electrode, 256 electrodes), binned at 20 ms resolution. The data were recorded from the speech motor cortex via four high-density microelectrode arrays (64 electrodes each). The 512 features are ordered as follows in all data files:
-												more details on neural data

											
										
										
											2025-07-02 15:18:07 -07:00
+								            - 0-64: ventral 6v threshold crossings
 								            - 65-128: area 4 threshold crossings
 								            - 129-192: 55b threshold crossings
 								            - 193-256: dorsal 6v threshold crossings
 								            - 257-320: ventral 6v spike band power
 								            - 321-384: area 4 spike band power
 								            - 385-448: 55b spike band power
 								            - 449-512: dorsal 6v spike band power
 								        - The ground truth sentence label
 								        - The ground truth phoneme sequence label
 								    - The data is split into training, validation, and test sets. The test set does not include ground truth sentence or phoneme labels.
-												readme file linking

											
										
										
											2025-07-02 15:34:24 -07:00
+								    - Data for each session/split is stored in `.hdf5` files. An example of how to load this data using the Python `h5py` library is provided in the [`model_training/evaluate_model_helpers.py`](model_training/evaluate_model_helpers.py) file in the `load_h5py_file()` function.
-												data description CSV and competition link

											
										
										
											2025-07-07 10:21:01 -07:00
+								    - Each block of data contains sentences drawn from a range of corpuses (Switchboard, OpenWebText2, a 50-word corpus, a custom frequent-word corpus, and a corpus of random word sequences). Furthermore, the majority of the data is during attempted vocalized speaking, but some of it is during attempted silent speaking. [`data/t15_copyTaskData_description.csv`](data/t15_copyTaskData_description.csv) contains a block-by-block description of the Copy Task data, including the session date, block number, number of trials, the corpus used, and what split the data is in (train, val, or test). The speaking strategy for each block is intentionally not listed here.
-												readme file linking

											
										
										
											2025-07-02 15:34:24 -07:00
+								- `t15_pretrained_rnn_baseline.zip`: This dataset contains the pretrained RNN baseline model checkpoint and args. An example of how to load this model and use it for inference is provided in the [`model_training/evaluate_model.py`](model_training/evaluate_model.py) file.
-												additional documentation

											
										
										
											2025-07-02 14:28:34 -07:00
-												better data download & unzip instructions

											
										
										
											2025-07-03 13:54:46 -07:00
+								### Data Directory Structure
 								Please download these datasets from [Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.dncjsxm85) and place them in the `data` directory. Be sure to unzip `t15_copyTask_neuralData.zip` and place the resulting `hdf5_data_final` folder into the `data` directory. Likewise, unzip `t15_pretrained_rnn_baseline.zip` and place the resulting `t15_pretrained_rnn_baseline` folder into the `data` directory. The final directory structure should look like this:
 								```
 								data/
 								├── t15_copyTask.pkl
 								├── t15_personalUse.pkl
 								├── hdf5_data_final/
 								│   ├── t15.2023.08.11/
 								│   │   ├── data_train.hdf5
 								│   ├── t15.2023.08.13/
 								│   │   ├── data_train.hdf5
 								│   │   ├── data_val.hdf5
 								│   │   ├── data_test.hdf5
 								│   ├── ...
 								├── t15_pretrained_rnn_baseline/
 								│   ├── checkpoint/
 								│   │   ├── args.yaml
 								│   │   ├── best_checkpoint
 								│   ├── training_log
 								```
-												additional documentation

											
										
										
											2025-07-02 14:28:34 -07:00
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
+								## Dependencies
 								- The code has only been tested on Ubuntu 22.04 with two NVIDIA RTX 4090 GPUs.
 								- We recommend using a conda environment to manage the dependencies. To install miniconda, follow the instructions [here](https://docs.anaconda.com/miniconda/miniconda-install/).
 								- Redis is required for communication between python processes. To install redis on Ubuntu:
 								    - https://redis.io/docs/getting-started/installation/install-redis-on-linux/
 								    - In terminal:
 								        ```bash
 								        curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
 								        echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
 								        sudo apt-get update
 								        sudo apt-get install redis
 								        ```
-												Update README.md
											
										
										
											2025-07-02 15:14:17 -07:00
+								    - Turn off autorestarting for the redis server in terminal:
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
+								        - `sudo systemctl disable redis-server`
-												clarify cmake install and check for it in setup script

											
										
										
											2025-07-02 22:28:07 -07:00
+								- `CMake >= 3.14` and `gcc >= 10.1` are required for the ngram language model installation. You can install these on linux with `sudo apt-get install cmake` and `sudo apt-get install build-essential`.
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
 								## Python environment setup for model training and evaluation
-												Copy Task figure and environment setup

											
										
										
											2024-08-14 12:00:20 -07:00
+								To create a conda environment with the necessary dependencies, run the following command from the root directory of this repository:
 								```bash
 								./setup.sh
-												Update README.md
											
										
										
											2024-10-14 11:25:14 -07:00
+								```
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
-												README and script messages

											
										
										
											2025-07-02 16:42:00 -07:00
+								Verify it worked by activating the conda environment with the command `conda activate b2txt25`.
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
+								## Python environment setup for ngram language model and OPT rescoring
-												readme file linking

											
										
										
											2025-07-02 15:34:24 -07:00
+								We use an ngram language model plus rescoring via the [Facebook OPT 6.7b](https://huggingface.co/facebook/opt-6.7b) LLM. A pretrained 1gram language model is included in this repository at [`language_model/pretrained_language_models/openwebtext_1gram_lm_sil`](language_model/pretrained_language_models/openwebtext_1gram_lm_sil). Pretrained 3gram and 5gram language models are available for download [here](https://datadryad.org/dataset/doi:10.5061/dryad.x69p8czpq) (`languageModel.tar.gz` and `languageModel_5gram.tar.gz`). Note that the 3gram model requires ~60GB of RAM, and the 5gram model requires ~300GB of RAM. Furthermore, OPT 6.7b requires a GPU with at least ~12.4 GB of VRAM to load for inference.
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
-												readme file linking

											
										
										
											2025-07-02 15:34:24 -07:00
+								Our Kaldi-based ngram implementation requires a different version of torch than our model training pipeline, so running the ngram language models requires an additional seperate python conda environment. To create this conda environment, run the following command from the root directory of this repository. For more detailed instructions, see the README.md in the [`language_model`](language_model) subdirectory.
-												competition update

											
										
										
											2025-07-02 12:18:09 -07:00
+								```bash
 								./setup_lm.sh
-												Update README.md
											
										
										
											2025-07-02 15:14:17 -07:00
+								```
-												README and script messages

											
										
										
											2025-07-02 16:42:00 -07:00
 								Verify it worked by activating the conda environment with the command `conda activate b2txt25_lm`.