This repository contains a speech recognition model implementation using PyTorch. The model uses CTC (Connectionist Temporal Classification) loss for training and beam search decoding for inference.
.
├── README.md
├── requirements.txt
├── model.py # Model architecture
├── dataset.py # Data loading and preprocessing
├── train.py # Training script
├── utils.py # Utility functions
└── train.ipynb # Training notebook
- Install dependencies:
pip install -r requirements.txt- Install ctcdecode:
git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode
pip install .
cd ..You can train the model either using the Python script or the Jupyter notebook:
python train.py- Open
train.ipynb - Adjust the configuration parameters if needed
- Run all cells
The model uses a bidirectional LSTM architecture with the following components:
- Multiple LSTM layers
- Dropout for regularization
- Linear projection layer
- CTC loss for training
- Beam search decoding for inference
The main configuration parameters are:
input_dim: Input feature dimensionhidden_dim: LSTM hidden dimensionnum_layers: Number of LSTM layersnum_classes: Number of output classesdropout: Dropout ratebatch_size: Training batch sizelearning_rate: Initial learning rateepochs: Number of training epochspatience: Patience for learning rate schedulerbeam_width: Beam width for decoding
The model expects:
- Features in
.npyformat - CSV files with training/validation data information
- Directory structure:
data/ ├── features/ │ ├── sample1.npy │ ├── sample2.npy │ └── ... ├── train.csv └── val.csv