Skip to content

achrafElFaiq/RNN-MachineTranslation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Machine Translation using Seq2Seq models

During this project, David, Achraf, and Hector trained various deep learning models to achieve machine translation from english to spanish. By the end, the variations we have on our models are either on the architecture level, or the embeddings used to feed the data to the models and these variations are the following:

  • LSTM (Character-Level, Word-Level: Manual (Trainable Layer), BERT (Frozen), Word2Vec (Frozen))
  • GRU (With and Without Attention)

Details about each architecture:

Details about each embedding

The main file is a mini dashbord on terminal that makes choosing which architecture to train easier,more details about the setup are in the next part.

📦 Setup

Clone the Repository

git clone https://github.com/ML-DL-Teaching/deep-learning-project-2025-dl_team_16.git
cd deep-learning-project-2025-dl_team_16.git

Project Structure

├── main.py                         # Interactive CLI to run any model
├── environment.yml                 # Conda dependencies
├── data/                           # Input data and processing scripts
│   ├── process_data.py             # Preprocessing script
│   ├── spa.txt                     # Raw input text
│   └── processed.txt               # Preprocessed output (generated)
├── EMBEDDINGS/                     # Word/character embedding modules
│   ├── CARACTERLEVEL/
│   └── WORDLEVEL/
│       ├── BERT/
│       ├── MANUAL/
│       └── WORD2VEC/
├── MODELS/                          # Deep learning model scripts
│   ├── LSTM/
│   │   ├── CARACTER LEVEL/
│   │   └── WORDLEVEL/
│   └── GRU/
│       ├── ATTENTION/
│       └── NO ATTENTION/

Create and Activate Conda Environment

conda env create -f environment.yml
conda activate your-env-name

you also need to install torch in the env with cuda support for our case 12.1

pip install torch==2.4.0+cu121 torchaudio==2.4.0+cu121 torchvision==0.19.0+cu121 \
  --index-url https://download.pytorch.org/whl/cu121

Prepare Data

  1. Download Required Resources

If you're using Word2Vec:

Download GoogleNews-vectors-negative300.bin.gz
Place it in the data/ directory.
  1. spa.txt is already in the github repo in data folder

  2. Run the Data Preprocessing Script

python data/process_data.py

This will generate processed.txt for training

Run the Launcher Script

python main.py

Follow the interactive prompts to choose and run the model you want.

You will be prompted to select:

Model Type
    LSTM
    GRU
Subtype (for LSTM)
    Character-Level
    Word-Level
Embedding Strategy (for Word-Level)
    Manual
    BERT
    Word2Vec

After your selections, the corresponding training script will automatically be executed.

📝 Example run:

Select Model Type:
1. LSTM
2. GRU
> 1

LSTM selected. Choose level:
1. Character Level
2. Word Level
> 2

Choose embedding type:
1. Manual
2. BERT
3. Word2Vec
> 1

Running: MODELS/LSTM/WORDLEVEL/MANUAL/word_level_manual_lstm.py

About

Benchmarking LSTM and GRU architectures with multiple embeddings (BERT, Word2Vec, character & word-level) for English to Spanish machine translation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages