Skip to content

njszym/orion

Repository files navigation

ORION: Optimal Rare-event Inference ON-demand

ORION is a machine-learning pipeline for predicting atomic transport in crystalline materials. Given a crystal structure and a mobile element (e.g., Ag in AgCl), it builds a dataset of possible atomic hops, trains three neural networks, and saves the trained models for inference.

The current pipeline handles only two types of defects:

  • Vacancies — a lattice site is empty; a neighboring atom can jump into it.
  • Interstitials — an extra atom sits between lattice sites; it can hop to neighboring interstitial positions.

How It Works

The pipeline is broken into three sequential training stages, each building on the previous.

Stage 1 — Mobility Classification

Question answered: Given a crystal structure with some defects, which atoms are likely to move from their original positions? (not just vibrations, but diffusion hops)

A graph neural network (GNN) takes a crystal structure as input and assigns each atom a probability between 0 and 1, where 1 means "likely mobile." This narrows down the search space for the later stages.

  • Input: Atomic structure (as a graph, with atoms as nodes and bonds as edges)
  • Output: Per-atom mobility probability (one number per atom)
  • Training labels: An atom is labeled mobile (1) if it moves more than --mobility-threshold Angstroms in any of the possible outcome structures in the dataset

Stage 2 — Displacement Prediction (Multi-Hop)

Question answered: Given a specific group of mobile atoms, where do they end up?

A separate MLP (multi-layer perceptron) is trained for each hop "size" (that is, how many atoms are involved in one concerted hop: 1 atom, 2 atom, or 3 atoms). Each model takes the MACE embeddings and fractional coordinates of the selected atoms as input and predicts their displacement vectors.

  • Input: Concatenated MACE embeddings + fractional position features for the chosen atoms
  • Output: 3D displacement vector (in Angstroms) for each atom in the group
  • Multi-modal: When multiple distinct hop destinations exist (e.g., an interstitial can jump to any of several neighboring sites), the model predicts multiple displacement "modes" simultaneously

Stage 3 — Path Prediction

Question answered: What does the intermediate atomic trajectory look like between the initial and final structures? And what is the migration barrier (energy change) along this path?

An MLP takes the initial and final positions of the mobile atoms (plus their immediate neighbors) and predicts the positions (and optionally energies) at each intermediate step.

  • Input: MACE embeddings + position features for mobile atoms and nearby neighbors, for both initial and final configurations
  • Output: Atom positions at each intermediate image (e.g., 7 frames between start and end), and optionally the energy at each image
  • Training data: Paths generated by IDPP interpolation or full NEB relaxation

What Are MACE Embeddings?

For now, all three stages use MACE-MP-0 (a universal pretrained machine-learning interatomic potential) to convert each atom's local chemical environment into a 640-dimensional feature vector. These embeddings encode bonding, coordination, and chemistry without requiring hand-crafted descriptors. The MACE model itself is not trained here — it is used as a fixed feature extractor.

What Is In This Repo

scripts/generate_data.py          Build the dataset of initial/final structure pairs
scripts/train_mobility.py         Train Stage 1: mobility classifier (GNN)
scripts/train_multi_hop_models.py Train Stage 2: displacement models (MLP per hop size)
scripts/train_path_models.py      Train Stage 3: path prediction models (MLP)
vacancy_workflow.sh               End-to-end example for vacancy transport
interstitial_workflow.sh          End-to-end example for interstitial transport
combined_workflow.sh              End-to-end example combining both defect types
data/AgCl.cif                     Example crystal structure (silver chloride)

Installation

pip install -e .

COMET dependency

ORION uses COMET for two things:

  1. Ranking interstitial sites by energy (--defect-type interstitial)
  2. Running NEB relaxations to get physical migration paths and energies (--path-method neb)

Quick Start

Run one of the provided end-to-end workflow scripts from the orion/ directory:

# Vacancy-only pipeline
bash vacancy_workflow.sh

# Interstitial-only pipeline
bash interstitial_workflow.sh

# Combined vacancy + interstitial pipeline
bash combined_workflow.sh

These scripts call the same Python scripts documented in detail below. Read on for a full walkthrough of each step.

Step-by-Step Walkthrough

The following example uses AgCl with vacancy transport. See the interstitial and combined sections afterward for the differences.

Step 1: Generate the Dataset

python scripts/generate_data.py \
  --structure data/AgCl.cif \   ## Path to the input crystal structure (CIF or POSCAR)
  --element Ag \                ## The element that will hop (removed for vacancy, inserted for interstitial)
  --supercell 2 2 2 \           ## Build a 2×2×2 supercell before generating defects (larger = more unique environments)
  --defect-type vacancy \       ## vacancy | interstitial | both
  --max-distance 4.0 \          ## Only consider vacancy hops shorter than this distance (Angstroms)
  --output-dir output/vacancy_data \  ## Where to write the dataset
  --generate-paths \            ## Also generate intermediate images for Stage 3 path training
  --path-method neb \           ## idpp (fast geometry interpolation) or neb (physically relaxed with COMET)
  --path-neb-calculator orb \   ## Which COMET calculator to use for NEB: orb, mace, chgnet, uma, ...
  --path-n-images 7             ## Number of intermediate structures to generate between start and end

This script enumerates all symmetry-distinct vacancy positions and all their symmetry-distinct hop destinations. For each initial vacancy configuration it creates a "group": one starting structure and every possible final structure (one per hop destination). If --generate-paths is set, it also runs path interpolation for each initial→final pair.

For interstitial datasets, additional arguments are used:

python scripts/generate_data.py \
  --defect-type interstitial \
  --energy-threshold 5.0 \     ## Only keep interstitial sites within this many eV of the lowest-energy site
  --max-pair-distance 3.0 \    ## Only pair interstitial sites closer than this (Angstroms)
  --max-calculations 200 \     ## How many site energy evaluations to run for ranking (more = slower but more thorough)
  --min-neighbors 6 \          ## Warn if a candidate site has fewer neighbors than this within the cutoff
  --calculator-type orb \      ## COMET calculator for energy ranking: orb, mace, chgnet, ...
  ...

Step 2: Train the Mobility Classifier (Stage 1)

python scripts/train_mobility.py \
  --data-dir output/vacancy_data \    ## Dataset folder from generate_data.py
  --output-dir output/mobility \      ## Where to save trained model files and logs
  --epochs 25 \                       ## Number of full passes through the training data
  --batch-size 8 \                    ## Structures processed per gradient update
  --lr 1e-3                           ## Learning rate for the Adam optimizer

What this trains: A GNN with 4 message-passing layers and hidden dimension 128. Each layer lets atoms "talk to" their neighbors within a 5 Å cutoff, mixing MACE embeddings with interatomic distances. After all message-passing layers, a small MLP maps each atom's feature vector to a single mobility probability.

Optional arguments:

  --cutoff 5.0 \        ## Neighbor cutoff radius for building the atom graph (Angstroms)
  --hidden-dim 128 \    ## Width of each GNN layer
  --num-layers 4 \      ## Number of message-passing rounds
  --no-mace \           ## Fall back to simple atomic-number embeddings instead of MACE (faster, less accurate)
  --no-focal            ## Use standard binary cross-entropy instead of focal loss

Step 3: Train Displacement Models (Stage 2)

python scripts/train_multi_hop_models.py \
  --data-dir output/vacancy_data \    ## Same dataset folder
  --output-dir output/multi_hop \     ## Where to save models and logs
  --epochs 100 \                      ## Training epochs per hop-size model (1-atom, 2-atom, 3-atom)
  --batch-size 64 \
  --mobility-threshold 1.0 \          ## An atom counts as "moved" if it displaced more than this (Angstroms)
  --save-predictions \                ## After training, export predicted CIF structures for the test split
  --pred-max-groups 10 \              ## Export predictions for at most this many test groups
  --pred-max-per-group 20 \           ## At most this many predicted structures per group
  --pred-min-disp 0.1                 ## Skip predictions where the average atomic displacement is below this (Angstroms)

What this trains: Three separate MLPs — one for 1-atom hops, one for 2-atom hops, one for 3-atom hops. The model for hop size N takes as input the concatenated MACE embeddings and fractional position Fourier features of the N atoms being considered, and outputs N displacement vectors (one per atom). If multiple distinct hop destinations exist for the same set of atoms, the model learns to predict all of them simultaneously (multi-modal output).

Training data is built by exhaustively enumerating all combinations of mobile atoms from each group, then looking up the target displacement from the recorded outcome structures.

Optional arguments:

  --hidden-dim 512 \             ## MLP width
  --num-layers 3 \               ## Number of hidden layers
  --zero-weight 0.5 \            ## Loss weight for atom combinations that do NOT hop (reduces false positives)
  --num-fourier-features 2 \     ## How many Fourier frequency components to include in positional encoding
  --num-modes 4 \                ## Fix the number of predicted displacement modes (auto-estimated if not set)
  --max-modes 12                 ## Upper limit on the auto-estimated number of modes

Step 4: Train Path Prediction Models (Stage 3)

python scripts/train_path_models.py \
  --data-dir output/vacancy_data \    ## Same dataset folder (must have paths/ subdirectories)
  --output-dir output/path_models \   ## Where to save models and logs
  --epochs 100 \
  --batch-size 64 \
  --mobility-threshold 1.0 \
  --num-images 7 \                    ## Must match --path-n-images used in generate_data.py
  --save-predictions \                ## Export predicted path CIFs for validation data
  --pred-max-paths 10                 ## Export at most this many predicted paths

What this trains: One MLP per hop size, similar in structure to Stage 2 but larger input. The input includes MACE embeddings and positional features for both the initial and final structures (for the mobile atoms and their nearby neighbors within --neighbor-radius). The output is a sequence of N atom position snapshots — the predicted intermediate images along the migration path. If NEB energy data is available, the model also learns to predict the energy at each image (controlled by --energy-loss-weight).

Optional arguments:

  --neighbor-radius 3.0 \       ## Include non-mobile atoms within this radius (Å) of any mobile atom in both
                                ##   initial and final structures — adds environmental context to the path model
  --mobile-weight 2.0 \         ## Multiply the loss for mobile atoms by this factor (focus training on them)
  --energy-loss-weight 0.1 \    ## Weight of the NEB energy profile loss relative to the position loss
                                ##   (set to 0 if no NEB data, or to disable energy prediction)
  --no-save-predictions         ## Disable exporting predicted path CIFs (saves disk space)

Output Folders — What Gets Saved

Dataset (generate_data.py output)

output/vacancy_data/
  train/
    group_0000/
      initial.cif          ← Starting structure (supercell with vacancy, before any hop)
      final_0.cif          ← Outcome 0: structure after the atom hopped to destination 0
      final_1.cif          ← Outcome 1: hop to destination 1 (different neighbor)
      ...                  ← One final_*.cif per symmetry-distinct hop destination
      mobility_labels.pt   ← PyTorch tensor, shape [N_atoms], values 0 or 1.
                           ←   1 means this atom moved more than the threshold in at least one outcome.
      metadata.json        ← Group info: element, defect type, lattice, which atoms are mobile, etc.
      paths/               ← Created only if --generate-paths was used
        path_0000/         ← Path for the initial → final_0 transition
          00_initial.cif
          01_interpolated.cif
          02_interpolated.cif
          ...
          07_final.cif     ← One file per image (00 = initial, NN = final, rest are intermediate)
          metadata.json    ← Which atoms moved, hop size, path length
          energy_profile.txt  ← NEB energies (eV) at each image, if --path-method neb was used
          neb/             ← Raw COMET NEB output files (only present for neb method)
        path_0001/         ← Path for the initial → final_1 transition
          ...
    group_0001/
      ...
  test/
    group_XXXX/            ← Same structure, held out for evaluation
      ...
  dataset_summary.json     ← High-level statistics: total groups, outcomes per group, mobile atoms per group

Each group_XXXX folder represents one unique initial configuration (one vacancy or interstitial position). The multiple final_*.cif files are the distinct places the mobile atom could end up — different hop directions, different distances, etc.


Mobility Model (train_mobility.py output)

output/mobility/
  best_model_loss.pt       ← Saved weights from the epoch with the lowest validation loss
  best_model_f1.pt         ← Saved weights from the epoch with the highest validation F1 score
  final_model.pt           ← Weights after the last training epoch
  latest.pt                ← Full training state: weights + optimizer state + history.
                           ←   Used to resume interrupted training (the script loads this automatically if it exists)
  epoch_checkpoints/
    epoch_001.pt           ← Snapshot after epoch 1
    epoch_002.pt
    ...
  training_history.json    ← Train/val loss and F1 at every epoch (useful for plotting learning curves)
  train.log                ← Console output of the training run

Multi-Hop Displacement Models (train_multi_hop_models.py output)

One subdirectory per hop size:

output/multi_hop/
  hop_1/                   ← Model for single-atom hops
    best_model.pt          ← Weights with lowest validation loss
    final_model.pt         ← Weights after last epoch
    latest.pt              ← Full state for resuming
    training_history.json  ← Loss and MAE (mean absolute error in Angstroms) per epoch
    train.log
  hop_2/                   ← Model for two-atom cooperative hops
    ...
  hop_3/                   ← Model for three-atom hops
    ...
  predictions/             ← Created only if --save-predictions was used
    group_0000/
      initial.cif          ← The starting structure for this test group
      pred_hop1_combo0000_mode0_idx3.cif
                           ← Predicted outcome: 1-atom hop, atom index 3, displacement mode 0
      pred_hop2_combo0001_mode0_idx3-7.cif
                           ← Predicted outcome: 2-atom hop, atoms 3 and 7, mode 0
      ...                  ← Many CIFs, one per (hop size, atom combination, mode)
    group_0001/
      ...

The pos_mae values logged during training are the mean absolute error of predicted displacements (in Angstroms) for atom combinations that do hop. The neg_mae measures how close to zero the predictions are for combinations that don't hop.


Path Prediction Models (train_path_models.py output)

output/path_models/
  hop_1/
    best_model.pt
    final_model.pt
    latest.pt
    training_history.json
    train.log
  hop_2/
    ...
  predictions/             ← Created by default (disable with --no-save-predictions)
    path_0000_hop1/        ← Predicted path for test path 0, hop size 1
      00_initial.cif       ← Starting structure
      01_pred.cif          ← Predicted intermediate image 1
      01_target.cif        ← Reference intermediate image 1 (from IDPP or NEB)
      02_pred.cif
      02_target.cif
      ...
      07_final.cif         ← Final structure (same as in the dataset)
      energy_pred.txt      ← Predicted energy at each image (only if NEB data was available)
      metadata.json        ← Which atoms moved, hop indices, path source
    path_0001_hop1/
      ...

Comparing pred vs target CIFs lets you visually inspect how well the model reproduces the intermediate structures. Load them in VESTA to visualize.

The energy_pred.txt outputs files will also be useful, as these provide a direct comparison to NEB-calculated energy profiles.

Notes

  • Stages 2 and 3 require MACE (pip install mace-torch). Stage 1 can run without MACE using --no-mace (uses atomic number embeddings instead, which is less accurate).
  • NEB path generation requires COMET and a valid calculator name (--path-neb-calculator). For quick geometry-only paths, use --path-method idpp (no COMET required).
  • The latest.pt file in each output directory is used for automatic resume: if you re-run a training script pointing to the same --output-dir, it will pick up from where it left off.
  • All scripts auto-detect CUDA. Training runs on GPU if available.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors