Skip to content

sachdevalab/MODIFI

Repository files navigation

🧬 MODIFI: DNA Modification Detection and ECE-host linkage inference from PacBio Metagenomic Data

License Python Platform PacBio

MODIFI is a software package for detecting DNA base modifications and inferring host–mobile genetic element (MGE) linkages from PacBio metagenomic sequencing data. It enables precise modification calling, motif discovery, and host–MGE association in complex microbial communities, supporting both subreads and HiFi read types.

Table of contents


How it works

MODIFI expects a PacBio BAM with IPD kinetics and a reference FASTA (metagenome assembly or isolate). If you pass an unaligned BAM, reads are aligned with pbmm2 while preserving kinetic tags. The pipeline splits work by contig, optionally accumulates control k-mer IPD statistics (from the same run or external databases), and normalizes sample IPDs against those controls. It then calls modifications, discovers motifs, and builds per-contig methylation profiles. With an MGE table (e.g. from geNomad) and optionally a binning file, it merges motif evidence and scores host–MGE linkages. Typical outputs are motif calls, strand-specific profiles, base-level GFF annotations, and host_summary.csv when linkage is enabled.


🧩 Installation

Option A: Install via Bioconda

conda create -n modifi -c bioconda modifi
conda activate modifi

If SMRT tools are installed in your system PATH, MODIFI will detect and use them automatically; otherwise, it defaults to its built-in alternatives.

Option B: Install from Source

1️⃣ Clone repository and install

git clone https://github.com/sachdevalab/MODIFI.git
cd MODIFI/
conda env create -n modifi -f env.yml
conda activate modifi
pip install .

2️⃣ Configure SMRT Link tools OPTIONAL

You can set your SMRT Link tools path here. You can also skip this step.

Configuration priority (in order):

  1. Config file first – MODIFI will use the path specified in smrt.config.yaml
  2. System PATH – If smrt.config.yaml is not found or incomplete, MODIFI checks system PATH
  3. Built-in alternative – Use built-in MultiMotifMaker and pbmm2 as fallback

To configure via smrt.config.yaml (only if needed):

Edit smrt.config.yaml in the MODIFI directory:

smrtlink_bin: /path/to/smrtlink/private/bin/

Verify installation

Test that the command-line entry point is available:

modifi --help

And run the built-in test dataset:

cd test/hifi/
bash test_hifi.sh

[Optional] Setup for subreads

If you need to process subreads, create a separate environment with pbcore (requires Python 3.9 and numpy 1.22.4):

conda env create -n MODIFI_subreads -f subreads.env.yml
conda activate MODIFI_subreads
pip install git+https://github.com/PacificBiosciences/pbcore.git
pip install .

⚡ Quick Start

Input Requirements

MODIFI supports both HiFi reads and subreads. Your PacBio BAM file must contain kinetics (IPD) tags:

Read Type Required IPD Tags Description
HiFi fi, ri Forward/Reverse IPD (codec V1)
Subreads ip IPD (raw frames or codec V1)

⚠️ Important:

  • FASTQ files do not contain kinetics information and cannot be used
  • The BAM file stores kinetic information; it does not need to be pre-aligned
  • See PacBio BAM format specifications for details
  • Reference should be indexed using samtools faidx before running.

Example

Run MODIFI with unaligned BAM containing kinetics:

samtools faidx assembly.fasta 
modifi \
    -o /path/to/output \
    -b raw.hifi_reads.bam \
    -r assembly.fasta \
    --read_type hifi

📖 Detailed Usage

Input BAM Options

Option 1: Unaligned BAM

MODIFI will automatically align reads using pbmm2:

modifi \
    -o /path/to/output \
    -b raw.hifi_reads.bam \
    -r assembly.fasta \
    --read_type hifi

Option 2: Pre-aligned BAM

If already aligned with pbmm2 (with kinetics preserved):

modifi \
    -o /path/to/output \
    --aligned_bam aligned.sorted.bam \
    -r assembly.fasta \
    --read_type hifi

Option 3: Subreads

For subread data (requires MODIFI_subreads environment):

modifi \
    -o /path/to/output \
    -b raw.subreads.bam \
    -r assembly.fasta \
    --read_type subreads

MGE-Host Linkage Inference

Provide an MGE table to automatically infer MGE-host linkages:

modifi \
    -o /path/to/output \
    -b raw.hifi_reads.bam \
    -r assembly.fasta \
    --read_type hifi \
    --mge_file MGE_list.tab

MGE table format: Tab-separated file with at least a seq_name column. Can be output from geNomad or manually created:

seq_name	length	topology	n_genes	genetic_code	plasmid_score	fdr
ERR6535514_2_L	6291	DTR	8	11	0.9997	0.0003
...

💡 Tip: Once motif detection has finished, you can re-run linkage only using the modifi-linkage helper instead of the full pipeline (see Linkage-only mode).

Incorporating Binning Results

Assign MGEs to host bins using binning results:

modifi \
    -o /path/to/output \
    -b raw.hifi_reads.bam \
    -r assembly.fasta \
    --read_type hifi \
    --mge_file MGE_list.tab \
    --bin_file binning.tab

Bin file format: Tab-separated, two columns (no header): contig name and bin name

scaffold_53	maxbin.017
scaffold_66	maxbin.017
scaffold_73	maxbin.017
...

Using Control Databases

For isolate genomes or low-complexity metagenomes, use a control database for better normalization.

Pre-built database:

modifi \
    -o /path/to/output \
    -b raw.hifi_reads.bam \
    -r assembly.fasta \
    --read_type hifi \
    --kmer_mean_db control/control_db.up7.down3.mean.dat \
    --kmer_num_db control/control_db.up7.down3.num.dat

Custom database:

Use control database from your own high-complexity metagenome runs (automatically generated in <output>/control/):

control/control_db.up7.down3.mean.dat
control/control_db.up7.down3.num.dat

🔬 Complete Metagenomics Workflow

End-to-end example from MGE prediction to host linkage inference.

Prerequisites:

  • Raw PacBio BAM with kinetics
  • Metagenomic assembly (FASTA)

Step 1: Predict MGEs with geNomad

genomad end-to-end \
    --relaxed \
    --cleanup \
    --enable-score-calibration \
    --threads 32 \
    --sensitivity 7.0 \
    --force-auto \
    assembly.fasta \
    output/genomad/ \
    /path/to/genomad_db

Step 2: Combine virus and plasmid predictions

cat output/genomad/assembly_summary/assembly_virus_summary.tsv \
    output/genomad/assembly_summary/assembly_plasmid_summary.tsv \
    > all_MGEs.tsv

Step 3: Run MODIFI

modifi \
    -o output/MODIFI \
    -b raw.hifi_reads.bam \
    -r assembly.fasta \
    --read_type hifi \
    --mge_file all_MGEs.tsv \
    --threads 32

Note: Proviruses are automatically excluded from analysis.


Linkage-only mode

If you have already run MODIFI and generated motif profiles in a given -o, you can re-run just the MGE-host linkage step (for example, with a different MGE table or binning file) without re-running modification detection:

modifi-linkage \
    -o /path/to/output \
    -r assembly.fasta \
    --mge_file all_MGEs.tsv \
    --bin_file binning.tab \
    --threads 32

⚙️ Command-line options

For the full, up-to-date list of arguments and defaults:

modifi --help
modifi-linkage --help

Commonly tuned options (modifi):

Option Role
-o Output directory for all pipeline artifacts
-b / --aligned_bam Unaligned BAM (-b/--bam) or pre-aligned BAM (--aligned_bam) with kinetics
-r Reference FASTA (contigs)
--read_type hifi (default) or subreads
--threads Parallelism (default 64)
--min_iden Minimum alignment identity (default 0.97 for HiFi, 0.85 for subreads if unset)
--mge_file / --bin_file MGE table (e.g. geNomad) and optional contig→bin map for linkage
--kmer_mean_db / --kmer_num_db Optional control k-mer IPD databases; window must match --up / --down
--up / --down Upstream / downstream k-mer flank length around the modified base (defaults 7 / 3)
--min_len, --min_cov, --min_ctg_cov Contig and site coverage filters
--min_frac, --min_sites, --min_score Motif retention and modification calling thresholds
--segment Depth-based contig segmentation (more recall, more runtime)
--run_steps Restrict to a subset of steps: split, load, control, compare, motif, profile, merge, host, anno
--clean Enable cleaning step
--visu_ipd Write IPD distribution figures under figs/
--detect_misassembly Misassembly-related IPD ratio outputs
--annotate_rm / --rm_gene_file RM-system annotation (MicrobeMod; testing)
--binning Methylation-based binning (testing)

📁 Output Files

Main Output Files

File Description
all.motifs.merged.csv Motifs detected from all the contigs
all.reprocess.gff Modified sites detected from all the contigs
motif_profile.csv Methylation fraction matrix: motifs × contigs
host_summary.csv Best predicted host for each MGE
mean_depth.csv Sequencing depth and length per contig
summary.csv Overall modification detection statistics
motif_heatmap.pdf Heatmap of motif profiles (empty for large datasets)

Per-Genome Results

Folder Contents
gffs/ *.reprocess.gff – Base-level methylation annotations
motifs/ *.motifs.csv – Detected motifs and metrics
profiles/ *.motifs.profile.csv – Strand-specific methylation ratios
hosts/ *.host_prediction.csv – Linkage scores for all candidate hosts
hosts/ *.motif_data.csv – Detailed motif data for scoring

Optional Output

Folder Contents Enabled by
RM_systems/ RM gene annotations --annotate_rm
figs/ IPD distribution plots --visu_ipd
ipd_ratio/ IPD ratio line plots --detect_misassembly

Output Format: Motif File

File: motifs/*.motifs.csv

Column Description
motifString DNA motif sequence (e.g., GATC, TCGAG)
centerPos Modified base position within motif (1-indexed)
modificationType Modification type (e.g., modified_base)
fraction Fraction of modified sites (0–1)
nDetected Number of modified sites detected
nGenome Total motif sites in genome
groupTag Motif group ID for clustering
partnerMotifString Complementary motif (if paired)
meanScore Mean modification score
meanIpdRatio Mean IPD ratio
meanCoverage Mean coverage at motif sites
objectiveScore Overall detection confidence score

Output Format: MGE-Host Linkage

File: host_summary.csv

Column Description
MGE MGE contig name
MGE_len MGE length (bp)
host Predicted host contig/bin
final_score Linkage confidence (0–1, higher is better)
specificity Association specificity
MGE_gc MGE GC content (%)
host_gc Host GC content (%)
cos_sim 4-mer frequency cosine similarity
MGE_cov MGE sequencing depth
host_cov Host sequencing depth
host_motif_num Number of motifs in host
total_sites Total modified sites analyzed
motif_info Detailed motif data (format: motif:length:sites:cov:score:pos)

Filtering recommendations:

  • General: final_score > 0.5 AND specificity < 0.01
  • Very high confidence: final_score > 0.8 AND specificity < 0.001

Output Format: GFF Annotation

Files: gffs/*.reprocess.gff

Standard GFF3 format with 9 columns:

Column Content
1 Contig name
2 Method (MODIFI)
3 Feature type (modified_base)
4–5 Position (start–end, 1-indexed)
6 Modification score
7 Strand (+/-)
8 Phase (not used)
9 Attributes (coverage, context, IPD ratio)

Quality Control Plots

Files: figs/*.png (generated with --visu_ipd)

2×2 subplot grid showing IPD normalization:

Position Content
(0,0) Alignment coverage distribution
(0,1) Raw IPD value distribution
(1,0) Control IPD distribution
(1,1) Normalized IPD ratio distribution

🧠 Control Database

For isolate genomes or low-complexity metagenomes, using a control database improves IPD normalization and modification calling accuracy.

Pre-built Database

MODIFI includes a pre-built control database:

control/control_db.up7.down3.mean.dat
control/control_db.up7.down3.num.dat

Use with --kmer_mean_db and --kmer_num_db flags (see Using Control Databases).

Custom Database

For better results, generate a control database from your own high-complexity metagenomes:

  1. Run MODIFI on high-complexity metagenomic samples
  2. Control files are automatically generated in <output>/control/
  3. Reuse these files for isolate/low-complexity samples

Important Notes

  • The k-mer window size (e.g., up7.down3) must match between sample and control
  • Control database accumulates more accurate statistics with larger/more diverse datasets
  • Recommended for: isolates, enriched cultures, low-diversity communities

Notes

Resource requirements (examples for reference)

In 59 metagenomics from nine habitats, the mean wall-clock time was 4.5 hours (range: 0.2–19 hours, std: 3.9), while the mean CPU time was 120 hours (range: 1–1,161 hours, std: 215), with a mean peak memory usage of 18 GB (range: 0.6–62 GB, std: 14).

Performance

  • Coverage >500× is automatically subsampled for computational efficiency
  • For large metagenomes, use --threads to parallelize processing
  • --segment flag increases recall for low-depth contigs but requires more time

Benchmarks and reproducibility

Driver scripts and plotting code used for analyses in development live under benchmark/.

Best Practices

  • Use HiFi reads instead of subreads when possible (better accuracy, faster)
  • Use control database for isolate genomes and low-complexity samples
  • Filter MGE-host linkages: final_score > 0.5 and specificity < 0.01

📚 Citation

If you use MODIFI in your research, please cite:

Metagenomic strain-resolved DNA modification patterns link extrachromosomal genetic elements to host strains Shuai Wang, Allison K. Guitor, Luis E. Valentin-Alvarado, Rebecca Garner, Pengfan Zhang, Ming Yan, Ling-Dong Shi, Marie C. Schoelmerich, Holly M. Steininger, Daniel M. Portik, Siyuan Zhang, Jeremy E. Wilkinson, Susan Lynch, Michael J. Morowitz, Matthias Hess, Spencer Diamond, Jillian F. Banfield, Rohan Sachdeva bioRxiv 2026.03.27.714056; doi: https://doi.org/10.64898/2026.03.27.714056


💬 Getting Help

For questions or bug reports:

We’ll respond as soon as possible.

About

MODIFI is for detecting DNA base modifications and linking ECEs to their microbial hosts using PacBio metagenomic data. It performs modification calling, motif discovery, and host–ECE association analysis, supporting both HiFi and subread data.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors