MODIFI is a software package for detecting DNA base modifications and inferring host–mobile genetic element (MGE) linkages from PacBio metagenomic sequencing data. It enables precise modification calling, motif discovery, and host–MGE association in complex microbial communities, supporting both subreads and HiFi read types.
- How it works
- Installation
- Quick Start
- Detailed Usage
- Complete Metagenomics Workflow
- Linkage-only mode
- Command-line options
- Output Files
- Control Database
- Important Notes
- Citation
- Getting Help
MODIFI expects a PacBio BAM with IPD kinetics and a reference FASTA (metagenome assembly or isolate). If you pass an unaligned BAM, reads are aligned with pbmm2 while preserving kinetic tags. The pipeline splits work by contig, optionally accumulates control k-mer IPD statistics (from the same run or external databases), and normalizes sample IPDs against those controls. It then calls modifications, discovers motifs, and builds per-contig methylation profiles. With an MGE table (e.g. from geNomad) and optionally a binning file, it merges motif evidence and scores host–MGE linkages. Typical outputs are motif calls, strand-specific profiles, base-level GFF annotations, and host_summary.csv when linkage is enabled.
conda create -n modifi -c bioconda modifi
conda activate modifiIf SMRT tools are installed in your system PATH, MODIFI will detect and use them automatically; otherwise, it defaults to its built-in alternatives.
git clone https://github.com/sachdevalab/MODIFI.git
cd MODIFI/
conda env create -n modifi -f env.yml
conda activate modifi
pip install .You can set your SMRT Link tools path here. You can also skip this step.
Configuration priority (in order):
- Config file first – MODIFI will use the path specified in
smrt.config.yaml - System PATH – If
smrt.config.yamlis not found or incomplete, MODIFI checks system PATH - Built-in alternative – Use built-in
MultiMotifMakerandpbmm2as fallback
To configure via smrt.config.yaml (only if needed):
Edit smrt.config.yaml in the MODIFI directory:
smrtlink_bin: /path/to/smrtlink/private/bin/Test that the command-line entry point is available:
modifi --helpAnd run the built-in test dataset:
cd test/hifi/
bash test_hifi.shIf you need to process subreads, create a separate environment with pbcore (requires Python 3.9 and numpy 1.22.4):
conda env create -n MODIFI_subreads -f subreads.env.yml
conda activate MODIFI_subreads
pip install git+https://github.com/PacificBiosciences/pbcore.git
pip install .MODIFI supports both HiFi reads and subreads. Your PacBio BAM file must contain kinetics (IPD) tags:
| Read Type | Required IPD Tags | Description |
|---|---|---|
| HiFi | fi, ri |
Forward/Reverse IPD (codec V1) |
| Subreads | ip |
IPD (raw frames or codec V1) |
⚠️ Important:
- FASTQ files do not contain kinetics information and cannot be used
- The BAM file stores kinetic information; it does not need to be pre-aligned
- See PacBio BAM format specifications for details
- Reference should be indexed using
samtools faidxbefore running.
Run MODIFI with unaligned BAM containing kinetics:
samtools faidx assembly.fasta
modifi \
-o /path/to/output \
-b raw.hifi_reads.bam \
-r assembly.fasta \
--read_type hifiMODIFI will automatically align reads using pbmm2:
modifi \
-o /path/to/output \
-b raw.hifi_reads.bam \
-r assembly.fasta \
--read_type hifiIf already aligned with pbmm2 (with kinetics preserved):
modifi \
-o /path/to/output \
--aligned_bam aligned.sorted.bam \
-r assembly.fasta \
--read_type hifiFor subread data (requires MODIFI_subreads environment):
modifi \
-o /path/to/output \
-b raw.subreads.bam \
-r assembly.fasta \
--read_type subreadsProvide an MGE table to automatically infer MGE-host linkages:
modifi \
-o /path/to/output \
-b raw.hifi_reads.bam \
-r assembly.fasta \
--read_type hifi \
--mge_file MGE_list.tabMGE table format: Tab-separated file with at least a seq_name column. Can be output from geNomad or manually created:
seq_name length topology n_genes genetic_code plasmid_score fdr
ERR6535514_2_L 6291 DTR 8 11 0.9997 0.0003
...💡 Tip: Once motif detection has finished, you can re-run linkage only using the
modifi-linkagehelper instead of the full pipeline (see Linkage-only mode).
Assign MGEs to host bins using binning results:
modifi \
-o /path/to/output \
-b raw.hifi_reads.bam \
-r assembly.fasta \
--read_type hifi \
--mge_file MGE_list.tab \
--bin_file binning.tabBin file format: Tab-separated, two columns (no header): contig name and bin name
scaffold_53 maxbin.017
scaffold_66 maxbin.017
scaffold_73 maxbin.017
...For isolate genomes or low-complexity metagenomes, use a control database for better normalization.
modifi \
-o /path/to/output \
-b raw.hifi_reads.bam \
-r assembly.fasta \
--read_type hifi \
--kmer_mean_db control/control_db.up7.down3.mean.dat \
--kmer_num_db control/control_db.up7.down3.num.datUse control database from your own high-complexity metagenome runs (automatically generated in <output>/control/):
control/control_db.up7.down3.mean.dat
control/control_db.up7.down3.num.dat
End-to-end example from MGE prediction to host linkage inference.
Prerequisites:
- Raw PacBio BAM with kinetics
- Metagenomic assembly (FASTA)
genomad end-to-end \
--relaxed \
--cleanup \
--enable-score-calibration \
--threads 32 \
--sensitivity 7.0 \
--force-auto \
assembly.fasta \
output/genomad/ \
/path/to/genomad_dbcat output/genomad/assembly_summary/assembly_virus_summary.tsv \
output/genomad/assembly_summary/assembly_plasmid_summary.tsv \
> all_MGEs.tsvmodifi \
-o output/MODIFI \
-b raw.hifi_reads.bam \
-r assembly.fasta \
--read_type hifi \
--mge_file all_MGEs.tsv \
--threads 32Note: Proviruses are automatically excluded from analysis.
If you have already run MODIFI and generated motif profiles in a given -o, you can re-run just the MGE-host linkage step (for example, with a different MGE table or binning file) without re-running modification detection:
modifi-linkage \
-o /path/to/output \
-r assembly.fasta \
--mge_file all_MGEs.tsv \
--bin_file binning.tab \
--threads 32For the full, up-to-date list of arguments and defaults:
modifi --help
modifi-linkage --helpCommonly tuned options (modifi):
| Option | Role |
|---|---|
-o |
Output directory for all pipeline artifacts |
-b / --aligned_bam |
Unaligned BAM (-b/--bam) or pre-aligned BAM (--aligned_bam) with kinetics |
-r |
Reference FASTA (contigs) |
--read_type |
hifi (default) or subreads |
--threads |
Parallelism (default 64) |
--min_iden |
Minimum alignment identity (default 0.97 for HiFi, 0.85 for subreads if unset) |
--mge_file / --bin_file |
MGE table (e.g. geNomad) and optional contig→bin map for linkage |
--kmer_mean_db / --kmer_num_db |
Optional control k-mer IPD databases; window must match --up / --down |
--up / --down |
Upstream / downstream k-mer flank length around the modified base (defaults 7 / 3) |
--min_len, --min_cov, --min_ctg_cov |
Contig and site coverage filters |
--min_frac, --min_sites, --min_score |
Motif retention and modification calling thresholds |
--segment |
Depth-based contig segmentation (more recall, more runtime) |
--run_steps |
Restrict to a subset of steps: split, load, control, compare, motif, profile, merge, host, anno |
--clean |
Enable cleaning step |
--visu_ipd |
Write IPD distribution figures under figs/ |
--detect_misassembly |
Misassembly-related IPD ratio outputs |
--annotate_rm / --rm_gene_file |
RM-system annotation (MicrobeMod; testing) |
--binning |
Methylation-based binning (testing) |
| File | Description |
|---|---|
all.motifs.merged.csv |
Motifs detected from all the contigs |
all.reprocess.gff |
Modified sites detected from all the contigs |
motif_profile.csv |
Methylation fraction matrix: motifs × contigs |
host_summary.csv |
Best predicted host for each MGE |
mean_depth.csv |
Sequencing depth and length per contig |
summary.csv |
Overall modification detection statistics |
motif_heatmap.pdf |
Heatmap of motif profiles (empty for large datasets) |
| Folder | Contents |
|---|---|
gffs/ |
*.reprocess.gff – Base-level methylation annotations |
motifs/ |
*.motifs.csv – Detected motifs and metrics |
profiles/ |
*.motifs.profile.csv – Strand-specific methylation ratios |
hosts/ |
*.host_prediction.csv – Linkage scores for all candidate hosts |
hosts/ |
*.motif_data.csv – Detailed motif data for scoring |
| Folder | Contents | Enabled by |
|---|---|---|
RM_systems/ |
RM gene annotations | --annotate_rm |
figs/ |
IPD distribution plots | --visu_ipd |
ipd_ratio/ |
IPD ratio line plots | --detect_misassembly |
File: motifs/*.motifs.csv
| Column | Description |
|---|---|
motifString |
DNA motif sequence (e.g., GATC, TCGAG) |
centerPos |
Modified base position within motif (1-indexed) |
modificationType |
Modification type (e.g., modified_base) |
fraction |
Fraction of modified sites (0–1) |
nDetected |
Number of modified sites detected |
nGenome |
Total motif sites in genome |
groupTag |
Motif group ID for clustering |
partnerMotifString |
Complementary motif (if paired) |
meanScore |
Mean modification score |
meanIpdRatio |
Mean IPD ratio |
meanCoverage |
Mean coverage at motif sites |
objectiveScore |
Overall detection confidence score |
File: host_summary.csv
| Column | Description |
|---|---|
MGE |
MGE contig name |
MGE_len |
MGE length (bp) |
host |
Predicted host contig/bin |
final_score |
Linkage confidence (0–1, higher is better) |
specificity |
Association specificity |
MGE_gc |
MGE GC content (%) |
host_gc |
Host GC content (%) |
cos_sim |
4-mer frequency cosine similarity |
MGE_cov |
MGE sequencing depth |
host_cov |
Host sequencing depth |
host_motif_num |
Number of motifs in host |
total_sites |
Total modified sites analyzed |
motif_info |
Detailed motif data (format: motif:length:sites:cov:score:pos) |
Filtering recommendations:
- General:
final_score > 0.5ANDspecificity < 0.01- Very high confidence:
final_score > 0.8ANDspecificity < 0.001
Files: gffs/*.reprocess.gff
Standard GFF3 format with 9 columns:
| Column | Content |
|---|---|
| 1 | Contig name |
| 2 | Method (MODIFI) |
| 3 | Feature type (modified_base) |
| 4–5 | Position (start–end, 1-indexed) |
| 6 | Modification score |
| 7 | Strand (+/-) |
| 8 | Phase (not used) |
| 9 | Attributes (coverage, context, IPD ratio) |
Files: figs/*.png (generated with --visu_ipd)
2×2 subplot grid showing IPD normalization:
| Position | Content |
|---|---|
| (0,0) | Alignment coverage distribution |
| (0,1) | Raw IPD value distribution |
| (1,0) | Control IPD distribution |
| (1,1) | Normalized IPD ratio distribution |
For isolate genomes or low-complexity metagenomes, using a control database improves IPD normalization and modification calling accuracy.
MODIFI includes a pre-built control database:
control/control_db.up7.down3.mean.dat
control/control_db.up7.down3.num.dat
Use with --kmer_mean_db and --kmer_num_db flags (see Using Control Databases).
For better results, generate a control database from your own high-complexity metagenomes:
- Run MODIFI on high-complexity metagenomic samples
- Control files are automatically generated in
<output>/control/ - Reuse these files for isolate/low-complexity samples
- The k-mer window size (e.g.,
up7.down3) must match between sample and control - Control database accumulates more accurate statistics with larger/more diverse datasets
- Recommended for: isolates, enriched cultures, low-diversity communities
In 59 metagenomics from nine habitats, the mean wall-clock time was 4.5 hours (range: 0.2–19 hours, std: 3.9), while the mean CPU time was 120 hours (range: 1–1,161 hours, std: 215), with a mean peak memory usage of 18 GB (range: 0.6–62 GB, std: 14).
- Coverage >500× is automatically subsampled for computational efficiency
- For large metagenomes, use
--threadsto parallelize processing --segmentflag increases recall for low-depth contigs but requires more time
Driver scripts and plotting code used for analyses in development live under benchmark/.
- Use HiFi reads instead of subreads when possible (better accuracy, faster)
- Use control database for isolate genomes and low-complexity samples
- Filter MGE-host linkages:
final_score > 0.5andspecificity < 0.01
If you use MODIFI in your research, please cite:
Metagenomic strain-resolved DNA modification patterns link extrachromosomal genetic elements to host strains Shuai Wang, Allison K. Guitor, Luis E. Valentin-Alvarado, Rebecca Garner, Pengfan Zhang, Ming Yan, Ling-Dong Shi, Marie C. Schoelmerich, Holly M. Steininger, Daniel M. Portik, Siyuan Zhang, Jeremy E. Wilkinson, Susan Lynch, Michael J. Morowitz, Matthias Hess, Spencer Diamond, Jillian F. Banfield, Rohan Sachdeva bioRxiv 2026.03.27.714056; doi: https://doi.org/10.64898/2026.03.27.714056
For questions or bug reports:
- 📧 Email: wshuai@berkeley.edu
- 🐛 GitHub Issues: open an issue in this repository
We’ll respond as soon as possible.