AMPidentifier

A Python toolkit for Antimicrobial Peptide (AMP) prediction and physicochemical assessment

////////////////////////////////////////////////////////////////////////
//                                                                    //
//      _    __  __ ____  _     _            _   _  __ _              //
//     / \  |  \/  |  _ \(_) __| | ___ _ __ | |_(_)/ _(_) ___ _ __    //
//    / _ \ | |\/| | |_) | |/ _` |/ _ \ '_ \| __| | |_| |/ _ \ '__|   //
//   / ___ \| |  | |  __/| | (_| |  __/ | | | |_| |  _| |  __/ |      //
//  /_/   \_\_|  |_|_|   |_|\__,_|\___|_| |_|\__|_|_| |_|\___|_|      //
//                                                                    //
////////////////////////////////////////////////////////////////////////

About

AMPidentifier is an open-source, modular Python toolkit for predicting Antimicrobial Peptides (AMPs) from amino acid sequences. It combines three pre-trained Machine Learning models (Random Forest, SVM, Gradient Boosting) with an ensemble voting system, and computes dozens of physicochemical descriptors via modlamp.

Users can run predictions with the built-in models, combine them in ensemble mode, or integrate external .pkl models for side-by-side comparison.

AMPidentifier is officially published on the Python Package Index (PyPI) at https://pypi.org/project/ampidentifier/ and can be installed directly via pip install ampidentifier. PyPI publication ensures that every release is versioned, indexed, and permanently accessible, which is essential for reproducibility in scientific workflows: researchers can cite a specific version and reproduce results exactly, regardless of when or where the analysis is run.

Related Projects

Project	Description	Link
AMPidentifier CLI	Full command-line version with training scripts, benchmarking, and extended documentation	github.com/madsondeluna/AMPidentifier
AMPidentifier Web Server	Browser-based interface for AMP prediction (no installation required)	github.com/madsondeluna/AMPidentifierServerBETA

Installation

pip install ampidentifier

We recommend using a virtual environment:

python3 -m venv venv
source venv/bin/activate   # macOS/Linux
# venv\Scripts\activate    # Windows
pip install ampidentifier

Available on PyPI: https://pypi.org/project/ampidentifier/

Quick Start

# Single model (Random Forest, default)
ampidentifier --input my_sequences.fasta --output_dir ./results

# Ensemble voting (recommended)
ampidentifier --input my_sequences.fasta --output_dir ./results --ensemble

# Compare SVM with an external model
ampidentifier --input my_sequences.fasta --output_dir ./results --model svm --external_models /path/to/my_model.pkl

Usage Examples

The examples below use this sample FASTA file (test_peptides.fasta) containing known AMPs and non-AMP peptides for demonstration:

>Magainin-2|Xenopus_laevis|Cationic_amphipathic_helix
GIGKFLHSAKKFGKAFVGEIMNS
>LL-37|Homo_sapiens|Cathelicidin_family
LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES
>Melittin|Apis_mellifera|Venom_peptide
GIGAVLKVLTTGLPALISWIKRKRQQ
>Insulin_Chain_B|Homo_sapiens|Peptide_hormone
FVNQHLCGSHLVEALYLVCGERGFFYTPKT
>Glucagon|Homo_sapiens|Peptide_hormone
HSQGTFTSDYSKYLDSRRAQDFVQWLMNT
>Vasoactive_intestinal_peptide|Homo_sapiens|Neuropeptide
HSDAVFTDNYTRLRKQMAVKKYLNSILN

Usage Example - Google Colab / Jupyter Notebook

Click the badge to open the demo notebook directly in Colab:

Or run the cells below manually in any Colab notebook:

# Cell 1: Install
!pip install ampidentifier

# Cell 2: Create the example FASTA file
fasta_content = """>Magainin-2|Xenopus_laevis|Cationic_amphipathic_helix
GIGKFLHSAKKFGKAFVGEIMNS
>LL-37|Homo_sapiens|Cathelicidin_family
LLGDFFRKSKEKIGKEFKRIVQRIKDFLRNLVPRTES
>Melittin|Apis_mellifera|Venom_peptide
GIGAVLKVLTTGLPALISWIKRKRQQ
>Insulin_Chain_B|Homo_sapiens|Peptide_hormone
FVNQHLCGSHLVEALYLVCGERGFFYTPKT
>Glucagon|Homo_sapiens|Peptide_hormone
HSQGTFTSDYSKYLDSRRAQDFVQWLMNT
>Vasoactive_intestinal_peptide|Homo_sapiens|Neuropeptide
HSDAVFTDNYTRLRKQMAVKKYLNSILN
"""

with open("test_peptides.fasta", "w") as f:
    f.write(fasta_content)

print("FASTA file created with 6 sequences (3 known AMPs + 3 non-AMPs)")

# Cell 3: Run with default model (Random Forest)
# Import the pipeline function directly from the package
import os
from amp_identifier.core import run_prediction_pipeline

os.makedirs("./results_rf", exist_ok=True)

run_prediction_pipeline(
    input_file="test_peptides.fasta",
    output_dir="./results_rf",
    internal_model_type="rf",   # Random Forest: best single-model AUC-ROC (0.9503)
    use_ensemble=False,
    external_model_paths=[],
)

# Cell 4: Run with ensemble mode (recommended)
# Combines RF + SVM + GB via majority voting for maximum robustness
import os
from amp_identifier.core import run_prediction_pipeline

os.makedirs("./results_ensemble", exist_ok=True)

run_prediction_pipeline(
    input_file="test_peptides.fasta",
    output_dir="./results_ensemble",
    internal_model_type="rf",   # ignored when use_ensemble=True
    use_ensemble=True,          # activates majority vote across all three models
    external_model_paths=[],
)

# Cell 5: Inspect results
# Runs ensemble first if output does not exist yet, then displays results
import os
import pandas as pd
from amp_identifier.core import run_prediction_pipeline

report_path   = "./results_ensemble/prediction_comparison_report.csv"
features_path = "./results_ensemble/physicochemical_features.csv"

if not os.path.exists(report_path):
    os.makedirs("./results_ensemble", exist_ok=True)
    run_prediction_pipeline(
        input_file="test_peptides.fasta",
        output_dir="./results_ensemble",
        internal_model_type="rf",
        use_ensemble=True,
        external_model_paths=[],
    )

report = pd.read_csv(report_path)
print("=== Ensemble Prediction Report ===")
print(report.to_string(index=False))

features = pd.read_csv(features_path)
print(f"\n=== Physicochemical Features ===")
print(f"Shape: {features.shape[0]} sequences x {features.shape[1]} descriptors")
print(features[['ID', 'Length', 'Charge', 'HydrophRatio']].to_string(index=False))

# Cell 6: Compare all three internal models individually
import os
import pandas as pd
from amp_identifier.core import run_prediction_pipeline

for model in ["rf", "svm", "gb"]:
    os.makedirs(f"./results_{model}", exist_ok=True)

    run_prediction_pipeline(
        input_file="test_peptides.fasta",
        output_dir=f"./results_{model}",
        internal_model_type=model,
        use_ensemble=False,
        external_model_paths=[],
    )

    report = pd.read_csv(f"./results_{model}/prediction_comparison_report.csv")
    pred_col = [c for c in report.columns if c.startswith("pred_")][0]
    amp_count = int(report[pred_col].sum())
    print(f"[{model.upper()}] Predicted AMPs: {amp_count}/6")

Arguments

Argument	Description	Required	Default
`-i, --input`	Path to the input FASTA file	Yes	none
`-o, --output_dir`	Path to the output directory	Yes	none
`-m, --model`	Internal model to use: `rf`, `svm`, `gb`	No	`rf`
`--ensemble`	Enable majority-vote ensemble across all internal models	No	Flag
`-e, --external_models`	One or more paths to external `.pkl` models for comparison (comma-separated)	No	none

Key Features

Three pre-trained ML models: Random Forest, Gradient Boosting, SVM
Ensemble voting: Majority vote across all models for improved robustness
External model support: Load custom .pkl models for comparison
Physicochemical descriptors: Compute and export an extensive set of sequence features via modlamp
Fully open-source and modular: Each component can be used independently

Pre-Trained Model Performance

Best values per metric in bold.

Metric	Random Forest (RF)	SVM	Gradient Boosting (GB)
Accuracy	0.8845	0.8740	0.8585
Precision	0.8910	0.8880	0.8665
Recall	0.8762	0.8558	0.8475
F1-Score	0.8836	0.8716	0.8569
MCC	0.7692	0.7484	0.7172
AUC-ROC	0.9503	0.9356	0.9289

Recommended: use --ensemble for most robust predictions (Accuracy: 87.47%, Sensitivity: 85.96%, Specificity: 88.98%).

Outputs

File	Description
`physicochemical_features.csv`	Computed physicochemical descriptors for each input sequence
`prediction_comparison_report.csv`	AMP/non-AMP predictions with confidence scores per model and consensus

Project Structure

amp_identifier/
├── __init__.py
├── core.py               # Main prediction workflow
├── data_io.py            # FASTA input reader
├── feature_extraction.py # Physicochemical descriptor computation
├── prediction.py         # Model loading and inference
└── reporting.py          # CSV report generation

Contributors

Name	Role	Affiliation
Madson A. de Luna-Aragão, MSc	Lead developer; architecture; ML; docs	UFMG
Rafael L. da Silva, BSc	Collaborator; preprocessing; pipeline testing	UFPE
Ana M. Benko-Iseppon, PhD	Advisor; study design; biological validation	UFPE
João Pacífico, PhD	Co-Advisor; computational review; evaluation	UPE
Carlos A. dos Santos-Silva, PhD	Co-Advisor; pipeline testing; review	CESMAC

Funding & Acknowledgments

Officially registered under UFPE - Universidade Federal de Pernambuco, Brazil
Supported by FACEPE - Fundação de Amparo à Pesquisa do Estado de Pernambuco
INPI Registration: BR 51 2025 005859-4

How to Cite

Luna-Aragão, M. A., da Silva, R. L., Pacífico, J., Santos-Silva, C. A. & Benko-Iseppon, A. M.
(2025). AMPidentifier: A Python toolkit for predicting antimicrobial peptides using ensemble
machine learning and physicochemical descriptors.
https://github.com/madsondeluna/AMPidentifier

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
amp_identifier		amp_identifier
benchmarking/base		benchmarking/base
data-for-tests		data-for-tests
img		img
model_training		model_training
normalization-info		normalization-info
scripts		scripts
static		static
tests		tests
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
about.html		about.html
index.html		index.html
main.py		main.py
predict.html		predict.html
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMPidentifier

About

Related Projects

Installation

Quick Start

Usage Examples

Usage Example - Google Colab / Jupyter Notebook

Arguments

Key Features

Pre-Trained Model Performance

Outputs

Project Structure

Contributors

Funding & Acknowledgments

How to Cite

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AMPidentifier

About

Related Projects

Installation

Quick Start

Usage Examples

Usage Example - Google Colab / Jupyter Notebook

Arguments

Key Features

Pre-Trained Model Performance

Outputs

Project Structure

Contributors

Funding & Acknowledgments

How to Cite

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages