CytoBulk is a toolkit for bulk and spatial transcriptomics deconvolution and mapping.
- CytoBulk has been tested on WSL2 and Linux systems.
- On Windows and macOS, many packages listed in
environment.ymlmay not have matching versions (or may be unavailable), so Docker is the recommended first-choice installation/runtime method. - If local installation fails and the issue cannot be resolved, please use Docker.
Core functions:
bulk_deconvst_deconvst_mappingbulk_mappinghe_mapping
For reproducing results from the paper, please refer to:
Run commands first:
conda env create -f environment.yml
conda activate cytobulk
pip install -e .Most common dependencies are included in environment.yml, but installing Giotto may still require manually installing additional packages.
Then install Giotto in R (required for marker detection with Giotto):
library(devtools) # if not installed: install.packages('devtools')
library(remotes) # if not installed: install.packages('remotes')
remotes::install_github("RubD/Giotto")Giotto reference:
Run commands first:
docker --version
docker info
docker pull kristawang/cytobulk:1.0.0
docker images | grep cytobulkIf Docker runs into OOM killer or other out-of-memory issues, add a memory limit to the Docker command, for example:
docker run --memory=16g ...Note on Docker logs: In some environments, runtime logs may appear in batches (or mostly at the end) due to output buffering behavior. If you do not see real-time logs, the program may still be running normally.
If you encounter an error during conda env create -f environment.yml that looks like:
ERROR: Failed to build 'rpy2' when getting requirements to build wheel
...
FileNotFoundError: [Errno 2] No such file or directory: 'gcc'
...
distutils.compilers.C.errors.CompileError: command 'gcc' failed: No such file or directory
Cause: This error occurs because the system lacks required C compilers (gcc) needed to build rpy2 from source.
Solution: Install the compilers using conda:
conda install -c conda-forge compilers make pkg-configAfter installation, verify that gcc is available:
which gcc && gcc --versionYou should see the path to gcc and its version information. Then retry the environment creation:
conda env create -f environment.yml- Most inputs are
.h5ad(AnnData) files. bulk_mappingrequiresbulk_adata.uns['deconv']generated bybulk_deconv.st_mappingrequiresst_adata.uns['deconv']generated byst_deconv.- For
he_mapping,lr_datamust containligandandreceptorcolumns.
import cytobulk as ct
import scanpy as sc
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
bulk_adata = sc.read_h5ad("/path/to/bulk_adata.h5ad")
deconv_result, bulk_out = ct.tl.bulk_deconv(
bulk_data=bulk_adata,
sc_adata=sc_adata,
annotation_key="celltype_minor",
dataset_name="my_bulk",
out_dir="/path/to/output",
n_cell=500,
)docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
bulk_deconv \
--sc /inputs/sc_adata.h5ad \
--bulk /inputs/bulk_adata.h5ad \
--annotation_key celltype_minor \
--out_dir /outputs \
--dataset_name my_bulk \
--n_cell 500 \
--seed 64bulk_data: bulkAnnData.sc_adata: single-cellAnnDatareference.annotation_key: cell type column insc_adata.obs.
dataset_name(default:""): output file prefix.out_dir(default:"."): output directory.n_cell(default:2000): pseudo-bulk cell number per synthetic sample group.top_k(default:50): number of eigen components used in graph deconvolution.use_adversarial(default:True): enable adversarial training in the deconvolution model.specificity(default:False): whether to generate additional cell-type-specific simulated bulk mixtures.- Recommendation: keep
Falsefor randomly simulated bulk data. - Recommendation: consider
Truefor real cohorts where dominant-cell-type simulation is beneficial.
- Recommendation: keep
high_purity(default:False): only meaningful whenspecificity=True; generates higher dominant-cell-type purity in simulation.- Recommendation: set
Truefor high tumor-purity cohorts (for example, TCGA-like settings).
- Recommendation: set
bulk_hvg(default:True): whether to also keep highly variable genes (HVGs) in bulk data.reproduce(default:False): enable strict reproduction mode; requires pretrained files inout_dir/modeland batch-effect file underout_dir/model/batch_effect.
Additional preprocessing kwargs commonly used:
downsampling(default in preprocessing:False): downsample per-cell-type reference cells before marker/HVG steps. Recommended to setTruefor large single-cell datasets.giotto_gene_num(default in preprocessing:150): marker-gene count for Giotto-based marker detection.skip_find_markers(default in preprocessing:False): skip marker discovery and use overlapping genes directly.
deconv_result(pandas.DataFrame): predicted cell-type fractions, and it is also stored inbulk_out.uns['deconv'].bulk_out(anndata.AnnData): original bulkAnnDatawithuns['deconv']added; saved toout_dir/output/{dataset_name}_bulk_adata.h5ad.
We provide one runnable demo input in demo/:
demo/NSCLC_GSE127471.h5ad(single-cell reference)demo/NSCLC_GSE127471_bulk.h5ad(bulk input)
Use annotation_key="Celltype_minor" for this demo.
For randomly simulated data (for example, NSCLC_GSE127471), we recommend specificity=False; otherwise, set specificity=True or keep it unset (use default behavior).
Docker version:
DATASET_DIR="/absolute/path/to/CytoBulk/demo"
DATASET_OUT="/absolute/path/to/output_dir"
DATASET_NAME="NSCLC_GSE127471"
docker run --rm -it \
-e PYTHONUNBUFFERED=1 \
-e HOST_UID="$(id -u)" \
-e HOST_GID="$(id -g)" \
-v "${DATASET_DIR}":/inputs:ro \
-v "${DATASET_OUT}":/outputs \
kristawang/cytobulk:1.0.0 \
bulk_deconv \
--sc "/inputs/${DATASET_NAME}.h5ad" \
--bulk "/inputs/${DATASET_NAME}_bulk.h5ad" \
--annotation_key "Celltype_minor" \
--out_dir "/outputs/" \
--dataset_name "${DATASET_NAME}" \
--n_cell 100 \
--seed 64 \
--specificity FalsePath definition for Docker mounts:
DATASET_DIR: local folder containing demo input files; mounted to container path/inputsas read-only.DATASET_OUT: local output folder; mounted to container path/outputsfor writing results.--scand--bulk: container-internal input paths under/inputs.--out_dir: container-internal output path (/outputs/).
Conda version:
import os
import cytobulk as ct
from scanpy import read_h5ad
import warnings
warnings.filterwarnings("ignore")
dataset_name = "NSCLC_GSE127471"
annotation_key = "Celltype_minor"
sc_adata_path = "demo/NSCLC_GSE127471.h5ad"
bulk_adata_path = "demo/NSCLC_GSE127471_bulk.h5ad"
out_dir = "demo_output"
sc_adata = read_h5ad(sc_adata_path)
bulk_adata = read_h5ad(bulk_adata_path)
os.makedirs(out_dir, exist_ok=True)
ct.tl.bulk_deconv(
bulk_data=bulk_adata,
sc_adata=sc_adata,
annotation_key=annotation_key,
out_dir=out_dir,
dataset_name=dataset_name,
n_cell=100,
specificity=False
)Note: Due to repository storage constraints, only bulk_deconv demo data is provided in this repository. For more comprehensive demo cases and use cases for other functions (st_deconv, st_mapping, bulk_mapping, he_mapping), please refer to the CytoBulk_paper repository.
import cytobulk as ct
import scanpy as sc
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
st_adata = sc.read_h5ad("/path/to/st_adata.h5ad")
deconv_result, st_out = ct.tl.st_deconv(
st_adata=st_adata,
sc_adata=sc_adata,
annotation_key="cell_type",
dataset_name="my_st",
out_dir="/path/to/output",
n_cell=8,
)docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
st_deconv \
--sc /inputs/sc_adata.h5ad \
--st /inputs/st_adata.h5ad \
--annotation_key cell_type \
--out_dir /outputs \
--dataset_name my_st \
--n_cell 8 \
--seed 64st_adata: spatial transcriptomicsAnnData.sc_adata: single-cellAnnDatareference.annotation_key: cell type column insc_adata.obs.
dataset_name(default:""): output file prefix.out_dir(default:"."): output directory.n_cell(default:10): base number of cells per simulated spot.top_k(default:50): graph deconvolution eigen components.skip_find_markers(default:False): skip marker detection (and use all overlap genes).use_adversarial(default:True): adversarial model training toggle.st_hvg(default:True): whether to keep HVGs for ST data.reproduce(default:False): requires pretrained files inout_dir/st_modeland batch-effect file underout_dir/st_model/batch_effect.
deconv_result(pandas.DataFrame): predicted cell-type fractions, and it is also stored inst_out.uns['deconv'].st_out(anndata.AnnData): original STAnnDatawithuns['deconv']added; saved toout_dir/output/{dataset_name}_st_adata.h5ad.
import cytobulk as ct
import scanpy as sc
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
st_adata = sc.read_h5ad("/path/to/output/my_st_st_adata.h5ad")
reconstructed_sc, reconstructed_adata = ct.tl.st_mapping(
st_adata=st_adata,
sc_adata=sc_adata,
out_dir="/path/to/output",
project="my_st",
annotation_key="cell_type",
seed=64,
)In the Docker reproduction scripts, this step is exposed as st_reconstruction (functionally corresponding to ct.tl.st_mapping).
docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
st_reconstruction \
--sc /inputs/sc_adata.h5ad \
--st /outputs/output/my_st_st_adata.h5ad \
--annotation_key cell_type \
--out_dir /outputs \
--dataset_name my_st \
--seed 64st_adata: deconvolved STAnnDatawithuns['deconv'].sc_adata: single-cellAnnDatareference.out_dir: output directory.project: output prefix/tag.annotation_key: cell type column insc_adata.obs.
seed(default:0): random seed.sc_downsample(default:False): whether to downsample scRNA-seq counts before matching.scRNA_max_transcripts_per_cell(default:1500): transcript cap whensc_downsample=True.mean_cell_numbers(default:8): used to estimate cells per spot ifst_adata.obsm['cell_num']is absent.save_reconstructed_st(default:True): save reconstructed STAnnData.
reconstructed_sc(pandas.DataFrame): spot-to-cell mapping table with columnsspot_idandcell_id.reconstructed_adata(anndata.AnnData): reconstructed ST expressionAnnData(contains reconstructed expression and original ST in layeroriginal_st).
import cytobulk as ct
import scanpy as sc
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
bulk_adata = sc.read_h5ad("/path/to/output/my_bulk_bulk_adata.h5ad")
reconstructed_cell, reconstructed_bulk = ct.tl.bulk_mapping(
bulk_adata=bulk_adata,
sc_adata=sc_adata,
annotation_key="celltype_minor",
out_dir="/path/to/output",
project="my_bulk",
n_cell=500,
multiprocessing=False,
)docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
bulk_mapping \
--sc /inputs/sc_adata.h5ad \
--bulk /outputs/output/my_bulk_bulk_adata.h5ad \
--annotation_key celltype_minor \
--out_dir /outputs \
--dataset_name my_bulk \
--n_cell 500 \
--seed 64bulk_adata: deconvolved bulkAnnDatawithuns['deconv'].sc_adata: single-cellAnnDatareference.
n_cell(default:100): number of mapped single cells per bulk sample.annotation_key(default:"curated_cell_type"): cell type column insc_adata.obs.bulk_layer(default:None): layer key used as bulk expression matrix.sc_layer(default:None): layer key used as single-cell expression matrix.reorder(default:True): reorder genes to enforce consistent gene order between bulk/sc.multiprocessing(default:True): parallel mapping.cpu_num(default:cpu_count()-4): worker count when multiprocessing is enabled.normalization(default:True): apply CPM + log normalization before mapping.filter_gene(default:True): filter genes by cosine similarity between original and reconstructed bulk expression.save(default:True): write mapping outputs to disk.
reconstructed_cell(pandas.DataFrame): mapping table with columnssample_idandcell_id.reconstructed_bulk(anndata.AnnData): bulkAnnDatacontaining mapping-related layers/fields (for examplelayers['mapping'],layers['mapping_ori'],obsm['cell_number']).
import cytobulk as ct
import scanpy as sc
import pandas as pd
# Optional: create tiles from a .svs image
ct.pp.process_svs_image(
svs_path="/path/to/sample.svs",
output_dir="/path/to/tiles",
crop_size=224,
magnification=1,
center_x=21000,
center_y=11200,
fold_width=10,
fold_height=10,
enable_cropping=True,
)
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
lr_data = pd.read_csv("/path/to/lrpairs.csv")
cell_coordinates, mapping_df = ct.tl.he_mapping(
image_dir="/path/to/tiles",
out_dir="/path/to/output",
project="my_he",
lr_data=lr_data,
sc_adata=sc_adata,
annotation_key="cell_type",
k_neighbor=30,
alpha="auto_compute",
batch_size=10000,
mapping_sc=True,
return_adata=False,
)docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
he_mapping \
--svs_path /inputs/sample.svs \
--image_out_dir /outputs/tiles \
--enable_cropping 1 \
--crop_size 224 \
--magnification 1 \
--center_x 21000 \
--center_y 11200 \
--fold_width 10 \
--fold_height 10 \
--sc /inputs/sc_adata.h5ad \
--lr_csv /inputs/lrpairs.csv \
--annotation_key cell_type \
--out_dir /outputs/he_result \
--project my_he \
--k_neighbor 30 \
--batch_size 10000 \
--mapping_sc 1 \
--return_adata 1 \
--seed 20230602For full H&E-to-scRNA mapping (mapping_sc=True):
image_dir: folder containing processed image tiles.out_dir: output directory.project: output prefix/tag.sc_adata: single-cellAnnDatareference.lr_data: ligand-receptor table (ligand,receptor).annotation_key: cell type column insc_adata.obs.
If you run SVS preprocessing (ct.pp.process_svs_image or Docker flags):
enable_cropping=True(or--enable_cropping 1): must provide crop region parameters:center_x,center_yfold_width,fold_height(and usually setcrop_size,magnificationexplicitly for reproducible tiling)
enable_cropping=False(or--enable_cropping 0): process the whole slide by default; no crop region parameters are required.
enable_cropping(default:False): whether to crop a local region before tiling.True: crop around the specified region (center_x,center_y,fold_width,fold_height).False: process the whole image; region parameters are ignored/not required.
crop_size(default:224): tile size in pixels.magnification(default:1): magnification factor for cropped/read region.center_x,center_y(example:21000,11200): crop center coordinates used whenenable_cropping=True.fold_width,fold_height(default:10,10): crop grid size used whenenable_cropping=True.annotation_key(default:"curated_celltype"): cell type label column.k_neighbor(default:30): graph neighbor size for image-cell graph construction.alpha(default:"auto_compute"): FGW trade-off between structure and feature matching."auto_compute": automatically estimate alpha from image cell-type distribution.- float
0~1: manually set alpha.
mapping_sc(default:True): ifFalse, only return H&E cell type prediction without scRNA mapping.batch_size(default:3000): number of image cells processed per batch.downsampling(default:False): downsample scRNA reference for mapping.return_adata(default:False): return/save mapped filteredAnnData.sc_st(default:False): use looser filtering/normalization path for spatial-like sc input.anchor_expression(default:None): optional anchor expressionAnnDataaligned to image coordinates.expression_weight(default:0): expression term weight in cost matrix when anchor expression is provided.skip_filtering(default:False): skip scRNA filtering in this function.
- When
mapping_sc=False: returns onlycell_coordinates(pandas.DataFrame, H&E inferred cell coordinates and predicted cell types). - When
mapping_sc=Trueandreturn_adata=False: returns(cell_coordinates, mapping_df). - When
mapping_sc=Trueandreturn_adata=True: returns(cell_coordinates, mapping_df, matched_adata).mapping_dfis the H&E-to-scRNA matching table.matched_adatais the matched/filtered single-cellAnnData.
If you encounter the following error while running ct.tl.he_mapping:
_pickle.UnpicklingError: invalid load key, '<'.
This error usually means the pretrained model file was not fully downloaded (corrupted/incomplete file). To resolve it, manually download the model file and place it in the package pretrained-model directory.
Download DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth from:
Then place it at:
cytobulk/tools/model/pretrained_models/DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth.
Large model files are not committed by default. If needed, place:
DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth- into
cytobulk/tools/model/pretrained_models/