Skip to content

deepomicslab/CytoBulk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CytoBulk

CytoBulk is a toolkit for bulk and spatial transcriptomics deconvolution and mapping.

  • CytoBulk has been tested on WSL2 and Linux systems.
  • On Windows and macOS, many packages listed in environment.yml may not have matching versions (or may be unavailable), so Docker is the recommended first-choice installation/runtime method.
  • If local installation fails and the issue cannot be resolved, please use Docker.

Core functions:

  • bulk_deconv
  • st_deconv
  • st_mapping
  • bulk_mapping
  • he_mapping

Paper Reproduction

For reproducing results from the paper, please refer to:


1) Installation

1.1 Conda installation

Run commands first:

conda env create -f environment.yml
conda activate cytobulk
pip install -e .

Most common dependencies are included in environment.yml, but installing Giotto may still require manually installing additional packages.

Then install Giotto in R (required for marker detection with Giotto):

library(devtools)  # if not installed: install.packages('devtools')
library(remotes)   # if not installed: install.packages('remotes')
remotes::install_github("RubD/Giotto")

Giotto reference:

1.2 Docker installation

Run commands first:

docker --version
docker info
docker pull kristawang/cytobulk:1.0.0
docker images | grep cytobulk

If Docker runs into OOM killer or other out-of-memory issues, add a memory limit to the Docker command, for example:

docker run --memory=16g ...

Note on Docker logs: In some environments, runtime logs may appear in batches (or mostly at the end) due to output buffering behavior. If you do not see real-time logs, the program may still be running normally.

1.3 Troubleshooting

Issue: rpy2 installation fails with "command 'gcc' failed"

If you encounter an error during conda env create -f environment.yml that looks like:

ERROR: Failed to build 'rpy2' when getting requirements to build wheel
...
FileNotFoundError: [Errno 2] No such file or directory: 'gcc'
...
distutils.compilers.C.errors.CompileError: command 'gcc' failed: No such file or directory

Cause: This error occurs because the system lacks required C compilers (gcc) needed to build rpy2 from source.

Solution: Install the compilers using conda:

conda install -c conda-forge compilers make pkg-config

After installation, verify that gcc is available:

which gcc && gcc --version

You should see the path to gcc and its version information. Then retry the environment creation:

conda env create -f environment.yml

2) General I/O conventions

  • Most inputs are .h5ad (AnnData) files.
  • bulk_mapping requires bulk_adata.uns['deconv'] generated by bulk_deconv.
  • st_mapping requires st_adata.uns['deconv'] generated by st_deconv.
  • For he_mapping, lr_data must contain ligand and receptor columns.

3) bulk_deconv

3.1 Command (Conda / Python API)

import cytobulk as ct
import scanpy as sc

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
bulk_adata = sc.read_h5ad("/path/to/bulk_adata.h5ad")

deconv_result, bulk_out = ct.tl.bulk_deconv(
		bulk_data=bulk_adata,
		sc_adata=sc_adata,
		annotation_key="celltype_minor",
		dataset_name="my_bulk",
		out_dir="/path/to/output",
		n_cell=500,
)

3.2 Command (Docker)

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	bulk_deconv \
	--sc /inputs/sc_adata.h5ad \
	--bulk /inputs/bulk_adata.h5ad \
	--annotation_key celltype_minor \
	--out_dir /outputs \
	--dataset_name my_bulk \
	--n_cell 500 \
	--seed 64

3.3 Required parameters

  • bulk_data: bulk AnnData.
  • sc_adata: single-cell AnnData reference.
  • annotation_key: cell type column in sc_adata.obs.

3.4 Common optional parameters (defaults and meanings)

  • dataset_name (default: ""): output file prefix.
  • out_dir (default: "."): output directory.
  • n_cell (default: 2000): pseudo-bulk cell number per synthetic sample group.
  • top_k (default: 50): number of eigen components used in graph deconvolution.
  • use_adversarial (default: True): enable adversarial training in the deconvolution model.
  • specificity (default: False): whether to generate additional cell-type-specific simulated bulk mixtures.
    • Recommendation: keep False for randomly simulated bulk data.
    • Recommendation: consider True for real cohorts where dominant-cell-type simulation is beneficial.
  • high_purity (default: False): only meaningful when specificity=True; generates higher dominant-cell-type purity in simulation.
    • Recommendation: set True for high tumor-purity cohorts (for example, TCGA-like settings).
  • bulk_hvg (default: True): whether to also keep highly variable genes (HVGs) in bulk data.
  • reproduce (default: False): enable strict reproduction mode; requires pretrained files in out_dir/model and batch-effect file under out_dir/model/batch_effect.

Additional preprocessing kwargs commonly used:

  • downsampling (default in preprocessing: False): downsample per-cell-type reference cells before marker/HVG steps. Recommended to set True for large single-cell datasets.
  • giotto_gene_num (default in preprocessing: 150): marker-gene count for Giotto-based marker detection.
  • skip_find_markers (default in preprocessing: False): skip marker discovery and use overlapping genes directly.

3.5 Output

  • deconv_result (pandas.DataFrame): predicted cell-type fractions, and it is also stored in bulk_out.uns['deconv'].
  • bulk_out (anndata.AnnData): original bulk AnnData with uns['deconv'] added; saved to out_dir/output/{dataset_name}_bulk_adata.h5ad.

3.6 Demo case (bulk_deconv)

We provide one runnable demo input in demo/:

  • demo/NSCLC_GSE127471.h5ad (single-cell reference)
  • demo/NSCLC_GSE127471_bulk.h5ad (bulk input)

Use annotation_key="Celltype_minor" for this demo.

For randomly simulated data (for example, NSCLC_GSE127471), we recommend specificity=False; otherwise, set specificity=True or keep it unset (use default behavior).

Docker version:

DATASET_DIR="/absolute/path/to/CytoBulk/demo"
DATASET_OUT="/absolute/path/to/output_dir"
DATASET_NAME="NSCLC_GSE127471"

docker run --rm -it \
	-e PYTHONUNBUFFERED=1 \
	-e HOST_UID="$(id -u)" \
	-e HOST_GID="$(id -g)" \
	-v "${DATASET_DIR}":/inputs:ro \
	-v "${DATASET_OUT}":/outputs \
	kristawang/cytobulk:1.0.0 \
	bulk_deconv \
	--sc "/inputs/${DATASET_NAME}.h5ad" \
	--bulk "/inputs/${DATASET_NAME}_bulk.h5ad" \
	--annotation_key "Celltype_minor" \
	--out_dir "/outputs/" \
	--dataset_name "${DATASET_NAME}" \
	--n_cell 100 \
	--seed 64 \
	--specificity False

Path definition for Docker mounts:

  • DATASET_DIR: local folder containing demo input files; mounted to container path /inputs as read-only.
  • DATASET_OUT: local output folder; mounted to container path /outputs for writing results.
  • --sc and --bulk: container-internal input paths under /inputs.
  • --out_dir: container-internal output path (/outputs/).

Conda version:

import os
import cytobulk as ct
from scanpy import read_h5ad
import warnings

warnings.filterwarnings("ignore")

dataset_name = "NSCLC_GSE127471"
annotation_key = "Celltype_minor"

sc_adata_path = "demo/NSCLC_GSE127471.h5ad"
bulk_adata_path = "demo/NSCLC_GSE127471_bulk.h5ad"
out_dir = "demo_output"


sc_adata = read_h5ad(sc_adata_path)
bulk_adata = read_h5ad(bulk_adata_path)

os.makedirs(out_dir, exist_ok=True)

ct.tl.bulk_deconv(
		bulk_data=bulk_adata,
		sc_adata=sc_adata,
		annotation_key=annotation_key,
		out_dir=out_dir,
		dataset_name=dataset_name,
		n_cell=100,
		specificity=False
)

Note: Due to repository storage constraints, only bulk_deconv demo data is provided in this repository. For more comprehensive demo cases and use cases for other functions (st_deconv, st_mapping, bulk_mapping, he_mapping), please refer to the CytoBulk_paper repository.


4) st_deconv

4.1 Command (Conda / Python API)

import cytobulk as ct
import scanpy as sc

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
st_adata = sc.read_h5ad("/path/to/st_adata.h5ad")

deconv_result, st_out = ct.tl.st_deconv(
		st_adata=st_adata,
		sc_adata=sc_adata,
		annotation_key="cell_type",
		dataset_name="my_st",
		out_dir="/path/to/output",
		n_cell=8,
)

4.2 Command (Docker)

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	st_deconv \
	--sc /inputs/sc_adata.h5ad \
	--st /inputs/st_adata.h5ad \
	--annotation_key cell_type \
	--out_dir /outputs \
	--dataset_name my_st \
	--n_cell 8 \
	--seed 64

4.3 Required parameters

  • st_adata: spatial transcriptomics AnnData.
  • sc_adata: single-cell AnnData reference.
  • annotation_key: cell type column in sc_adata.obs.

4.4 Common optional parameters (defaults and meanings)

  • dataset_name (default: ""): output file prefix.
  • out_dir (default: "."): output directory.
  • n_cell (default: 10): base number of cells per simulated spot.
  • top_k (default: 50): graph deconvolution eigen components.
  • skip_find_markers (default: False): skip marker detection (and use all overlap genes).
  • use_adversarial (default: True): adversarial model training toggle.
  • st_hvg (default: True): whether to keep HVGs for ST data.
  • reproduce (default: False): requires pretrained files in out_dir/st_model and batch-effect file under out_dir/st_model/batch_effect.

4.5 Output

  • deconv_result (pandas.DataFrame): predicted cell-type fractions, and it is also stored in st_out.uns['deconv'].
  • st_out (anndata.AnnData): original ST AnnData with uns['deconv'] added; saved to out_dir/output/{dataset_name}_st_adata.h5ad.

5) st_mapping

5.1 Command (Conda / Python API)

import cytobulk as ct
import scanpy as sc

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
st_adata = sc.read_h5ad("/path/to/output/my_st_st_adata.h5ad")

reconstructed_sc, reconstructed_adata = ct.tl.st_mapping(
		st_adata=st_adata,
		sc_adata=sc_adata,
		out_dir="/path/to/output",
		project="my_st",
		annotation_key="cell_type",
		seed=64,
)

5.2 Command (Docker)

In the Docker reproduction scripts, this step is exposed as st_reconstruction (functionally corresponding to ct.tl.st_mapping).

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	st_reconstruction \
	--sc /inputs/sc_adata.h5ad \
	--st /outputs/output/my_st_st_adata.h5ad \
	--annotation_key cell_type \
	--out_dir /outputs \
	--dataset_name my_st \
	--seed 64

5.3 Required parameters

  • st_adata: deconvolved ST AnnData with uns['deconv'].
  • sc_adata: single-cell AnnData reference.
  • out_dir: output directory.
  • project: output prefix/tag.
  • annotation_key: cell type column in sc_adata.obs.

5.4 Common optional parameters (defaults and meanings)

  • seed (default: 0): random seed.
  • sc_downsample (default: False): whether to downsample scRNA-seq counts before matching.
  • scRNA_max_transcripts_per_cell (default: 1500): transcript cap when sc_downsample=True.
  • mean_cell_numbers (default: 8): used to estimate cells per spot if st_adata.obsm['cell_num'] is absent.
  • save_reconstructed_st (default: True): save reconstructed ST AnnData.

5.5 Output

  • reconstructed_sc (pandas.DataFrame): spot-to-cell mapping table with columns spot_id and cell_id.
  • reconstructed_adata (anndata.AnnData): reconstructed ST expression AnnData (contains reconstructed expression and original ST in layer original_st).

6) bulk_mapping

6.1 Command (Conda / Python API)

import cytobulk as ct
import scanpy as sc

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
bulk_adata = sc.read_h5ad("/path/to/output/my_bulk_bulk_adata.h5ad")

reconstructed_cell, reconstructed_bulk = ct.tl.bulk_mapping(
		bulk_adata=bulk_adata,
		sc_adata=sc_adata,
		annotation_key="celltype_minor",
		out_dir="/path/to/output",
		project="my_bulk",
		n_cell=500,
		multiprocessing=False,
)

6.2 Command (Docker)

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	bulk_mapping \
	--sc /inputs/sc_adata.h5ad \
	--bulk /outputs/output/my_bulk_bulk_adata.h5ad \
	--annotation_key celltype_minor \
	--out_dir /outputs \
	--dataset_name my_bulk \
	--n_cell 500 \
	--seed 64

6.3 Required parameters

  • bulk_adata: deconvolved bulk AnnData with uns['deconv'].
  • sc_adata: single-cell AnnData reference.

6.4 Common optional parameters (defaults and meanings)

  • n_cell (default: 100): number of mapped single cells per bulk sample.
  • annotation_key (default: "curated_cell_type"): cell type column in sc_adata.obs.
  • bulk_layer (default: None): layer key used as bulk expression matrix.
  • sc_layer (default: None): layer key used as single-cell expression matrix.
  • reorder (default: True): reorder genes to enforce consistent gene order between bulk/sc.
  • multiprocessing (default: True): parallel mapping.
  • cpu_num (default: cpu_count()-4): worker count when multiprocessing is enabled.
  • normalization (default: True): apply CPM + log normalization before mapping.
  • filter_gene (default: True): filter genes by cosine similarity between original and reconstructed bulk expression.
  • save (default: True): write mapping outputs to disk.

6.5 Output

  • reconstructed_cell (pandas.DataFrame): mapping table with columns sample_id and cell_id.
  • reconstructed_bulk (anndata.AnnData): bulk AnnData containing mapping-related layers/fields (for example layers['mapping'], layers['mapping_ori'], obsm['cell_number']).

7) he_mapping

7.1 Command (Conda / optional SVS preprocessing + mapping)

import cytobulk as ct
import scanpy as sc
import pandas as pd

# Optional: create tiles from a .svs image
ct.pp.process_svs_image(
		svs_path="/path/to/sample.svs",
		output_dir="/path/to/tiles",
		crop_size=224,
		magnification=1,
		center_x=21000,
		center_y=11200,
		fold_width=10,
		fold_height=10,
		enable_cropping=True,
)

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
lr_data = pd.read_csv("/path/to/lrpairs.csv")

cell_coordinates, mapping_df = ct.tl.he_mapping(
		image_dir="/path/to/tiles",
		out_dir="/path/to/output",
		project="my_he",
		lr_data=lr_data,
		sc_adata=sc_adata,
		annotation_key="cell_type",
		k_neighbor=30,
		alpha="auto_compute",
		batch_size=10000,
		mapping_sc=True,
		return_adata=False,
)

7.2 Command (Docker)

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	he_mapping \
	--svs_path /inputs/sample.svs \
	--image_out_dir /outputs/tiles \
	--enable_cropping 1 \
	--crop_size 224 \
	--magnification 1 \
	--center_x 21000 \
	--center_y 11200 \
	--fold_width 10 \
	--fold_height 10 \
	--sc /inputs/sc_adata.h5ad \
	--lr_csv /inputs/lrpairs.csv \
	--annotation_key cell_type \
	--out_dir /outputs/he_result \
	--project my_he \
	--k_neighbor 30 \
	--batch_size 10000 \
	--mapping_sc 1 \
	--return_adata 1 \
	--seed 20230602

7.3 Required parameters and preprocessing rules

For full H&E-to-scRNA mapping (mapping_sc=True):

  • image_dir: folder containing processed image tiles.
  • out_dir: output directory.
  • project: output prefix/tag.
  • sc_adata: single-cell AnnData reference.
  • lr_data: ligand-receptor table (ligand, receptor).
  • annotation_key: cell type column in sc_adata.obs.

If you run SVS preprocessing (ct.pp.process_svs_image or Docker flags):

  • enable_cropping=True (or --enable_cropping 1): must provide crop region parameters:
    • center_x, center_y
    • fold_width, fold_height (and usually set crop_size, magnification explicitly for reproducible tiling)
  • enable_cropping=False (or --enable_cropping 0): process the whole slide by default; no crop region parameters are required.

7.4 Common optional parameters (defaults and meanings)

  • enable_cropping (default: False): whether to crop a local region before tiling.
    • True: crop around the specified region (center_x, center_y, fold_width, fold_height).
    • False: process the whole image; region parameters are ignored/not required.
  • crop_size (default: 224): tile size in pixels.
  • magnification (default: 1): magnification factor for cropped/read region.
  • center_x, center_y (example: 21000, 11200): crop center coordinates used when enable_cropping=True.
  • fold_width, fold_height (default: 10, 10): crop grid size used when enable_cropping=True.
  • annotation_key (default: "curated_celltype"): cell type label column.
  • k_neighbor (default: 30): graph neighbor size for image-cell graph construction.
  • alpha (default: "auto_compute"): FGW trade-off between structure and feature matching.
    • "auto_compute": automatically estimate alpha from image cell-type distribution.
    • float 0~1: manually set alpha.
  • mapping_sc (default: True): if False, only return H&E cell type prediction without scRNA mapping.
  • batch_size (default: 3000): number of image cells processed per batch.
  • downsampling (default: False): downsample scRNA reference for mapping.
  • return_adata (default: False): return/save mapped filtered AnnData.
  • sc_st (default: False): use looser filtering/normalization path for spatial-like sc input.
  • anchor_expression (default: None): optional anchor expression AnnData aligned to image coordinates.
  • expression_weight (default: 0): expression term weight in cost matrix when anchor expression is provided.
  • skip_filtering (default: False): skip scRNA filtering in this function.

7.5 Output

  • When mapping_sc=False: returns only cell_coordinates (pandas.DataFrame, H&E inferred cell coordinates and predicted cell types).
  • When mapping_sc=True and return_adata=False: returns (cell_coordinates, mapping_df).
  • When mapping_sc=True and return_adata=True: returns (cell_coordinates, mapping_df, matched_adata).
    • mapping_df is the H&E-to-scRNA matching table.
    • matched_adata is the matched/filtered single-cell AnnData.

7.6 Troubleshooting: model file loading error

If you encounter the following error while running ct.tl.he_mapping:

_pickle.UnpicklingError: invalid load key, '<'.

This error usually means the pretrained model file was not fully downloaded (corrupted/incomplete file). To resolve it, manually download the model file and place it in the package pretrained-model directory.

Download DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth from:

Then place it at:

  • cytobulk/tools/model/pretrained_models/DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth .

8) Pretrained model note

Large model files are not committed by default. If needed, place:

  • DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth
  • into cytobulk/tools/model/pretrained_models/

9) Repository

GitHub: https://github.com/deepomicslab/CytoBulk

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors