Python library for characterising dataset difficulty in binary classification tasks. Wraps PyCol (Python Class Overlap Library) with a clean metric interface and adds a configurable experiment framework for studying how complexity metrics correlate with ML classifier performance.
A central question in machine learning is: what makes a dataset hard to classify? This library provides quantitative tools to answer that question by computing 35+ complexity metrics — measures of feature overlap, class boundary geometry, instance-level neighbourhood structure, and statistical distribution similarity — and correlating them with the generalisation performance of standard classifiers under controlled dataset manipulations (varying class separation, noise, imbalance, dimensionality, and dataset geometry).
pdm installRequires Python 3.13. Uses PDM for dependency management.
from data_complexity.data_metrics.metrics import ComplexityMetrics
import numpy as np
dataset = {"X": np.random.randn(200, 2), "y": np.array([0] * 100 + [1] * 100)}
cm = ComplexityMetrics(dataset=dataset)
# All 35+ metrics as a flat dict of scalars
print(cm.get_all_metrics_scalar())from data_complexity.experiments.pipeline import run_experiment
exp = run_experiment("moons_noise") # runs, saves plots and CSVs to results/from data_complexity.experiments.pipeline import (
Experiment, ExperimentConfig, DatasetSpec, ParameterSpec, datasets_from_sweep
)
config = ExperimentConfig(
datasets=datasets_from_sweep(
DatasetSpec("Gaussian", {"cov_type": "spherical", "class_separation": 4.0}),
ParameterSpec("cov_scale", [0.5, 1.0, 2.0, 4.0], label_format="scale={value}"),
),
x_label="cov_scale",
cv_folds=5,
ml_metrics=["accuracy", "f1"],
name="my_gaussian_variance",
)
exp = Experiment(config)
exp.run(verbose=True, n_jobs=-1)
exp.compute_distances()
exp.plot()
exp.save()35+ metrics grouped into six categories. Each captures a distinct aspect of classification difficulty.
| Category | Count | What it measures | Reference |
|---|---|---|---|
| Feature Overlap | 6 | Linear separability of individual features and best linear projections | F1, F1v, F2, F3, F4, IN |
| Instance Overlap | 9 | Fraction of instances in ambiguous neighbourhood regions | Raug, N3, kDN, CM, Borderline, … |
| Structural Overlap | 9 | Topology of the class boundary; cluster fragmentation | N1, T1, Clust, ONB, DBC, … |
| Multiresolution Overlap | 5 | Class purity aggregated across multiple spatial resolutions | MRCA, C1, C2, Purity, … |
| Classical Measures | 1 | Dataset-level statistics independent of class geometry | IR (Imbalance Ratio) |
| Distributional Measures | 5 | Statistical distribution overlap and decision-boundary geometry | Silhouette, Bhattacharyya, Wasserstein, … |
Full metric reference with equations and citations: data_complexity/data_metrics/README.md
Experiments use five parametric synthetic generators. All are 2-class binary classification datasets (unless otherwise noted).
| Type | Key parameters | Geometry |
|---|---|---|
Gaussian |
class_separation, cov_type, cov_scale, cov_correlation, minority_reduce_scaler |
Two Gaussian clusters in 2D |
Moons |
moons_noise |
Two interleaving crescents (non-linear boundary) |
Circles |
circles_noise |
Two concentric circles (closed, non-linear boundary) |
Blobs |
blobs_features |
Isotropic Gaussian clusters in arbitrary dimensions |
XOR |
— | Four-quadrant checkerboard (linearly inseparable) |
Real datasets (sklearn classic, medical, UCI) are also used in comparison experiments.
All experiment scripts live in data_complexity/experiments/runs/ and are documented there in full.
| Family | Scripts | Studies |
|---|---|---|
| Pairwise complexity | 16 | How complexity metrics co-vary under controlled manipulations (separation, variance, imbalance, dimensionality, geometry) |
| Averaged / grouped | 2 | Metric co-variation averaged across multiple dataset geometries for geometry-agnostic conclusions |
| Complexity + ML | 10 | Correlation between complexity metrics and classifier accuracy under the same manipulations |
Pre-defined experiment configurations (runnable by name):
from data_complexity.experiments.pipeline import list_configs, run_experiment
print(list_configs())
# ['gaussian_variance', 'gaussian_separation', 'gaussian_correlation',
# 'gaussian_imbalance', 'moons_noise', 'circles_noise', 'blobs_features']
run_experiment("gaussian_separation")| Module | Description | Docs |
|---|---|---|
data_complexity/data_metrics/ |
Complexity metric implementations (all 35+) | data_metrics/README.md |
data_complexity/experiments/pipeline/ |
Generic experiment framework: config, run, analyse, save/load | pipeline/README.md |
data_complexity/experiments/classification/ |
ML model evaluation: models, metrics, evaluators, pipeline | classification/README.md |
data_complexity/experiments/runs/ |
Executable experiment scripts (28 studies) | runs/README.md |
pdm run pytest tests/ -v
pdm run pytest tests/ -v -k "metric" # filter by nameTest modules cover: complexity metrics, ML models, evaluation metrics, evaluators, pipeline orchestration, experiment framework, grouped experiments.