data_complexity_analysis

Python library for characterising dataset difficulty in binary classification tasks. Wraps PyCol (Python Class Overlap Library) with a clean metric interface and adds a configurable experiment framework for studying how complexity metrics correlate with ML classifier performance.

What it studies

A central question in machine learning is: what makes a dataset hard to classify? This library provides quantitative tools to answer that question by computing 35+ complexity metrics — measures of feature overlap, class boundary geometry, instance-level neighbourhood structure, and statistical distribution similarity — and correlating them with the generalisation performance of standard classifiers under controlled dataset manipulations (varying class separation, noise, imbalance, dimensionality, and dataset geometry).

Installation

pdm install

Requires Python 3.13. Uses PDM for dependency management.

Quick start

Compute complexity metrics on a dataset

from data_complexity.data_metrics.metrics import ComplexityMetrics
import numpy as np

dataset = {"X": np.random.randn(200, 2), "y": np.array([0] * 100 + [1] * 100)}
cm = ComplexityMetrics(dataset=dataset)

# All 35+ metrics as a flat dict of scalars
print(cm.get_all_metrics_scalar())

Run a pre-defined experiment

from data_complexity.experiments.pipeline import run_experiment

exp = run_experiment("moons_noise")   # runs, saves plots and CSVs to results/

Build a custom experiment

from data_complexity.experiments.pipeline import (
    Experiment, ExperimentConfig, DatasetSpec, ParameterSpec, datasets_from_sweep
)

config = ExperimentConfig(
    datasets=datasets_from_sweep(
        DatasetSpec("Gaussian", {"cov_type": "spherical", "class_separation": 4.0}),
        ParameterSpec("cov_scale", [0.5, 1.0, 2.0, 4.0], label_format="scale={value}"),
    ),
    x_label="cov_scale",
    cv_folds=5,
    ml_metrics=["accuracy", "f1"],
    name="my_gaussian_variance",
)

exp = Experiment(config)
exp.run(verbose=True, n_jobs=-1)
exp.compute_distances()
exp.plot()
exp.save()

Complexity metrics

35+ metrics grouped into six categories. Each captures a distinct aspect of classification difficulty.

Category	Count	What it measures	Reference
Feature Overlap	6	Linear separability of individual features and best linear projections	F1, F1v, F2, F3, F4, IN
Instance Overlap	9	Fraction of instances in ambiguous neighbourhood regions	Raug, N3, kDN, CM, Borderline, …
Structural Overlap	9	Topology of the class boundary; cluster fragmentation	N1, T1, Clust, ONB, DBC, …
Multiresolution Overlap	5	Class purity aggregated across multiple spatial resolutions	MRCA, C1, C2, Purity, …
Classical Measures	1	Dataset-level statistics independent of class geometry	IR (Imbalance Ratio)
Distributional Measures	5	Statistical distribution overlap and decision-boundary geometry	Silhouette, Bhattacharyya, Wasserstein, …

Full metric reference with equations and citations: data_complexity/data_metrics/README.md

Synthetic dataset types

Experiments use five parametric synthetic generators. All are 2-class binary classification datasets (unless otherwise noted).

Type	Key parameters	Geometry
`Gaussian`	`class_separation`, `cov_type`, `cov_scale`, `cov_correlation`, `minority_reduce_scaler`	Two Gaussian clusters in 2D
`Moons`	`moons_noise`	Two interleaving crescents (non-linear boundary)
`Circles`	`circles_noise`	Two concentric circles (closed, non-linear boundary)
`Blobs`	`blobs_features`	Isotropic Gaussian clusters in arbitrary dimensions
`XOR`	—	Four-quadrant checkerboard (linearly inseparable)

Real datasets (sklearn classic, medical, UCI) are also used in comparison experiments.

Experiments

All experiment scripts live in data_complexity/experiments/runs/ and are documented there in full.

Family	Scripts	Studies
Pairwise complexity	16	How complexity metrics co-vary under controlled manipulations (separation, variance, imbalance, dimensionality, geometry)
Averaged / grouped	2	Metric co-variation averaged across multiple dataset geometries for geometry-agnostic conclusions
Complexity + ML	10	Correlation between complexity metrics and classifier accuracy under the same manipulations

Pre-defined experiment configurations (runnable by name):

from data_complexity.experiments.pipeline import list_configs, run_experiment

print(list_configs())
# ['gaussian_variance', 'gaussian_separation', 'gaussian_correlation',
#  'gaussian_imbalance', 'moons_noise', 'circles_noise', 'blobs_features']

run_experiment("gaussian_separation")

Module overview

Module	Description	Docs
`data_complexity/data_metrics/`	Complexity metric implementations (all 35+)	`data_metrics/README.md`
`data_complexity/experiments/pipeline/`	Generic experiment framework: config, run, analyse, save/load	`pipeline/README.md`
`data_complexity/experiments/classification/`	ML model evaluation: models, metrics, evaluators, pipeline	`classification/README.md`
`data_complexity/experiments/runs/`	Executable experiment scripts (28 studies)	`runs/README.md`

Testing

pdm run pytest tests/ -v
pdm run pytest tests/ -v -k "metric"   # filter by name

Test modules cover: complexity metrics, ML models, evaluation metrics, evaluators, pipeline orchestration, experiment framework, grouped experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
data_complexity		data_complexity
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
TODO.md		TODO.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data_complexity_analysis

What it studies

Installation

Quick start

Compute complexity metrics on a dataset

Run a pre-defined experiment

Build a custom experiment

Complexity metrics

Synthetic dataset types

Experiments

Module overview

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data_complexity_analysis

What it studies

Installation

Quick start

Compute complexity metrics on a dataset

Run a pre-defined experiment

Build a custom experiment

Complexity metrics

Synthetic dataset types

Experiments

Module overview

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages