Skip to content

mattclifford1/data_complexity_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

171 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data_complexity_analysis

Python library for characterising dataset difficulty in binary classification tasks. Wraps PyCol (Python Class Overlap Library) with a clean metric interface and adds a configurable experiment framework for studying how complexity metrics correlate with ML classifier performance.

What it studies

A central question in machine learning is: what makes a dataset hard to classify? This library provides quantitative tools to answer that question by computing 35+ complexity metrics — measures of feature overlap, class boundary geometry, instance-level neighbourhood structure, and statistical distribution similarity — and correlating them with the generalisation performance of standard classifiers under controlled dataset manipulations (varying class separation, noise, imbalance, dimensionality, and dataset geometry).


Installation

pdm install

Requires Python 3.13. Uses PDM for dependency management.


Quick start

Compute complexity metrics on a dataset

from data_complexity.data_metrics.metrics import ComplexityMetrics
import numpy as np

dataset = {"X": np.random.randn(200, 2), "y": np.array([0] * 100 + [1] * 100)}
cm = ComplexityMetrics(dataset=dataset)

# All 35+ metrics as a flat dict of scalars
print(cm.get_all_metrics_scalar())

Run a pre-defined experiment

from data_complexity.experiments.pipeline import run_experiment

exp = run_experiment("moons_noise")   # runs, saves plots and CSVs to results/

Build a custom experiment

from data_complexity.experiments.pipeline import (
    Experiment, ExperimentConfig, DatasetSpec, ParameterSpec, datasets_from_sweep
)

config = ExperimentConfig(
    datasets=datasets_from_sweep(
        DatasetSpec("Gaussian", {"cov_type": "spherical", "class_separation": 4.0}),
        ParameterSpec("cov_scale", [0.5, 1.0, 2.0, 4.0], label_format="scale={value}"),
    ),
    x_label="cov_scale",
    cv_folds=5,
    ml_metrics=["accuracy", "f1"],
    name="my_gaussian_variance",
)

exp = Experiment(config)
exp.run(verbose=True, n_jobs=-1)
exp.compute_distances()
exp.plot()
exp.save()

Complexity metrics

35+ metrics grouped into six categories. Each captures a distinct aspect of classification difficulty.

Category Count What it measures Reference
Feature Overlap 6 Linear separability of individual features and best linear projections F1, F1v, F2, F3, F4, IN
Instance Overlap 9 Fraction of instances in ambiguous neighbourhood regions Raug, N3, kDN, CM, Borderline, …
Structural Overlap 9 Topology of the class boundary; cluster fragmentation N1, T1, Clust, ONB, DBC, …
Multiresolution Overlap 5 Class purity aggregated across multiple spatial resolutions MRCA, C1, C2, Purity, …
Classical Measures 1 Dataset-level statistics independent of class geometry IR (Imbalance Ratio)
Distributional Measures 5 Statistical distribution overlap and decision-boundary geometry Silhouette, Bhattacharyya, Wasserstein, …

Full metric reference with equations and citations: data_complexity/data_metrics/README.md


Synthetic dataset types

Experiments use five parametric synthetic generators. All are 2-class binary classification datasets (unless otherwise noted).

Type Key parameters Geometry
Gaussian class_separation, cov_type, cov_scale, cov_correlation, minority_reduce_scaler Two Gaussian clusters in 2D
Moons moons_noise Two interleaving crescents (non-linear boundary)
Circles circles_noise Two concentric circles (closed, non-linear boundary)
Blobs blobs_features Isotropic Gaussian clusters in arbitrary dimensions
XOR Four-quadrant checkerboard (linearly inseparable)

Real datasets (sklearn classic, medical, UCI) are also used in comparison experiments.


Experiments

All experiment scripts live in data_complexity/experiments/runs/ and are documented there in full.

Family Scripts Studies
Pairwise complexity 16 How complexity metrics co-vary under controlled manipulations (separation, variance, imbalance, dimensionality, geometry)
Averaged / grouped 2 Metric co-variation averaged across multiple dataset geometries for geometry-agnostic conclusions
Complexity + ML 10 Correlation between complexity metrics and classifier accuracy under the same manipulations

Pre-defined experiment configurations (runnable by name):

from data_complexity.experiments.pipeline import list_configs, run_experiment

print(list_configs())
# ['gaussian_variance', 'gaussian_separation', 'gaussian_correlation',
#  'gaussian_imbalance', 'moons_noise', 'circles_noise', 'blobs_features']

run_experiment("gaussian_separation")

Module overview

Module Description Docs
data_complexity/data_metrics/ Complexity metric implementations (all 35+) data_metrics/README.md
data_complexity/experiments/pipeline/ Generic experiment framework: config, run, analyse, save/load pipeline/README.md
data_complexity/experiments/classification/ ML model evaluation: models, metrics, evaluators, pipeline classification/README.md
data_complexity/experiments/runs/ Executable experiment scripts (28 studies) runs/README.md

Testing

pdm run pytest tests/ -v
pdm run pytest tests/ -v -k "metric"   # filter by name

Test modules cover: complexity metrics, ML models, evaluation metrics, evaluators, pipeline orchestration, experiment framework, grouped experiments.

About

Python library for analyzing data complexity metrics in classification datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages