ProsopyBase

Raw data, error correction, and preprocessing pipeline for the Prosobab Neo-Babylonian prosopography database.

This repository is the foundational data layer for the ProsopyEstimation and other projects. It produces a clean, standardized attestation dataset (preprocessed_whole_data.csv) that downstream analyses build on.

Repository Structure

ProsopyBase/
├── data/
│   ├── raw/
│   │   └── initial_prosobab_data.csv          # Raw export from Prosobab database
│   ├── corrections/
│   │   ├── final-fixes.csv                    # 25 documented date corrections
│   │   ├── unique_kings_verification.csv      # Known regnal year ranges (from eBL)
│   │   ├── duplicate_attestations_from_initial.csv
│   │   ├── julian_dates_calculator_mismatches.csv
│   │   └── initial_prosobab_inconsistencies_julian_date.csv
│   └── processed/
│       └── preprocessed_whole_data.csv        # Output: clean, corrected dataset
└── scripts/
    ├── helpers.py                             # Shared utility functions
    ├── error_correction.py                    # Applies documented date corrections
    └── preprocessing.py                       # Main preprocessing pipeline

Data Description

`data/raw/initial_prosobab_data.csv`

The raw export from the Prosobab database — a prosopography of individuals attested in Neo-Babylonian and Achaemenid-period legal documents (c. 625–330 BCE). Each row is one attestation of one person in one tablet.

Key columns: PID, Tablet ID, Julian date, Date, Role, ind.Name, ind.Patronym, ind.Family name, Archive, Place of writing

`data/corrections/`

Results of the data-curation investigation that preceded regular preprocessing:

File	Contents
`final-fixes.csv`	25 tablet-level Julian year corrections (see below)
`unique_kings_verification.csv`	Ruling year ranges for each attested Babylonian king, sourced from eBL
`duplicate_attestations_from_initial.csv`	Diagnostic: attestations appearing more than once (explained by differing "Type and objects" values, not errors)
`julian_dates_calculator_mismatches.csv`	Diagnostic: tablets where the calculated Julian year (from regnal year + king start) differs from the recorded value
`initial_prosobab_inconsistencies_julian_date.csv`	Diagnostic: tablets whose Julian date falls outside the known reign of their attested king

About the corrections (final-fixes.csv): 25 tablets contain clear typographic errors in their Julian year (e.g. 51 instead of 531). These were identified by cross-referencing king regnal ranges against attested dates. The corrections are applied programmatically by error_correction.py; the raw data file is never modified.

`data/processed/preprocessed_whole_data.csv`

The output of the preprocessing pipeline:

~15,000+ attestations of ~11,500+ individuals across ~4,000 tablets
Structural normalization applied (split-year entries resolved, PID-less rows removed)
All 25 documented date corrections applied
This file is the input for ProsopyEstimation and ProsopyWitness

Preprocessing Pipeline

data/raw/initial_prosobab_data.csv
        |
        v
[preprocessing.py]
  1. Fill missing Julian date fields
  2. Extract primary year from split-year entries (e.g. "517/516" → "517")
  3. Remove attestations without a valid PID
        |
        v
[error_correction.py]
  Apply 25 documented corrections from data/corrections/final-fixes.csv
        |
        v
data/processed/preprocessed_whole_data.csv

Requirements

pip install -r requirements.txt

Dependencies: pandas, numpy

Usage

Run the full preprocessing pipeline from the scripts/ directory:

cd scripts
python preprocessing.py

Output: data/processed/preprocessed_whole_data.csv

Expected output summary:

Preprocessed data written to ../data/processed/preprocessed_whole_data.csv
  Rows: ~15128
  Unique tablets: ~4005
  Unique people (PIDs): ~11496

To verify a specific correction (e.g. tablet 432, corrected from 527 → 536):

import pandas as pd
df = pd.read_csv('data/processed/preprocessed_whole_data.csv')
print(df[df['Tablet ID'] == 432]['Split_Julian_dates'].unique())
# Expected: [536.0]

Data Source

Waerzeggers, C., Groß, M., et al. (2019). Prosobab: Prosopography of Babylonia (c. 620–330 BCE). Leiden University. https://prosobab.leidenuniv.nl.

Waerzeggers, C., & Groß, M. (2022). Prosobab (version 1.0). DANS Data Station Archaeology. https://doi.org/10.17026/dans-zvn-eece.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProsopyBase

Repository Structure

Data Description

`data/raw/initial_prosobab_data.csv`

`data/corrections/`

`data/processed/preprocessed_whole_data.csv`

Preprocessing Pipeline

Requirements

Usage

Data Source

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

DigitalPasts/ProsopyBase

Folders and files

Latest commit

History

Repository files navigation

ProsopyBase

Repository Structure

Data Description

data/raw/initial_prosobab_data.csv

data/corrections/

data/processed/preprocessed_whole_data.csv

Preprocessing Pipeline

Requirements

Usage

Data Source

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`data/raw/initial_prosobab_data.csv`

`data/corrections/`

`data/processed/preprocessed_whole_data.csv`

Packages