Ontology-Based Data Generation for Neuro-Symbolic Reasoning.
Vincent Van Schependom, Cas Proost, Pieter Bonte
Department of Computer Science, KU Leuven campus Kulak Kortrijk
Knowledge Graph Reasoning (KGR) involves deriving new, implicit knowledge from a Knowledge Graph (KG) and its accompanying ontology. Traditionally this is done by symbolic reasoners, which execute ontology rules with perfect soundness and completeness — but are sensitive to noise and computationally expensive on real-world KGs.
Neuro-symbolic reasoners have emerged as a scalable alternative: instead of executing rules at inference time, a neural model is trained to imitate them (a paradigm known as approximate reasoning). This shift comes with a critical prerequisite: the model must be trained on a dataset that faithfully reflects the target ontology. Real-world KGs (e.g. DBpedia, Freebase) are too noisy and incomplete for this, and existing synthetic data pipelines - which we refer to as Unguided Deductive Materialization (UDM) - fall short in two ways:
- Structurally shallow data. UDM generates base facts without ontology guidance, then materializes targets via a forward-chaining reasoner. The resulting graphs are dominated by shallow inferences; the deep, multi-hop derivations a model is actually meant to learn occur only by accident.
- Trivial negatives. UDM relies on random or constrained corruption, which produces negatives that are easy to reject from surface features alone - collapsing the training signal to pattern matching rather than genuine reasoning.
A further practical issue is scalability: forward-chaining materializers must hold the entire deductive closure in memory, causing them to fail on large ontologies, a ceiling we call the Reasoning Wall.
Synthology addresses all of the above via backward-chaining proof construction: given any OWL 2 RL ontology, it purposefully engineers training samples by constructing proof trees for target triples, guaranteeing multi-hop derivations by design. The three main contributions are:
- Synthology: the first ontology-agnostic, backward-chaining synthetic data generator for OWL 2 RL. Any supported ontology can now be used to train a neuro-symbolic reasoner without expensive data gathering.
- Proof-based negative sampling: hard negatives are constructed directly from proof trees, producing near-miss facts that require genuine multi-hop reasoning to correctly classify.
- Empirical evaluation: a comparative study across two ontologies (Family Tree, OWL2Bench) demonstrating significant advantages in hop distribution, predicate coverage, negative-sample quality, and scalability over UDM baselines. Synthology also avoids the Reasoning Wall that UDM hits at scale.
As a supporting deliverable, we release an open-source PyTorch Lightning reimplementation of the Recursive Reasoning Network (RRN), a neuro-symbolic link-prediction model, used as the evaluation architecture throughout.
- Introduction
- Features
- Installation
- Reproducibility
- Training RRN model
- Data generation
- Visual verification
- Hyperparameter Optimization (WandB Sweeps)
- Custom configurations
- Experiment Protocols
- OWL2 RL Profile Coverage and Appendix Tables
- Configuration Parameters
- Development
- Known issues
Don't worry if the repository looks a bit overwhelming :) I value reproducibility of scientific experiments very highly, so:
- I created a sophisticated
uvmonorepo, i.e. a single repository containing multiple packages as 'subprojects', each with their own dependencies and configurations. - I added a Linux devcontainer for easy setup on any OS (including Windows, which is not Unix-based like Linux or macOS).
The subprojects (located in apps/) include the core Synthology generator (ont_generator), the UDM/Jena baseline pipeline (udm_baseline), the RRN code (RRN), the ASP-based Family Tree generator (asp_generator), and supporting scripts for visualization and hyperparameter optimization.
The uv nature of this repo makes it possible to easily manage dependencies between these subprojects. Furthermore, it provides a task runner (invoke) to run common tasks (e.g., generating datasets, training models, running experiments) from the project root. Use the following command to see all available tasks:
uv run invoke --list # list all available tasks
uv run invoke <task-name> # run a specific taskThis project uses uv for dependency management and invoke for task automation.
Make sure you have cloned the repo and are in the project root directory.
On Unix systems, you can locally run all commands as-is. As an alternative, follow the Windows instructions to use the devcontainer. Below are the steps to set up the project on your own macOS or Linux machine without using the devcontainer.
If don't already have uv installed, then do so first, e.g. on macOS with Homebrew:
brew install uvOr on Linux using the official installation script:
curl -LsSf https://astral.sh/uv/install.sh | shThen, install project dependencies:
uv syncAs you can see, with uv, installing dependencies is as easy as running a single command! No contradictory requirements.txt files or anything like that :)
The family tree data generator makes use of the DLV system in order to perform symbolic reasoning over family trees by means of the ontology mentioned above.
If you are running the project on your own Linux machine, you can use the provided installation script to download and set up DLV automatically:
bash install-dlv-linux.shIf running the project on your own macOS machine, you have to download the DLV executable for your platform from the official website
After you have downloaded and extracted the DLV executable, change the permissions to make it executable:
chmod +x /path/to/dlv/executableFinally, update the configuration file configs/asp_generator/config.yaml to point to the DLV executable you just downloaded:
# configs/asp_generator/config.yaml
# ...
dlv: /path/to/dlv/executable # <- change this!
# ...Some workflows (notably OWL2Bench generation and Jena-backed materialization) rely on files in vendor/.
By default, this repo keeps these folders out of git history (see .gitignore) to avoid committing large third-party artifacts.
From the project root, set them up as follows:
mkdir -p vendor
# OWL2Bench Java generator source (required for gen-owl2bench* tasks)
git clone https://github.com/kracr/owl2bench.git vendor/OWL2Bench
# Apache Jena distribution (required by UDM/Jena materialization helper)
curl -L -o /tmp/apache-jena-6.0.0.tar.gz \
https://archive.apache.org/dist/jena/binaries/apache-jena-6.0.0.tar.gz
tar -xzf /tmp/apache-jena-6.0.0.tar.gz -C vendorAfter cloning OWL2Bench, ensure the RL ontology path exists at:
ontologies/UNIV-BENCH-OWL2RL.owl
If needed, copy it from the cloned vendor folder:
mkdir -p ontologies
cp vendor/OWL2Bench/UNIV-BENCH-OWL2RL.owl ontologies/For the easiest use, you should open the devcontainer, which I included in .devcontainer/, for example using VS Code:
- I assume you are in the project root directory.
- Click the
><icon in the bottom-left corner of VS Code. - Select
Reopen in Container.
The (Linux) devcontainer will be built using Dockerfile and post_create.sh will take care of installing uv, as well as syncing the project dependencies and setting up the config files.
After the installation is complete, VS Code might prompt you with
"Press any key to exit"
Once you actually press a key, a new terminal will open in the devcontainer, but the virtual environment might not be activated yet.
Close the terminal and open a new one (CMD + J or Terminal > Create New Terminal). This new terminal should now have the virtual environment activated automatically.
You should always see (synthology) > at the beginning of the terminal prompt when working in the devcontainer, which indicates that the virtual environment is active.
You don't need to install DLV manually (like on macOS/Linux), as it is already installed in the devcontainer.
See the Development section for instructions on setting up development tools like ruff and ty (using VS Code extensions is recommended).
I only ran experiments on an LSF cluster. You can use the provided job scripts in jobscripts/ as templates. Make sure to adjust the resource requests and module loads according to your cluster's specifications.
The same dependencies apply as for the local installation (Python, uv, Java, Maven, OWL2Bench, Apache Jena).
If you're on an LSF cluster, you can load Java and Maven modules as follows:
# Load Java 21 (required by Jena 5.x)
module load openjdk/21
# Verify Java is available and correct version
which java && java -version
# Now install Maven
./install-mvn.sh
# Verify Maven is available
which mvn && mvn -vThe exact sequence of invoke commands needed to reproduce our results are located in the 3 experiment-specific markdown files:
experiments/exp1.mdexperiments/exp2.mdexperiments/exp3.md
To train the Recursive Reasoning Network (RRN) model on the generated family tree datasets, use the following invoke task:
uv run invoke train-rrn
# configs/rrn/ config.yaml
# data/ default.yaml
# dataset/asp.yaml
# dataset/ont.yaml
# model/ default.yaml
# hyperparams/ default.yamlTo tweak the parameters, please refer to the configuration section. This also applies to all data generation methods.
All ontologies that were used for data generation are located in the ontologies/ folder.
All generators output data in a standardized format.
Each split (train, val, test) contains:
facts.csv: Base facts (explicit relations/memberships).targets.csv: All facts (base + inferred) and negative samples.
Below, I describe how to generate the reldata Family Tree dataset based on the ASP solver by Patrick Hohenecker.
Quick Start (generates and converts to standard format):
uv run invoke gen-ft-aspThis command generates raw reldata output in data/asp/out-reldata and then automatically converts it to the standard format (facts.csv and targets.csv) in data/asp/family_tree/{train,val,test}.
To use the backward-chaining ontology-based generator (which outputs the standard format):
uv run invoke gen-ft-ontOr run directly:
uv run --package ont_generator python -m ont_generator.create_dataThis generates facts.csv and targets.csv in data/ont/family/{train,val,test}.
You can run hyperparameter sweeps that span both the ontology data generation and the RRN model training. This allows you to find the optimal combination of dataset characteristics (e.g., complexity, size, negative sampling ratio) and model hyperparameters.
A wrapper script scripts/sweep_ont_rrn.py handles the coordination between the generator and the model.
-
Define your sweep configuration: Create a YAML file (e.g.,
configs/my_sweep.yaml) defining the parameters to tune. Use the prefixgen.for generator parameters andrrn.for RRN parameters.Example (
configs/sweep_sample.yaml):program: scripts/sweep_ont_rrn.py method: bayes metric: name: val_loss goal: minimize parameters: # Generator Parameters gen.dataset.n_train: values: [1000, 2000] gen.neg_sampling.ratio: min: 0.5 max: 2.0 # Model Parameters rrn.hyperparams.learning_rate: min: 0.0001 max: 0.01
-
Initialize the sweep:
uv run wandb sweep configs/sweep_sample.yaml
This will output a sweep ID (e.g.,
username/project/sweep_id). -
Start the agent:
uv run wandb agent <SWEEP_ID>
The script automatically generates a temporary dataset for each run, trains the model on it, reports metrics to WandB, and cleans up the data afterwards.
This repo uses Hydra for configuration management.
You can modify the default configurations in 2 ways:
All configurations -- for the link-prediction models and the data generators -- are stored in the configs/ folder.
You can create your own configuration files by copying and modifying the existing ones.
For example, create a hyperparams2.yaml file in configs/rrn/hyperparams/ and modify configs/rrn/config.yaml to use it:
defaults:
- data: default
- model: default
- hyperparams: hyperparams2 # <- your custom hyperparameters
- _self_
# rest of config...You can also override specific configuration options directly from the command line.
(note that this only works when running the packages directly, not via invoke)
uv run --package ont_generator python -m ont_generator.create_data \
dataset.n_train=500 \
dataset.n_val=100 \
dataset.n_test=100Another example, for training the RRN model with custom (hyper)parameters:
uv run --package rrn python -m rrn.train \
data/dataset=aspThis section documents what is currently implemented in the ontology parser/chainer and what is not yet implemented.
The current implementation supports the following core axioms and property types:
rdfs:subClassOfrdfs:subPropertyOfrdfs:domainrdfs:range(object-class ranges; datatype ranges are currently skipped as inference rules)owl:inverseOfowl:propertyChainAxiomfor chain lengths1and2owl:disjointWith(as a consistency constraint)rdf:typehandling for:owl:SymmetricPropertyowl:TransitivePropertyowl:ReflexivePropertyowl:IrreflexiveProperty(constraint)owl:AsymmetricProperty(constraint)owl:FunctionalProperty(constraint)
Important OWL2 RL constructs that are not yet fully supported include:
- Restriction-heavy constructs encoded with blank nodes, such as combinations of:
owl:onPropertyowl:someValuesFromowl:allValuesFromowl:hasValue- qualified cardinality variants
- Equivalence and identity constructs:
owl:equivalentClassowl:equivalentPropertyowl:sameAsclosure/rewrite behavior
- Set/boolean class constructors:
owl:intersectionOfowl:unionOfowl:complementOfowl:oneOf
- Disjointness/group constructs such as:
owl:propertyDisjointWithowl:AllDisjointClassesowl:AllDifferent
Design note: this is an implementation scope choice, not an architectural limitation. New support can be added incrementally through parser handlers and rule templates.
| YAML Parameter | Symbol | Type | Default | Description |
|---|---|---|---|---|
min_individuals |
int | 1 | Lower acceptance bound on sample size: graphs with fewer individuals are rejected. | |
max_individuals |
int | 1000 | Upper acceptance bound on sample size: graphs with more individuals are rejected. | |
min_rules |
int | 1 | Minimum number of ontology rules selected per generated sample before proof generation. | |
max_rules |
int | 5 | Maximum number of ontology rules selected per generated sample. | |
target_min_proofs_rule |
int | 5 | Target lower bound on proofs kept per selected rule; effectively bounded by how many valid proofs exist. | |
seed |
int | 23 | Seed for pseudorandom sampling (rule selection, proof-root counts, and corruption choices), improving reproducibility. | |
max_recursion |
int | 3 | Per-sample recursion cap for rule reuse in backward chaining; deeper recursion allows longer inference chains. | |
global_max_depth |
int | 10 | Absolute depth limit for recursive proof search; branches beyond this depth are pruned. | |
max_proofs_per_atom |
int | 5 | Hard cap on number of proofs emitted for one goal atom, preventing combinatorial explosion. | |
individual_pool_size |
$ \mathcal{U} $ | int | 60 | Target size of the reusable individual pool used when instantiating variables during proof construction. |
individual_reuse_prob |
float | 0.7 | Probability of reusing an existing individual from the pool rather than creating a new one. | |
use_signature_sampling |
bool | true | If enabled, generated proofs are grouped by structural signature and one representative per group is sampled, improving diversity and reducing redundant Cartesian combinations. | |
min_proof_roots |
int | 5 | Minimum number of independent root-generation cycles attempted per selected rule. | |
max_proof_roots |
int | 20 | Maximum number of independent root-generation cycles attempted per selected rule. | |
always_generate_base |
bool | false | If true, emits a base proof even when derivation rules apply; if false, base proofs are mainly used when no matching rule exists. | |
min_lcc_ratio |
float | 0.8 | Validation threshold for graph connectivity: the largest connected component must cover at least this fraction of individuals. | |
strategy |
enum | proof_based |
Negative sampling mode used in the thesis experiments: random, constrained, proof_based. |
|
ratio |
float | 1.0 | Target negative-to-positive ratio for generated examples; |
|
corrupt_base_facts |
bool | false | Enables corruption of proof-leaf base facts in proof-based logic; this controls whether propagated counterfactual negatives are produced in that branch. |
Creating a new subproject:
uv init apps/my-new-app --package
uv syncAdding new dependencies only to a specific subproject:
uv add <dependency> --package my-new-appIn case the terminal doesn't show real-time updates, try setting the following environment variable:
export PYTHONUNBUFFERED=1This forces Python to flush its output buffer immediately.
If you encounter an error related to mvn not being found, make sure you have Apache Maven installed and that the mvn command is available in your system's PATH. You can verify this by running:
which mvnTo add Maven to your PATH, you can follow these steps:
- Download and install Apache Maven from the official website: https://maven.apache.org/download.cgi
- Extract the downloaded archive to a directory of your choice
- Add the
bindirectory of the extracted Maven folder to your system's PATH environment variable.
E.g.
export MAVEN_EXECUTABLE="$PWD/apache-maven-3.9.13/bin/mvn"
export PATH="$PWD/apache-maven-3.9.13/bin:$PATH"