AskBeacon Code Base

🛑 AskBeacon is meant to operate in conjunction with sBeacon. There are detailed instructions on how to set up sBeacon with terraform templates in this rep https://github.com/aehrc/terraform-aws-serverless-beacon. To ensure the connection between sBeacon and askBeacon is robust and secure please get in touch via http://bioinformatics.csiro.au to obtain tailored (i.e. security sensitive) instructions. In the meantime we encourage people to get familiar with the system through the set up demo here. 🛑

Demo Instance

We have hosted a demo instance to allow potential researchers to perform basic evaluation of the system. The demo instance uses 1000 Genomes Phase 3 data. You can access the demo instance using the following link.

🛑 Variant queries of 1000 genomes phase 3 data is associated with GRCH37. Hence queries must be of the following form;

Get me all the individuals having a variant in first chromosome between 11856377 to 11856379 under assembly id "GRCH37"

Launch the demo

Please use the following credentials

username: demo@example.com
password: demo1234

Please note that this AskBeacon instance demonstrates examples 2 and 3 from the paper (i.e. subsets of the 1000 Genome Project). Interactive access to any complete datasets, such as Parkinson's Progression Markers Initiative (example 1) can only be make available by the original data custodians, so this demo is meant solely to showcase AskBeacon's functionality rather than to support research.

AskBeacon Web Interface

AskBeacon is accessible from Analytics tab followed by AskBeacon analysis selection as follows.

AskBeacon can also be acessed through the Query tab followed by clicking the chat icon. This capability is for automatically populating the sBeacon UI to interact with the canned sBeacon visualization options.

AskBeacon Core Logic

AskBeacon core logic is presented in this repository. This repository contains the following jupyter notebooks and their functionality is as follows.

1 - Extract Information

This notebook outlines how we extract information pertaining to a Beacon query from the user. It evaluates LLM performance on query understanding and information extraction tasks. The notebook initialises LLM models (GPT-4 via Azure OpenAI as well as open-source alternatives such as Mistral, Llama, Gemma and Qwen via Ollama) and extracts four key elements from a user's natural language query:

Scope – determines the target data scope (e.g. g_variants, cohorts, individuals)
Filters – identifies disease or condition filters and their associated scope
Granularity – determines the query type (e.g. count vs record)
Variants – parses genomic coordinates (chromosome, start/end positions) and related medical filters

The notebook reads test queries and expected answers from Queries-and-answers.xlsx and produces JSON-formatted predictions for each model so that accuracy can be assessed against the ground truth.

2 - Ontology Retrieval

This notebook demonstrates the ontology indexing and retrieval process based on extracted information. It builds a semantic search system that matches user-supplied terms to biomedical ontology terms using text embeddings. This uses the following files.

`terms.csv`

This directly comes from sBeacon backend. sBeacon keeps a terms table and has the following format.

term	label	scope
OBI:0000070	genotyping assay	cohorts
OBI:0000070	genotyping assay	cohorts

`embeddings.csv`

This is the resulting file from the embedding of ontology terms. This is loaded into memory and indexed using docarray for ontology lookup.

term	label	scope	embedding
OBI:0000070	genotyping assay	cohorts	[-0.06088845431804657, ...]
OBI:0000070	genotyping assay	cohorts	[-0.06088845431804657, ...]

The notebook steps are:

Embed ontology terms – ontology terms from terms.csv are embedded using Azure OpenAI text embeddings and saved to embeddings.csv
Index embeddings – the embeddings are loaded into an in-memory vector database (InMemoryExactNNIndex from DocArray)
Semantic retrieval – a user-supplied natural language term (e.g. "renal failure") is embedded and the nearest ontology terms are returned with similarity scores (e.g. SNOMED:42399005 – Renal failure – score 0.91)

3 - Extraction

This is the extraction portion of the AskBeacon analytics facility. The notebook outlines how we generate the extractor code using the Beacon V2 SDK. It translates a natural language user query into executable Python code that constructs a Beacon V2 API request.

The notebook steps are:

Load the ontology vector database produced in Notebook 2
Extract query components (scope, filters, variants, granularity) using an LLM (see Notebook 1)
Map each extracted filter term to a formal ontology identifier via semantic search (confidence threshold > 0.9)
Generate and format Python code (using Black) that instantiates a BeaconV2() SDK object with the appropriate variant parameters, scope, and ontology-mapped filters

For example, the query "Individuals with Parkinson's with variants in first chromosome from 10000–15000 bases" produces:

data = (
    BeaconV2()
    .with_g_variant("GRCH38", "N", "N", [10000], [15000], "1")
    .with_scope("individuals")
    .with_filter("ontology", "SNOMED:49049000", "individuals")
    .load()
)

4 - Execution

This notebook shows how we generate the code that we will be executing to perform the analysis. It takes a natural language analytics request and produces executable Python code (using pandas/matplotlib) that operates on the data retrieved in Notebook 3. Note that this notebook relies on the following file.

`metadata.json`

This file has the summary information of the extracted data from the previous step.

{
  "table_names": ["data"],
  "table_metadata": [
    [
      [
        "id",
        "diseases",
        ...
      ],
      [
        "int",
        "list",
        "dict",
        ...
      ]
    ]
  ]
}

The notebook steps are:

Load table schema information from metadata.json (table names, column names, and column types)
Format the metadata into a structured description that is passed to the LLM as context
Use the LLM to generate executable Python analytics code based on the user's natural language request and the available schema
The generated code includes the analytics logic (e.g. plotting, aggregation, CSV export), the file paths for saved outputs, assumptions made about the data structure, and feedback for potential data issues

For example, a request such as "Plot frequency of karyotypic sex in pie chart" produces code that creates a matplotlib pie chart of the sex distribution and exports a CSV of individuals with their id, ethnicity, and sex.

Utils

You can find the utils we have used under following directories.

`utils`

This has pydantic models, templates, vector db and code sanitisers used in our implementation

`analytics_utils`

This has the runners of extractors and execution codes. You might also find how we generate the metadata for metadata.json here as well.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
analytics_utils		analytics_utils
utils		utils
.gitignore		.gitignore
1 - Extract_Information.ipynb		1 - Extract_Information.ipynb
2 - Ontology_Retrieval.ipynb		2 - Ontology_Retrieval.ipynb
3 - Extraction.ipynb		3 - Extraction.ipynb
4 - Execution.ipynb		4 - Execution.ipynb
Queries-and-answers.xlsx		Queries-and-answers.xlsx
README.md		README.md
embeddings.csv		embeddings.csv
image-AskBeacon-Analytics.png		image-AskBeacon-Analytics.png
image-AskBeacon-Query.png		image-AskBeacon-Query.png
image-askbeacon.png		image-askbeacon.png
image-home.png		image-home.png
metadata.json		metadata.json
requirements.txt		requirements.txt
terms.csv		terms.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AskBeacon Code Base

Demo Instance

AskBeacon Core Logic

1 - Extract Information

2 - Ontology Retrieval

`terms.csv`

`embeddings.csv`

3 - Extraction

4 - Execution

`metadata.json`

Utils

`utils`

`analytics_utils`

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AskBeacon Code Base

Demo Instance

AskBeacon Core Logic

1 - Extract Information

2 - Ontology Retrieval

terms.csv

embeddings.csv

3 - Extraction

4 - Execution

metadata.json

Utils

utils

analytics_utils

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`terms.csv`

`embeddings.csv`

`metadata.json`

`utils`

`analytics_utils`

Packages