About

A simple RAG system for answering questions about small knowledge base.

Architecture

There are 2 variants:

In-context: KB goes into the system prompt, single LLM call
Iterative search: up to 3 vector searches before responding

Iterative search looks roughly like this:

┌──────────-┐     ┌───────────-───┐     ┌─────────────────┐     ┌──────────┐
│  Client   │────▶│  FastAPI      │────▶│  Ask Handler    │────▶│  LLM     │
│  (HTTP)   │◀────│  POST /ask    │◀────│                 │◀────│ (Mistral)│
└──────────-┘     └────────────-──┘     └────────┬────────┘     └──────────┘
                                               │
                                      ┌────────▼────────--┐
                                      │ Vector Retriever  │
                                      │ (Qdrant in-mem)   |
                                      + an embedding model│
                                      └─────────────────--┘

Notes

In-context setup (or otherwise llm directly reading the KB) could actually be rather practical due to prompt caching. It may also be decent latency wise.

Loopy setup uses a single function with verbaly described structural requirements (see src/handlers/iterative_search.py).

Two other alternatives were considered, namely 2 separate functions and an object with nested enum object. Current setup forces function call. Other options were not tried as model could follow current function call structure fine and time is short.

Same llm is used throughout and there is no separate context for different parts of the pipeline. This is simpler approach and avoids issues with context loss. Evaluation might be biased though, as same model is used for both generating candidate answers and evaluating them.

Order of output fields in function call is designed to encourage deciding on the set of relevant kb items before generating the response. Whether it is actually generated in the desired way (top to bottom) was not verified.

Retrieval quality (recall) is not great, I was gonna try colbert style model but time is shot, once again.

The service is synchronous, it will block until the response is ready. This is primary due to now willing to deal with multiple processes/services. There is no observability. There is no thread memory, each request is independent and is not persisted.

File structure

.
├── data
│   ├── evaluation_queries.json
│   └── knowledge_base.json
├── evals
│   ├── metrics                 # contains the evaluation metrics
│   └── run_evals.py            # entry point to run evals
├── predictions                 # predictions from local eval run
│   ├── in_context.pkl
│   └── iterative_search.pkl
├── src
│   ├── handlers                # logic for processing requests
│   ├── type_definitions 
│   ├── main.py                 # entry point to run the server
│   ├── retrieval.py            # deals with the vector search
│   └── utils.py
└── README.md

Setup

Python ≥ 3.12
uv
A Mistral API key (I have some credits there, thus the model choice; free tier api key should also work)

Install & run

# Install dependencies
uv venv && uv sync

# Set your API key
export MISTRAL_API_KEY="your-key-here"

# handler can be 'iterative_search' or 'in_context'
# first time might take couple minutes
uv run python src/main.py --host 127.0.0.1 --port 8000 --handler iterative_search

Query the API

curl -X POST http://127.0.0.1:8000/ \
  -H "Content-Type: application/json" \
  -d '{"question": "Your question here"}'

Evaluation

There is some evaluation code in evals/run_evals.py. Existing outputs of both handlers are in ./predictions. By default those will be used for evaluation. If you'd rather rerun everything from scratch, you can change the PREDICTIONS_PATH in evals/run_evals.py.

To run the evaluation:

# from project root
uv run python -m evals.run_evals

Eval results

The results are likely biased due to the choice of evaluation model and the overall evalution setup.

Metrics definition:

Retrieval quality: precision and recall of the retrieved sources. Ground truth is assumed to be the set of sources used by the in-context handler. Refer to the code for how the empty sets handled.
Groundedness: Checks if all the claims in the response are supported by the retrieved sources. If model abstains from answering, it is skipped.
Correctness: Checks if the answer is correct based on the information available in the knowledge base. If model should abstain from answering, it is skipped.
Abstention: Checks when the model abstains from answering. If model should abstain from answering and indeed does, it is counted as a true positive. If model should not abstain from answering and does abstain, it is counted as a false positive.

Retrieval quality:

{'precision': 1.0, 'recall': 0.71}

Metric	In-Context	Iterative Search
Correctness	14 / 14	11 / 14
Groundedness	14 / 14	8 / 11
Abstention (TP/FP/TN/FN)	1 / 0 / 14 / 0	1 / 3 / 11 / 0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
evals		evals
examples		examples
predictions		predictions
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Architecture

Notes

File structure

Setup

Install & run

Query the API

Evaluation

Eval results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Architecture

Notes

File structure

Setup

Install & run

Query the API

Evaluation

Eval results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages