Skip to content

anotherchudov/toku

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

A simple RAG system for answering questions about small knowledge base.

Architecture

There are 2 variants:

  • In-context: KB goes into the system prompt, single LLM call
  • Iterative search: up to 3 vector searches before responding

Iterative search looks roughly like this:

┌──────────-┐     ┌───────────-───┐     ┌─────────────────┐     ┌──────────┐
│  Client   │────▶│  FastAPI      │────▶│  Ask Handler    │────▶│  LLM     │
│  (HTTP)   │◀────│  POST /ask    │◀────│                 │◀────│ (Mistral)│
└──────────-┘     └────────────-──┘     └────────┬────────┘     └──────────┘
                                               │
                                      ┌────────▼────────--┐
                                      │ Vector Retriever  │
                                      │ (Qdrant in-mem)   |
                                      + an embedding model│
                                      └─────────────────--┘

Notes

In-context setup (or otherwise llm directly reading the KB) could actually be rather practical due to prompt caching. It may also be decent latency wise.

Loopy setup uses a single function with verbaly described structural requirements (see src/handlers/iterative_search.py).

Two other alternatives were considered, namely 2 separate functions and an object with nested enum object. Current setup forces function call. Other options were not tried as model could follow current function call structure fine and time is short.

Same llm is used throughout and there is no separate context for different parts of the pipeline. This is simpler approach and avoids issues with context loss. Evaluation might be biased though, as same model is used for both generating candidate answers and evaluating them.

Order of output fields in function call is designed to encourage deciding on the set of relevant kb items before generating the response. Whether it is actually generated in the desired way (top to bottom) was not verified.

Retrieval quality (recall) is not great, I was gonna try colbert style model but time is shot, once again.

The service is synchronous, it will block until the response is ready. This is primary due to now willing to deal with multiple processes/services. There is no observability. There is no thread memory, each request is independent and is not persisted.

File structure

.
├── data
│   ├── evaluation_queries.json
│   └── knowledge_base.json
├── evals
│   ├── metrics                 # contains the evaluation metrics
│   └── run_evals.py            # entry point to run evals
├── predictions                 # predictions from local eval run
│   ├── in_context.pkl
│   └── iterative_search.pkl
├── src
│   ├── handlers                # logic for processing requests
│   ├── type_definitions 
│   ├── main.py                 # entry point to run the server
│   ├── retrieval.py            # deals with the vector search
│   └── utils.py
└── README.md

Setup

  • Python ≥ 3.12
  • uv
  • A Mistral API key (I have some credits there, thus the model choice; free tier api key should also work)

Install & run

# Install dependencies
uv venv && uv sync

# Set your API key
export MISTRAL_API_KEY="your-key-here"

# handler can be 'iterative_search' or 'in_context'
# first time might take couple minutes
uv run python src/main.py --host 127.0.0.1 --port 8000 --handler iterative_search

Query the API

curl -X POST http://127.0.0.1:8000/ \
  -H "Content-Type: application/json" \
  -d '{"question": "Your question here"}'

Evaluation

There is some evaluation code in evals/run_evals.py. Existing outputs of both handlers are in ./predictions. By default those will be used for evaluation. If you'd rather rerun everything from scratch, you can change the PREDICTIONS_PATH in evals/run_evals.py.

To run the evaluation:

# from project root
uv run python -m evals.run_evals

Eval results

The results are likely biased due to the choice of evaluation model and the overall evalution setup.

Metrics definition:

  • Retrieval quality: precision and recall of the retrieved sources. Ground truth is assumed to be the set of sources used by the in-context handler. Refer to the code for how the empty sets handled.
  • Groundedness: Checks if all the claims in the response are supported by the retrieved sources. If model abstains from answering, it is skipped.
  • Correctness: Checks if the answer is correct based on the information available in the knowledge base. If model should abstain from answering, it is skipped.
  • Abstention: Checks when the model abstains from answering. If model should abstain from answering and indeed does, it is counted as a true positive. If model should not abstain from answering and does abstain, it is counted as a false positive.

Retrieval quality:

{'precision': 1.0, 'recall': 0.71}

Metric In-Context Iterative Search
Correctness 14 / 14 11 / 14
Groundedness 14 / 14 8 / 11
Abstention (TP/FP/TN/FN) 1 / 0 / 14 / 0 1 / 3 / 11 / 0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages