Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
135 commits
Select commit Hold shift + click to select a range
a6f8b21
Split README into separate MD files in dir docs/
Mar 14, 2026
cd06366
Fix output-keys.md file name, link to newest RAGAS
Mar 14, 2026
6788341
installation.md fille name without `0-` prefix
Mar 14, 2026
bba05aa
Fix link from steps-score.md to recall@k computation
Mar 14, 2026
4451ac7
Move license back to README
Mar 25, 2026
2c717f4
docs/installation: Fix wording to answer correctness metrics
Mar 25, 2026
15acf06
docs/steps-score: typos & grammar
Mar 25, 2026
bb1b1b7
Sentence: retrieval eval using IDs ignores content
Mar 25, 2026
61f06a5
Example code: rename variable; improve comments
Mar 25, 2026
c274ca2
Intro: consolidate LLM-based metrics usage explanation
Mar 25, 2026
6cde6bc
WIP new structure
Mar 26, 2026
b1c0235
Rename output-keys.md to output.md
Mar 27, 2026
0e52be7
Rename reference-qa-data.md to example-reference-data.md
Mar 27, 2026
fc8f08c
Rename resposes-to-evaluate.md example-target-data.md
Mar 27, 2026
bfec745
Move output intro from evaluation-results.md to output.md
Mar 27, 2026
fabfd98
Rename evaluation-results.md to example-output.md
Mar 27, 2026
b480edf
Move example files to `examples/` directory
Mar 27, 2026
bd21b21
Intro: remove prefix `0-`, link to docs sections
Mar 27, 2026
c20a0d5
Rename `into.md` to `metrics.md`; link docs sections from README
Mar 27, 2026
b08493e
Change examples/ files from MD to YAML, JSON; fix refs
Mar 27, 2026
8ebf9a4
input.md: long lines
Mar 27, 2026
f32219e
steps-score.md: Fix "тhen"
Mar 30, 2026
a515e4b
Revamp steps-score.md
Mar 30, 2026
d753897
steps-score.md: Fix math formulas formatting
Mar 30, 2026
ed4399a
steps-score.md: Fix math formulas formatting
Mar 30, 2026
94f9244
steps-score.md: Fix math formulas formatting
Mar 30, 2026
462695a
steps-score.md: Fix math formulas formatting
Mar 30, 2026
471d589
steps-score.md: Fix math formulas formatting
Mar 30, 2026
6acca40
steps-score.md: Fix math formulas formatting
Mar 30, 2026
1014749
steps-score.md: Fix math formulas formatting
Mar 30, 2026
a546060
steps-score.md: Fix math formulas formatting
Mar 30, 2026
663bd9d
steps-score.md: Fix math formulas formatting
Mar 30, 2026
eb9f6b1
steps-score.md: Fix math formulas formatting
Mar 30, 2026
73f78e4
steps-score.md: Fix math formulas formatting
Mar 30, 2026
835c67d
steps-score.md: Fix math formulas formatting
Mar 30, 2026
26391b5
steps-score.md: Fix math formulas formatting
Mar 30, 2026
1bc3c3a
steps-score.md: Fix math formulas formatting
Mar 30, 2026
3633e09
steps-score.md: Fix math formulas formatting
Mar 30, 2026
991301b
steps-score.md: Fix math formulas formatting
Mar 30, 2026
6c8e440
steps-score.md: Fix math formulas formatting
Mar 30, 2026
3bc85f1
steps-score.md: Fix math formulas formatting
Mar 30, 2026
54d0cc3
steps-score.md: Fix math formulas formatting
Mar 30, 2026
52ff494
steps-score.md: Fix math formulas formatting
Mar 30, 2026
2c2a110
steps-score.md: Fix math formulas formatting
Mar 30, 2026
85065f5
steps-score.md: Fix math formulas formatting
Mar 30, 2026
55b5917
steps-score.md: Fix math formulas formatting
Mar 30, 2026
2319ab5
steps-score.md: Fix math formulas formatting
Mar 30, 2026
af839d6
steps-score.md: Fix math formulas formatting
Mar 30, 2026
9ed6c0e
steps-score.md: Fix math formulas formatting
Mar 30, 2026
6bf55fe
steps-score.md: Fix math formulas formatting
Mar 30, 2026
5f60b78
steps-score.md: Fix math formulas formatting
Mar 30, 2026
f605fde
steps-score.md: Fix math formulas formatting
Mar 30, 2026
fe781f4
Fix links from README to docs
Mar 30, 2026
48011fa
usage.md: Typo
Mar 30, 2026
f386440
usage.md: consistent headings case; typo
Mar 30, 2026
4aff6f3
usage.md: add links; reword for clarity
Mar 30, 2026
4a8fbde
configuration.md: typo
Mar 30, 2026
15c37b9
metrics.md: anchor typo
Mar 30, 2026
c5d8d45
output.md: typo
Mar 30, 2026
5a9590a
output.md: typo
Mar 30, 2026
889c4b3
steps-score.md: typo
Mar 30, 2026
6bfd388
usage.md: linked file name extension (.md -> .yaml)
Mar 30, 2026
21985e1
Shorten docs file names
Mar 30, 2026
3924e65
README.md: minor grammar; break lines
Mar 30, 2026
0a26488
configuration.md: fix too-minor section heading
Mar 30, 2026
88f46b2
custom.md#Recommendations: reword for clarity
Mar 30, 2026
af82b96
custom.md: Shorten recommendation section heading
Mar 30, 2026
5079560
steps.md: revamp match score computation explanation
Mar 30, 2026
152754d
Typos
Mar 30, 2026
387d620
steps.md: clearer explanation of required columns
Mar 30, 2026
a6e4ede
steps.md: `steps_score` -> steps score
Mar 30, 2026
3bcea0a
steps.md: intro, reorder, ASCII diagram
Mar 30, 2026
86530b7
README: fix link to configuration.md
Mar 30, 2026
5a8be36
steps.md: link title typo
Mar 30, 2026
2781c1f
steps.md: more links to sections
Mar 30, 2026
e38a5e4
input.md: clearer example reference and target responses
Mar 30, 2026
69a2f8b
installation.md: fix link heading to output.md
Mar 30, 2026
17a23e0
llm.md: fix link to custom.md
Mar 30, 2026
3051ed3
custom.md: link title = full section name
Mar 30, 2026
8454857
metrics.md: link to specific section Steps score
Mar 30, 2026
f4a9e90
metrics.md: fix link title to output.md
Mar 30, 2026
b8ed943
output.md: link specific, link titles, minor reword
Mar 30, 2026
b5dc1c3
Rename section "Aggregates keys" to "Aggregate metrics"
Mar 30, 2026
5f00e48
Rename installation.md to install.md
Mar 30, 2026
fa76c81
Rename configuration.md to configure.md
Mar 30, 2026
b1330c7
install.md: link to metrics.md instead of output.md
Apr 1, 2026
8b84ec5
install.md: clarify that link to metrics.md applies to all
Apr 1, 2026
941830f
usage.md: add top-level section Usage, move down others
Apr 1, 2026
e5ebd96
README: fix link to renamed config.md
Apr 1, 2026
61c52b2
steps.md: fix link and title to section Steps score
Apr 1, 2026
baca76f
steps.md: fix link to retrieval-ids.md section
Apr 1, 2026
5c4e525
config.md: change metric link from output.md to metrics.md
Apr 1, 2026
19412c5
usage.md: Closing paren
Apr 1, 2026
4211946
usage.md: start all link titles with "section"
Apr 1, 2026
307f449
usage.md: same case, delimiting for section links
Apr 1, 2026
0c4f85b
steps.md: grammar, clarification
Apr 1, 2026
b368b55
install.md: extra `)`
Apr 1, 2026
80790b8
usage.md: capitalize section heading in link
Apr 1, 2026
d02e2fa
Consistent case of "section" in links
Apr 1, 2026
13908b6
usage.md: link title Input -> Inputs, like heading
Apr 1, 2026
073a259
steps.md: fix link to section Context Recall@k
Apr 1, 2026
55e3fb8
Fix level-2 heading "Command-line use"
Apr 1, 2026
9ad1122
steps.md: Fix Steps score "where" #3 LaTeX
Apr 1, 2026
40cc73b
steps.md: Capitalize long bullets
Apr 1, 2026
81bb8d0
metrics.md: remove "chat bot", leave "agent"
Apr 1, 2026
6076e0f
Break lines to 80 characters, as elsewhere
Apr 1, 2026
2333643
output.md: fix retrieval_context_{recall,precision} definitions
Apr 1, 2026
19214b6
Aggregate metrics intro
Apr 1, 2026
70c1084
Steps matching: clarify non-constraint
Apr 1, 2026
514c3f0
Improve section SPARQL comparison
Apr 1, 2026
114dedc
Clarify match score : steps_score relation
Apr 1, 2026
d7f8eb8
Improve match score rules intro sentence
Apr 1, 2026
3176cfa
Make links absolute URLs
Apr 1, 2026
2e55d16
Re-join long lines
Apr 1, 2026
be62945
Start bullet sentences with capital letter
Apr 1, 2026
770361b
config.md: show key types as code
Apr 1, 2026
199131d
config.md rewording
Apr 1, 2026
31543d2
Consistently use `-` (not `*`) for bullets
Apr 1, 2026
c29a04a
Consistently indent sub-bullets by 2 spaces
Apr 1, 2026
349c02e
retrieval-ids.md reformat like other pages
Apr 1, 2026
c5f0dcf
Fix absolute URLs: /blob/main
Apr 2, 2026
0b784db
config.md: typo: repeated word
Apr 2, 2026
dff10e9
input.md: Clarify output_media_type; formatting
Apr 2, 2026
a7fbd34
output.md: consisent answer_relevance, steps_score
Apr 2, 2026
e13af69
output.md: Extensive rewording
Apr 3, 2026
bf4af7f
output.md: aggregates: remove repetitious intro
Apr 3, 2026
f12c54d
usage.md: consistent bullet case
Apr 3, 2026
f54f019
steps.md: shorten Match score rules intro
Apr 3, 2026
c7650b2
Move metrics definitions to section Metrics
pgan002 Apr 6, 2026
b4cacb8
steps.md: detail ASK queries and edge cases
pgan002 Apr 6, 2026
5c36a43
steps.md: replace ASCII art diagram by image
pgan002 Apr 6, 2026
3a89aa4
Consistent section titles
pgan002 Apr 6, 2026
6749ebc
Replace "section" by the section symbol §
pgan002 Apr 6, 2026
2176182
metrics.md: bold metric names
pgan002 Apr 6, 2026
05d56fa
usage.md: ! fix formatting
pgan002 Apr 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,288 changes: 12 additions & 1,276 deletions README.md

Large diffs are not rendered by default.

117 changes: 117 additions & 0 deletions docs/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Configuration

`run_evaluation()` and `compute_aggregates()` are configured using a YAML file whose path is passed as a parameter. The configuration has the following structure:

- `llm`: required for [LLM-based metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md). Keys:
- `generation`: required. The following keys are required:
- `provider`: (`str`) name of the organization providing the generation model, as supported by LiteLLM
- `model`: (`str`) name of the generation model
- `temperature`: (`float` in the range [0.0, 2.0]) adversarial temperature for generation
- `max_tokens`: (`int` > 0) maximum number of tokens to generate
- Optional keys: parameters to be passed to LiteLLM for generation (for [`answer_correctness`](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md) and [custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md)). Examples:
- `base_url`: (`str`) base URL for the generation model, alternative to the provider's default URL
- `api_key`: (`str`) API key for the generation model, alternative to setting the environment variable corresponding to the provider (e.g. `OPENAI_API_KEY` for OpenAI. `AZURE_OPENAI_API_KEY` for Azure etc.)
- `embedding`: required for [`answer_relevance`](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md).
- `provider`: (`str`) name of the organization providing the embedding model
- `model`: (`str`) name of the embedding model
- `custom_evaluations`: (list of the following maps) required nonempty for
[custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md). Each map has keys:
- `name`: (`str`) name of the evaluation
- `inputs`: (`list[str]`) list of input variables. Any combination of:
- `question`
- `reference_answer`
- `reference_steps`
- `actual_answer`
- `actual_steps`
- `steps_keys`: (`list[str]`; required if `inputs` contains `actual_steps` or `reference_steps`) one or both of:
- `args`
- `output`
- `steps_name`: (`str`; required if `inputs` contains `actual_steps` or `reference_steps`) the type (name) of steps to include in the evaluation
- `instructions`: (`str`) instructions for the evaluation
- `outputs`: (`map[str,str]`) output variable names and descriptions

## Example configuration file with LLM configuration

Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) and embedding (for [`answer_relevance`](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md)). It assumes that the environment variable `OPENAI_API_KEY` is set with your OpenAI API key.

```YAML
llm:
generation:
provider: openai
model: gpt-4o-mini
temperature: 0.0
max_tokens: 65536
embedding:
provider: openai
model: text-embedding-3-small
```

## Example configuration file with LLM configuration and API keys

Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) and embedding (for [`answer_relevance`](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)) with different API keys in place of environment variables.

```YAML
llm:
generation:
provider: azure
model: graphrag-eval-system-tests-gpt-5.2
base_url: https://my-generator.openai.azure.com
temperature: 0.0
max_tokens: 8192
api_key: ...
embedding:
provider: azure
model: graphrag-eval-system-tests-text-embedding-3-small
api_base: https://my-embedder.openai.azure.com
api_key: ...
```

## Example configuration file with custom evaluations

Below is a YAML file that defines two custom evaluations:
1. a simple relevance evaluation
1. SPARQL retrieval evaluation using the reference answer

This is an example of the format and may not create accurate evaluations.

```YAML
llm:
generation:
provider: openai
model: gpt-4o-mini
temperature: 0.0
max_tokens: 65536
embedding:
provider: openai
model: text-embedding-3-small
custom_evaluations:
-
name: my_answer_relevance
inputs:
- question
- actual_answer
instructions: |
Evaluate how relevant is the answer to the question.
outputs:
my_answer_relevance: fraction between 0 and 1
my_answer_relevance_reason: reason for your evaluation
-
name: sparql_llm_evaluation
inputs:
- question
- reference_answer
- actual_steps
steps_keys:
- output
steps_name: sparql
instructions: |
Divide the reference answer into claims and try to match each claim to the
SPARQL query results. Count the:
- reference claims
- SPARQL results
- matching claims
outputs:
sparql_recall: Number of matching claims as a fraction of reference claims (fraction 0-1)
sparql_precision: Number of matching claims as a fraction of SPARQL results (fraction 0-1)
sparql_reason: reason for your evaluation
```
49 changes: 49 additions & 0 deletions docs/custom.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Custom evaluation (custom metrics)

You (the user) can define your own metrics of system outputs to be evaluated using an LLM. To do this, specify its name, inputs, outputs and instructions in a YAML file and pass the file path as a parameter to `run_evaluation()`. This will return your output metrics alongside the standard metrics described in previous sections.

One configuration file can define multiple custom evaluations, each of which will be done as a separate query to the LLM. Each evaluation can have multiple outputs. The format is shown in the example sections below.

See [§ Example configuration file with custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md#example-configuration-file-with-custom-evaluations).

## Example call to evaluate using custom metrics

```python
evaluation_results = run_evaluation(
reference_qa_dataset,
chat_responses,
"my_project/custom_eval.yaml"
)
```

## Example output for custom SPARQL evaluation

With the [custom SPARQL evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md#example-configuration-file-with-custom-evaluations), the output is as for [§ Example evaluation results](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/examples/output.yaml), except that it has the following additional keys and example values:

```yaml
my_answer_relevance: 0.9
my_answer_relevance_eval_reason: The answer contains relevant information except for the sentence about total revenue
sparql_recall: 0.75
sparql_precision: 0.6
sparql_eval_reason: The reference answer has 4 claims; there are 5 SPARQL results; 3 claims match
```

## Output in case of evaluation error

If there is an error during evaluation:
- The configured output keys will have value `null`
- There will be an additional key explaining the error. The key will be `{name}_error` where `name` is the custom evaluation name.

There are three types of error:
1. The reference input is missing keys requested in the custom evaluation configuration.
- Example: `custom_1_error: Reference missing key 'reference_steps'`
1. The actual output to be evaluated is missing keys requested in the custom evaluation configuration.
- Example: `custom_1_error: Actual output missing 'actual_steps'`
1. The evaluating LLM output does not conform to the custom evaluation configuration.
- Example: `custom_1_error: "Expected 6 tab-separated values, got: 0.1\tCustom answer reason"`

## Recommendations

To improve custom evaluation accuracy:
1. Specify only several outputs in each evaluation
1. Specify outputs explaining any quantities that the LLM must count or estimate. You can request one explanation per quantity or one shared explanation for several quantities.
Loading
Loading