Ontotext-AD · pgan002 · Mar 14, 2026 · Mar 14, 2026 · Mar 14, 2026 · Mar 14, 2026
diff --git a/README.md b/README.md
diff --git a/docs/config.md b/docs/config.md
@@ -0,0 +1,117 @@
+# Configuration
+
+`run_evaluation()` and `compute_aggregates()` are configured using a YAML file whose path is passed as a parameter. The configuration has the following structure:
+
+- `llm`: required for [LLM-based metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md). Keys:
+  - `generation`: required. The following keys are required:
+    - `provider`: (`str`) name of the organization providing the generation model, as supported by LiteLLM
+    - `model`: (`str`) name of the generation model
+    - `temperature`: (`float` in the range [0.0, 2.0]) adversarial temperature for generation
+    - `max_tokens`: (`int` > 0) maximum number of tokens to generate
+    - Optional keys: parameters to be passed to LiteLLM for generation (for [`answer_correctness`](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md) and [custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md)). Examples:
+      - `base_url`: (`str`) base URL for the generation model, alternative to the provider's default URL
+      - `api_key`: (`str`) API key for the generation model, alternative to setting the environment variable corresponding to the provider (e.g. `OPENAI_API_KEY` for OpenAI. `AZURE_OPENAI_API_KEY` for Azure etc.)
+  - `embedding`: required for [`answer_relevance`](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md).
+      - `provider`: (`str`) name of the organization providing the embedding model
+      - `model`: (`str`) name of the embedding model
+- `custom_evaluations`: (list of the following maps) required nonempty for
+[custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md). Each map has keys:
+    - `name`: (`str`) name of the evaluation
+    - `inputs`: (`list[str]`) list of input variables. Any combination of:
+      - `question`
+      - `reference_answer`
+      - `reference_steps`
+      - `actual_answer`
+      - `actual_steps`
+    - `steps_keys`: (`list[str]`; required if `inputs` contains `actual_steps` or `reference_steps`) one or both of:
+      - `args`
+      - `output`
+    - `steps_name`: (`str`; required if `inputs` contains `actual_steps` or `reference_steps`) the type (name) of steps to include in the evaluation
+    - `instructions`: (`str`) instructions for the evaluation
+    - `outputs`: (`map[str,str]`) output variable names and descriptions
+
+## Example configuration file with LLM configuration
+
+Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) and embedding (for [`answer_relevance`](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md)). It assumes that the environment variable `OPENAI_API_KEY` is set with your OpenAI API key.
+
+```YAML
+llm:
+  generation:
+    provider: openai
+    model: gpt-4o-mini
+    temperature: 0.0
+    max_tokens: 65536
+  embedding:
+    provider: openai
+    model: text-embedding-3-small
+```
+
+## Example configuration file with LLM configuration and API keys
+
+Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) and embedding (for [`answer_relevance`](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)) with different API keys in place of environment variables.
+
+```YAML
+llm:
+  generation:
+    provider: azure
+    model: graphrag-eval-system-tests-gpt-5.2
+    base_url: https://my-generator.openai.azure.com
+    temperature: 0.0
+    max_tokens: 8192
+    api_key: ...
+  embedding:
+    provider: azure
+    model: graphrag-eval-system-tests-text-embedding-3-small
+    api_base: https://my-embedder.openai.azure.com
+    api_key: ...
+```
+
+## Example configuration file with custom evaluations
+
+Below is a YAML file that defines two custom evaluations:
+1. a simple relevance evaluation
+1. SPARQL retrieval evaluation using the reference answer
+
+This is an example of the format and may not create accurate evaluations.
+
+```YAML
+llm:
+  generation:
+    provider: openai
+    model: gpt-4o-mini
+    temperature: 0.0
+    max_tokens: 65536
+  embedding:
+    provider: openai
+    model: text-embedding-3-small
+custom_evaluations:
+  -
+    name: my_answer_relevance
+    inputs:
+      - question
+      - actual_answer
+    instructions: |
+      Evaluate how relevant is the answer to the question.
+    outputs:
+      my_answer_relevance: fraction between 0 and 1
+      my_answer_relevance_reason: reason for your evaluation
+  -
+    name: sparql_llm_evaluation
+    inputs:
+      - question
+      - reference_answer
+      - actual_steps
+    steps_keys:
+      - output
+    steps_name: sparql
+    instructions: |
+      Divide the reference answer into claims and try to match each claim to the
+      SPARQL query results. Count the:
+      - reference claims
+      - SPARQL results
+      - matching claims
+    outputs:
+      sparql_recall: Number of matching claims as a fraction of reference claims (fraction 0-1)
+      sparql_precision: Number of matching claims as a fraction of SPARQL results (fraction 0-1)
+      sparql_reason: reason for your evaluation
+```
diff --git a/docs/custom.md b/docs/custom.md
@@ -0,0 +1,49 @@
+# Custom evaluation (custom metrics)
+
+You (the user) can define your own metrics of system outputs to be evaluated using an LLM. To do this, specify its name, inputs, outputs and instructions in a YAML file and pass the file path as a parameter to `run_evaluation()`. This will return your output metrics alongside the standard metrics described in previous sections.
+
+One configuration file can define multiple custom evaluations, each of which will be done as a separate query to the LLM. Each evaluation can have multiple outputs. The format is shown in the example sections below.
+
+See [§ Example configuration file with custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md#example-configuration-file-with-custom-evaluations).
+
+## Example call to evaluate using custom metrics
+
+```python
+evaluation_results = run_evaluation(
+    reference_qa_dataset, 
+    chat_responses, 
+    "my_project/custom_eval.yaml"
+)
+```
+
+## Example output for custom SPARQL evaluation
+
+With the [custom SPARQL evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md#example-configuration-file-with-custom-evaluations), the output is as for [§ Example evaluation results](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/examples/output.yaml), except that it has the following additional keys and example values:
+
+```yaml
+  my_answer_relevance: 0.9
+  my_answer_relevance_eval_reason: The answer contains relevant information except for the sentence about total revenue
+  sparql_recall: 0.75
+  sparql_precision: 0.6
+  sparql_eval_reason: The reference answer has 4 claims; there are 5 SPARQL results; 3 claims match
+```
+
+## Output in case of evaluation error
+
+If there is an error during evaluation:
+- The configured output keys will have value `null`
+- There will be an additional key explaining the error. The key will be `{name}_error` where `name` is the custom evaluation name.
+
+There are three types of error:
+1. The reference input is missing keys requested in the custom evaluation configuration.
+  - Example: `custom_1_error: Reference missing key 'reference_steps'`
+1. The actual output to be evaluated is missing keys requested in the custom evaluation configuration.
+  - Example: `custom_1_error: Actual output missing 'actual_steps'`
+1. The evaluating LLM output does not conform to the custom evaluation configuration.
+  - Example: `custom_1_error: "Expected 6 tab-separated values, got: 0.1\tCustom answer reason"`
+
+## Recommendations
+
+To improve custom evaluation accuracy:
+1. Specify only several outputs in each evaluation
+1. Specify outputs explaining any quantities that the LLM must count or estimate. You can request one explanation per quantity or one shared explanation for several quantities.