You (the user) can define your own metrics of system outputs to be evaluated using an LLM. To do this, specify its name, inputs, outputs and instructions in a YAML file and pass the file path as a parameter to run_evaluation(). This will return your output metrics alongside the standard metrics described in previous sections.
One configuration file can define multiple custom evaluations, each of which will be done as a separate query to the LLM. Each evaluation can have multiple outputs. The format is shown in the example sections below.
See § Example configuration file with custom evaluation.
evaluation_results = run_evaluation(
reference_qa_dataset,
chat_responses,
"my_project/custom_eval.yaml"
)With the custom SPARQL evaluation, the output is as for § Example evaluation results, except that it has the following additional keys and example values:
my_answer_relevance: 0.9
my_answer_relevance_eval_reason: The answer contains relevant information except for the sentence about total revenue
sparql_recall: 0.75
sparql_precision: 0.6
sparql_eval_reason: The reference answer has 4 claims; there are 5 SPARQL results; 3 claims matchIf there is an error during evaluation:
- The configured output keys will have value
null - There will be an additional key explaining the error. The key will be
{name}_errorwherenameis the custom evaluation name.
There are three types of error:
- The reference input is missing keys requested in the custom evaluation configuration.
- Example:
custom_1_error: Reference missing key 'reference_steps'
- The actual output to be evaluated is missing keys requested in the custom evaluation configuration.
- Example:
custom_1_error: Actual output missing 'actual_steps'
- The evaluating LLM output does not conform to the custom evaluation configuration.
- Example:
custom_1_error: "Expected 6 tab-separated values, got: 0.1\tCustom answer reason"
To improve custom evaluation accuracy:
- Specify only several outputs in each evaluation
- Specify outputs explaining any quantities that the LLM must count or estimate. You can request one explanation per quantity or one shared explanation for several quantities.