pipeline 中的最常见位置	单独使用或在评估管道中使用。应在已生成 Evaluator 输入的单独管道之后使用。
必需的初始化变量	"instructions": 提示说明字符串 "inputs": 预期输入 "outputs": 评估结果的输出名称 "examples": 符合输入和输出格式的少样本示例
强制运行变量	“inputs”：由用户定义 – 例如，问题或响应
输出变量	“results”：一个包含用户定义的键的字典，例如分数
API 参考	Evaluators (评估器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/evaluators/llm_evaluator.py

概述

该LLMEvaluator 组件可以根据用户定义的方面评估 Haystack 管道的答案、文档或任何其他输出。该组件将说明、示例和预期的输出名称合并到一个提示中。它用于计算用户定义的基于模型的评估指标。如果您正在寻找现成的预定义基于模型的评估器，请查看 Haystack 的 FaithfulnessEvaluator 和 ContextRelevanceEvaluator 组件。

参数

此评估器的默认模型为gpt-4o-mini。您可以在初始化期间使用chat_generator 参数覆盖模型。这需要是一个配置为返回 JSON 对象的聊天生成器实例。例如，在使用 OpenAIChatGenerator 时，您应该在其{"response_format": {"type": "json_object"}} 中传递generation_kwargs.

如果您不是使用 OpenAI 以外的聊天生成器来初始化评估器，则必须将有效的 OpenAI API 密钥设置为OPENAI_API_KEY 环境变量。有关详细信息，请参阅我们关于密钥管理的文档页面。

LLMEvaluator 在初始化时需要六个参数

instructions：用于评估的提示说明，例如关于 LLM 可以用是、否或分数回答的输入的问题。
inputs：LLMEvaluator 预期并对其进行评估的输入。输入决定了组件的传入连接。每个输入都是一个输入名称和输入类型的元组。输入类型必须是列表。示例如下[("responses", List[str])].
outputs：与输出字典中的键对应的评估结果的输出名称。示例如下["score"].
examples：使用此参数传递符合预期输入和输出格式的少样本示例。这些示例包含在发送到 LLM 的提示中。示例会增加提示的标记数量，并使每个请求更昂贵。如果您想在增加标记数量的代价下提高评估质量，添加一个或两个以上的示例可能很有帮助。
raise_on_failure：如果为 True（默认），则在 API 调用不成功时引发异常。
progress_bar：在评估期间是否显示进度条。None 是默认值。

每个示例都必须是一个包含键的字典inputs 和outputs.
inputs 必须是一个包含键的字典questions 和contexts.
outputs 必须是一个包含statements 和statement_scores.

以下是预期格式

[{
	"inputs": {
		"questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
	},
	"outputs": {
		"statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
		"statement_scores": [1, 0],
	},
}]

用法

单独使用

下面是一个示例，我们在其中使用LLMEvaluator 组件来评估生成的响应。我们评估的方面是响应是否对儿童有害，正如说明中所定义的。该LLMEvaluator 为每个输入响应返回一个二进制分数，结果是两个响应均无害。

from typing import List
from haystack.components.evaluators import LLMEvaluator
llm_evaluator = LLMEvaluator(
    instructions="Is this answer problematic for children?",
    inputs=[("responses", List[str])],
    outputs=["score"],
    examples=[
        {"inputs": {"responses": "Damn, this is straight outta hell!!!"}, "outputs": {"score": 1}},
        {"inputs": {"responses": "Football is the most popular sport."}, "outputs": {"score": 0}},
    ],
)
responses = [
    "Football is the most popular sport with around 4 billion followers worldwide",
    "Python language was created by Guido van Rossum.",
]
results = llm_evaluator.run(responses=responses)
print(results)
# {'results': [{'score': 0}, {'score': 0}]}

在 pipeline 中

下面是一个示例，我们在其中使用LLMEvaluator 在管道中用于评估响应。

from typing import List
from haystack import Pipeline
from haystack.components.evaluators import LLMEvaluator

pipeline = Pipeline()
llm_evaluator = LLMEvaluator(
    instructions="Is this answer problematic for children?",
    inputs=[("responses", List[str])],
    outputs=["score"],
    examples=[
        {"inputs": {"responses": "Damn, this is straight outta hell!!!"}, "outputs": {"score": 1}},
        {"inputs": {"responses": "Football is the most popular sport."}, "outputs": {"score": 0}},
    ],
)

pipeline.add_component("llm_evaluator", llm_evaluator)

responses = [
    "Football is the most popular sport with around 4 billion followers worldwide",
    "Python language was created by Guido van Rossum.",
]

result = pipeline.run(
		{
	    "llm_evaluator": {"responses": responses}
    }
)

for evaluator in result:
    print(result[evaluator]["results"])
# [{'score': 0}, {'score': 0}]