FaithfulnessEvaluator

该FaithfulnessEvaluator使用 LLM 来评估生成答案是否可以从提供的上下文中推断出来。它不需要真实标签。这个指标称为忠实度，有时也称为有依据性或幻觉。


pipeline 中的最常见位置	单独使用或在评估管道中使用。应在已生成 Evaluator 输入的单独管道之后使用。
强制运行变量	"questions": 问题列表 "contexts": 上下文列表的列表，即文档的内容。这相当于每个问题一个上下文列表。 "predicted_answers": 预测答案列表，例如 RAG 管道中生成器的输出
输出变量	一个包含 - `score`: 一个介于 0.0 和 1.0 之间的数字，表示所有问题上的平均忠实度得分 -`individual_scores`: 一个列表，包含从 0.0 到 1.0 的个体忠实度得分，对应于问题、上下文列表和预测答案的每个输入三元组。 -`results`: 包含`statements` 和`statement_scores` 键的字典列表。它们包含 LLM 从每个预测答案中提取的陈述以及每个陈述对应的忠实度得分，得分要么为 0，要么为 1。
API 参考	Evaluators (评估器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/evaluators/faithfulness.py

您可以使用FaithfulnessEvaluator 组件用于评估 Haystack 管道（例如 RAG 管道）检索到的文档，而无需真实标签。该组件将生成的答案拆分成陈述，并使用 LLM 将每个陈述与提供的上下文进行检查。较高的忠实度得分更好，它表明生成的答案中有更多的陈述可以从上下文中推断出来。忠实度得分可用于更好地理解 RAG 管道中的生成器出现幻觉的频率和时间。

参数

此评估器的默认模型为gpt-4o-mini。您可以在初始化期间使用chat_generator 参数覆盖模型。这需要是一个配置为返回 JSON 对象的聊天生成器实例。例如，在使用 OpenAIChatGenerator 时，您应该在其{"response_format": {"type": "json_object"}} 中传递generation_kwargs.

如果您不是使用 OpenAI 以外的聊天生成器来初始化评估器，则必须将有效的 OpenAI API 密钥设置为OPENAI_API_KEY 环境变量。有关详细信息，请参阅我们关于密钥管理的文档页面。

另外两个可选的初始化参数是

raise_on_failure: 如果为 True，则在 API 调用不成功时引发异常。
progress_bar: 在评估期间是否显示进度条。

FaithfulnessEvaluator 有一个可选的examples 参数，可用于传递符合FaithfulnessEvaluator 预期输入和输出格式的少量示例。这些示例包含在发送到 LLM 的提示中。因此，示例增加了提示的 token 数量，并使每次请求的成本更高。如果您希望以更多的 token 为代价来提高评估质量，添加示例会很有帮助。

每个示例都必须是一个包含键的字典inputs 和outputs.
inputs 必须是一个包含键的字典questions, contexts 和predicted_answers.
outputs 必须是一个包含statements 和statement_scores.
以下是预期格式

[{
	"inputs": {
		"questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
		"predicted_answers": "Rome is the capital of Italy with more than 4 million inhabitants.",
	},
	"outputs": {
		"statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
		"statement_scores": [1, 0],
	},
}]

用法

单独使用

下面是一个使用FaithfulnessEvaluator 组件来评估基于提供的问题和上下文生成的预测答案的示例。由于它检测到答案中有两个陈述，而只有一个是正确的，因此FaithfulnessEvaluator 返回 0.5 的得分。

from haystack.components.evaluators import FaithfulnessEvaluator

questions = ["Who created the Python language?"]
contexts = [
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
predicted_answers = ["Python is a high-level general-purpose programming language that was created by George Lucas."]
evaluator = FaithfulnessEvaluator()
result = evaluator.run(questions=questions, contexts=contexts, predicted_answers=predicted_answers)

print(result["individual_scores"])
# [0.5]
print(result["score"])
# 0.5
print(result["results"])
# [{'statements': ['Python is a high-level general-purpose programming language.',
# 'Python was created by George Lucas.'], 'statement_scores': [1, 0], 'score': 0.5}]

在 pipeline 中

下面是一个示例，我们使用FaithfulnessEvaluator 和ContextRelevanceEvaluator 在管道中用于评估 RAG 管道基于提供的问题接收到的预测答案和上下文（文档的内容）。运行管道而不是单独的组件可以简化计算多个指标的操作。

from haystack import Pipeline
from haystack.components.evaluators import ContextRelevanceEvaluator, FaithfulnessEvaluator

pipeline = Pipeline()
context_relevance_evaluator = ContextRelevanceEvaluator()
faithfulness_evaluator = FaithfulnessEvaluator()
pipeline.add_component("context_relevance_evaluator", context_relevance_evaluator)
pipeline.add_component("faithfulness_evaluator", faithfulness_evaluator)

questions = ["Who created the Python language?"]
contexts = [
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
predicted_answers = ["Python is a high-level general-purpose programming language that was created by George Lucas."]

result = pipeline.run(
		{
			"context_relevance_evaluator": {"questions": questions, "contexts": contexts},
	    "faithfulness_evaluator": {"questions": questions, "contexts": contexts, "predicted_answers": predicted_answers}
    }
)

for evaluator in result:
    print(result[evaluator]["individual_scores"])
#...
# [0.5]
for evaluator in result:
    print(result[evaluator]["score"])
# 
# 0.5

更新于 6 个月前