RagasEvaluator

此组件使用基于 LLM 的指标来评估 Haystack 管道。它支持上下文相关性、事实准确性、响应相关性等指标。


pipeline 中的最常见位置	单独使用或在评估管道中使用。在单独的管道生成评估器的输入后使用。
必需的初始化变量	“metric”：要用于评估的 Ragas 指标
强制运行变量	“inputs”：包含预期输入的关键字参数字典。预期输入将根据您正在评估的指标而变化。有关更多详细信息，请参见下文。
输出变量	“results”：指标结果的嵌套列表。可以有一个或多个结果，具体取决于指标。每个结果都是一个字典，包含 - `name` - 指标的名称。 -`score` - 指标的分数。
API 参考	Ragas
GitHub 链接	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/ragas

Ragas 是一个提供许多基于 LLM 的评估指标的评估框架。您可以使用RagasEvaluator 组件来评估 Haystack 管道（例如，检索增强生成管道）针对 Ragas 提供的指标之一。

支持的指标

Ragas 支持许多指标，我们通过 Ragas 指标枚举公开这些指标。以下是 Haystack 中RagasEvaluator 支持的指标列表，以及在初始化评估器时预期的metric_params。许多指标使用 OpenAI 模型，需要设置环境变量OPENAI_API_KEY。有关这些指标的完整指南，请访问 Ragas 文档。

指标	指标参数	预期输入	指标描述
`ANSWER_CORRECTNESS`	`"weights": Tuple[float, float]`	`questions: List[str]`, `responses: List[str]`, `ground_truths: List[str]`	根据事实真相对生成答案的准确性进行评分。
`FAITHFULNESS`	None	`questions: List[str]`, `contexts: List[List[str]]`, `responses: List[str]`	对生成响应的事实性进行评分。
`ANSWER_SIMILARITY`	`"threshold": float`	`responses: List[str]`, `ground_truths: List[str]`	对生成答案与指定的事实真相答案的相似度进行评分。
`CONTEXT_PRECISION`	None	`questions: List[str]`, `contexts: List[List[str]]`, `ground_truths: List[str]`	评分答案是否包含与所问问题相关的额外无关信息。
`CONTEXT_UTILIZATION`	None	`questions: List[str]`, `contexts: List[List[str]]`, `responses: List[str]`	评估生成答案在多大程度上使用了提供的上下文。
`CONTEXT_RECALL`	None	`questions: List[str]`, `contexts: List[List[str]]`, `ground_truths: List[str]`	对生成响应与指定问题的完整性进行评分。
`ASPECT_CRITIQUE`	`"name": str`, `"definition”: str`, `"strictness": int`	`questions: List[str]`, `contexts: List\[List[str]]`, `responses: List[str]`	根据自定义方面对生成答案进行二元评分。
`ANSWER_RELEVANCY`	`"strictness": int`	`questions: List[str]`, `contexts: List[List[str]] responses: List[str]`	根据问题对生成响应的相关性进行评分。

参数概述

初始化您需要在RagasEvaluator 中提供以下参数

metric：一个RagasMetric.
metric_params：可选，如果指标需要其他参数，则应在此处提供。

用法

要使用要使用RagasEvaluator，您首先需要安装集成

pip install ragas-haystack

要使用要使用RagasEvaluator，您需要按照以下步骤操作

初始化RagasEvaluator，同时为正在使用的指标提供正确的metric_params。
运行RagasEvaluator，可以单独运行或在管道中运行，方法是提供所使用指标的预期输入。

示例

评估上下文相关性

创建上下文相关性评估管道

from haystack import Pipeline
from haystack_integrations.components.evaluators.ragas import RagasEvaluator, RagasMetric

pipeline = Pipeline()
evaluator = RagasEvaluator(
    metric=RagasMetric.ANSWER_RELEVANCY,
)
pipeline.add_component("evaluator", evaluator)

要运行评估管道，您应该准备好指标的*预期输入*。此指标需要一个列表questions 和contexts，这些应该来自您要评估的管道的结果。

results = pipeline.run({"evaluator": {"questions": ["When was the Rhodes Statue built?", "Where is the Pyramid of Giza?"], 
                                                "contexts": [["Context for question 1"], ["Context for question 2"]]}})

评估上下文相关性和方面批评

创建评估上下文相关性和方面批评的管道

from haystack import Pipeline
from haystack_integrations.components.evaluators.ragas import RagasEvaluator, RagasMetric

pipeline = Pipeline()
evaluator_context = RagasEvaluator(
    metric=RagasMetric.CONTEXT_PRECISION,
)
evaluator_aspect = RagasEvaluator(
    metric=RagasMetric.ASPECT_CRITIQUE,
    metric_params={"name": "custom", "definition": "Is this answer problematic for children?", "strictness": 3},
)
pipeline.add_component("evaluator_context", evaluator_context)
pipeline.add_component("evaluator_aspect", evaluator_aspect)

要运行评估管道，您应该准备好指标的*预期输入*。这些指标需要一个列表questions, contexts, responses 和ground_truths。这些应该来自您要评估的管道的结果。

QUESTIONS = ["Which is the most popular global sport?", "Who created the Python language?"]
CONTEXTS = [["The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people."], 
                 ["Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."]]
RESPONSES = ["Football is the most popular sport with around 4 billion followers worldwide", "Python language was created by Guido van Rossum."]
GROUND_TRUTHS = ["Football is the most popular sport", "Python language was created by Guido van Rossum."]
results = pipeline.run({
        "evaluator_context": {"questions": QUESTIONS, "contexts": CONTEXTS, "ground_truths": GROUND_TRUTHS},
        "evaluator_aspect": {"questions": QUESTIONS, "contexts": CONTEXTS, "responses": RESPONSES},
})

其他参考资料

🧑‍🍳 食谱：使用 Ragas 集成评估 RAG 管道

更新于大约 1 年前