pipeline 中的最常见位置	在查询管道中，在返回文档列表的组件之后，例如Retriever
必需的初始化变量	"token": Hugging Face API 令牌。可以通过`HF_API_TOKEN` 或`HF_TOKEN` 环境变量设置。
强制运行变量	"documents": 文档列表 "query": 查询字符串
输出变量	"answers": 一个 `ExtractedAnswer` 对象列表
API 参考	Readers (读取器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/readers/extractive.py

概述

ExtractiveReader 从文档文本中定位并提取给定查询的答案。它用于提取式 QA 系统，您希望确切知道答案在文档中的位置。它通常与前面的 Retriever 结合使用，但您也可以将其与其他获取文档的组件一起使用。

Readers 为答案分配一个*概率*。此分数范围从 0 到 1，表示 Reader 返回的结果与查询的匹配程度。最接近 1 的概率意味着模型对答案的相关性有高度信心。Reader 根据概率分数对答案进行排序，概率较高的排在前面。您可以在可选的top_k 参数中设置要返回的文档数量。

您可以使用概率来设定系统的质量期望。为此，请使用 Reader 的confidence_score 参数设置答案的最小概率阈值。例如，将confidence_threshold 设置为0.7 意味着只有概率高于 0.7 的答案才会被返回。

默认情况下，Reader 包含一种情况，即在文档文本中找不到查询的答案（no_answer=True）。在这种情况下，它会返回一个额外的ExtractedAnswer，该答案没有文本，并且具有所有top_k 答案不正确的概率。例如，如果top_k=4，系统将返回四个答案和一个额外的空答案。每个答案都有一个指定的概率。如果空答案的概率为 0.5，则表示没有一个返回的答案是正确的概率。要仅接收实际的 top_k 答案，请在初始化组件时将no_answer 参数设置为False。

模型

以下是我们推荐与ExtractiveReader:

Model URL	描述	Language
deepset/roberta-base-squad2-distilled (默认)	一个蒸馏模型，速度相对较快，性能良好。	English
deepset/roberta-large-squad2	一个大型模型，性能良好。比蒸馏模型慢。	English
deepset/tinyroberta-squad2	roberta-large-squad2 模型的蒸馏版本，速度非常快。	English
deepset/xlm-roberta-base-squad2	一个基础的多语言模型，速度和性能都很好。	Multilingual

您也可以在Hugging Face上查看其他问题回答模型。

用法

单独使用

下面是一个使用ExtractiveReader 在管道之外。Reader 在运行时接收查询和文档。它应该返回两个答案和一个额外的第三个答案，该答案没有文本，并且表示top_k 答案不正确的概率。

from haystack import Document
from haystack.components.readers import ExtractiveReader

docs = [Document(content="Paris is the capital of France."), Document(content="Berlin is the capital of Germany.")]

reader = ExtractiveReader()
reader.warm_up()

reader.run(query="What is the capital of France?", documents=docs, top_k=2)

在 pipeline 中

以下是一个管道示例，该管道从InMemoryDocumentStore 基于关键字搜索（使用InMemoryBM25Retriever）设置要排名的文档数量。然后，它使用ExtractiveReader 中检索文档，以从检索到的最佳文档中提取我们查询的答案。

将 ExtractiveReader 的top_k 设置为 2 时，还会返回一个额外的、第三个没有文本的答案，以及其他top_k 答案不正确的概率。

from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.readers import ExtractiveReader

docs = [Document(content="Paris is the capital of France."),
        Document(content="Berlin is the capital of Germany."),
        Document(content="Rome is the capital of Italy."),
        Document(content="Madrid is the capital of Spain.")]
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)

retriever = InMemoryBM25Retriever(document_store = document_store)
reader = ExtractiveReader()
reader.warm_up()

extractive_qa_pipeline = Pipeline()
extractive_qa_pipeline.add_component(instance=retriever, name="retriever")
extractive_qa_pipeline.add_component(instance=reader, name="reader")

extractive_qa_pipeline.connect("retriever.documents", "reader.documents")

query = "What is the capital of France?"
extractive_qa_pipeline.run(data={"retriever": {"query": query, "top_k": 3}, 
                                   "reader": {"query": query, "top_k": 2}})