SentenceWindowRetriever

使用此组件检索相关句子周围的邻近句子，以获取完整上下文。


pipeline 中的最常见位置	在主 Retriever 组件之后使用，例如`InMemoryEmbeddingRetriever` 或任何其他 Retriever。
必需的初始化变量	"document_store": Document Store 的一个实例
强制运行变量	"retrieved_documents": 要为其获取上下文窗口的已检索文档列表
输出变量	“context_windows”: 字符串列表 "context_documents": 按以下顺序排列的文档列表`split_idx_start`
API 参考	Retrievers (检索器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/retrievers/sentence_window_retriever.py

概述

"Sentence window" 是一种检索技术，它允许检索相关句子周围的上下文。

在索引期间，文档被分解成更小的块或句子并进行索引。在检索期间，根据一定的相似性度量，检索与给定查询最相关的句子。

一旦我们有了相关的句子，我们就可以检索邻近的句子来提供完整的上下文。要检索的邻近句子的数量由相关句子之前和之后的固定句子数量定义。

此组件旨在与其他 Retriever 一起使用，例如InMemoryEmbeddingRetriever。这些 Retriever 通过使用相似性度量将查询与索引句子进行比较来查找相关句子。然后，SentenceWindowRetriever 组件通过利用存储在Document 对象中的元数据来检索相关句子周围的邻近句子。

用法

单独使用

splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word")  
text = ("This is a text with some words. There is a second sentence. And there is also a third sentence. "  
        "It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence")
doc = Document(content=text)

docs = splitter.run([doc])
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs["documents"])

retriever = SentenceWindowRetriever(document_store=doc_store, window_size=3)

在 Pipeline 中

from haystack import Document, Pipeline  
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever  
from haystack.components.retrievers import SentenceWindowRetriever  
from haystack.components.preprocessors import DocumentSplitter  
from haystack.document_stores.in_memory import InMemoryDocumentStore
    
splitter = DocumentSplitter(split_length=10, split_overlap=5, split_by="word")
text = (
        "This is a text with some words. There is a second sentence. And there is also a third sentence. "
        "It also contains a fourth sentence. And a fifth sentence. And a sixth sentence. And a seventh sentence"
)
doc = Document(content=text)
docs = splitter.run([doc])
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs["documents"])


rag = Pipeline()
rag.add_component("bm25_retriever", InMemoryBM25Retriever(doc_store, top_k=1))
rag.add_component("sentence_window_retriever", SentenceWindowRetriever(document_store=doc_store, window_size=3))
rag.connect("bm25_retriever", "sentence_window_retriever")

rag.run({'bm25_retriever': {"query":"third"}})

其他参考资料

📓 教程：检索句子周围的上下文窗口

更新于 11 个月前