AutoMergingRetriever

使用 AutoMergingRetriever 来改进搜索结果，当多个相关片段匹配查询时，返回完整的父文档而不是零散的块。


pipeline 中的最常见位置	在返回分层文档的主要 Retriever 组件之后使用。
必需的初始化变量	“document_store”: 用于检索父文档的 Document Store
强制运行变量	“documents”: 由 Retriever 匹配到的叶子文档列表
输出变量	“documents”: 结果文档列表
API 参考	Retrievers (检索器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/retrievers/auto_merging_retriever.py

概述

该AutoMergingRetriever 是一个处理分层文档结构组件。当满足特定阈值时，它会返回父文档而不是单个叶子文档。

这在处理被分割成多个块的段落时特别有用。当同一个段落的多个块匹配到您的查询时，完整的段落通常比单个片段提供更多的上下文和价值。

这是此 Retriever 的工作原理：

它要求文档以树状结构组织，叶子节点存储在文档索引中 - 请参阅 HierarchicalDocumentSplitter 文档。
在搜索时，它会计算同一父节点下的有多少叶子文档匹配您的查询。
如果此计数超过您定义的阈值，它将返回父文档而不是单个叶子节点。

该AutoMergingRetriever 目前可以由以下 Document Stores 使用：

用法

单独使用

from haystack import Document
from haystack.components.preprocessors import HierarchicalDocumentSplitter
from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

# create a hierarchical document structure with 3 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]

# store level-1 parent documents and initialize the retriever
doc_store_parents = InMemoryDocumentStore()
for doc in docs["documents"]:
    if doc.meta["children_ids"] and doc.meta["level"] == 1:
        doc_store_parents.write_documents([doc])
retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)

# assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
# since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]]
docs = retriever.run(leaf_docs[4:6])
>> {'documents': [Document(id=538..),
>> content: 'warm glow over the trees. Birds began to sing.',
>> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
>> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}

在 pipeline 中

这是一个 RAG Haystack 管道的示例。它首先使用 BM25 检索叶子级别的文档块，然后使用AutoMergingRetriever 将它们合并为更高级别的父文档，构建提示，并使用 OpenAI 的聊天模型生成答案。

from typing import List, Tuple
from haystack import Document, Pipeline
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.dataclasses import ChatMessage

def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
    splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
    docs = splitter.run(documents)

    leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
    leaf_doc_store = InMemoryDocumentStore()
    leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE)

    parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
    parent_doc_store = InMemoryDocumentStore()
    parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE)

    return leaf_doc_store, parent_doc_store

# Add documents
docs = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
    Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")
]

leaf_docs, parent_docs = indexing(docs)

prompt_template = [
    ChatMessage.from_system("You are a helpful assistant."),
    ChatMessage.from_user(
        "Given these documents, answer the question.\nDocuments:\n"
        "{% for doc in documents %}{{ doc.content }}{% endfor %}\n"
        "Question: {{question}}\nAnswer:"
    )
]

rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=leaf_docs), name="bm25_retriever")
rag_pipeline.add_component(instance=AutoMergingRetriever(parent_docs, threshold=0.6), name="retriever")
rag_pipeline.add_component(instance=ChatPromptBuilder(template=prompt_template, required_variables={"question", "documents"}), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIChatGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

rag_pipeline.connect("bm25_retriever.documents", "retriever.documents")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.messages", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

question = "How many languages are there?"
result = rag_pipeline.run({
    "bm25_retriever": {"query": question},
    "prompt_builder": {"question": question},
    "answer_builder": {"query": question}
})

更新于 7 个月前