AutoMergingRetriever
使用 AutoMergingRetriever 来改进搜索结果,当多个相关片段匹配查询时,返回完整的父文档而不是零散的块。
| pipeline 中的最常见位置 | 在返回分层文档的主要 Retriever 组件之后使用。 |
| 必需的初始化变量 | “document_store”: 用于检索父文档的 Document Store |
| 强制运行变量 | “documents”: 由 Retriever 匹配到的叶子文档列表 |
| 输出变量 | “documents”: 结果文档列表 |
| API 参考 | Retrievers (检索器) |
| GitHub 链接 | https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/retrievers/auto_merging_retriever.py |
概述
该AutoMergingRetriever 是一个处理分层文档结构组件。当满足特定阈值时,它会返回父文档而不是单个叶子文档。
这在处理被分割成多个块的段落时特别有用。当同一个段落的多个块匹配到您的查询时,完整的段落通常比单个片段提供更多的上下文和价值。
这是此 Retriever 的工作原理:
- 它要求文档以树状结构组织,叶子节点存储在文档索引中 - 请参阅
HierarchicalDocumentSplitter文档。 - 在搜索时,它会计算同一父节点下的有多少叶子文档匹配您的查询。
- 如果此计数超过您定义的阈值,它将返回父文档而不是单个叶子节点。
该AutoMergingRetriever 目前可以由以下 Document Stores 使用:
- AstraDocumentStore
- ElasticsearchDocumentStore
- OpenSearchDocumentStore
- PgvectorDocumentStore
- QdrantDocumentStore
用法
单独使用
from haystack import Document
from haystack.components.preprocessors import HierarchicalDocumentSplitter
from haystack.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
# create a hierarchical document structure with 3 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]
# store level-1 parent documents and initialize the retriever
doc_store_parents = InMemoryDocumentStore()
for doc in docs["documents"]:
if doc.meta["children_ids"] and doc.meta["level"] == 1:
doc_store_parents.write_documents([doc])
retriever = AutoMergingRetriever(doc_store_parents, threshold=0.5)
# assume we retrieved 2 leaf docs from the same parent, the parent document should be returned,
# since it has 3 children and the threshold=0.5, and we retrieved 2 children (2/3 > 0.66(6))
leaf_docs = [doc for doc in docs["documents"] if not doc.meta["children_ids"]]
docs = retriever.run(leaf_docs[4:6])
>> {'documents': [Document(id=538..),
>> content: 'warm glow over the trees. Birds began to sing.',
>> meta: {'block_size': 10, 'parent_id': '835..', 'children_ids': ['c17...', '3ff...', '352...'], 'level': 1, 'source_id': '835...',
>> 'page_number': 1, 'split_id': 1, 'split_idx_start': 45})]}
在 pipeline 中
这是一个 RAG Haystack 管道的示例。它首先使用 BM25 检索叶子级别的文档块,然后使用AutoMergingRetriever 将它们合并为更高级别的父文档,构建提示,并使用 OpenAI 的聊天模型生成答案。
from typing import List, Tuple
from haystack import Document, Pipeline
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.dataclasses import ChatMessage
def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
docs = splitter.run(documents)
leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
leaf_doc_store = InMemoryDocumentStore()
leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE)
parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
parent_doc_store = InMemoryDocumentStore()
parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE)
return leaf_doc_store, parent_doc_store
# Add documents
docs = [
Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")
]
leaf_docs, parent_docs = indexing(docs)
prompt_template = [
ChatMessage.from_system("You are a helpful assistant."),
ChatMessage.from_user(
"Given these documents, answer the question.\nDocuments:\n"
"{% for doc in documents %}{{ doc.content }}{% endfor %}\n"
"Question: {{question}}\nAnswer:"
)
]
rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=leaf_docs), name="bm25_retriever")
rag_pipeline.add_component(instance=AutoMergingRetriever(parent_docs, threshold=0.6), name="retriever")
rag_pipeline.add_component(instance=ChatPromptBuilder(template=prompt_template, required_variables={"question", "documents"}), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIChatGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("bm25_retriever.documents", "retriever.documents")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.messages", "llm.messages")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
question = "How many languages are there?"
result = rag_pipeline.run({
"bm25_retriever": {"query": question},
"prompt_builder": {"question": question},
"answer_builder": {"query": question}
})
更新于 7 个月前
