DocumentLengthRouter
根据文档的长度将其路由到不同的输出连接content 字段。
| pipeline 中的最常见位置 | 灵活 |
| 强制运行变量 | "documents": 文档列表 |
| 输出变量 | "short_documents":一个文档列表,其中content 为 None,或长度小于或等于阈值。content 的长度小于或等于阈值。“long_documents”:一个文档列表,其中 content 的长度大于阈值。 |
| API 参考 | Routers (路由器) |
| GitHub 链接 | https://github.com/deepset-ai/haystack/blob/main/haystack/components/routers/document_length_router.py |
概述
DocumentLengthRouter 根据其长度将文档路由到不同的输出连接content 字段。
它允许设置一个threshold 初始化参数。文档的content 为 None,或长度小于或等于阈值,则路由到 "short_documents"。否则路由到 "long_documents"。content 大于阈值,则路由到 "long_documents"。
一个常见的用例是DocumentLengthRouter 处理从 PDF 文件中获得的文档,这些文件包含非文本内容,例如扫描页面或图像。此组件可以检测空文档或内容较少的文档,并将它们路由到执行 OCR、生成字幕或计算图像嵌入的组件。
用法
单独使用
from haystack.components.routers import DocumentLengthRouter
from haystack.dataclasses import Document
docs = [
Document(content="Short"),
Document(content="Long document "*20),
]
router = DocumentLengthRouter(threshold=10)
result = router.run(documents=docs)
print(result)
# {
# "short_documents": [Document(content="Short", ...)],
# "long_documents": [Document(content="Long document ...", ...)],
# }
在 pipeline 中
在以下索引管道中,PyPDFToDocument 转换器从 PDF 文件中提取文本。然后,文档被使用DocumentSplitter 按页面分割。接下来,DocumentLengthRouter 将短文档路由到LLMDocumentContentExtractor 以提取文本,这对于非文本、基于图像的页面特别有用。最后,所有文档都使用DocumentJoiner 收集并写入 Document Store。
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.extractors.image import LLMDocumentContentExtractor
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.joiners import DocumentJoiner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.routers import DocumentLengthRouter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
indexing_pipe = Pipeline()
indexing_pipe.add_component(
"pdf_converter",
PyPDFToDocument(store_full_path=True)
)
# setting skip_empty_documents=False is important here because the
# LLMDocumentContentExtractor can extract text from non-textual documents
# that otherwise would be skipped
indexing_pipe.add_component(
"pdf_splitter",
DocumentSplitter(
split_by="page",
split_length=1,
skip_empty_documents=False
)
)
indexing_pipe.add_component(
"doc_length_router",
DocumentLengthRouter(threshold=10)
)
indexing_pipe.add_component(
"content_extractor",
LLMDocumentContentExtractor(
chat_generator=OpenAIChatGenerator(model="gpt-4.1-mini")
)
)
indexing_pipe.add_component(
"doc_joiner",
DocumentJoiner(sort_by_score=False)
)
indexing_pipe.add_component(
"document_writer",
DocumentWriter(document_store=document_store)
)
indexing_pipe.connect("pdf_converter.documents", "pdf_splitter.documents")
indexing_pipe.connect("pdf_splitter.documents", "doc_length_router.documents")
# The short PDF pages will be enriched/captioned
indexing_pipe.connect(
"doc_length_router.short_documents",
"content_extractor.documents"
)
indexing_pipe.connect(
"doc_length_router.long_documents",
"doc_joiner.documents"
)
indexing_pipe.connect(
"content_extractor.documents",
"doc_joiner.documents"
)
indexing_pipe.connect("doc_joiner.documents", "document_writer.documents")
# Run the indexing pipeline with sources
indexing_result = indexing_pipe.run(
data={"sources": ["textual_pdf.pdf", "non_textual_pdf.pdf"]}
)
# Inspect the documents
indexed_documents = document_store.filter_documents()
print(f"Indexed {len(indexed_documents)} documents:\n")
for doc in indexed_documents:
print("file_path: ", doc.meta["file_path"])
print("page_number: ", doc.meta["page_number"])
print("content: ", doc.content)
print("-" * 100 + "\n")
# Indexed 3 documents:
#
# file_path: textual_pdf.pdf
# page_number: 1
# content: A sample PDF file...
# ----------------------------------------------------------------------------------------------------
#
# file_path: textual_pdf.pdf
# page_number: 2
# content: Page 2 of Sample PDF...
# ----------------------------------------------------------------------------------------------------
#
# file_path: non_textual_pdf.pdf
# page_number: 1
# content: Content extracted from non-textual PDF using a LLM...
# ----------------------------------------------------------------------------------------------------
更新于 3 个月前
