FileTypeRouter

使用此 Router 在索引管道中，根据文件类型将文件路径或字节流路由到不同的输出以进行进一步处理。


pipeline 中的最常见位置	作为预处理数据的第一个组件，后面是转换器。
必需的初始化变量	"mime_types": 用于分类的 MIME 类型或正则表达式模式列表。
强制运行变量	"sources": 要分类的文件路径或字节流列表。
输出变量	"unclassified": 未分类文件路径或字节流列表。 “mime_types”: 例如 “"text/plain", "text/html", "application/pdf", "text/markdown", "audio/x-wav", "image/jpeg”: 分类后的文件路径或字节流列表。
API 参考	Routers (路由器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/routers/file_type_router.py

概述

FileTypeRouter 根据文件类型（例如，纯文本、jpeg 图像或音频 wave）路由文件路径或字节流。对于文件路径，它从文件扩展名推断 MIME 类型，而对于字节流，它根据提供的元数据确定 MIME 类型。

在初始化组件时，您需要指定要路由到不同输出的 MIME 类型集。为此，请将mime_types 参数设置为类型列表，例如["text/plain", "audio/x-wav", "image/jpeg"]。未列出的类型将被路由到名为“unclassified”的输出。

用法

单独使用

下面是一个使用FileTypeRouter 用于对两个简单文档进行排名。

from haystack import Document
from haystack.components.routers import FileTypeRouter

router = FileTypeRouter(mime_types=["text/plain"])
router.run(sources=["text-file-will-be-added.txt", "pdf-will-not-ne-added.pdf"])

在 pipeline 中

下面是一个使用FileTypeRouter 将纯文本文件转发到DocumentSplitter，然后转发到DocumentWriter 的管道示例。只有纯文本文件的内容会被添加到InMemoryDocumentStore，但其他任何类型文件的内容不会。作为替代方案，您可以在管道中添加一个PyPDFConverter，并使用FileTypeRouter 将 PDF 路由到它，以便它将其转换为文档。

from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=FileTypeRouter(mime_types=["text/plain"]), name="file_type_router")
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentSplitter(), name="splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("file_type_router.text/plain", "text_file_converter.sources")
p.connect("text_file_converter.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
p.run({"file_type_router": {"sources":["text-file-will-be-added.txt", "pdf-will-not-be-added.pdf"]}})

更新于大约 1 年前