MultiFileConverter

将 CSV、DOCX、HTML、JSON、MD、PPTX、PDF、TXT 和 XSLX 文件转换为文档。


pipeline 中的最常见位置	在 PreProcessors 之前，或者在索引管道的开头
强制运行变量	"sources": 文件路径或 ByteStream 对象的列表
输出变量	"documents": 转换后的文档列表 "unclassified": 未分类的文件路径或 byte streams 的列表
API 参考	Converters (转换器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/multi_file_converter.py

概述

MultiFileConverter 将各种文件类型输入转换为文档。

它是一个 SuperComponent，将 FileTypeRouter、九个转换器和一个 DocumentJoiner 组合成一个组件。

参数

要初始化 MultiFileConverter，没有必需的参数。可选地，您可以提供 encoding 和 json_content_key 参数。

该json_content_key 参数允许您为 JSON 文件指定提取数据中的哪个键将作为文档的内容。该参数会传递给底层的 JSONConverter 组件。

该encoding 参数允许您指定 TXT、CSV 和 MD 文件的默认编码。如果未提供任何值，则组件默认使用utf-8。请注意，如果输入 ByteStream 的元数据中指定了编码，它将覆盖此参数的设置。该参数会传递给底层的 TextFileToDocument 和 CSVToDocument 组件。

用法

安装所有支持文件类型的依赖项，以便在您的索引管道中使用 MultiFileConverter:

pip install pypdf markdown-it-py  mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas

单独使用

from haystack.components.converters import MultiFileConverter

converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})

在 pipeline 中

你也可以在你的管道中使用MultiFileConverter。

from haystack import Pipeline
from haystack.components.converters import MultiFileConverter
from haystack.components.preprocessors import DocumentPreprocessor
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", MultiFileConverter())
pipeline.add_component("preprocessor", DocumentPreprocessor())
pipeline.add_component("writer", DocumentWriter(document_store = document_store))
pipeline.connect("converter", "preprocessor")
pipeline.connect("preprocessor", "writer")

result = pipeline.run(data={"sources": ["test.txt", "test.pdf"]})

print(result)
# {'writer': {'documents_written': 3}}

更新于 6 个月前