AzureOCRDocumentConverter

AzureOCRDocumentConverter 使用 Azure 的文档智能服务将文件转换为文档。它支持以下文件格式：PDF（可搜索和仅图像）、JPEG、PNG、BMP、TIFF、DOCX、XLSX、PPTX 和 HTML。


pipeline 中的最常见位置	在预处理器之前，或在索引管道的开头。
必需的初始化变量	"endpoint": 你的 Azure 资源的终结点 "api_key": 你的 Azure 资源的 API 密钥。可以设置为`AZURE_AI_API_KEY` 环境变量。
强制运行变量	"sources": 文件路径列表
输出变量	"documents": 文档列表 "raw_azure_response": 来自 Azure 的原始响应列表
API 参考	Converters (转换器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/azure.py

概述

AzureOCRDocumentConverter 接收文件路径列表或 ByteStream 对象作为输入，并使用 Azure 服务将文件转换为文档列表。可选地，可以通过meta 输入参数将元数据附加到文档。你需要一个有效的 Azure 帐户和一个文档智能或认知服务资源才能使用此集成。请按照 Azure 文档中描述的步骤来设置你的资源。

该组件默认使用AZURE_AI_API_KEY 环境变量。否则，你可以在初始化时传递一个api_key——请参阅下面的代码示例。

在初始化组件时，你可以选择设置model_id，它指的是你想使用的模型。请参阅 Azure 文档以获取可用模型列表。默认模型是"prebuilt-read".

该AzureOCRDocumentConverter 不会将表格提取为纯文本，而是生成单独的Document 对象，类型为 table，并保留表格的二维结构。

用法

您需要首先安装azure-ai-formrecognizer 包来使用AzureOCRDocumentConverter:

pip install "azure-ai-formrecognizer>=3.2.0b2"

单独使用

from pathlib import Path

from haystack.components.converters import AzureOCRDocumentConverter
from haystack.utils import Secret

converter = AzureOCRDocumentConverter(
    endpoint="azure_resource_url",
    api_key=Secret.from_token("<your-api-key>")
)

converter.run(sources=[Path("my_file.pdf")])

在 pipeline 中

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import AzureOCRDocumentConverter
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.utils import Secret

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", AzureOCRDocumentConverter(endpoint="azure_resource_url", api_key=Secret.from_token("<your-api-key>")))
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

file_names = ["my_file.pdf"]
pipeline.run({"converter": {"sources": file_names}})

更新于大约 1 年前