PDFToImageContent

PDFToImageContent 读取本地 PDF 文件并将其转换为ImageContent 对象。这些对象已准备好用于多模态 AI 管道，包括图像字幕、视觉 QA 或基于提示的生成等任务。


pipeline 中的最常见位置	在查询管道中的`ChatPromptBuilder` 之前
强制运行变量	"sources": PDF 文件路径或 ByteStreams 列表
输出变量	"image_contents": ImageContent 对象列表
API 参考	图像转换器
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/image/pdf_to_image.py

概述

PDFToImageContent 处理 PDF 源列表并将其转换为ImageContent 对象，每个 PDF 页面对应一个。这些对象可用于需要 base64 编码图像输入的 M 模态管道。

每个源都可以是

文件路径（字符串或Path），或
一个ByteStream 对象。

可选地，您可以使用meta 参数提供元数据。这可以是一个字典（应用于所有图像），或一个与sources.

使用长度匹配的列表。size 参数可用于在保持纵横比的同时调整图像大小。这可以减少内存使用和传输大小，这在处理远程模型或资源受限的环境时非常有用。

该组件通常在查询管道中使用，紧邻ChatPromptBuilder.

用法

单独使用

from haystack.components.converters.image import PDFToImageContent

converter = PDFToImageContent()

sources = ["file.pdf", "another_file.pdf"]

image_contents = converter.run(sources=sources)["image_contents"]
print(image_contents)

# [ImageContent(base64_image='...',
#               mime_type='application/pdf',
#               detail=None,
#               meta={'file_path': 'file.pdf', 'page_number': 1}),
#  ...]

在 pipeline 中

使用ImageFileToImageContent 为 M 模态 QA 或 LLM 图像字幕提供图像数据到ChatPromptBuilder。

from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.converters.image import PDFToImageContent

# Query pipeline
pipeline = Pipeline()
pipeline.add_component("image_converter", PDFToImageContent(detail="auto"))
pipeline.add_component(
    "chat_prompt_builder",
    ChatPromptBuilder(
        required_variables=["question"],
        template="""{% message role="system" %}
You are a helpful assistant that answers questions using the provided images.
{% endmessage %}

{% message role="user" %}
Question: {{ question }}

{% for img in image_contents %}
{{ img | templatize_part }}
{% endfor %}
{% endmessage %}
"""
    )
)
pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini"))

pipeline.connect("image_converter", "chat_prompt_builder.image_contents")
pipeline.connect("chat_prompt_builder", "llm")

sources = ["flan_paper.pdf"]

result = pipeline.run(
    data={
        "image_converter": {"sources": ["flan_paper.pdf"], "page_range":"9"},
        "chat_prompt_builder": {"question": "What is the main takeaway of Figure 6?"}
    }
)
print(result["replies"][0].text)

# ('The main takeaway of Figure 6 is that Flan-PaLM demonstrates improved '
# 'performance in zero-shot reasoning tasks when utilizing chain-of-thought '
# '(CoT) reasoning, as indicated by higher accuracy across different model '
# 'sizes compared to PaLM without finetuning. This highlights the importance of '
# 'instruction finetuning combined with CoT for enhancing reasoning '
# 'capabilities in models.')

其他参考资料

🧑‍🍳 食谱：M 模态简介

更新于 3 个月前