pipeline 中的最常见位置	在索引管道中转换器之后，用于从基于图像的文档中提取文本
必需的初始化变量	"chat_generator": 一个支持视觉输入 ChatGenerator 实例 "prompt": 指导 LLM 如何提取内容的说明文本 (不允许使用 Jinja 变量)
强制运行变量	"documents": 包含文件路径的文档列表，存储在元数据中
输出变量	"documents": 已成功处理并提取了内容的文档 "failed_documents": 处理失败的文档，包含错误元数据
API 参考	Extractors (提取器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/extractors/image/llm_document_content_extractor.py

概述

LLMDocumentContentExtractor 使用支持视觉功能的语言模型 (LLM) 从基于图像的文档中提取文本内容。此组件特别适用于处理扫描文档、包含文本的图像或需要转换为可搜索文本的 PDF 页面。

该组件的工作原理是：

使用DocumentToImageContent 组件将每个输入文档转换为图像，
使用预定义的提示词指导 LLM 如何提取内容，
通过支持视觉功能的 ChatGenerator 处理图像以提取结构化的文本内容。

提示词不得包含 Jinja 变量；它应该只包含对 LLM 的说明。图像数据和提示词会一起作为 Chat Message 传递给 LLM。

对于 LLM 无法提取内容的文档，将返回一个单独的failed_documents 列表，其元数据中包含一个content_extraction_error 条目，用于调试或重新处理。

用法

单独使用

下面是一个使用LLMDocumentContentExtractor 从基于图像的文档中提取文本

from haystack import Document
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.extractors.image import LLMDocumentContentExtractor

# Initialize the chat generator with vision capabilities
chat_generator = OpenAIChatGenerator(
    model="gpt-4o-mini",
    generation_kwargs={"temperature": 0.0}
)

# Create the extractor
extractor = LLMDocumentContentExtractor(
    chat_generator=chat_generator,
    file_path_meta_field="file_path",
    raise_on_failure=False
)

# Create documents with image file paths
documents = [
    Document(content="", meta={"file_path": "image.jpg"}),
    Document(content="", meta={"file_path": "document.pdf", "page_number": 1}),
]

# Run the extractor
result = extractor.run(documents=documents)

# Check results
print(f"Successfully processed: {len(result['documents'])}")
print(f"Failed documents: {len(result['failed_documents'])}")

# Access extracted content
for doc in result["documents"]:
    print(f"File: {doc.meta['file_path']}")
    print(f"Extracted content: {doc.content[:100]}...")

使用自定义提示词

您可以提供自定义提示词来指导 LLM 如何提取内容

from haystack.components.extractors.image import LLMDocumentContentExtractor
from haystack.components.generators.chat import OpenAIChatGenerator

custom_prompt = """
Extract all text content from this image-based document.

Instructions:
- Extract text exactly as it appears
- Preserve the reading order
- Format tables as markdown
- Describe any images or diagrams briefly
- Maintain document structure

Document:"""

chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")
extractor = LLMDocumentContentExtractor(
    chat_generator=chat_generator,
    prompt=custom_prompt,
    file_path_meta_field="file_path"
)

documents = [Document(content="", meta={"file_path": "scanned_document.pdf"})]
result = extractor.run(documents=documents)

处理失败的文档

该组件为失败的文档提供详细的错误信息

from haystack.components.extractors.image import LLMDocumentContentExtractor
from haystack.components.generators.chat import OpenAIChatGenerator

chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")
extractor = LLMDocumentContentExtractor(
    chat_generator=chat_generator,
    raise_on_failure=False  # Don't raise exceptions, return failed documents
)

documents = [Document(content="", meta={"file_path": "problematic_image.jpg"})]
result = extractor.run(documents=documents)

# Check for failed documents
for failed_doc in result["failed_documents"]:
    print(f"Failed to process: {failed_doc.meta['file_path']}")
    print(f"Error: {failed_doc.meta['content_extraction_error']}")

在 pipeline 中

下面是一个使用LLMDocumentContentExtractor 处理基于图像的文档并存储提取文本的管道示例

from haystack import Pipeline
from haystack.components.extractors.image import LLMDocumentContentExtractor
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import Document

# Create document store
document_store = InMemoryDocumentStore()

# Create pipeline
p = Pipeline()
p.add_component(instance=LLMDocumentContentExtractor(
    chat_generator=OpenAIChatGenerator(model="gpt-4o-mini"),
    file_path_meta_field="file_path"
), name="content_extractor")
p.add_component(instance=DocumentSplitter(), name="splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")

# Connect components
p.connect("content_extractor.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

# Create test documents
docs = [
    Document(content="", meta={"file_path": "scanned_document.pdf"}),
    Document(content="", meta={"file_path": "image_with_text.jpg"}),
]

# Run pipeline
result = p.run({"content_extractor": {"documents": docs}})

# Check results
print(f"Successfully processed: {len(result['content_extractor']['documents'])}")
print(f"Failed documents: {len(result['content_extractor']['failed_documents'])}")

# Access documents in the store
stored_docs = document_store.filter_documents()
print(f"Documents in store: {len(stored_docs)}")