模块 haystack_experimental.components.extractors.llm_document_content_extractor

LLMDocumentContentExtractor

使用支持视觉的 LLM（大型语言模型）从基于图像的文档中提取文本内容。

此组件通过 DocumentToImageContent 组件将每个输入文档转换为图像，使用提示指示 LLM 如何提取内容，并使用 ChatGenerator 根据提供的提示提取结构化文本内容。

提示不得包含变量；它应该只包含对 LLM 的说明。图像数据和提示一起作为聊天消息传递给 LLM。

对于 LLM 无法提取内容的文档，将返回一个单独的failed_documents 列表。这些失败的文档将在其元数据中包含一个content_extraction_error 条目。此元数据可用于调试或稍后重新处理文档。

使用示例

from haystack import Document
from haystack_experimental.components.generators.chat import OpenAIChatGenerator
from haystack_experimental.components.extractors import LLMDocumentContentExtractor
chat_generator = OpenAIChatGenerator()
extractor = LLMDocumentContentExtractor(chat_generator=chat_generator)
documents = [
    Document(content="", meta={"file_path": "image.jpg"}),
    Document(content="", meta={"file_path": "document.pdf", "page_number": 1}),
]
updated_documents = extractor.run(documents=documents)["documents"]
print(updated_documents)
# [Document(content='Extracted text from image.jpg',
#           meta={'file_path': 'image.jpg'}),
#  ...]

LLMDocumentContentExtractor.init

def __init__(*,
             chat_generator: ChatGenerator,
             prompt: str = DEFAULT_PROMPT_TEMPLATE,
             file_path_meta_field: str = "file_path",
             root_path: Optional[str] = None,
             detail: Optional[Literal["auto", "high", "low"]] = None,
             size: Optional[Tuple[int, int]] = None,
             raise_on_failure: bool = False,
             max_workers: int = 3)

初始化 LLMDocumentContentExtractor 组件。

参数:

chat_generator：一个 ChatGenerator 实例，代表用于提取文本的 LLM。此生成器必须支持基于视觉的输入并返回纯文本响应。目前，支持 OpenAIChatGenerator 和 AmazonBedrockChatGenerator 的实验版本。
prompt：提供给 LLM 的说明文本。它不得包含 Jinja 变量。提示应仅包含关于如何提取基于图像的文档内容的说明。
file_path_meta_field: Document 中包含图像或 PDF 文件路径的元数据字段。
root_path: 存储文档文件的根目录路径。如果提供，文档元数据中的文件路径将相对于此路径解析。如果为 None，则文件路径被视为绝对路径。
detail：图像的可选详细级别（仅 OpenAI 支持）。可以是 "auto"、"high" 或 "low"。处理图像时，此参数将传递给 chat_generator。
size: 如果提供，则将图像调整为适合指定的尺寸（宽度，高度），同时保持纵横比。这可以减小文件大小、内存使用量和处理时间，这在处理具有分辨率限制的模型或向远程服务传输图像时非常有用。
raise_on_failure：如果为 True，则会引发 LLM 的异常。如果为 False，则会记录并返回失败的文档。
max_workers：使用 ThreadPoolExecutor 并行处理跨文档的 LLM 调用所使用的最大线程数。

LLMDocumentContentExtractor.warm_up

def warm_up()

如果 ChatGenerator 具有 warm_up 方法，则预热它。

LLMDocumentContentExtractor.to_dict

def to_dict() -> Dict[str, Any]

将组件序列化为字典。

返回值:

包含序列化数据的字典。

LLMDocumentContentExtractor.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "LLMDocumentContentExtractor"

从字典反序列化组件。

参数:

data: 包含序列化数据的字典。

返回值:

组件的实例。

LLMDocumentContentExtractor.run

@component.output_types(documents=List[Document],
                        failed_documents=List[Document])
def run(documents: List[Document]) -> Dict[str, List[Document]]

使用支持视觉的 LLM 对一系列基于图像的文档运行内容提取。

每个文档都会与预定义的提示一起传递给 LLM。响应用于更新文档的内容。如果提取失败，文档将被返回到failed_documents 列表中，并附带描述失败的元数据。

参数:

documents：要处理的基于图像的文档列表。每个文档的元数据中都必须有一个有效的文件路径。

返回值:

一个字典，包含

"documents"：成功处理的文档，已更新提取的内容。
"failed_documents"：处理失败的文档，并带有失败的元数据注释。

模块 haystack_experimental.components.extractors.llm_document_content_extractor

LLMDocumentContentExtractor

使用示例

LLMDocumentContentExtractor.__init__

LLMDocumentContentExtractor.warm_up

LLMDocumentContentExtractor.to_dict

LLMDocumentContentExtractor.from_dict

LLMDocumentContentExtractor.run

LLMDocumentContentExtractor.init