模块 azure

AzureOCRDocumentConverter

使用 Azure 的文档智能服务将文件转换为文档。

支持的文件格式有：PDF、JPEG、PNG、BMP、TIFF、DOCX、XLSX、PPTX 和 HTML。

要使用此组件，您需要一个有效的 Azure 帐户和文档智能或认知服务资源。有关设置资源的帮助，请参阅 Azure 文档。

使用示例

from haystack.components.converters import AzureOCRDocumentConverter
from haystack.utils import Secret

converter = AzureOCRDocumentConverter(endpoint="<url>", api_key=Secret.from_token("<your-api-key>"))
results = converter.run(sources=["path/to/doc_with_images.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

AzureOCRDocumentConverter.init

def __init__(endpoint: str,
             api_key: Secret = Secret.from_env_var("AZURE_AI_API_KEY"),
             model_id: str = "prebuilt-read",
             preceding_context_len: int = 3,
             following_context_len: int = 3,
             merge_multiple_column_headers: bool = True,
             page_layout: Literal["natural", "single_column"] = "natural",
             threshold_y: Optional[float] = 0.05,
             store_full_path: bool = False)

创建 AzureOCRDocumentConverter 组件。

参数:

endpoint: 您的 Azure 资源的端点。
api_key: 您的 Azure 资源的 API 密钥。
model_id: 您要使用的模型的 ID。可用模型的列表，请参阅 [Azure 文档] (https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature)。
preceding_context_len: 表格之前包含为前导上下文的行数（将添加到元数据中）。
following_context_len: 表格之后包含为后续上下文的行数（将添加到元数据中）。
merge_multiple_column_headers: 如果True，则将多个列标题行合并为一行。
page_layout: 要遵循的页面布局类型。可能选项
natural: 使用 Azure 确定的自然阅读顺序。
single_column: 根据阈值将页面上所有相同高度的行分组threshold_y.
threshold_y: 仅当single_column 设置为page_layout 时才相关。以英寸为单位的阈值，用于确定两个识别出的 PDF 元素是否被合并为一行。这对于章节标题或数字可能在水平轴上与其余文本分开至关重要。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

AzureOCRDocumentConverter.run

@component.output_types(documents=list[Document],
                        raw_azure_response=list[dict])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[list[dict[str, Any]]] = None)

使用 Azure 的文档智能服务将文件列表转换为文档。

参数:

sources: 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档列表
raw_azure_response: 用于创建文档的原始 Azure 响应列表

AzureOCRDocumentConverter.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

返回值:

包含序列化数据的字典。

AzureOCRDocumentConverter.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AzureOCRDocumentConverter"

从字典反序列化组件。

参数:

data: 要反序列化的字典。

返回值:

反序列化后的组件。

模块 csv

CSVToDocument

将 CSV 文件转换为文档。

By default, it uses UTF-8 encoding when converting files but
you can also set a custom encoding.
It can attach metadata to the resulting documents.

### Usage example

```python
from haystack.components.converters.csv import CSVToDocument
converter = CSVToDocument()
results = converter.run(sources=["sample.csv"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'col1,col2

ow1,row1 row2row2 ' ```

CSVToDocument.init

def __init__(encoding: str = "utf-8", store_full_path: bool = False)

创建 CSVToDocument 组件。

参数:

encoding: 要转换的 csv 文件的编码。如果源 ByteStream 的元数据中指定了编码，则会覆盖此值。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

CSVToDocument.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将 CSV 文件转换为文档。

参数:

sources: 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档

模块 docx

DOCXMetadata

描述 Docx 文件的元数据。

参数:

author: 作者
category: 类别
comments: 注释
content_status: 内容状态
created: 创建日期（ISO 格式字符串）
identifier: 标识符
keywords: 可用关键字
language: 文档的语言
last_modified_by: 最后修改文档的用户
last_printed: 最后打印日期（ISO 格式字符串）
modified: 最后修改日期（ISO 格式字符串）
revision: 修订号
subject: 主题
title: 标题
version: 版本

DOCXTableFormat

用于在 Document 中存储 DOCX 表格数据的支持格式。

DOCXTableFormat.from_str

@staticmethod
def from_str(string: str) -> "DOCXTableFormat"

将字符串转换为 DOCXTableFormat 枚举。

DOCXLinkFormat

用于在 Document 中存储 DOCX 链接信息的支持格式。

DOCXLinkFormat.from_str

@staticmethod
def from_str(string: str) -> "DOCXLinkFormat"

将字符串转换为 DOCXLinkFormat 枚举。

DOCXToDocument

将 DOCX 文件转换为文档。

使用python-docx 库将 DOCX 文件转换为文档。此组件不保留原始文档中的分页符。

使用示例

from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat, DOCXLinkFormat

converter = DOCXToDocument(table_format=DOCXTableFormat.CSV, link_format=DOCXLinkFormat.MARKDOWN)
results = converter.run(sources=["sample.docx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the DOCX file.'

DOCXToDocument.init

def __init__(table_format: Union[str, DOCXTableFormat] = DOCXTableFormat.CSV,
             link_format: Union[str, DOCXLinkFormat] = DOCXLinkFormat.NONE,
             store_full_path: bool = False)

创建 DOCXToDocument 组件。

参数:

table_format: 表格输出的格式。可以是 DOCXTableFormat.MARKDOWN、DOCXTableFormat.CSV、“markdown”或“csv”。
link_format: 链接输出的格式。可以是：DOCXLinkFormat.MARKDOWN 或“markdown”以获取文本，DOCXLinkFormat.PLAIN 或“plain”以获取文本（地址），DOCXLinkFormat.NONE 或“none”以获取不带链接的文本。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

DOCXToDocument.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

返回值:

包含序列化数据的字典。

DOCXToDocument.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DOCXToDocument"

从字典反序列化组件。

参数:

data: 要反序列化的字典。

返回值:

反序列化后的组件。

DOCXToDocument.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将 DOCX 文件转换为文档。

参数:

sources: 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档

模块 html

HTMLToDocument

将 HTML 文件转换为文档。

使用示例

from haystack.components.converters import HTMLToDocument

converter = HTMLToDocument()
results = converter.run(sources=["path/to/sample.html"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the HTML file.'

HTMLToDocument.init

def __init__(extraction_kwargs: Optional[dict[str, Any]] = None,
             store_full_path: bool = False)

创建 HTMLToDocument 组件。

参数:

extraction_kwargs: 一个字典，包含用于自定义提取过程的关键字参数。这些参数会传递给底层的 Trafilaturaextract 函数。有关可用参数的完整列表，请参阅 Trafilatura 文档。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

HTMLToDocument.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

返回值:

包含序列化数据的字典。

HTMLToDocument.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HTMLToDocument"

从字典反序列化组件。

参数:

data: 要反序列化的字典。

返回值:

反序列化后的组件。

HTMLToDocument.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None,
        extraction_kwargs: Optional[dict[str, Any]] = None)

将 HTML 文件列表转换为文档。

参数:

sources: HTML 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。
extraction_kwargs: 用于自定义提取过程的附加关键字参数。

返回值:

包含以下键的字典

documents: 创建的文档

模块 json

JSONConverter

将一个或多个 JSON 文件转换为文本文档。

使用示例

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))

converter = JSONConverter(content_key="text")
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'This is the content of my document'

可选地，您还可以提供一个jq_schema 字符串来过滤 JSON 源文件，以及extra_meta_fields 来从过滤后的数据中提取。

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

data = {
    "laureates": [
        {
            "firstname": "Enrico",
            "surname": "Fermi",
            "motivation": "for his demonstrations of the existence of new radioactive elements produced "
            "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
            " slow neutrons",
        },
        {
            "firstname": "Rita",
            "surname": "Levi-Montalcini",
            "motivation": "for their discoveries of growth factors",
        },
    ],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
    jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
)

results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'

print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)
# 'for their discoveries of growth factors'

print(documents[1].meta)
# {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}

JSONConverter.init

def __init__(jq_schema: Optional[str] = None,
             content_key: Optional[str] = None,
             extra_meta_fields: Optional[Union[set[str], Literal["*"]]] = None,
             store_full_path: bool = False)

创建 JSONConverter 组件。

可选的jq_schema 可用于提取 JSON 源文件中的嵌套数据。有关过滤器语法的更多信息，请参阅官方 jq 文档。如果jq_schema 未设置，则将使用整个 JSON 源文件来提取内容。

可选地，您可以提供一个content_key 来指定从提取的对象中哪个键将被设置为文档的内容。

如果同时设置了jq_schema 和content_key，组件将在由jq_schema 提取的 JSON 对象中搜索content_key。如果提取的数据不是 JSON 对象，它将被跳过。

如果仅设置了jq_schema，则提取的数据必须是标量值。如果它是 JSON 对象或数组，它将被跳过。

如果仅设置了content_key 被设置，源 JSON 文件必须是 JSON 对象，否则将被跳过。

extra_meta_fields 可以设置为字符串集合或文字"*" 字符串。如果是字符串集合，它必须指定从提取的对象中必须在生成的文档中设置的字段。如果找不到字段，元值将为None。如果设置为"*"，所有不是content_key 并且在过滤后的 JSON 对象中找到的字段都将被保存为元数据。

如果既没有设置jq_schema 也未设置content_key，初始化将失败。

参数:

jq_schema: 可选的 jq 过滤器字符串，用于提取内容。如果未指定，则整个 JSON 对象将用于提取信息。
content_key: 可选的键，用于提取文档内容。如果jq_schema 已指定，则content_key 将从该对象中提取。
extra_meta_fields: 可选的元键集合，用于从内容中提取。如果jq_schema 已指定，则所有键都将从该对象中提取。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

JSONConverter.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

返回值:

包含序列化数据的字典。

JSONConverter.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "JSONConverter"

从字典反序列化组件。

参数:

data: 要反序列化的字典。

返回值:

反序列化后的组件。

JSONConverter.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将 JSON 文件列表转换为文档。

参数:

sources: 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，其长度必须与源的数量匹配。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档列表。

模块 markdown

MarkdownToDocument

将 Markdown 文件转换为文本文档。

使用示例

from haystack.components.converters import MarkdownToDocument
from datetime import datetime

converter = MarkdownToDocument()
results = converter.run(sources=["path/to/sample.md"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the markdown file.'

MarkdownToDocument.init

def __init__(table_to_single_line: bool = False,
             progress_bar: bool = True,
             store_full_path: bool = False)

创建 MarkdownToDocument 组件。

参数:

table_to_single_line: 如果为 True，则将表格内容转换为单行。
progress_bar: 如果为 True，则在运行时显示进度条。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

MarkdownToDocument.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将 Markdown 文件列表转换为文档。

参数:

sources: 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档列表

模块 msg

MSGToDocument

将 Microsoft Outlook .msg 文件转换为 Haystack 文档。

此组件从 .msg 文件中提取电子邮件元数据（如发件人、收件人、抄送、密送、主题）和正文内容，并将它们转换为结构化的 Haystack 文档。此外，.msg 文件中的任何文件附件都会作为 ByteStream 对象提取。

示例用法

from haystack.components.converters.msg import MSGToDocument
from datetime import datetime

converter = MSGToDocument()
results = converter.run(sources=["sample.msg"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
attachments = results["attachments"]
print(documents[0].content)

MSGToDocument.init

def __init__(store_full_path: bool = False) -> None

创建 MSGToDocument 组件。

参数:

store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

MSGToDocument.run

@component.output_types(documents=list[Document], attachments=list[ByteStream])
def run(
    sources: list[Union[str, Path, ByteStream]],
    meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
) -> dict[str, Union[list[Document], list[ByteStream]]]

将 MSG 文件转换为文档。

参数:

sources: 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档。
attachments: 从文件附件创建的 ByteStream 对象。

模块 multi_file_converter

MultiFileConverter

一个处理多种文件类型转换的文件转换器。

MultiFileConverter 处理以下文件类型：

CSV
DOCX
HTML
JSON
MD
TEXT
PDF（无 OCR）
PPTX
XLSX

使用示例

from haystack.super_components.converters import MultiFileConverter

converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})

MultiFileConverter.init

def __init__(encoding: str = "utf-8",
             json_content_key: str = "content") -> None

初始化 MultiFileConverter。

参数:

encoding: 读取文件时使用的编码。
json_content_key: 在转换 JSON 文件时，用于文档内容字段的键。

模块 openapi_functions

OpenAPIServiceToFunctions

将 OpenAPI 服务定义转换为适合 OpenAI 函数调用的格式。

定义必须遵循 OpenAPI 规范 3.0.0 或更高版本。它可以是 JSON 或 YAML 格式。每个函数必须具有：- 唯一的 operationId - description - requestBody 和/或 parameters - requestBody 和/或 parameters 的 schema 有关 OpenAPI 规范的更多详细信息，请参阅官方文档。有关 OpenAI 函数调用的更多详细信息，请参阅官方文档。

使用示例

from haystack.components.converters import OpenAPIServiceToFunctions

converter = OpenAPIServiceToFunctions()
result = converter.run(sources=["path/to/openapi_definition.yaml"])
assert result["functions"]

OpenAPIServiceToFunctions.init

def __init__()

创建 OpenAPIServiceToFunctions 组件。

OpenAPIServiceToFunctions.run

@component.output_types(functions=list[dict[str, Any]],
                        openapi_specs=list[dict[str, Any]])
def run(sources: list[Union[str, Path, ByteStream]]) -> dict[str, Any]

将 OpenAPI 定义转换为 OpenAI 函数调用的格式。

参数:

sources: OpenAPI 定义（JSON 或 YAML 格式）的文件路径或 ByteStream 对象。

引发:

RuntimeError: 如果 OpenAPI 定义无法下载或处理。
ValueError: 如果源类型未识别或在 OpenAPI 定义中未找到任何函数。

返回值:

包含以下键的字典

functions: JSON 对象格式的函数定义
openapi_specs: JSON/YAML 对象格式的 OpenAPI 规范，带有已解析的引用

模块 output_adapter

OutputAdaptationException

在输出适配期间发生错误时引发的异常。

OutputAdapter

使用 Jinja 模板适配 Component 的输出。

使用示例

from haystack import Document
from haystack.components.converters import OutputAdapter

adapter = OutputAdapter(template="{{ documents[0].content }}", output_type=str)
documents = [Document(content="Test content"]
result = adapter.run(documents=documents)

assert result["output"] == "Test content"

OutputAdapter.init

def __init__(template: str,
             output_type: TypeAlias,
             custom_filters: Optional[dict[str, Callable]] = None,
             unsafe: bool = False)

创建 OutputAdapter 组件。

参数:

template: 一个 Jinja 模板，定义如何适配输入数据。模板中的变量定义了此实例的输入。例如，使用此模板

{{ documents[0].content }}

Component 的输入将是documents.

output_type: 此实例将返回的输出类型。
custom_filters: 在模板中使用的自定义 Jinja 过滤器字典。
unsafe: 启用 Jinja 模板中任意代码的执行。仅当您信任模板来源时才应使用此选项，因为它可能导致远程代码执行。

OutputAdapter.run

def run(**kwargs)

使用提供的输入渲染 Jinja 模板。

参数:

kwargs: 必须包含在template 字符串中使用的所有变量。

引发:

OutputAdaptationException: 如果模板渲染失败。

返回值:

包含以下键的字典

output: 渲染后的 Jinja 模板。

OutputAdapter.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

返回值:

包含序列化数据的字典。

OutputAdapter.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "OutputAdapter"

从字典反序列化组件。

参数:

data: 要反序列化的字典。

返回值:

反序列化后的组件。

模块 pdfminer

CID_PATTERN

检测 CID 字符的正则表达式模式

PDFMinerToDocument

将 PDF 文件转换为文档。

使用pdfminer 兼容的转换器，用于将 PDF 文件转换为文档。 https://pdfminersix.readthedocs.io/en/latest/

使用示例

from haystack.components.converters.pdfminer import PDFMinerToDocument

converter = PDFMinerToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

PDFMinerToDocument.init

def __init__(line_overlap: float = 0.5,
             char_margin: float = 2.0,
             line_margin: float = 0.5,
             word_margin: float = 0.1,
             boxes_flow: Optional[float] = 0.5,
             detect_vertical: bool = True,
             all_texts: bool = False,
             store_full_path: bool = False) -> None

创建 PDFMinerToDocument 组件。

参数:

line_overlap: 此参数决定是否根据两个字符之间的重叠量将它们视为在同一行。重叠是相对于两个字符的最小高度计算的。
char_margin: 根据字符之间的距离确定两个字符是否属于同一行。如果距离小于指定的边距，则认为字符在同一行。边距是相对于字符宽度的。
word_margin: 根据字符之间的距离确定同一行中的两个字符是否属于同一单词。如果距离大于指定的边距，则会在它们之间添加一个中间空格，以使文本更具可读性。边距是相对于字符宽度的。
line_margin: 此参数决定是否根据行之间的距离将两行视为同一段落。如果距离小于指定的边距，则认为行是同一段落的一部分。边距是相对于行高的。
boxes_flow: 此参数决定在确定文本框顺序时水平和垂直位置的重要性。可以设置一个介于 -1.0 和 +1.0 之间的值，其中 -1.0 表示仅水平位置很重要，+1.0 表示仅垂直位置很重要。将值设置为“None”将禁用高级布局分析，文本框将根据其左下角的位置进行排序。
detect_vertical: 此参数决定在布局分析期间是否考虑垂直文本。
all_texts: 是否应对图形中的文本执行布局分析。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

PDFMinerToDocument.detect_undecoded_cid_characters

def detect_undecoded_cid_characters(text: str) -> dict[str, Any]

查找 CID 的字符序列，即：尚未从其 CID 格式正确解码的字符。

这有助于检测文本提取器是否无法正确提取文本，例如 PDF 使用非标准字体。

PDF 字体可以包含一个 ToUnicode 映射（将字符代码映射到 Unicode），以支持在 PDF 查看器中搜索字符串或复制粘贴等操作。此映射立即提供了文本提取器所需的映射。如果该映射不可用，文本提取器将无法解码 CID 字符，并会按原样返回它们。

参见： https://pdfminersix.readthedocs.io/en/latest/faq.html#why-are-there-cid-x-values-in-the-textual-output

:param: text: 要检查未解码 CID 字符的文本 :returns: 包含检测结果的字典

PDFMinerToDocument.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将 PDF 文件转换为文档。

参数:

sources: PDF 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档

模块 pptx

PPTXToDocument

将 PPTX 文件转换为文档。

使用示例

from haystack.components.converters.pptx import PPTXToDocument

converter = PPTXToDocument()
results = converter.run(sources=["sample.pptx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is the text from the PPTX file.'

PPTXToDocument.init

def __init__(store_full_path: bool = False)

创建 PPTXToDocument 组件。

参数:

store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

PPTXToDocument.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将 PPTX 文件转换为文档。

参数:

sources: 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档

模块 pypdf

PyPDFExtractionMode

用于从 PDF 提取文本的模式。

PyPDFExtractionMode.str

def __str__() -> str

将 PyPDFExtractionMode 枚举转换为字符串。

PyPDFExtractionMode.from_str

@staticmethod
def from_str(string: str) -> "PyPDFExtractionMode"

将字符串转换为 PyPDFExtractionMode 枚举。

PyPDFToDocument

将 PDF 文件转换为您的管道可以查询的文档。

此组件使用 PyPDF 库。您可以为生成的文档附加元数据。

使用示例

from haystack.components.converters.pypdf import PyPDFToDocument

converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'

PyPDFToDocument.init

def __init__(*,
             extraction_mode: Union[
                 str, PyPDFExtractionMode] = PyPDFExtractionMode.PLAIN,
             plain_mode_orientations: tuple = (0, 90, 180, 270),
             plain_mode_space_width: float = 200.0,
             layout_mode_space_vertically: bool = True,
             layout_mode_scale_weight: float = 1.25,
             layout_mode_strip_rotated: bool = True,
             layout_mode_font_height_weight: float = 1.0,
             store_full_path: bool = False)

创建 PyPDFToDocument 组件。

参数:

extraction_mode: 用于从 PDF 提取文本的模式。Layout 模式是一种实验模式，它遵循 PDF 的渲染布局。
plain_mode_orientations: 在纯文本模式下提取文本时要查找的方向元组。如果extraction_mode 是PyPDFExtractionMode.LAYOUT.
plain_mode_space_width: 如果未从字体中提取，则强制使用默认空格宽度。如果extraction_mode 是PyPDFExtractionMode.LAYOUT.
layout_mode_space_vertically: 是否包含由 y 距离 + 字体高度推断的空行。如果extraction_mode 是PyPDFExtractionMode.PLAIN.
layout_mode_scale_weight: 在计算加权平均字符宽度时，字符串长度的乘数。如果extraction_mode 是PyPDFExtractionMode.PLAIN.
layout_mode_strip_rotated: 布局模式不支持旋转文本。设置为False 以包含旋转文本。如果发现旋转文本，布局将降级并会记录警告。如果extraction_mode 是PyPDFExtractionMode.PLAIN.
layout_mode_font_height_weight: 在计算空行高度时，字体高度的乘数。如果extraction_mode 是PyPDFExtractionMode.PLAIN.
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

PyPDFToDocument.to_dict

def to_dict()

将组件序列化为字典。

返回值:

包含序列化数据的字典。

PyPDFToDocument.from_dict

@classmethod
def from_dict(cls, data)

从字典反序列化组件。

参数:

data: 包含序列化数据的字典。

返回值:

反序列化后的组件。

PyPDFToDocument.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将 PDF 文件转换为文档。

参数:

sources: 要转换的 PDF 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，其长度必须与源的数量匹配，因为它们会被压缩在一起。对于 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 一系列转换后的文档。

模块 tika

XHTMLParser

自定义解析器，用于从 Tika XHTML 内容中提取页面。

XHTMLParser.handle_starttag

def handle_starttag(tag: str, attrs: list[tuple])

识别页面 div 的开始。

XHTMLParser.handle_endtag

def handle_endtag(tag: str)

识别页面 div 的结束。

XHTMLParser.handle_data

def handle_data(data: str)

填充页面内容。

TikaDocumentConverter

将各种类型的文件转换为文档。

此组件使用 Apache Tika 解析文件，因此需要运行的 Tika 服务器。有关运行 Tika 的更多选项，请参阅官方文档。

使用示例

from haystack.components.converters.tika import TikaDocumentConverter

converter = TikaDocumentConverter()
results = converter.run(
    sources=["sample.docx", "my_document.rtf", "archive.zip"],
    meta={"date_added": datetime.now().isoformat()}
)
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the docx file.'

TikaDocumentConverter.init

def __init__(tika_url: str = "https://:9998/tika",
             store_full_path: bool = False)

创建 TikaDocumentConverter 组件。

参数:

tika_url: Tika 服务器 URL。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

TikaDocumentConverter.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将文件转换为文档。

参数:

sources: HTML 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档

模块 txt

TextFileToDocument

将文本文件转换为您的管道可以查询的文档。

默认情况下，它在转换文件时使用 UTF-8 编码，但您也可以设置自定义编码。它可以为生成的文档附加元数据。

使用示例

from haystack.components.converters.txt import TextFileToDocument

converter = TextFileToDocument()
results = converter.run(sources=["sample.txt"])
documents = results["documents"]
print(documents[0].content)
# 'This is the content from the txt file.'

TextFileToDocument.init

def __init__(encoding: str = "utf-8", store_full_path: bool = False)

创建 TextFileToDocument 组件。

参数:

encoding: 要转换的文本文件的编码。如果源 ByteStream 的元数据中指定了编码，则会覆盖此值。
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

TextFileToDocument.run

@component.output_types(documents=list[Document])
def run(sources: list[Union[str, Path, ByteStream]],
        meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None)

将文本文件转换为文档。

参数:

sources: 要转换的文本文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，其长度必须与源的数量匹配，因为它们会被压缩在一起。对于 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 一系列转换后的文档。

模块 xlsx

XLSXToDocument

将 XLSX（Excel）文件转换为文档。

Supports reading data from specific sheets or all sheets in the Excel file. If all sheets are read, a Document is
created for each sheet. The content of the Document is the table which can be saved in CSV or Markdown format.

### Usage example

```python
from haystack.components.converters.xlsx import XLSXToDocument

converter = XLSXToDocument()
results = converter.run(sources=["sample.xlsx"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# ",A,B

1,col_a,col_b 2,1.5,test " ```

XLSXToDocument.init

def __init__(table_format: Literal["csv", "markdown"] = "csv",
             sheet_name: Union[str, int, list[Union[str, int]], None] = None,
             read_excel_kwargs: Optional[dict[str, Any]] = None,
             table_format_kwargs: Optional[dict[str, Any]] = None,
             *,
             store_full_path: bool = False)

创建 XLSXToDocument 组件。

参数:

table_format: 将 Excel 文件转换为的格式。
sheet_name: 要读取的表格的名称。如果为 None，则读取所有表格。
read_excel_kwargs: 要传递给pandas.read_excel 的其他参数。请参阅 https://pandas.ac.cn/docs/reference/api/pandas.read_excel.html#pandas-read-excel
table_format_kwargs: 要传递给表格格式函数的其他关键字参数。
如果table_format 为“csv”，这些参数将传递给pandas.DataFrame.to_csv。请参阅 https://pandas.ac.cn/docs/reference/api/pandas.DataFrame.to_csv.html#pandas-dataframe-to-csv
如果table_format 为“markdown”，这些参数将传递给pandas.DataFrame.to_markdown。请参阅 https://pandas.ac.cn/docs/reference/api/pandas.DataFrame.to_markdown.html#pandas-dataframe-to-markdown
store_full_path: 如果为 True，则文件的完整路径将存储在文档的元数据中。如果为 False，则仅存储文件名。

XLSXToDocument.run

@component.output_types(documents=list[Document])
def run(
    sources: list[Union[str, Path, ByteStream]],
    meta: Optional[Union[dict[str, Any], list[dict[str, Any]]]] = None
) -> dict[str, list[Document]]

将 XLSX 文件转换为文档。

参数:

sources: 文件路径或 ByteStream 对象列表。
meta: 要附加到文档的可选元数据。此值可以是字典列表或单个字典。如果是单个字典，其内容将添加到所有生成的文档的元数据中。如果是列表，列表的长度必须与源的数量匹配，因为这两个列表将被压缩。如果sources 包含 ByteStream 对象，它们的meta 将被添加到输出文档中。

返回值:

包含以下键的字典

documents: 创建的文档

模块 azure

AzureOCRDocumentConverter

使用示例

AzureOCRDocumentConverter.__init__

AzureOCRDocumentConverter.run

AzureOCRDocumentConverter.to_dict

AzureOCRDocumentConverter.from_dict

模块 csv

CSVToDocument

CSVToDocument.__init__

CSVToDocument.run

模块 docx

DOCXMetadata

DOCXTableFormat

DOCXTableFormat.from_str

DOCXLinkFormat

DOCXLinkFormat.from_str

DOCXToDocument

DOCXToDocument.__init__

DOCXToDocument.to_dict

DOCXToDocument.from_dict

DOCXToDocument.run

模块 html

HTMLToDocument

HTMLToDocument.__init__

HTMLToDocument.to_dict

HTMLToDocument.from_dict

HTMLToDocument.run

模块 json

JSONConverter

使用示例

JSONConverter.__init__

JSONConverter.to_dict

JSONConverter.from_dict

JSONConverter.run

模块 markdown

MarkdownToDocument

MarkdownToDocument.__init__

MarkdownToDocument.run

模块 msg

MSGToDocument

示例用法

MSGToDocument.__init__

MSGToDocument.run

模块 multi_file_converter

MultiFileConverter

MultiFileConverter.__init__

模块 openapi_functions

OpenAPIServiceToFunctions

OpenAPIServiceToFunctions.__init__

OpenAPIServiceToFunctions.run

模块 output_adapter

OutputAdaptationException

OutputAdapter

OutputAdapter.__init__

OutputAdapter.run

OutputAdapter.to_dict

OutputAdapter.from_dict

模块 pdfminer

CID_PATTERN

PDFMinerToDocument

PDFMinerToDocument.__init__

PDFMinerToDocument.detect_undecoded_cid_characters

PDFMinerToDocument.run

模块 pptx

PPTXToDocument

PPTXToDocument.__init__

PPTXToDocument.run

模块 pypdf

PyPDFExtractionMode

PyPDFExtractionMode.__str__

PyPDFExtractionMode.from_str

PyPDFToDocument

使用示例

PyPDFToDocument.__init__

PyPDFToDocument.to_dict

PyPDFToDocument.from_dict

PyPDFToDocument.run

模块 tika

XHTMLParser

AzureOCRDocumentConverter.init

CSVToDocument.init

DOCXToDocument.init

HTMLToDocument.init

JSONConverter.init

MarkdownToDocument.init

MSGToDocument.init

MultiFileConverter.init

OpenAPIServiceToFunctions.init

OutputAdapter.init

PDFMinerToDocument.init

PPTXToDocument.init

PyPDFExtractionMode.str

PyPDFToDocument.init

TikaDocumentConverter.init

TextFileToDocument.init

XLSXToDocument.init