模块 csv_document_cleaner

CSVDocumentCleaner

一个用于清理 CSV 文档的组件，可以移除空行和空列。

此组件处理存储在 Documents 中的 CSV 内容，允许在执行清理操作之前选择性地忽略指定数量的行和列。此外，它还提供了保留文档 ID 的选项，并控制是否删除空行和空列。

CSVDocumentCleaner.init

def __init__(*,
             ignore_rows: int = 0,
             ignore_columns: int = 0,
             remove_empty_rows: bool = True,
             remove_empty_columns: bool = True,
             keep_id: bool = False) -> None

初始化 CSVDocumentCleaner 组件。

参数:

ignore_rows: 在处理之前要从 CSV 表顶部忽略的行数。
ignore_columns: 在处理之前要从 CSV 表左侧忽略的列数。
remove_empty_rows: 是否删除完全为空的行。
remove_empty_columns: 是否删除完全为空的列。
keep_id: 是否在输出文档中保留原始文档 ID。使用这些参数忽略的行和列将保留在最终输出中，这意味着在删除空行和空列时不会考虑它们。

CSVDocumentCleaner.run

@component.output_types(documents=list[Document])
def run(documents: list[Document]) -> dict[str, list[Document]]

通过移除空行和空列来清理 CSV 文档，同时保留指定的忽略行和列。

参数:

documents: 包含 CSV 格式内容的 Document 列表。

返回值:

一个字典，其键 "documents" 下面是一个包含已清理 Document 的列表。处理步骤

将每个文档的内容读取为 CSV 表。
保留指定数量的ignore_rows（从顶部）和ignore_columns（从左侧）。
删除所有完全为空的行和列（如果由remove_empty_rows 和remove_empty_columns).
重新附加被忽略的行和列，以保持其原始位置。
将清理后的 CSV 内容返回为一个新的Document 对象，并可以选择保留原始文档 ID。

模块 csv_document_splitter

CSVDocumentSplitter

一个组件，用于根据分割参数将 CSV 文档分割成子表。

该分割器支持两种操作模式：

识别连续的空行或空列，其数量超过给定阈值，并使用它们作为分隔符将文档分割成更小的表。
将每一行分割成一个单独的子表，表示为一个 Document。

CSVDocumentSplitter.init

def __init__(row_split_threshold: Optional[int] = 2,
             column_split_threshold: Optional[int] = 2,
             read_csv_kwargs: Optional[dict[str, Any]] = None,
             split_mode: SplitMode = "threshold") -> None

初始化 CSVDocumentSplitter 组件。

参数:

row_split_threshold: 触发分割所需的连续空行的最小数量。
column_split_threshold: 触发分割所需的连续空列的最小数量。
read_csv_kwargs: 要传递给pandas.read_csv 的额外关键字参数。默认情况下，组件使用以下选项：
header=None
skip_blank_lines=False 以保留空行
dtype=object 以防止类型推断（例如，将数字转换为浮点数）。有关更多信息，请参阅 https://pandas.ac.cn/docs/reference/api/pandas.read_csv.html。
split_mode: 如果设置为threshold，组件将根据超过row_split_threshold 或column_split_threshold 的连续空行或空列数量来分割文档。如果设置为row-wise，组件会将每一行分割成一个单独的子表。

CSVDocumentSplitter.run

@component.output_types(documents=list[Document])
def run(documents: list[Document]) -> dict[str, list[Document]]

处理并将 CSV 文档列表分割成多个子表。

分割过程

如果提供了row_split_threshold，则应用基于行的分割。
如果提供了column_split_threshold，则应用基于列的分割。
如果同时指定了这两个阈值，则首先按行进行递归分割，然后按列进行分割，确保对仍然包含空部分的任何子表进行进一步的碎片化。
根据子表在原始文档中的原始位置对其进行排序。

参数:

documents: 包含 CSV 格式内容的 Document 列表。每个文档假定包含一个或多个由空行或空列分隔的表。

返回值:

一个字典，其键"documents"，映射到一个新的 Document 对象列表，每个对象代表从原始 CSV 中提取的一个子表。每个文档的元数据包括：- 一个字段source_id，用于跟踪原始文档。- 一个字段row_idx_start，用于指示子表在原始表中的起始行索引。- 一个字段col_idx_start，用于指示子表在原始表中的起始列索引。- 一个字段split_id，用于指示在原始文档中的分割顺序。- 从原始文档复制的所有其他元数据。如果无法处理某个文档，则将其原样返回。

meta 字段从原始文档保留在分割后的文档中。
该模块 document_cleaner

清理文档中的文本。

DocumentCleaner

它按此顺序移除多余的空格、空行、指定的子字符串、正则表达式、页面页眉和页脚。

DocumentCleaner.__init__

使用示例

from haystack import Document
from haystack.components.preprocessors import DocumentCleaner

doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")

cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
result = cleaner.run(documents=[doc])

assert result["documents"][0].content == "This is a document to clean "

初始化 DocumentCleaner。

def __init__(remove_empty_lines: bool = True,
             remove_extra_whitespaces: bool = True,
             remove_repeated_substrings: bool = False,
             keep_id: bool = False,
             remove_substrings: Optional[list[str]] = None,
             remove_regex: Optional[str] = None,
             unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
                                                     "NFKD"]] = None,
             ascii_only: bool = False)

remove_empty_lines: 如果为

参数:

True，则删除空行。remove_extra_whitespaces: 如果为
True，则删除多余的空格。remove_repeated_substrings: 如果为
True，则删除页面上的重复子字符串（页眉和页脚）。页面必须由换页符 "\f" 分隔，该分隔符受TextFileToDocument 和remove_substrings: 要从文本中删除的子字符串列表。AzureOCRDocumentConverter.
remove_regex: 用于匹配并替换子字符串为 "" 的正则表达式。
keep_id: 如果为
True，则保留原始文档的 ID。unicode_normalization: 要应用于文本的 Unicode 规范化形式。注意：此操作将在其他所有步骤之前运行。
ascii_only: 是否将文本转换为仅 ASCII。将删除字符的重音符号并用 ASCII 字符替换它们。其他非 ASCII 字符将被删除。注意：此操作将在任何模式匹配或删除之前运行。
DocumentCleaner.run

清理文档。

@component.output_types(documents=list[Document])
def run(documents: list[Document])

documents: 要清理的 Document 列表。

参数:

TypeError: 如果 documents 不是 Document 列表。

引发:

documents: 已清理 Document 的列表。

返回值:

一个字典，其中包含以下键

模块 document_preprocessor

一个 SuperComponent，首先分割然后清理文档。

DocumentPreprocessor

此组件由一个 DocumentSplitter 后跟一个 DocumentCleaner 组成，在一个管道中。它以文档列表作为输入，并返回一个已处理的文档列表。

DocumentPreprocessor.__init__

使用示例

from haystack import Document
from haystack.components.preprocessors import DocumentPreprocessor

doc = Document(content="I love pizza!")
preprocessor = DocumentPreprocessor()
result = preprocessor.run(documents=[doc])
print(result["documents"])

初始化 DocumentPreProcessor，该组件首先分割然后清理文档。

def __init__(*,
             split_by: Literal["function", "page", "passage", "period", "word",
                               "line", "sentence"] = "word",
             split_length: int = 250,
             split_overlap: int = 0,
             split_threshold: int = 0,
             splitting_function: Optional[Callable[[str], list[str]]] = None,
             respect_sentence_boundary: bool = False,
             language: Language = "en",
             use_split_rules: bool = True,
             extend_abbreviations: bool = True,
             remove_empty_lines: bool = True,
             remove_extra_whitespaces: bool = True,
             remove_repeated_substrings: bool = False,
             keep_id: bool = False,
             remove_substrings: Optional[list[str]] = None,
             remove_regex: Optional[str] = None,
             unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
                                                     "NFKD"]] = None,
             ascii_only: bool = False) -> None

拆分器参数

split_by: 分割的单位：“function”, “page”, “passage”, “period”, “word”, “line” 或 “sentence”。:

参数:

split_length: 每次分割的最大单位数量（单词、行、页面等）。
split_overlap: 连续分割之间的重叠单位数量。
split_threshold: 每次分割的最小单位数量。如果分割小于此值，它将被合并到前一次分割中。
splitting_function: 如果
split_by="function"，则使用的自定义分割函数。respect_sentence_boundary: 如果为.
True，则按单词分割，但会尝试不中断句子。language: 如果
split_by="sentence" 或respect_sentence_boundary=True，则由句子分词器使用的语言。use_split_rules: 是否为句子分词器应用额外的分割启发式。.
extend_abbreviations: 是否使用特定语言的精选缩写列表来扩展句子分词器。
清理器参数

True，移除跨页面的重复子字符串，如页眉/页脚。:

True，则删除空行。remove_extra_whitespaces: 如果为
True，则删除多余的空格。remove_repeated_substrings: 如果为
True，则删除页面上的重复子字符串（页眉和页脚）。页面必须由换页符 "\f" 分隔，该分隔符受True，保留原始文档 ID。
True，则保留原始文档的 ID。remove_substrings: 要从文档内容中删除的字符串列表。
remove_regex: 要从文档内容中删除匹配项的正则表达式模式。
unicode_normalization: 要应用于文本的 Unicode 规范化形式，例如
"NFC"ascii_only: 如果为.
True，则将文本转换为仅 ASCII。DocumentPreprocessor.to_dict

将 SuperComponent 序列化为字典。

def to_dict() -> dict[str, Any]

DocumentPreprocessor.from_dict

返回值:

包含序列化数据的字典。

从字典反序列化 SuperComponent。

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DocumentPreprocessor"

反序列化的 SuperComponent。

参数:

data: 要反序列化的字典。

返回值:

模块 document_splitter

将长文档分割成更小的块。

DocumentSplitter

这是索引过程中的常见预处理步骤。它有助于 Embedder 创建有意义的语义表示，并防止超出语言模型的上下文限制。

DocumentSplitter 与以下 DocumentStores 兼容：

Chroma 有限支持，不存储重叠信息

Astra
Pinecone 有限支持，不存储重叠信息
Elasticsearch
OpenSearch
Pgvector
DocumentSplitter.__init__
Qdrant
Weaviate

使用示例

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")

splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
result = splitter.run(documents=[doc])

初始化 DocumentSplitter。

def __init__(split_by: Literal["function", "page", "passage", "period", "word",
                               "line", "sentence"] = "word",
             split_length: int = 200,
             split_overlap: int = 0,
             split_threshold: int = 0,
             splitting_function: Optional[Callable[[str], list[str]]] = None,
             respect_sentence_boundary: bool = False,
             language: Language = "en",
             use_split_rules: bool = True,
             extend_abbreviations: bool = True,
             *,
             skip_empty_documents: bool = True)

split_by: 分割文档的单位。可选择：

参数:

word 用于按空格（" "）分割
period 用于按句号（"."）分割
page 用于按换页符（"\f"）分割
passage 用于按双换行符（"\n\n"）分割
line 用于分割每一行（"\n"）
sentence 用于按 NLTK 句子分词器分割
split_length: 每次分割的最大单位数量。
split_overlap: 每次分割的重叠单位数量。
split_threshold: 每次分割的最小单位数量。如果一个分割的单位数少于阈值，则将其附加到前一次分割。
splitting_function: 当
split_by 设置为 "function" 时需要。这是一个函数，它必须接受一个str 作为输入，并返回一个list 的str 作为输出，代表分割后的块。respect_sentence_boundary: 选择是否在按 "word" 分割时尊重句子边界。如果为 True，则使用 NLTK 检测句子边界，确保只在句子之间进行分割。
language: 为 NLTK 分词器选择语言。默认为英语 ("en")。
use_split_rules: 选择是否在使用
sentence分割时应用额外的分割规则。.
extend_abbreviations: 选择是否使用一组精选的缩写来扩展 NLTK 的 PunktTokenizer 缩写，如果可用的话。目前支持英语 ("en") 和德语 ("de")。
skip_empty_documents: 选择是否跳过内容为空的文档。默认为 True。当 Pipeline 中的下游组件（如 LLMDocumentContentExtractor）可以从非文本文档中提取文本时，将其设置为 False。

DocumentSplitter.warm_up

def warm_up()

通过加载句子分词器来预热 DocumentSplitter。

DocumentSplitter.run

@component.output_types(documents=list[Document])
def run(documents: list[Document])

将文档分割成更小的部分。

按split_by 中指定的单位，长度为split_length，重叠为split_overlap.

参数:

documents: 要分割的文档。

引发:

TypeError: 如果输入不是 Document 列表。
ValueError: 如果文档内容为 None。

返回值:

一个字典，其中包含以下键

documents: 包含分割后文本的 Document 列表。每个文档包含：
一个元数据字段source_id，用于跟踪原始文档。
一个元数据字段page_number，用于跟踪原始页码。
从原始文档复制的所有其他元数据。

DocumentSplitter.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

DocumentSplitter.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DocumentSplitter"

从字典反序列化组件。

模块 hierarchical_document_splitter

HierarchicalDocumentSplitter

将文档分割成不同大小的块，构建一个块的层次树结构。

树的根节点是原始文档，叶节点是最小的块。中间的块被连接起来，使得较小的块成为父较大块的子节点。

使用示例

from haystack import Document
from haystack.components.preprocessors import HierarchicalDocumentSplitter

doc = Document(content="This is a simple test document")
splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
splitter.run([doc])
>> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
>> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
>> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
>> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}

HierarchicalDocumentSplitter.init

def __init__(block_sizes: set[int],
             split_overlap: int = 0,
             split_by: Literal["word", "sentence", "page",
                               "passage"] = "word")

初始化 HierarchicalDocumentSplitter。

参数:

block_sizes: 用于将文档分割成块的块大小集合。块按降序分割。
split_threshold: 每次分割的最小单位数量。如果一个分割的单位数少于阈值，则将其附加到前一次分割。
split_by: 分割文档的单位。

HierarchicalDocumentSplitter.run

@component.output_types(documents=list[Document])
def run(documents: list[Document])

为文档列表中的每个文档构建一个分层文档结构。

参数:

documents: 要分割成层次块的 Document 列表。

返回值:

HierarchicalDocument 列表

HierarchicalDocumentSplitter.build_hierarchy_from_doc

def build_hierarchy_from_doc(document: Document) -> list[Document]

从单个文档构建一个分层树文档结构。

给定一个文档，此函数将文档分割成不同大小的层次块，表示为 HierarchicalDocument 对象。

参数:

document: 要分割成层次块的 Document。

返回值:

HierarchicalDocument 列表

HierarchicalDocumentSplitter.to_dict

def to_dict() -> dict[str, Any]

返回组件的字典表示。

返回值:

组件的序列化字典表示。

HierarchicalDocumentSplitter.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HierarchicalDocumentSplitter"

从字典反序列化此组件。

参数:

data: 要反序列化并创建组件的字典。

返回值:

反序列化后的组件。

模块 recursive_splitter

RecursiveDocumentSplitter

递归地将文本分块。

此组件用于将文本分割成更小的块，它通过递归地将一系列分隔符应用于文本来实现。

分隔符按提供的顺序应用，通常是一个按特定顺序应用的分隔符列表，最后一个分隔符是最具体的。

每个分隔符应用于文本，然后检查产生的块，保留小于 split_length 的块，对于大于 split_length 的块，则将下一个分隔符应用于剩余文本。

直到所有块都小于 split_length 参数。

示例:

from haystack import Document
from haystack.components.preprocessors import RecursiveDocumentSplitter

chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
text = ('''Artificial intelligence (AI) - Introduction

AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
chunker.warm_up()
doc = Document(content=text)
doc_chunks = chunker.run([doc])
print(doc_chunks["documents"])
>[
>Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
>Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
>Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
>Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
>]

RecursiveDocumentSplitter.init

def __init__(*,
             split_length: int = 200,
             split_overlap: int = 0,
             split_unit: Literal["word", "char", "token"] = "word",
             separators: Optional[list[str]] = None,
             sentence_splitter_params: Optional[dict[str, Any]] = None)

初始化 RecursiveDocumentSplitter。

参数:

split_length: 每个块的最大长度，默认以单词为单位，但也可以是字符或 token。请参阅split_units 参数。
split_overlap: 连续块之间的重叠字符数。
split_unit: split_length 参数的单位。可以是 "word"（单词）、"char"（字符）或 "token"（标记）。如果选择 "token"，则使用 tiktoken 分词器 (o200k_base) 将文本分割成 token。
separators: 用于分割文本的可选分隔符字符串列表。字符串分隔符将被视为正则表达式，除非分隔符是 "sentence"，在这种情况下，文本将使用基于 NLTK 的自定义句子分词器分割。参见：haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter。如果未提供分隔符，则使用默认分隔符 ["\n\n", "sentence", "\n", " " ]。
sentence_splitter_params: 要传递给句子分词器的可选参数。参见：haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter 以获取更多信息。

引发:

ValueError: 如果重叠大于或等于块大小，或重叠为负数，或任何分隔符不是字符串。

RecursiveDocumentSplitter.warm_up

def warm_up() -> None

如果需要，预热句子分词器和 tiktoken 分词器。

RecursiveDocumentSplitter.run

@component.output_types(documents=list[Document])
def run(documents: list[Document]) -> dict[str, list[Document]]

将文档列表分割成具有更小文本块的文档。

参数:

documents: 要分割的 Document 列表。

引发:

RuntimeError: 如果组件未预热但需要进行句子分割或分词。

返回值:

一个字典，包含键 "documents"，其值为一个 Document 列表，其中包含对应于输入文档的更小文本块。

模块 text_cleaner

TextCleaner

清理文本字符串。

它可以移除匹配一系列正则表达式的子字符串，将文本转换为小写，移除标点符号，并移除数字。在评估之前使用它来清理文本数据。

使用示例

from haystack.components.preprocessors import TextCleaner

text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."

cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
result = cleaner.run(texts=[text_to_clean])

TextCleaner.init

def __init__(remove_regexps: Optional[list[str]] = None,
             convert_to_lowercase: bool = False,
             remove_punctuation: bool = False,
             remove_numbers: bool = False)

初始化 TextCleaner 组件。

参数:

remove_regexps: 要从文本中移除匹配子字符串的正则表达式模式列表。
convert_to_lowercase: 如果为True，则将所有字符转换为小写。
remove_punctuation: 如果为True，则从文本中移除标点符号。
remove_numbers: 如果为True，则从文本中移除数字。

TextCleaner.run

@component.output_types(texts=list[str])
def run(texts: list[str]) -> dict[str, Any]

清理给定的字符串列表。

参数:

texts: 要清理的字符串列表。

返回值:

一个字典，其中包含以下键

texts: 清理后的字符串列表。

模块 csv_document_cleaner

CSVDocumentCleaner

CSVDocumentCleaner.__init__

CSVDocumentCleaner.run

模块 csv_document_splitter

CSVDocumentSplitter

CSVDocumentSplitter.__init__

CSVDocumentSplitter.run

清理文档中的文本。

DocumentCleaner

使用示例

初始化 DocumentCleaner。

清理文档。

一个 SuperComponent，首先分割然后清理文档。

DocumentPreprocessor

初始化 DocumentPreProcessor，该组件首先分割然后清理文档。

将 SuperComponent 序列化为字典。

从字典反序列化 SuperComponent。

将长文档分割成更小的块。

DocumentSplitter

使用示例

初始化 DocumentSplitter。

DocumentSplitter.warm_up

DocumentSplitter.run

DocumentSplitter.to_dict

DocumentSplitter.from_dict

模块 hierarchical_document_splitter

HierarchicalDocumentSplitter

使用示例

HierarchicalDocumentSplitter.__init__

HierarchicalDocumentSplitter.run

HierarchicalDocumentSplitter.build_hierarchy_from_doc

HierarchicalDocumentSplitter.to_dict

HierarchicalDocumentSplitter.from_dict

模块 recursive_splitter

RecursiveDocumentSplitter

RecursiveDocumentSplitter.__init__

RecursiveDocumentSplitter.warm_up

RecursiveDocumentSplitter.run

模块 text_cleaner

TextCleaner

使用示例

TextCleaner.__init__

TextCleaner.run

CSVDocumentCleaner.init

CSVDocumentSplitter.init

HierarchicalDocumentSplitter.init

RecursiveDocumentSplitter.init

TextCleaner.init