pipeline 中的最常见位置	在索引管道中，位于 Converters 和 `DocumentCleaner` 之后，Classifiers 之前
强制运行变量	“documents”：文档列表
输出变量	“documents”：文档列表
API 参考	PreProcessors (预处理器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/recursive_splitter.py

概述

该RecursiveDocumentSplitter 期望输入为文档列表，并返回文本被分割后的文档列表。您可以在初始化组件时设置以下参数：

split_length：默认情况下，每个块的最大长度（以单词为单位）。请参阅split_units 参数来更改单位。
split_overlap：连续块之间重叠的字符或单词数。
split_unit：split_length 参数的单位。可以是split_length 参数的单位。可以是"word", "char"，或者"token".
separators：一个可选的分隔符字符串列表，用于分割文本。如果您不提供任何分隔符，则默认使用：["\n\n", "sentence", "\n", " "]。字符串分隔符将作为正则表达式处理。如果分隔符是"sentence"，文本将使用基于 NLTK 的自定义句子分词器进行分割。有关更多信息，请参阅 SentenceSplitter 代码。
sentence_splitter_params：要传递给 SentenceSplitter 的可选参数。

分隔符按照列表中定义的顺序应用。第一个分隔符将应用于文本；任何结果块，只要在指定的chunk_size 范围内，都会被保留。对于超出指定chunk_size 的块，将应用列表中的下一个分隔符。如果所有分隔符都已使用，并且块仍然超出chunk_size，则将根据chunk_size 进行硬分割，同时考虑是使用单词还是字符作为计数单位。此过程将重复进行，直到所有块都符合指定chunk_size.

用法

from haystack import Document
from haystack.components.preprocessors import RecursiveDocumentSplitter

chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
text = ('''Artificial intelligence (AI) - Introduction

AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
chunker.warm_up()
doc = Document(content=text)
doc_chunks = chunker.run([doc])
print(doc_chunks["documents"])
>[
>Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
>Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
>Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
>Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
>]

在 pipeline 中

在索引管道中使用 RecursiveSplitter 的方法如下：

from pathlib import Path

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import RecursiveDocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=RecursiveDocumentSplitter(
        split_length=400,
        split_overlap=0,
        split_unit="char",
        separators=["\n\n", "\n", "sentence", " "],
        sentence_splitter_params={
	        "language": "en", 
	        "use_split_rules": True, 
	        "keep_white_spaces": False
        }
    ), 
	name="recursive_splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})