文档API 参考📓 教程🧑‍🍳 食谱🤝 集成💜 Discord🎨 Studio
文档

RecursiveDocumentSplitter

该组件通过对文本应用给定的分隔符列表,将文本递归地分解成更小的块。

pipeline 中的最常见位置在索引管道中,位于 ConvertersDocumentCleaner 之后,Classifiers 之前
强制运行变量“documents”:文档列表
输出变量“documents”:文档列表
API 参考PreProcessors (预处理器)
GitHub 链接https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/recursive_splitter.py

概述

RecursiveDocumentSplitter 期望输入为文档列表,并返回文本被分割后的文档列表。您可以在初始化组件时设置以下参数:

  • split_length:默认情况下,每个块的最大长度(以单词为单位)。请参阅split_units 参数来更改单位。
  • split_overlap:连续块之间重叠的字符或单词数。
  • split_unitsplit_length 参数的单位。可以是split_length 参数的单位。可以是"word", "char",或者"token".
  • separators:一个可选的分隔符字符串列表,用于分割文本。如果您不提供任何分隔符,则默认使用:["\n\n", "sentence", "\n", " "]。字符串分隔符将作为正则表达式处理。如果分隔符是"sentence",文本将使用基于 NLTK 的自定义句子分词器进行分割。有关更多信息,请参阅 SentenceSplitter 代码。
  • sentence_splitter_params:要传递给 SentenceSplitter 的可选参数。

分隔符按照列表中定义的顺序应用。第一个分隔符将应用于文本;任何结果块,只要在指定的chunk_size 范围内,都会被保留。对于超出指定chunk_size 的块,将应用列表中的下一个分隔符。如果所有分隔符都已使用,并且块仍然超出chunk_size,则将根据chunk_size 进行硬分割,同时考虑是使用单词还是字符作为计数单位。此过程将重复进行,直到所有块都符合指定chunk_size.

用法

from haystack import Document
from haystack.components.preprocessors import RecursiveDocumentSplitter

chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
text = ('''Artificial intelligence (AI) - Introduction

AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
chunker.warm_up()
doc = Document(content=text)
doc_chunks = chunker.run([doc])
print(doc_chunks["documents"])
>[
>Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
>Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
>Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
>Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
>]

在 pipeline 中

在索引管道中使用 RecursiveSplitter 的方法如下:

from pathlib import Path

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import RecursiveDocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=RecursiveDocumentSplitter(
        split_length=400,
        split_overlap=0,
        split_unit="char",
        separators=["\n\n", "\n", "sentence", " "],
        sentence_splitter_params={
	        "language": "en", 
	        "use_split_rules": True, 
	        "keep_white_spaces": False
        }
    ), 
	name="recursive_splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})