RecursiveDocumentSplitter
该组件通过对文本应用给定的分隔符列表,将文本递归地分解成更小的块。
| pipeline 中的最常见位置 | 在索引管道中,位于 Converters 和 DocumentCleaner 之后,Classifiers 之前 |
| 强制运行变量 | “documents”:文档列表 |
| 输出变量 | “documents”:文档列表 |
| API 参考 | PreProcessors (预处理器) |
| GitHub 链接 | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/recursive_splitter.py |
概述
该RecursiveDocumentSplitter 期望输入为文档列表,并返回文本被分割后的文档列表。您可以在初始化组件时设置以下参数:
split_length:默认情况下,每个块的最大长度(以单词为单位)。请参阅split_units参数来更改单位。split_overlap:连续块之间重叠的字符或单词数。split_unit:split_length参数的单位。可以是split_length参数的单位。可以是"word","char",或者"token".separators:一个可选的分隔符字符串列表,用于分割文本。如果您不提供任何分隔符,则默认使用:["\n\n", "sentence", "\n", " "]。字符串分隔符将作为正则表达式处理。如果分隔符是"sentence",文本将使用基于 NLTK 的自定义句子分词器进行分割。有关更多信息,请参阅 SentenceSplitter 代码。sentence_splitter_params:要传递给 SentenceSplitter 的可选参数。
分隔符按照列表中定义的顺序应用。第一个分隔符将应用于文本;任何结果块,只要在指定的chunk_size 范围内,都会被保留。对于超出指定chunk_size 的块,将应用列表中的下一个分隔符。如果所有分隔符都已使用,并且块仍然超出chunk_size,则将根据chunk_size 进行硬分割,同时考虑是使用单词还是字符作为计数单位。此过程将重复进行,直到所有块都符合指定chunk_size.
用法
from haystack import Document
from haystack.components.preprocessors import RecursiveDocumentSplitter
chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
text = ('''Artificial intelligence (AI) - Introduction
AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
chunker.warm_up()
doc = Document(content=text)
doc_chunks = chunker.run([doc])
print(doc_chunks["documents"])
>[
>Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
>Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
>Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
>Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
>]
在 pipeline 中
在索引管道中使用 RecursiveSplitter 的方法如下:
from pathlib import Path
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import RecursiveDocumentSplitter
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=RecursiveDocumentSplitter(
split_length=400,
split_overlap=0,
split_unit="char",
separators=["\n\n", "\n", "sentence", " "],
sentence_splitter_params={
"language": "en",
"use_split_rules": True,
"keep_white_spaces": False
}
),
name="recursive_splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})
更新于 6 个月前
