CSVDocumentCleaner

使用CSVDocumentCleaner 用于清理 CSV 文档，通过移除空行和空列，同时保留指定的忽略行和列。它处理存储在文档中的 CSV 内容，有助于标准化数据以供进一步分析。


pipeline 中的最常见位置	在索引管道中，位于 Converters 之后，Embedders 或 Writers 之前。
强制运行变量	"documents": 包含 CSV 内容的文档列表。
输出变量	"documents": 清理后的 CSV 文档列表。
API 参考	PreProcessors (预处理器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/csv_document_cleaner.py

概述

CSVDocumentCleaner 期望输入为一系列Document 对象，每个对象都包含 CSV 格式的内容作为文本。它通过移除完全空的行和列来清理数据，同时允许用户指定清理前要保留的行数和列数。

参数

ignore_rows: 在处理之前，从 CSV 表顶部忽略的行数。如果删除了任何列，则相同的列也会从被忽略的行中删除。
ignore_columns: 在处理之前，从 CSV 表左侧忽略的列数。如果删除了任何行，则相同的行也会从被忽略的列中删除。
remove_empty_rows: 是否移除完全空的行。
remove_empty_columns: 是否移除完全空的列。
keep_id: 是否在输出文档中保留原始文档 ID。

清理过程

该CSVDocumentCleaner 算法遵循以下步骤：

使用 pandas 读取每个文档的 CSV 内容。
保留从顶部开始的指定数量的ignore_rows，以及从左侧开始的ignore_columns。
删除所有完全为空的行和列（只包含 NaN 值）。
如果删除了列，也会从忽略的行中删除它们。
如果删除了行，也会从忽略的列中删除它们。
重新附加剩余的忽略行和列，以保持其原始位置。
将清理后的 CSV 内容作为新的Document 对象返回。

用法

单独使用

您可以使用CSVDocumentCleaner 独立用于清理 CSV 文档。

from haystack import Document
from haystack.components.preprocessors import CSVDocumentCleaner

cleaner = CSVDocumentCleaner(ignore_rows=1, ignore_columns=0)

documents = [Document(content="""col1,col2,col3\n,,\na,b,c\n,,""" )]
cleaned_docs = cleaner.run(documents=documents)

在 pipeline 中

from pathlib import Path
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import XLSXToDocument
from haystack.components.preprocessors import CSVDocumentCleaner
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=XLSXToDocument(), name="xlsx_file_converter")
p.add_component(instance=CSVDocumentCleaner(ignore_rows=1, ignore_columns=1), name="csv_cleaner")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")

p.connect("xlsx_file_converter.documents", "csv_cleaner.documents")
p.connect("csv_cleaner.documents", "writer.documents")

p.run({"xlsx_file_converter": {"sources": [Path("your_xlsx_file.xlsx")]}})

这确保了 CSV 文档在进一步处理或存储之前得到妥善清理。

更新于 8 个月前