TextCleaner
使用TextCleaner 使文本数据更具可读性。它会删除正则表达式、标点符号和数字,并转换为小写。这在评估前清理文本数据时尤其有用。
| pipeline 中的最常见位置 | 在 Generator 和 Evaluator 之间 |
| 强制运行变量 | "texts": 要清理的字符串列表 |
| 输出变量 | "texts": 清理后的字符串列表 |
| API 参考 | PreProcessors (预处理器) |
| GitHub 链接 | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/text_cleaner.py |
概述
TextCleaner 期望输入为字符串列表,并返回已清理文本的字符串列表。可选择的清理步骤包括convert_to_lowercase, remove_punctuation,以及remove_numbers。这三个参数是在初始化组件时需要设置的布尔值。
convert_to_lowercase将文本中的所有字符转换为小写。remove_punctuation删除文本中的所有标点符号。remove_numbers删除文本中的所有数字。
此外,您还可以使用参数指定一个正则表达式remove_regexps,任何匹配项都将被删除。
用法
单独使用
您可以在管道外部使用它来清理任何文本
from haystack.components.preprocessors import TextCleaner
text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
result = cleaner.run(texts=[text_to_clean])
在 pipeline 中
在此示例中,我们使用了TextCleaner 在ExtractiveReader 和OutputAdapter 之后,以删除文本中的标点符号。然后,我们自定义的ExactMatchEvaluator 组件将检索到的答案与基本事实答案进行比较。
from typing import List
from haystack import component, Document, Pipeline
from haystack.components.converters import OutputAdapter
from haystack.components.preprocessors import TextCleaner
from haystack.components.readers import ExtractiveReader
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)
@component
class ExactMatchEvaluator:
@component.output_types(score=int)
def run(self, expected: str, provided: List[str]):
return {"score": int(expected in provided)}
adapter = OutputAdapter(
template="{{answers | extract_data}}",
output_type=List[str],
custom_filters={"extract_data": lambda data: [answer.data for answer in data if answer.data]}
)
p = Pipeline()
p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
p.add_component("reader", ExtractiveReader())
p.add_component("adapter", adapter)
p.add_component("cleaner", TextCleaner(remove_punctuation=True))
p.add_component("evaluator", ExactMatchEvaluator())
p.connect("retriever", "reader")
p.connect("reader", "adapter")
p.connect("adapter", "cleaner.texts")
p.connect("cleaner", "evaluator.provided")
question = "What behavior indicates a high level of self-awareness of elephants?"
ground_truth_answer = "recognizing themselves in mirrors"
result = p.run({"retriever": {"query": question}, "reader": {"query": question}, "evaluator": {"expected": ground_truth_answer}})
print(result)
更新于 6 个月前
