pipeline 中的最常见位置	灵活
必需的初始化变量	"document_store": Document Store 实例 "cache_field": 文档元数据字段的名称
强制运行变量	“items”: 与文档中`cache_field` 关联的值列表
输出变量	“hits”: 在缓存中找到具有指定值的文档列表 “misses”: 无法找到的值列表
API 参考	Caching (缓存)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/caching/cache_checker.py

概述

CacheChecker 检查 Document Store 中是否包含任何文档，其cache_field 中的值与提供的items 输入变量中的任何值匹配。它返回一个字典，其中包含两个键："hits" 和"misses"。值分别是缓存中找到的文档列表和未找到的项目列表。

用法

单独使用

from haystack.components.caching import CacheChecker
from haystack.document_stores.in_memory import InMemoryDocumentStore

my_doc_store = InMemoryDocumentStore()

# For URL-based caching
cache_checker = CacheChecker(document_store=my_doc_store, cache_field="url")
cache_check_results = cache_checker.run(items=["https://example.com/resource", "https://another_example.com/other_resources"])
print(cache_check_results["hits"])    # List of Documents that were found in the cache: all of these have 'url': <one of the above> in the metadata
print(cache_check_results["misses"])  # URLs that were not found in the cache, like ["https://example.com/resource"]

# For caching based on a custom identifier
cache_checker = CacheChecker(document_store=my_doc_store, cache_field="metadata_field")
cache_check_results = cache_checker.run(items=["12345", "ABCDE"])
print(cache_check_results["hits"])    # Documents that were found in the cache: all of these have 'metadata_field': <one of the above> in the metadata
print(cache_check_results["misses"])  # Values that were not found in the cache, like: ["ABCDE"]

在 pipeline 中

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.components.caching import CacheChecker
from haystack.document_stores.in_memory import InMemoryDocumentStore

pipeline = Pipeline()
document_store = InMemoryDocumentStore()
pipeline.add_component(instance=CacheChecker(document_store, cache_field="meta.file_path"), name="cache_checker")
pipeline.add_component(instance=TextFileToDocument(), name="text_file_converter")
pipeline.add_component(instance=DocumentCleaner(), name="cleaner")
pipeline.add_component(instance=DocumentSplitter(split_by="sentence", split_length=250, split_overlap=30), name="splitter")
pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
pipeline.connect("cache_checker.misses", "text_file_converter.sources")
pipeline.connect("text_file_converter.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")
pipeline.connect("splitter.documents", "writer.documents")

pipeline.draw("pipeline.png")

# Take the current directory as input and run the pipeline
result = pipeline.run({"cache_checker": {"items": ["code_of_conduct_1.txt"]}})
print(result)

# The second execution skips the files that were already processed
result = pipeline.run({"cache_checker": {"items": ["code_of_conduct_1.txt"]}})
print(result)