LinkContentFetcher

使用 LinkContentFetcher，您可以将多个 URL 的内容作为管道的数据。您可以在索引和查询管道中使用它来获取您提供的 URL 的内容。


pipeline 中的最常见位置	在索引或查询管道中作为数据获取步骤
强制运行变量	"urls": URL 列表（字符串）
输出变量	"streams": `ByteStream` 对象列表
API 参考	Fetchers (获取器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/fetchers/link_content.py

概述

LinkContentFetcher 获取您提供的urls 的内容，并返回一个内容流列表。此列表中的每个项目都是它成功获取的单个链接的内容，形式为ByteStream 对象。返回列表中的每个对象都包含元数据，其中包含其内容类型（在content_type 键中）和其 URL（在url 键中）。

例如，如果您将十个 URL 传递给LinkContentFetcher，并且它成功获取了其中六个，那么输出将是一个包含六个ByteStream 对象的列表，每个对象都包含有关其内容类型和 URL 的信息。

有时，网站可能会阻止LinkContentFetcher 获取其内容。在这种情况下，它会记录错误并返回成功获取的ByteStream 对象。

通常，要在管道中使用此组件，您必须将返回的ByteStream 对象列表转换为Document 对象列表。为此，您可以使用HTMLToDocument 组件。

您可以使用在索引管道的开头使用 LinkContentFetcher，将 URL 的内容索引到 Document Store 中。您也可以直接在查询管道中使用它，例如检索增强生成 (RAG) 管道，将 URL 的内容作为数据源。

用法

单独使用

下面是一个示例，其中LinkContentFetcher 获取了 URL 的内容。它使用默认设置初始化组件。要更改默认组件设置，例如retry_attempts，请查看 API 参考文档。

from haystack.components.fetchers import LinkContentFetcher

fetcher = LinkContentFetcher()

fetcher.run(urls=["https://haystack.com.cn"])

在 pipeline 中

下面是一个使用LinkContentFetcher 将指定 URL 的内容索引到InMemoryDocumentStore 中的索引管道示例。请注意，它如何使用HTMLToDocument 组件将ByteStream 对象列表转换为Document 对象。

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
fetcher = LinkContentFetcher()
converter = HTMLToDocument()
writer = DocumentWriter(document_store = document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
indexing_pipeline.add_component(instance=converter, name="converter")
indexing_pipeline.add_component(instance=writer, name="writer")

indexing_pipeline.connect("fetcher.streams", "converter.sources")
indexing_pipeline.connect("converter.documents", "writer.documents")

indexing_pipeline.run(data={"fetcher": {"urls": ["https://haystack.com.cn/blog/guide-to-using-zephyr-with-haystack2"]}})

更新于大约 1 年前