pipeline 中的最常见位置	1. 在 RAG 管道中的 `PromptBuilder` 之前 2. 语义搜索管道中的最后一个组件 3. 在提取式 QA 管道中的 `ExtractiveReader` 之前
必需的初始化变量	"document_store": 一个 OpenSearchDocumentStore 实例
强制运行变量	"query": 查询字符串
输出变量	"documents": 与查询匹配的文档列表
API 参考	OpenSearch
GitHub 链接	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/opensearch

概述

OpenSearchBM25Retriever 是一个基于关键字的 Retriever，用于从OpenSearchDocumentStore 中获取与查询匹配的 Document。它根据 BM25 算法确定 Document 与查询的相似度，该算法计算两个字符串之间的加权词重叠。

由于OpenSearchBM25Retriever 根据词语重叠匹配字符串，常用于查找人名或产品名、ID 或定义明确的错误消息的精确匹配。BM25 算法非常轻量且简单。尽管如此，在处理域外数据时，使用更复杂的基于嵌入的方法很难超越它。

除了query，OpenSearchBM25Retriever 接受其他可选参数，包括top_k（要检索的文档的最大数量）和filters（用于缩小搜索范围）。
您可以使用fuzziness 参数来调整非精确模糊匹配的执行方式。
还可以使用all_terms_must_match 参数来指定查询中的所有术语是否都必须匹配，该参数默认为False.

如果您想要更灵活地将查询匹配到 Document，可以使用OpenSearchEmbeddingRetriever，它使用 LLM 创建的向量来检索相关信息。

设置与安装

安装并运行 OpenSearch 实例。

如果您已设置 Docker，我们建议拉取 Docker 镜像并运行它。

docker pull opensearchproject/opensearch:2.11.0
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms1024m -Xmx1024m" opensearchproject/opensearch:2.11.0

或者，您可以访问 OpenSearch 集成 GitHub 并使用提供的文件启动一个运行 OpenSearch 的 Docker 容器。docker-compose.yml:

docker compose up

成功运行 OpenSearch 实例后，安装opensearch-haystack 集成。

pip install opensearch-haystack

用法

单独使用

此检索器需要OpensearchDocumentStore 和索引的 Documents 来运行。它不能独立使用。

在 RAG 管道中

将您的OPENAI_API_KEY 设置为环境变量，然后运行以下代码

from haystack_integrations.components.retrievers.opensearch  import OpenSearchBM25Retriever
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

from haystack import Document
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.document_stores.types import DuplicatePolicy

import os
api_key = os.environ['OPENAI_API_KEY']

# Create a RAG query pipeline
prompt_template = """
    Given these documents, answer the question.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{question}}
    \nAnswer:
    """

document_store = OpenSearchDocumentStore(hosts="https://:9200", use_ssl=True,
verify_certs=False, http_auth=("admin", "admin"))

# Add Documents
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

# DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors
document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP)

retriever = OpenSearchBM25Retriever(document_store=document_store)
rag_pipeline = Pipeline()
rag_pipeline.add_component(name="retriever", instance=retriever)
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(api_key=api_key), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
rag_pipeline.connect("retriever", "answer_builder.documents")

question = "How many languages are spoken around the world today?"
result = rag_pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )
print(result['answer_builder']['answers'][0])

以下是一个示例输出

GeneratedAnswer(
  data='Over 7,000 languages are spoken around the world today.',
  query='How many languages are spoken around the world today?',
  documents=[
    Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 7.179112),
    Document(id=7f225626ad1019b273326fbaf11308edfca6d663308a4a3533ec7787367d59a2, content: 'In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the ph...', score: 1.1426818)],
  meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 86, 'completion_tokens': 13, 'total_tokens': 99}})

其他参考资料

🧑‍🍳 食谱：使用 Amazon Bedrock 和 Haystack 进行基于 PDF 的问答