一个基于关键词的 Retriever，用于从 Azure AI Search Document Store 中检索匹配查询的文档。


pipeline 中的最常见位置	1. 在 RAG 管道的 `PromptBuilder` 之前 2. 语义搜索管道的最后一个组件 3. 在抽取式 QA 管道的 `ExtractiveReader` 之前
必需的初始化变量	"document_store": `AzureAISearchDocumentStore` 的一个实例
强制运行变量	"query": 一个字符串
输出变量	“documents”: 文档列表（匹配查询）
API 参考	Azure AI Search
GitHub 链接	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_ai_search

概述

该AzureAISearchBM25Retriever 是一个基于关键词的 Retriever，旨在从AzureAISearchDocumentStore 中检索与查询匹配的文档。它使用 BM25 算法，该算法计算查询和文档之间的加权词重叠来确定它们的相似性。Retriever 接受文本查询，但您也可以提供带有布尔运算符的术语组合。一些有效的查询示例可以是"pool", "pool spa"，以及"pool spa +airport".

除了query，AzureAISearchBM25Retriever 接受其他可选参数，包括top_k（要检索的文档的最大数量）和filters（用于缩小搜索范围）。

如果您的搜索索引包含语义配置，您可以启用语义排名并将其应用于 Retriever 的结果。有关更多详细信息，请参阅 Azure AI 文档。

如果您想要 BM25 和向量检索的组合，请使用AzureAISearchHybridRetriever，它结合使用向量搜索和 BM25 搜索来匹配文档和查询。

用法

安装

此集成要求您拥有一个有效的 Azure 订阅，并已部署 Azure AI Search 服务。

要开始使用 Azure AI Search 和 Haystack，请使用以下命令安装包：

pip install azure-ai-search-haystack

单独使用

此 Retriever 需要AzureAISearchDocumentStore 和已索引的文档才能运行。

from haystack import Document
from haystack_integrations.components.retrievers.azure_ai_search import AzureAISearchBM25Retriever
from haystack_integrations.document_stores.azure_ai_search import AzureAISearchDocumentStore

document_store = AzureAISearchDocumentStore(index_name="haystack_docs")
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)

retriever = AzureAISearchBM25Retriever(document_store=document_store)
retriever.run(query="How many languages are spoken around the world today?")

在 RAG 管道中

下面的示例展示了如何在 RAG 管道中使用AzureAISearchBM25Retriever。将您的OPENAI_API_KEY 设置为环境变量，然后运行以下代码


from haystack_integrations.components.retrievers.azure_ai_search import AzureAISearchBM25Retriever
from haystack_integrations.document_stores.azure_ai_search import AzureAISearchDocumentStore

from haystack import Document
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.document_stores.types import DuplicatePolicy

import os
api_key = os.environ['OPENAI_API_KEY']

# Create a RAG query pipeline
prompt_template = """
    Given these documents, answer the question.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{question}}
    \nAnswer:
    """

document_store = AzureAISearchDocumentStore(index_name="haystack-docs")

# Add Documents
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

# policy param is optional, as AzureAISearchDocumentStore has a default policy of DuplicatePolicy.OVERWRITE
document_store.write_documents(documents=documents, policy=DuplicatePolicy.OVERWRITE)

retriever = AzureAISearchBM25Retriever(document_store=document_store)
rag_pipeline = Pipeline()
rag_pipeline.add_component(name="retriever", instance=retriever)
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

question = "Tell me something about languages?"
result = rag_pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )
print(result['answer_builder']['answers'][0])