该 Retriever 结合了基于嵌入的检索和 BM25 文本搜索，以在搜索索引中查找匹配的文档，从而获得更相关的结果。


pipeline 中的最常见位置	1. 在 RAG 管道的 TextEmbedder 之后，PromptBuilder 之前 2. 混合搜索管道中的最后一个组件 3. 在提取式 QA 管道的 TextEmbedder 之后，ExtractiveReader 之前
必需的初始化变量	"document_store": AzureAISearchDocumentStore 的实例
强制运行变量	"query": 一个字符串 ”query_embedding”: 一个浮点数列表
输出变量	“documents”: 文档列表（与查询匹配）
API 参考	Azure AI Search
GitHub 链接	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_ai_search

概述

该AzureAISearchHybridRetriever 结合了向量检索和 BM25 文本搜索，以从AzureAISearchDocumentStore 中获取相关文档。它在一个请求中处理文本（关键字）查询和查询嵌入，并行执行所有子查询。结果使用 Reciprocal Rank Fusion (RRF) 进行合并和重新排序，以创建统一的结果集。

除了query 和query_embedding 之外，AzureAISearchHybridRetriever 之外，还可以接受可选参数，例如top_k（要检索的最大文档数）和filters 来细化搜索。初始化时还可以传递其他关键字参数以进行进一步自定义。

如果你的搜索索引包含语义配置，你可以启用语义排名以将其应用于 Retriever 的结果。有关更多详细信息，请参阅 Azure AI 文档。

要进行纯粹基于关键字的检索，可以使用AzureAISearchBM25Retriever，对于基于嵌入的检索，则有AzureAISearchEmbeddingRetriever 可用。

用法

安装

此集成要求您拥有一个有效的 Azure 订阅，并已部署 Azure AI Search 服务。

要开始使用 Azure AI search 和 Haystack，请使用以下命令安装包

pip install azure-ai-search-haystack

单独使用

此 Retriever 需要AzureAISearchDocumentStore 和已索引的文档才能运行。

from haystack import Document
from haystack_integrations.components.retrievers.azure_ai_search import AzureAISearchHybridRetriever
from haystack_integrations.document_stores.azure_ai_search import AzureAISearchDocumentStore

document_store = AzureAISearchDocumentStore(index_name="haystack_docs")
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)

retriever = AzureAISearchHybridRetriever(document_store=document_store)
# fake embeddings to keep the example simple
retriever.run(query="How many languages are spoken around the world today?", query_embedding=[0.1]*384)

在 RAG 管道中

以下示例演示了在管道中使用AzureAISearchHybridRetriever。索引管道负责使用嵌入在AzureAISearchDocumentStore 中索引和存储文档，而查询管道使用混合检索来根据给定查询获取相关文档。

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.writers import DocumentWriter

from haystack_integrations.components.retrievers.azure_ai_search import AzureAISearchHybridRetriever
from haystack_integrations.document_stores.azure_ai_search import AzureAISearchDocumentStore

document_store = AzureAISearchDocumentStore(index_name="hybrid-retrieval-example")

model = "sentence-transformers/all-mpnet-base-v2"

documents = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(
        content="""Elephants have been observed to behave in a way that indicates a
         high level of self-awareness, such as recognizing themselves in mirrors."""
    ),
    Document(
        content="""In certain parts of the world, like the Maldives, Puerto Rico, and
          San Diego, you can witness the phenomenon of bioluminescent waves."""
    ),
]

document_embedder = SentenceTransformersDocumentEmbedder(model=model)
document_embedder.warm_up()

# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=document_embedder, name="doc_embedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="doc_writer")
indexing_pipeline.connect("doc_embedder", "doc_writer")

indexing_pipeline.run({"doc_embedder": {"documents": documents}})

# Query Pipeline
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model))
query_pipeline.add_component("retriever", AzureAISearchHybridRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

result = query_pipeline.run({"text_embedder": {"text": query}, "retriever": {"query": query}})

print(result["retriever"]["documents"][0])