pipeline 中的最常见位置	1. 在 Text Embedder 之后，在 RAG pipeline 的 `PromptBuilder` 之前 2. 语义搜索 pipeline 中的最后一个组件 3. 在 Text Embedder 之后，在 extractive QA pipeline 的 `ExtractiveReader` 之前
必需的初始化变量	"document_store": MongoDBAtlasDocumentStore 的实例
强制运行变量	“query_embedding”：浮点数列表
输出变量	“documents”：文档列表
API 参考	MongoDB Atlas
GitHub 链接	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mongodb_atlas

该MongoDBAtlasEmbeddingRetriever 是一个基于嵌入式检索器，与 MongoDBAtlasDocumentStore 兼容。它会比较查询和文档的嵌入，并根据结果从 Document Store 中检索出与查询最相关的文档。

参数

使用要将 MongoDBAtlasEmbeddingRetriever 集成到您的 NLP 系统中，请确保查询和文档的嵌入已经可用。您可以通过在索引 Pipeline 中添加 Document Embedder，并在查询 Pipeline 中添加 Text Embedder 来实现这一点。

除了query_embedding 之外，MongoDBAtlasEmbeddingRetriever 还接受其他可选参数，包括top_k（要检索的文档的最大数量）和filters（用于缩小搜索范围）。

用法

安装

要开始使用 MongoDB Atlas 和 Haystack，请使用以下命令安装包

pip install mongodb-atlas-haystack

单独使用

检索器需要一个实例MongoDBAtlasDocumentStore 和已索引的文档才能运行。

from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever

document_store = MongoDBAtlasDocumentStore()

retriever = MongoDBAtlasEmbeddingRetriever(document_store=document_store)

# example run query
retriever.run(query_embedding=[0.1]*384)

在 Pipeline 中

from haystack import Pipeline, Document
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.writers import DocumentWriter
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack_integrations.components.embedders.mongodb_atlas import MongoDBAtlasEmbeddingRetriever

# Create some example documents
documents = [
    Document(content="My name is Jean and I live in Paris."),
    Document(content="My name is Mark and I live in Berlin."),
    Document(content="My name is Giorgio and I live in Rome."),
]

# We support many different databases. Here we load a simple and lightweight in-memory document store.
document_store = MongoDBAtlasDocumentStore()

# Define some more components
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
doc_embedder = SentenceTransformersDocumentEmbedder(model="intfloat/e5-base-v2")
query_embedder = SentenceTransformersTextEmbedder(model="intfloat/e5-base-v2")

# Pipeline that ingests document for retrieval
ingestion_pipe = Pipeline()
ingestion_pipe.add_component(instance=doc_embedder, name="doc_embedder")
ingestion_pipe.add_component(instance=doc_writer, name="doc_writer")

ingestion_pipe.connect("doc_embedder.documents", "doc_writer.documents")
ingestion_pipe.run({"doc_embedder": {"documents": documents}})

# Build a RAG pipeline with a Retriever to get relevant documents to 
# the query and a OpenAIGenerator interacting with LLMs using a custom prompt.
prompt_template = """
Given these documents, answer the question.\nDocuments:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}

\nQuestion: {{question}}
\nAnswer:
"""
rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=query_embedder, name="query_embedder")
rag_pipeline.add_component(instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("embedding_retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

# Ask a question on the data you just added.
question = "Where does Mark live?"
result = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
    }
)

# For details, like which documents were used to generate the answer, look into the GeneratedAnswer object
print(result["answer_builder"]["answers"])