文档API 参考📓 教程🧑‍🍳 食谱🤝 集成💜 Discord🎨 Studio
文档

ElasticsearchBM25Retriever

一个基于关键词的 Retriever,用于从 Elasticsearch Document Store 中检索与查询匹配的 Document。

pipeline 中的最常见位置1. 在 RAG 管道中的 PromptBuilder 之前 2. 语义搜索管道中的最后一个组件 3. 在提取式 QA 管道中的 ExtractiveReader 之前
必需的初始化变量"document_store": ElasticsearchDocumentStore 的一个实例
强制运行变量“query”: 一个字符串
输出变量"documents": 文档列表(与查询匹配)
API 参考Elasticsearch
GitHub 链接https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/elasticsearch

概述

ElasticsearchBM25Retriever 是一个基于关键词的 Retriever,用于从ElasticsearchDocumentStore 中检索与查询匹配的 Document。它根据 BM25 算法计算 Document 与查询之间的相似度,该算法计算两个字符串之间的加权词重叠。

由于ElasticsearchBM25Retriever 根据词重叠匹配字符串,通常用于查找人名或产品名、ID 或定义明确的错误消息的精确匹配。BM25 算法非常轻量级且简单。尽管如此,对于在域外数据上使用更复杂的基于嵌入的方法,它也很难被超越。

除了queryElasticsearchBM25Retriever 还接受其他可选参数,包括top_k(要检索的文档的最大数量)和filters(用于缩小搜索范围)。
在初始化 Retriever 时,您还可以使用fuzziness 参数,调整不精确的模糊匹配是如何执行的。

如果您想要查询和文档之间的语义匹配,您可以使用ElasticsearchEmbeddingRetriever,它使用嵌入模型创建的向量来检索相关信息。

安装

安装 Elasticsearch,然后启动一个实例。Haystack 支持 Elasticsearch 8。

如果您已设置 Docker,我们建议拉取 Docker 镜像并运行它。

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.1
docker run -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms1024m -Xmx1024m" -e "xpack.security.enabled=false" elasticsearch:8.11.1

作为替代方案,您可以前往 Elasticsearch 集成 GitHub,并使用提供的文件启动一个运行 Elasticsearch 的 Docker 容器。docker-compose.yml:

docker compose up

启动 Elasticsearch 实例后,安装elasticsearch-haystack 集成。

pip install elasticsearch-haystack

用法

单独使用

from haystack import Document
from haystack_integrations.components.retrievers.elasticsearch import ElasticsearchBM25Retriever
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore
from elasticsearch import Elasticsearch

document_store = ElasticsearchDocumentStore(hosts= "https://:9200/")
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)

retriever = ElasticsearchBM25Retriever(document_store=document_store)
retriever.run(query="How many languages are spoken around the world today?")

在 RAG 管道中

将您的OPENAI_API_KEY 设置为环境变量,然后运行以下代码


from haystack_integrations.components.retrievers.elasticsearch import ElasticsearchBM25Retriever
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore

from elasticsearch import Elasticsearch

from haystack import Document
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.document_stores.types import DuplicatePolicy

import os
api_key = os.environ['OPENAI_API_KEY']

# Create a RAG query pipeline
prompt_template = """
    Given these documents, answer the question.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{question}}
    \nAnswer:
    """

document_store = ElasticsearchDocumentStore(hosts= "https://:9200/")

# Add Documents

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

# DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors
document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP)

retriever = ElasticsearchBM25Retriever(document_store=document_store)
rag_pipeline = Pipeline()
rag_pipeline.add_component(name="retriever", instance=retriever)
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(api_key=api_key), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

question = "How many languages are spoken around the world today?"
result = rag_pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )
print(result['answer_builder']['answers'][0].data)

以下是您可能获得的示例输出

"Over 7,000 languages are spoken around the world today"

相关链接

请查看 GitHub 仓库或我们的文档中的 API 参考