文档API 参考📓 教程🧑‍🍳 食谱🤝 集成💜 Discord🎨 Studio
文档

InMemoryBM25Retriever

一个兼容 InMemoryDocumentStore 的基于关键字的检索器。

pipeline 中的最常见位置在查询管道中
在 RAG 管道中,位于 PromptBuilder 之前
在语义搜索管道中,作为最后一个组件
在抽取式 QA 管道中,位于 ExtractiveReader 之前
必需的初始化变量"document_store": 一个 InMemoryDocumentStore 实例
强制运行变量"query": 查询字符串
输出变量"documents": 文档列表(与查询匹配)
API 参考Retrievers (检索器)
GitHub 链接https://github.com/deepset-ai/haystack/blob/main/haystack/components/retrievers/in_memory/bm25_retriever.py

概述

InMemoryBM25Retriever 是一个基于关键字的检索器,它从临时内存数据库中获取与查询匹配的文档。它使用 BM25 算法来确定文档与查询之间的相似度,该算法计算两个字符串之间的加权词重叠。

由于InMemoryBM25Retriever 基于词重叠匹配字符串,通常用于查找人名、产品名、ID 或定义明确的错误消息的精确匹配。BM25 算法非常轻量且简单。尽管如此,在处理非领域外数据时,使用更复杂的基于嵌入的方法可能难以超越它。

除了queryInMemoryBM25Retriever 还接受其他可选参数,包括top_k(要检索的文档的最大数量)和filters(用于缩小搜索范围)。
一些会影响 BM25 检索的相关参数必须在初始化相应的InMemoryDocumentStore 时定义:这些参数包括特定的 BM25 算法及其参数。

用法

单独使用

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)

retriever = InMemoryBM25Retriever(document_store=document_store)
retriever.run(query="How many languages are spoken around the world today?")

在 Pipeline 中

在 RAG 管道中

以下是在检索增强生成管道中使用此检索器的示例

import os
from haystack import Document
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Create a RAG query pipeline
prompt_template = """
    Given these documents, answer the question.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{question}}
    \nAnswer:
    """

os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"

rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.metadata", "answer_builder.metadata")
rag_pipeline.connect("retriever", "answer_builder.documents")

# Draw the pipeline
rag_pipeline.draw("./rag_pipeline.png")

# Add Documents
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
rag_pipeline.get_component("retriever").document_store.write_documents(documents)

# Run the pipeline
question = "How many languages are there?"
result = rag_pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )
print(result['answer_builder']['answers'][0])

在文档搜索管道中

您可以在文档搜索管道中使用此检索器

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.pipeline import Pipeline

# Create components and a query pipeline
document_store = InMemoryDocumentStore()
retriever = InMemoryBM25Retriever(document_store=document_store)

pipeline = Pipeline()
pipeline.add_component(instance=retriever, name="retriever")

# Add Documents
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents)

# Run the pipeline
result = pipeline.run(data={"retriever": {"query":"How many languages are there?"}})

print(result['retriever']['documents'][0])

相关链接