pipeline 中的最常见位置	1. 在 RAG 管道中的 `PromptBuilder` 之前 2. 语义搜索管道中的最后一个组件 3. 在提取式 QA 管道中的 `ExtractiveReader` 之前
必需的初始化变量	"document_store": PgvectorDocumentStore 实例
强制运行变量	“query”: 一个字符串
输出变量	“document”：文档列表（匹配查询）
API 参考	Pgvector
GitHub 链接	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/pgvector

概述

该PgvectorKeywordRetriever 是一个基于关键字的 Retriever，兼容PgvectorDocumentStore.

该组件使用 PostgreSQL 的ts_rank_cd 函数来对文档进行排名。
它会考虑查询词在文档中出现的频率、词语在文档中的接近程度以及它们在文档中出现的部分的重要性。
更多详情，请参阅 Postgres 文档。

请注意，与类似组件（如ElasticsearchBM25Retriever）不同，此 Retriever 开箱即用不应用模糊搜索，因此有必要仔细构建查询以避免零结果。

除了query，PgvectorKeywordRetriever 接受其他可选参数，包括 top_k（要检索的文档的最大数量）和filters 用于缩小搜索范围。

安装

要快速设置带有 pgvector 的 PostgreSQL 数据库，您可以使用 Docker

docker run -d -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=postgres ankane/pgvector

有关如何安装 pgvector 的更多信息，请访问 pgvector GitHub 仓库。

安装pgvector-haystack 集成

pip install pgvector-haystack

用法

单独使用

此检索器需要PgvectorDocumentStore 和索引文档才能运行。

设置环境变量PG_CONN_STR 为您的 PostgreSQL 数据库连接字符串。

from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorKeywordRetriever

document_store = PgvectorDocumentStore()
retriever = PgvectorKeywordRetriever(document_store=document_store)

retriever.run(query="my nice query")

在 RAG 管道中

运行此代码所需的先决条件是

设置环境变量 OPENAI_API_KEY 为您的 OpenAI API 密钥。
设置环境变量PG_CONN_STR 为您的 PostgreSQL 数据库连接字符串。

from haystack import Document
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.document_stores.types import DuplicatePolicy

from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import (
    PgvectorKeywordRetriever,
)

# Create a RAG query pipeline
prompt_template = """
    Given these documents, answer the question.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{question}}
    \nAnswer:
    """

document_store = PgvectorDocumentStore(
    language="english",  # this parameter influences text parsing for keyword retrieval
    recreate_table=True,
)

documents = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(
        content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."
    ),
    Document(
        content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves."
    ),
]

# DuplicatePolicy.SKIP param is optional, but useful to run the script multiple times without throwing errors
document_store.write_documents(documents=documents, policy=DuplicatePolicy.SKIP)

retriever = PgvectorKeywordRetriever(document_store=document_store)
rag_pipeline = Pipeline()
rag_pipeline.add_component(name="retriever", instance=retriever)
rag_pipeline.add_component(
    instance=PromptBuilder(template=prompt_template), name="prompt_builder"
)
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

question = "languages spoken around the world today"
result = rag_pipeline.run(
    {
        "retriever": {"query": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
print(result["answer_builder"])