文档API 参考📓 教程🧑‍🍳 食谱🤝 集成💜 Discord🎨 Studio
API 参考

Retrievers (检索器)

遍历文档存储,返回一组与查询相关的候选文档。

模块 haystack_experimental.components.retrievers.chat_message_retriever

ChatMessageRetriever

从底层 ChatMessageStore 检索聊天消息。

使用示例

from haystack.dataclasses import ChatMessage
from haystack_experimental.components.retrievers import ChatMessageRetriever
from haystack_experimental.chat_message_stores.in_memory import InMemoryChatMessageStore

messages = [
    ChatMessage.from_assistant("Hello, how can I help you?"),
    ChatMessage.from_user("Hi, I have a question about Python. What is a Protocol?"),
]

message_store = InMemoryChatMessageStore()
message_store.write_messages(messages)
retriever = ChatMessageRetriever(message_store)

result = retriever.run()

print(result["messages"])

ChatMessageRetriever.__init__

def __init__(message_store: ChatMessageStore, last_k: int = 10)

创建 ChatMessageRetriever 组件。

参数:

  • message_store: ChatMessageStore 实例。
  • last_k: 要检索的最后 N 条消息的数量。如果未指定,默认为 10 条消息。

ChatMessageRetriever.to_dict

def to_dict() -> Dict[str, Any]

将组件序列化为字典。

返回值:

包含序列化数据的字典。

ChatMessageRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChatMessageRetriever"

从字典反序列化组件。

参数:

  • data: 要反序列化的字典。

返回值:

反序列化后的组件。

ChatMessageRetriever.run

@component.output_types(messages=List[ChatMessage])
def run(last_k: Optional[int] = None) -> Dict[str, List[ChatMessage]]

运行 ChatMessageRetriever

参数:

  • last_k: 要检索的最后 N 条消息的数量。此参数优先于传递给 ChatMessageRetriever 构造函数的 last_k 参数。如果未指定,将使用传递给构造函数的 last_k 参数。

引发:

  • ValueError: 如果 last_k 不为 None 且小于 1

返回值:

  • messages - 检索到的聊天消息。

模块 haystack_experimental.components.retrievers.multi_query_embedding_retriever

MultiQueryEmbeddingRetriever

一个组件,它使用嵌入式检索器并行检索文档。

此组件接收文本查询列表,使用查询嵌入器将其转换为嵌入,然后使用嵌入式检索器并行查找每个查询的相关文档。结果被合并并按相关性得分排序。

使用示例

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack_experimental.components.retrievers import MultiQueryEmbeddingRetriever

documents = [
    Document(content="Renewable energy is energy that is collected from renewable resources."),
    Document(content="Solar energy is a type of green energy that is harnessed from the sun."),
    Document(content="Wind energy is another type of green energy that is generated by wind turbines."),
    Document(content="Geothermal energy is heat that comes from the sub-surface of the earth."),
    Document(content="Biomass energy is produced from organic materials, such as plant and animal waste."),
    Document(content="Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources."),
]

# Populate the document store
doc_store = InMemoryDocumentStore()
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()
doc_writer = DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.SKIP)
documents = doc_embedder.run(documents)["documents"]
doc_writer.run(documents=documents)

# Run the multi-query retriever
in_memory_retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=1)
query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")

multi_query_retriever = MultiQueryEmbeddingRetriever(
    retriever=in_memory_retriever,
    query_embedder=query_embedder,
    max_workers=3
)

queries = ["Geothermal energy", "natural gas", "turbines"]
result = multi_query_retriever.run(queries=queries)
for doc in result["documents"]:
    print(f"Content: {doc.content}, Score: {doc.score}")
>> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 0.8509603046266574
>> Content: Renewable energy is energy that is collected from renewable resources., Score: 0.42763211298893034
>> Content: Solar energy is a type of green energy that is harnessed from the sun., Score: 0.40077417016494354
>> Content: Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources., Score: 0.3774863680995796
>> Content: Wind energy is another type of green energy that is generated by wind turbines., Score: 0.3091423972562246
>> Content: Biomass energy is produced from organic materials, such as plant and animal waste., Score: 0.25173074243668087

MultiQueryEmbeddingRetriever.__init__

def __init__(*,
             retriever: EmbeddingRetriever,
             query_embedder: TextEmbedder,
             max_workers: int = 3)

初始化 MultiQueryEmbeddingRetriever。

参数:

  • retriever: 用于文档检索的嵌入式检索器。
  • query_embedder: 用于将文本查询转换为嵌入的查询嵌入器。
  • max_workers: 并行处理的最大工作线程数。

MultiQueryEmbeddingRetriever.warm_up

def warm_up() -> None

如果查询嵌入器和检索器有 warm_up 方法,则进行预热。

MultiQueryEmbeddingRetriever.run

@component.output_types(documents=List[Document])
def run(queries: List[str],
        retriever_kwargs: Optional[dict[str, Any]] = None) -> dict[str, Any]

并行检索文档。

参数:

  • queries: 要处理的文本查询列表。
  • retriever_kwargs: 传递给检索器 run 方法的可选参数字典。

返回值:

一个包含

  • documents: 按相关性得分排序的检索到的文档列表。

MultiQueryEmbeddingRetriever.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

返回值:

表示序列化组件的字典。

MultiQueryEmbeddingRetriever.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "MultiQueryEmbeddingRetriever"

从字典反序列化组件。

参数:

  • data: 要反序列化的字典。

返回值:

反序列化后的组件。

模块 haystack_experimental.components.retrievers.multi_query_text_retriever

MultiQueryTextRetriever

一个组件,它使用文本检索器并行检索文档。

此组件接收文本查询列表,并使用文本检索器并行查找每个查询的相关文档,使用线程池管理并发执行。结果被合并并按相关性得分排序。

您可以将此组件与 QueryExpander 组件结合使用,以增强检索过程。

使用示例

from haystack import Document
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack_experimental.components.query import QueryExpander
from haystack_experimental.components.retrievers.multi_query_text_retriever import MultiQueryTextRetriever

documents = [
    Document(content="Renewable energy is energy that is collected from renewable resources."),
    Document(content="Solar energy is a type of green energy that is harnessed from the sun."),
    Document(content="Wind energy is another type of green energy that is generated by wind turbines."),
    Document(content="Hydropower is a form of renewable energy using the flow of water to generate electricity."),
    Document(content="Geothermal energy is heat that comes from the sub-surface of the earth.")
]

document_store = InMemoryDocumentStore()
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
doc_writer.run(documents=documents)

in_memory_retriever = InMemoryBM25Retriever(document_store=document_store, top_k=1)
multiquery_retriever = MultiQueryTextRetriever(retriever=in_memory_retriever)
results = multiquery_retriever.run(queries=["renewable energy?", "Geothermal", "Hydropower"])
for doc in results["documents"]:
    print(f"Content: {doc.content}, Score: {doc.score}")
>>
>> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 1.6474448833731097
>> Content: Hydropower is a form of renewable energy using the flow of water to generate electricity., Score: 1.6157822790079805
>> Content: Renewable energy is energy that is collected from renewable resources., Score: 1.5255309812344944

MultiQueryTextRetriever.__init__

def __init__(retriever: TextRetriever, max_workers: int = 3)

初始化 MultiQueryTextRetriever。

参数:

  • retriever: 用于文档检索的文本检索器。
  • max_workers: 并行处理的最大工作线程数。默认为 3。

MultiQueryTextRetriever.warm_up

def warm_up() -> None

如果检索器有 warm_up 方法,则进行预热。

MultiQueryTextRetriever.run

@component.output_types(documents=list[Document])
def run(queries: List[str],
        retriever_kwargs: Optional[dict[str, Any]] = None) -> dict[str, Any]

并行检索文档。

参数:

  • queries: 要处理的文本查询列表。
  • retriever_kwargs: 传递给检索器 run 方法的可选参数字典。

返回值:

一个包含documents: 按相关性得分排序的检索到的文档列表。

MultiQueryTextRetriever.to_dict

def to_dict() -> dict[str, Any]

将组件序列化为字典。

返回值:

序列化后的组件(字典格式)。

MultiQueryTextRetriever.from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "MultiQueryTextRetriever"

从字典反序列化组件。

参数:

  • data: 要反序列化的字典。

返回值:

反序列化后的组件。