遍历文档存储,返回一组与查询相关的候选文档。
模块 haystack_experimental.components.retrievers.chat_message_retriever
ChatMessageRetriever
从底层 ChatMessageStore 检索聊天消息。
使用示例
from haystack.dataclasses import ChatMessage
from haystack_experimental.components.retrievers import ChatMessageRetriever
from haystack_experimental.chat_message_stores.in_memory import InMemoryChatMessageStore
messages = [
ChatMessage.from_assistant("Hello, how can I help you?"),
ChatMessage.from_user("Hi, I have a question about Python. What is a Protocol?"),
]
message_store = InMemoryChatMessageStore()
message_store.write_messages(messages)
retriever = ChatMessageRetriever(message_store)
result = retriever.run()
print(result["messages"])
ChatMessageRetriever.__init__
def __init__(message_store: ChatMessageStore, last_k: int = 10)
创建 ChatMessageRetriever 组件。
参数:
message_store: ChatMessageStore 实例。last_k: 要检索的最后 N 条消息的数量。如果未指定,默认为 10 条消息。
ChatMessageRetriever.to_dict
def to_dict() -> Dict[str, Any]
将组件序列化为字典。
返回值:
包含序列化数据的字典。
ChatMessageRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChatMessageRetriever"
从字典反序列化组件。
参数:
data: 要反序列化的字典。
返回值:
反序列化后的组件。
ChatMessageRetriever.run
@component.output_types(messages=List[ChatMessage])
def run(last_k: Optional[int] = None) -> Dict[str, List[ChatMessage]]
运行 ChatMessageRetriever
参数:
last_k: 要检索的最后 N 条消息的数量。此参数优先于传递给 ChatMessageRetriever 构造函数的 last_k 参数。如果未指定,将使用传递给构造函数的 last_k 参数。
引发:
ValueError: 如果 last_k 不为 None 且小于 1
返回值:
messages- 检索到的聊天消息。
模块 haystack_experimental.components.retrievers.multi_query_embedding_retriever
MultiQueryEmbeddingRetriever
一个组件,它使用嵌入式检索器并行检索文档。
此组件接收文本查询列表,使用查询嵌入器将其转换为嵌入,然后使用嵌入式检索器并行查找每个查询的相关文档。结果被合并并按相关性得分排序。
使用示例
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack_experimental.components.retrievers import MultiQueryEmbeddingRetriever
documents = [
Document(content="Renewable energy is energy that is collected from renewable resources."),
Document(content="Solar energy is a type of green energy that is harnessed from the sun."),
Document(content="Wind energy is another type of green energy that is generated by wind turbines."),
Document(content="Geothermal energy is heat that comes from the sub-surface of the earth."),
Document(content="Biomass energy is produced from organic materials, such as plant and animal waste."),
Document(content="Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources."),
]
# Populate the document store
doc_store = InMemoryDocumentStore()
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()
doc_writer = DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.SKIP)
documents = doc_embedder.run(documents)["documents"]
doc_writer.run(documents=documents)
# Run the multi-query retriever
in_memory_retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=1)
query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
multi_query_retriever = MultiQueryEmbeddingRetriever(
retriever=in_memory_retriever,
query_embedder=query_embedder,
max_workers=3
)
queries = ["Geothermal energy", "natural gas", "turbines"]
result = multi_query_retriever.run(queries=queries)
for doc in result["documents"]:
print(f"Content: {doc.content}, Score: {doc.score}")
>> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 0.8509603046266574
>> Content: Renewable energy is energy that is collected from renewable resources., Score: 0.42763211298893034
>> Content: Solar energy is a type of green energy that is harnessed from the sun., Score: 0.40077417016494354
>> Content: Fossil fuels, such as coal, oil, and natural gas, are non-renewable energy sources., Score: 0.3774863680995796
>> Content: Wind energy is another type of green energy that is generated by wind turbines., Score: 0.3091423972562246
>> Content: Biomass energy is produced from organic materials, such as plant and animal waste., Score: 0.25173074243668087
MultiQueryEmbeddingRetriever.__init__
def __init__(*,
retriever: EmbeddingRetriever,
query_embedder: TextEmbedder,
max_workers: int = 3)
初始化 MultiQueryEmbeddingRetriever。
参数:
retriever: 用于文档检索的嵌入式检索器。query_embedder: 用于将文本查询转换为嵌入的查询嵌入器。max_workers: 并行处理的最大工作线程数。
MultiQueryEmbeddingRetriever.warm_up
def warm_up() -> None
如果查询嵌入器和检索器有 warm_up 方法,则进行预热。
MultiQueryEmbeddingRetriever.run
@component.output_types(documents=List[Document])
def run(queries: List[str],
retriever_kwargs: Optional[dict[str, Any]] = None) -> dict[str, Any]
并行检索文档。
参数:
queries: 要处理的文本查询列表。retriever_kwargs: 传递给检索器 run 方法的可选参数字典。
返回值:
一个包含
documents: 按相关性得分排序的检索到的文档列表。
MultiQueryEmbeddingRetriever.to_dict
def to_dict() -> dict[str, Any]
将组件序列化为字典。
返回值:
表示序列化组件的字典。
MultiQueryEmbeddingRetriever.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "MultiQueryEmbeddingRetriever"
从字典反序列化组件。
参数:
data: 要反序列化的字典。
返回值:
反序列化后的组件。
模块 haystack_experimental.components.retrievers.multi_query_text_retriever
MultiQueryTextRetriever
一个组件,它使用文本检索器并行检索文档。
此组件接收文本查询列表,并使用文本检索器并行查找每个查询的相关文档,使用线程池管理并发执行。结果被合并并按相关性得分排序。
您可以将此组件与 QueryExpander 组件结合使用,以增强检索过程。
使用示例
from haystack import Document
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack_experimental.components.query import QueryExpander
from haystack_experimental.components.retrievers.multi_query_text_retriever import MultiQueryTextRetriever
documents = [
Document(content="Renewable energy is energy that is collected from renewable resources."),
Document(content="Solar energy is a type of green energy that is harnessed from the sun."),
Document(content="Wind energy is another type of green energy that is generated by wind turbines."),
Document(content="Hydropower is a form of renewable energy using the flow of water to generate electricity."),
Document(content="Geothermal energy is heat that comes from the sub-surface of the earth.")
]
document_store = InMemoryDocumentStore()
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
doc_writer.run(documents=documents)
in_memory_retriever = InMemoryBM25Retriever(document_store=document_store, top_k=1)
multiquery_retriever = MultiQueryTextRetriever(retriever=in_memory_retriever)
results = multiquery_retriever.run(queries=["renewable energy?", "Geothermal", "Hydropower"])
for doc in results["documents"]:
print(f"Content: {doc.content}, Score: {doc.score}")
>>
>> Content: Geothermal energy is heat that comes from the sub-surface of the earth., Score: 1.6474448833731097
>> Content: Hydropower is a form of renewable energy using the flow of water to generate electricity., Score: 1.6157822790079805
>> Content: Renewable energy is energy that is collected from renewable resources., Score: 1.5255309812344944
MultiQueryTextRetriever.__init__
def __init__(retriever: TextRetriever, max_workers: int = 3)
初始化 MultiQueryTextRetriever。
参数:
retriever: 用于文档检索的文本检索器。max_workers: 并行处理的最大工作线程数。默认为 3。
MultiQueryTextRetriever.warm_up
def warm_up() -> None
如果检索器有 warm_up 方法,则进行预热。
MultiQueryTextRetriever.run
@component.output_types(documents=list[Document])
def run(queries: List[str],
retriever_kwargs: Optional[dict[str, Any]] = None) -> dict[str, Any]
并行检索文档。
参数:
queries: 要处理的文本查询列表。retriever_kwargs: 传递给检索器 run 方法的可选参数字典。
返回值:
一个包含documents: 按相关性得分排序的检索到的文档列表。
MultiQueryTextRetriever.to_dict
def to_dict() -> dict[str, Any]
将组件序列化为字典。
返回值:
序列化后的组件(字典格式)。
MultiQueryTextRetriever.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "MultiQueryTextRetriever"
从字典反序列化组件。
参数:
data: 要反序列化的字典。
返回值:
反序列化后的组件。
