MetaFieldGroupingRanker
根据元数据键对文档进行分组,重新排序文档。
| pipeline 中的最常见位置 | 在查询管道中,紧随返回文档列表的组件之后,例如 Retriever |
| 必需的初始化变量 | "group_by": 用于分组的元字段的名称 |
| 强制运行变量 | “documents”: 要分组的文档列表 |
| 输出变量 | “documents”: 分组后的文档列表 |
| API 参考 | Rankers (排序器) |
| GitHub 链接 | https://github.com/deepset-ai/haystack/blob/main/haystack/components/rankers/meta_field_grouping_ranker.py |
概述
该MetaFieldGroupingRanker 组件根据主元键对文档进行分组group_by,并使用可选的次级键进行子分组,subgroup_by.
在每个组或子组内,组件还可以按元数据键对文档进行排序sort_docs_by.
输出是按以下顺序排列的扁平文档列表:group_by 和subgroup_by 的值。任何没有分组的文档都将放在列表的末尾。
该组件有助于提高 LLM 后续处理的效率和性能。
用法
单独使用
from haystack.components.rankers import MetaFieldGroupingRanker
from haystack import Document
docs = [
Document(content="JavaScript is popular", meta={"group": "42", "split_id": 7, "subgroup": "subB"}),
Document(content="Python is popular", meta={"group": "42", "split_id": 4, "subgroup": "subB"}),
Document(content="A chromosome is DNA", meta={"group": "314", "split_id": 2, "subgroup": "subC"}),
Document(content="An octopus has three hearts", meta={"group": "11", "split_id": 2, "subgroup": "subD"}),
Document(content="Java is popular", meta={"group": "42", "split_id": 3, "subgroup": "subB"}),
]
ranker = MetaFieldGroupingRanker(group_by="group", subgroup_by="subgroup", sort_docs_by="split_id")
result = ranker.run(documents=docs)
print(result["documents"])
在 pipeline 中
以下管道使用MetaFieldGroupingRanker 按特定的元字段组织文档,然后按页码排序,接着将这些组织好的文档格式化为聊天消息,并将其传递给OpenAIChatGenerator 以创建结构化的内容解释。
from haystack import Pipeline
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.rankers import MetaFieldGroupingRanker
from haystack.dataclasses import Document, ChatMessage
docs = [
Document(
content="Chapter 1: Introduction to Python",
meta={"chapter": "1", "section": "intro", "page": 1}
),
Document(
content="Chapter 2: Basic Data Types",
meta={"chapter": "2", "section": "basics", "page": 15}
),
Document(
content="Chapter 1: Python Installation",
meta={"chapter": "1", "section": "setup", "page": 5}
),
]
ranker = MetaFieldGroupingRanker(
group_by="chapter",
subgroup_by="section",
sort_docs_by="page"
)
chat_generator = OpenAIChatGenerator(
generation_kwargs={
"temperature": 0.7,
"max_tokens": 500
}
)
# First run the ranker
ranked_result = ranker.run(documents=docs)
ranked_docs = ranked_result["documents"]
# Create chat messages with the ranked documents
messages = [
ChatMessage.from_system("You are a helpful programming tutor."),
ChatMessage.from_user(
f"Here are the course documents in order:\n" +
"\n".join([f"- {doc.content}" for doc in ranked_docs]) +
"\n\nBased on these documents, explain the structure of this Python course."
)
]
# Create and run pipeline for just the chat generator
pipeline = Pipeline()
pipeline.add_component("chat_generator", chat_generator)
result = pipeline.run(
data={
"chat_generator": {
"messages": messages
}
}
)
print(result["chat_generator"]["replies"][0])
更新于 11 个月前
