pipeline 中的最常见位置	在查询管道中，紧随返回文档列表的组件之后，例如 Retriever
必需的初始化变量	"group_by": 用于分组的元字段的名称
强制运行变量	“documents”: 要分组的文档列表
输出变量	“documents”: 分组后的文档列表
API 参考	Rankers (排序器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/rankers/meta_field_grouping_ranker.py

概述

该MetaFieldGroupingRanker 组件根据主元键对文档进行分组group_by，并使用可选的次级键进行子分组，subgroup_by.
在每个组或子组内，组件还可以按元数据键对文档进行排序sort_docs_by.

输出是按以下顺序排列的扁平文档列表：group_by 和subgroup_by 的值。任何没有分组的文档都将放在列表的末尾。

该组件有助于提高 LLM 后续处理的效率和性能。

用法

单独使用

from haystack.components.rankers import MetaFieldGroupingRanker
from haystack import Document

docs = [
    Document(content="JavaScript is popular", meta={"group": "42", "split_id": 7, "subgroup": "subB"}),
    Document(content="Python is popular", meta={"group": "42", "split_id": 4, "subgroup": "subB"}),
    Document(content="A chromosome is DNA", meta={"group": "314", "split_id": 2, "subgroup": "subC"}),
    Document(content="An octopus has three hearts", meta={"group": "11", "split_id": 2, "subgroup": "subD"}),
    Document(content="Java is popular", meta={"group": "42", "split_id": 3, "subgroup": "subB"}),
]

ranker = MetaFieldGroupingRanker(group_by="group", subgroup_by="subgroup", sort_docs_by="split_id")
result = ranker.run(documents=docs)
print(result["documents"])

在 pipeline 中

以下管道使用MetaFieldGroupingRanker 按特定的元字段组织文档，然后按页码排序，接着将这些组织好的文档格式化为聊天消息，并将其传递给OpenAIChatGenerator 以创建结构化的内容解释。

from haystack import Pipeline
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.rankers import MetaFieldGroupingRanker
from haystack.dataclasses import Document, ChatMessage

docs = [
    Document(
        content="Chapter 1: Introduction to Python",
        meta={"chapter": "1", "section": "intro", "page": 1}
    ),
    Document(
        content="Chapter 2: Basic Data Types",
        meta={"chapter": "2", "section": "basics", "page": 15}
    ),
    Document(
        content="Chapter 1: Python Installation",
        meta={"chapter": "1", "section": "setup", "page": 5}
    ),
]

ranker = MetaFieldGroupingRanker(
    group_by="chapter",
    subgroup_by="section",
    sort_docs_by="page"
)

chat_generator = OpenAIChatGenerator(
    generation_kwargs={
        "temperature": 0.7,
        "max_tokens": 500
    }
)

# First run the ranker
ranked_result = ranker.run(documents=docs)
ranked_docs = ranked_result["documents"]

# Create chat messages with the ranked documents
messages = [
    ChatMessage.from_system("You are a helpful programming tutor."),
    ChatMessage.from_user(
        f"Here are the course documents in order:\n" + 
        "\n".join([f"- {doc.content}" for doc in ranked_docs]) +
        "\n\nBased on these documents, explain the structure of this Python course."
    )
]

# Create and run pipeline for just the chat generator
pipeline = Pipeline()
pipeline.add_component("chat_generator", chat_generator)

result = pipeline.run(
    data={
        "chat_generator": {
            "messages": messages
        }
    }
)

print(result["chat_generator"]["replies"][0])