文档API 参考📓 教程🧑‍🍳 食谱🤝 集成💜 Discord🎨 Studio
文档

MetaFieldGroupingRanker

根据元数据键对文档进行分组,重新排序文档。

pipeline 中的最常见位置在查询管道中,紧随返回文档列表的组件之后,例如 Retriever
必需的初始化变量"group_by": 用于分组的元字段的名称
强制运行变量“documents”: 要分组的文档列表
输出变量“documents”: 分组后的文档列表
API 参考Rankers (排序器)
GitHub 链接https://github.com/deepset-ai/haystack/blob/main/haystack/components/rankers/meta_field_grouping_ranker.py

概述

MetaFieldGroupingRanker 组件根据主元键对文档进行分组group_by,并使用可选的次级键进行子分组,subgroup_by.
在每个组或子组内,组件还可以按元数据键对文档进行排序sort_docs_by.

输出是按以下顺序排列的扁平文档列表:group_bysubgroup_by 的值。任何没有分组的文档都将放在列表的末尾。

该组件有助于提高 LLM 后续处理的效率和性能。

用法

单独使用

from haystack.components.rankers import MetaFieldGroupingRanker
from haystack import Document

docs = [
    Document(content="JavaScript is popular", meta={"group": "42", "split_id": 7, "subgroup": "subB"}),
    Document(content="Python is popular", meta={"group": "42", "split_id": 4, "subgroup": "subB"}),
    Document(content="A chromosome is DNA", meta={"group": "314", "split_id": 2, "subgroup": "subC"}),
    Document(content="An octopus has three hearts", meta={"group": "11", "split_id": 2, "subgroup": "subD"}),
    Document(content="Java is popular", meta={"group": "42", "split_id": 3, "subgroup": "subB"}),
]

ranker = MetaFieldGroupingRanker(group_by="group", subgroup_by="subgroup", sort_docs_by="split_id")
result = ranker.run(documents=docs)
print(result["documents"])

在 pipeline 中

以下管道使用MetaFieldGroupingRanker 按特定的元字段组织文档,然后按页码排序,接着将这些组织好的文档格式化为聊天消息,并将其传递给OpenAIChatGenerator 以创建结构化的内容解释。

from haystack import Pipeline
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.rankers import MetaFieldGroupingRanker
from haystack.dataclasses import Document, ChatMessage

docs = [
    Document(
        content="Chapter 1: Introduction to Python",
        meta={"chapter": "1", "section": "intro", "page": 1}
    ),
    Document(
        content="Chapter 2: Basic Data Types",
        meta={"chapter": "2", "section": "basics", "page": 15}
    ),
    Document(
        content="Chapter 1: Python Installation",
        meta={"chapter": "1", "section": "setup", "page": 5}
    ),
]

ranker = MetaFieldGroupingRanker(
    group_by="chapter",
    subgroup_by="section",
    sort_docs_by="page"
)

chat_generator = OpenAIChatGenerator(
    generation_kwargs={
        "temperature": 0.7,
        "max_tokens": 500
    }
)

# First run the ranker
ranked_result = ranker.run(documents=docs)
ranked_docs = ranked_result["documents"]

# Create chat messages with the ranked documents
messages = [
    ChatMessage.from_system("You are a helpful programming tutor."),
    ChatMessage.from_user(
        f"Here are the course documents in order:\n" + 
        "\n".join([f"- {doc.content}" for doc in ranked_docs]) +
        "\n\nBased on these documents, explain the structure of this Python course."
    )
]

# Create and run pipeline for just the chat generator
pipeline = Pipeline()
pipeline.add_component("chat_generator", chat_generator)

result = pipeline.run(
    data={
        "chat_generator": {
            "messages": messages
        }
    }
)

print(result["chat_generator"]["replies"][0])