文档API 参考📓 教程🧑‍🍳 菜谱🤝 集成💜 Discord🎨 Studio
文档

LLMMetadataExtractor

使用大型语言模型从文档中提取元数据。元数据是通过向 LLM 提供一个生成它的提示来提取的。

pipeline 中的最常见位置在索引管道的 预处理器 之后
必需的初始化变量"prompt": 指导 LLM 如何从文档中提取元数据的提示

"chat_generator": 一个 Chat Generator 实例,代表配置为返回 JSON 对象的 LLM
强制运行变量“documents”:文档列表
输出变量“documents”:文档列表
API 参考Extractors (提取器)
GitHub 链接https://github.com/deepset-ai/haystack/blob/main/haystack/components/extractors/llm_metadata_extractor.py

概述

LLMMetadataExtractor 的提取依赖于 LLM 和一个提示来执行元数据提取。在初始化时,它需要一个 LLM、一个 Haystack Generator 和一个描述元数据提取过程的提示。

提示应包含一个名为document 的变量,该变量将指向文档列表中的单个文档。因此,要访问文档的内容,您可以在提示中使用{{ document.content }}

运行时,它会接收一个文档列表,并对列表中的每个文档运行 LLM,从文档中提取元数据。元数据将被添加到文档的元数据字段中。

如果 LLM 无法从文档中提取元数据,该文档将被添加到failed_documents 列表中。失败文档的元数据将包含键metadata_extraction_errormetadata_extraction_response.

这些文档可以使用另一个提取器使用其中的metadata_extraction_responsemetadata_extraction_error 来重新运行,以在提示中提取元数据。

当前实现支持以下 Haystack Generators

用法

这是一个使用LLMMetadataExtractor 提取命名实体并将其添加到文档元数据中的示例。

首先,是必要的导入

from haystack import Document
from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
from haystack.components.generators.chat import OpenAIChatGenerator

然后,定义一些文档

docs = [
    Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
    Document(content="Hugging Face is a company founded in New York, USA and is known for its Transformers library"
    )
]

现在,一个从文档中提取命名实体的提示

NER_PROMPT = '''
    -Goal-
    Given text and a list of entity types, identify all entities of those types from the text.

    -Steps-
    1. Identify all entities. For each identified entity, extract the following information:
    - entity_name: Name of the entity, capitalized
    - entity_type: One of the following types: [organization, product, service, industry]
    Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}

    2. Return output in a single list with all the entities identified in steps 1.

    -Examples-
    ######################
    Example 1:
    entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
    text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
    10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
    our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
    base and high cross-border usage.
    We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
    with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
    Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
    United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
    agreement with Emirates Skywards.
    And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
    issuers are equally
    ------------------------
    output:
    {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
    #############################
    -Real Data-
    ######################
    entity_types: [company, organization, person, country, product, service]
    text: {{ document.content }}
    ######################
    output:
    '''

现在,定义一个使用LLMMetadataExtractor 从文档中提取命名实体的简单索引管道

chat_generator = OpenAIChatGenerator(
  generation_kwargs={
    "max_tokens": 500,
    "temperature": 0.0,
    "seed": 0,
    "response_format": {"type": "json_object"},
  },
  max_retries=1,
  timeout=60.0,
)

extractor = LLMMetadataExtractor(
  prompt=NER_PROMPT,
  chat_generator=generator,
  expected_keys=["entities"],
  raise_on_failure=False,
)

extractor.warm_up()
extractor.run(documents=docs)

>> {'documents': [
  Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
           meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
                               {'entity': 'Haystack', 'entity_type': 'product'}]}),
  Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
           meta: {'entities': [
             {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
             {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
           ]})
]
    'failed_documents': []
   }
>>