pipeline 中的最常见位置	在索引管道的预处理器之后
必需的初始化变量	"prompt": 指导 LLM 如何从文档中提取元数据的提示 "chat_generator": 一个 Chat Generator 实例，代表配置为返回 JSON 对象的 LLM
强制运行变量	“documents”：文档列表
输出变量	“documents”：文档列表
API 参考	Extractors (提取器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/extractors/llm_metadata_extractor.py

概述

该LLMMetadataExtractor 的提取依赖于 LLM 和一个提示来执行元数据提取。在初始化时，它需要一个 LLM、一个 Haystack Generator 和一个描述元数据提取过程的提示。

提示应包含一个名为document 的变量，该变量将指向文档列表中的单个文档。因此，要访问文档的内容，您可以在提示中使用{{ document.content }}。

运行时，它会接收一个文档列表，并对列表中的每个文档运行 LLM，从文档中提取元数据。元数据将被添加到文档的元数据字段中。

如果 LLM 无法从文档中提取元数据，该文档将被添加到failed_documents 列表中。失败文档的元数据将包含键metadata_extraction_error 和metadata_extraction_response.

这些文档可以使用另一个提取器使用其中的metadata_extraction_response 和metadata_extraction_error 来重新运行，以在提示中提取元数据。

当前实现支持以下 Haystack Generators

用法

这是一个使用LLMMetadataExtractor 提取命名实体并将其添加到文档元数据中的示例。

首先，是必要的导入

from haystack import Document
from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
from haystack.components.generators.chat import OpenAIChatGenerator

然后，定义一些文档

docs = [
    Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
    Document(content="Hugging Face is a company founded in New York, USA and is known for its Transformers library"
    )
]

现在，一个从文档中提取命名实体的提示

NER_PROMPT = '''
    -Goal-
    Given text and a list of entity types, identify all entities of those types from the text.

    -Steps-
    1. Identify all entities. For each identified entity, extract the following information:
    - entity_name: Name of the entity, capitalized
    - entity_type: One of the following types: [organization, product, service, industry]
    Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}

    2. Return output in a single list with all the entities identified in steps 1.

    -Examples-
    ######################
    Example 1:
    entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
    text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
    10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
    our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
    base and high cross-border usage.
    We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
    with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
    Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
    United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
    agreement with Emirates Skywards.
    And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
    issuers are equally
    ------------------------
    output:
    {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
    #############################
    -Real Data-
    ######################
    entity_types: [company, organization, person, country, product, service]
    text: {{ document.content }}
    ######################
    output:
    '''

现在，定义一个使用LLMMetadataExtractor 从文档中提取命名实体的简单索引管道

chat_generator = OpenAIChatGenerator(
  generation_kwargs={
    "max_tokens": 500,
    "temperature": 0.0,
    "seed": 0,
    "response_format": {"type": "json_object"},
  },
  max_retries=1,
  timeout=60.0,
)

extractor = LLMMetadataExtractor(
  prompt=NER_PROMPT,
  chat_generator=generator,
  expected_keys=["entities"],
  raise_on_failure=False,
)

extractor.warm_up()
extractor.run(documents=docs)

>> {'documents': [
  Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
           meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
                               {'entity': 'Haystack', 'entity_type': 'product'}]}),
  Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
           meta: {'entities': [
             {'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
             {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
           ]})
]
    'failed_documents': []
   }
>>