LLMMetadataExtractor
使用大型语言模型从文档中提取元数据。元数据是通过向 LLM 提供一个生成它的提示来提取的。
| pipeline 中的最常见位置 | 在索引管道的 预处理器 之后 |
| 必需的初始化变量 | "prompt": 指导 LLM 如何从文档中提取元数据的提示 "chat_generator": 一个 Chat Generator 实例,代表配置为返回 JSON 对象的 LLM |
| 强制运行变量 | “documents”:文档列表 |
| 输出变量 | “documents”:文档列表 |
| API 参考 | Extractors (提取器) |
| GitHub 链接 | https://github.com/deepset-ai/haystack/blob/main/haystack/components/extractors/llm_metadata_extractor.py |
概述
该LLMMetadataExtractor 的提取依赖于 LLM 和一个提示来执行元数据提取。在初始化时,它需要一个 LLM、一个 Haystack Generator 和一个描述元数据提取过程的提示。
提示应包含一个名为document 的变量,该变量将指向文档列表中的单个文档。因此,要访问文档的内容,您可以在提示中使用{{ document.content }}。
运行时,它会接收一个文档列表,并对列表中的每个文档运行 LLM,从文档中提取元数据。元数据将被添加到文档的元数据字段中。
如果 LLM 无法从文档中提取元数据,该文档将被添加到failed_documents 列表中。失败文档的元数据将包含键metadata_extraction_error 和metadata_extraction_response.
这些文档可以使用另一个提取器使用其中的metadata_extraction_response 和metadata_extraction_error 来重新运行,以在提示中提取元数据。
当前实现支持以下 Haystack Generators
用法
这是一个使用LLMMetadataExtractor 提取命名实体并将其添加到文档元数据中的示例。
首先,是必要的导入
from haystack import Document
from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
from haystack.components.generators.chat import OpenAIChatGenerator
然后,定义一些文档
docs = [
Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
Document(content="Hugging Face is a company founded in New York, USA and is known for its Transformers library"
)
]
现在,一个从文档中提取命名实体的提示
NER_PROMPT = '''
-Goal-
Given text and a list of entity types, identify all entities of those types from the text.
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [organization, product, service, industry]
Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}
2. Return output in a single list with all the entities identified in steps 1.
-Examples-
######################
Example 1:
entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
base and high cross-border usage.
We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
agreement with Emirates Skywards.
And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
issuers are equally
------------------------
output:
{"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
#############################
-Real Data-
######################
entity_types: [company, organization, person, country, product, service]
text: {{ document.content }}
######################
output:
'''
现在,定义一个使用LLMMetadataExtractor 从文档中提取命名实体的简单索引管道
chat_generator = OpenAIChatGenerator(
generation_kwargs={
"max_tokens": 500,
"temperature": 0.0,
"seed": 0,
"response_format": {"type": "json_object"},
},
max_retries=1,
timeout=60.0,
)
extractor = LLMMetadataExtractor(
prompt=NER_PROMPT,
chat_generator=generator,
expected_keys=["entities"],
raise_on_failure=False,
)
extractor.warm_up()
extractor.run(documents=docs)
>> {'documents': [
Document(id=.., content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework',
meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'},
{'entity': 'Haystack', 'entity_type': 'product'}]}),
Document(id=.., content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers library',
meta: {'entities': [
{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'},
{'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers', 'entity_type': 'product'}
]})
]
'failed_documents': []
}
>>
更新于 6 个月前
