pipeline 中的最常见位置	在预处理器之前，或在索引管道的开头。
必需的初始化变量	其中一个，或两者都提供 "jq_schema": 用于提取内容的 jq 过滤器字符串 "content_key": 用于提取文档内容的键字符串
强制运行变量	"sources": 文件路径列表或 ByteStream 对象
输出变量	"documents": 文档列表
API 参考	Converters (转换器)
GitHub 链接	https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/json.py

概述

JSONConverter 将一个或多个 JSON 文件转换为文本文件。

参数概述

要初始化JSONConverter，您必须提供以下任一参数：jq_schema，或content_key 参数，或两者都提供。

jq_schema 参数过滤器用于从 JSON 文件中提取嵌套数据。有关过滤器语法的更多信息，请参阅 jq 文档。如果未设置，则使用整个 JSON 文件。

该content_key 参数允许您指定在提取的数据中哪个键将作为文档的内容。

如果同时设置了jq_schema 和content_key，则会在jq_schema 提取的数据中搜索content_key。非对象数据将被跳过。
如果仅设置了jq_schema，则提取的值必须是标量；对象或数组将被跳过。
如果仅设置了content_key，则源必须是 JSON 对象，否则将被跳过。

有关所有参数的完整列表，请查看 API 参考。

用法

您需要安装jq 包才能使用此转换器。

pip install jq

示例

以下是组件简单用法的示例

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

source = ByteStream.from_string(json.dumps({"text": "This is the content of my document"}))

converter = JSONConverter(content_key="text")
results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'This is the content of my document'

在下面的更复杂的示例中，我们提供了一个jq_schema 字符串来过滤 JSON 源文件，并使用extra_meta_fields 从过滤后的数据中提取信息。

import json

from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream

data = {
  "laureates": [
    {
      "firstname": "Enrico",
      "surname": "Fermi",
      "motivation": "for his demonstrations of the existence of new radioactive elements produced "
      "by neutron irradiation, and for his related discovery of nuclear reactions brought about by"
      " slow neutrons",
    },
    {
      "firstname": "Rita",
      "surname": "Levi-Montalcini",
      "motivation": "for their discoveries of growth factors",
    },
  ],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(
  jq_schema=".laureates[]", content_key="motivation", extra_meta_fields={"firstname", "surname"}
)

results = converter.run(sources=[source])
documents = results["documents"]
print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'

print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}

print(documents[1].content)
# 'for their discoveries of growth factors'

print(documents[1].meta)
# {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}