【问题标题】:How to get nested keys of a json stream using jq如何使用 jq 获取 json 流的嵌套键
【发布时间】:2018-04-08 23:11:30
【问题描述】:

我正在尝试设计一些关系表来保存各种 json 流的解析输出。数据流具有相当复杂的结构,为了便于表设计,我需要知道每个流的每一级嵌套键。而且我对如何使用 jq 从流中获取每个嵌套键感到迷茫。下面是一个简化的代表json流。

{
  "startAt": 0,
  "total": 5315,
  "issues": [
    {
      "id": "44269",
      "name": "someName",
      "fields": {
        "fixVersions": [
          {
            "id": "11401",
            "releaseDate": "2016-09-30"
          }
        ],
        "status": {
          "id": "10110",
          "statusCategory": {
            "id": 3,
            "name": "Done"
          }
        }
      }
    },
    {
      "id": "44270",
      "key": "LEAD-XXXX",
      "fields": {
        "assignee": {
          "id": "10111",
          "name": "Don"
        },
        "status": {
          "id": "10110",
          "statusCategory": {
            "id": 2,
            "name": "inProgress"
          }
        }
      }
    }
  ]
}

我期待以下输出。如果有更好的方法帮助我进行表格设计,我将非常高兴。

startAt
total
issues: []
issues:id
issues:name
issues:key
issues:fields
issues:fields:fixVersions: []
issues:fields:fixVersions:id
issues:fields:fixVersions:releaseDate
issues:fields:status
issues:fields:status:id
issues:fields:status:statusCategory
issues:fields:status:statusCategory:id
issues:fields:status:statusCategory:name
issues:fields:assignee
issues:fields:assignee:id
issues:fields:assignee:name

如何使用 jq 获取上述流的嵌套键。非常感谢帮助。

【问题讨论】:

    标签: json schema jq


    【解决方案1】:

    如果有更好的方法,我会非常高兴...

    如果我是你,我会从以下内容开始(也可能以以下内容结束):

    [paths(scalars) | map(if type == "number" then 0 else . end)]
    | unique
    | .[]
    

    在您的示例中,使用 -cr 命令行选项,这会产生:

    ["issues",0,"fields","assignee","id"]
    ["issues",0,"fields","assignee","name"]
    ["issues",0,"fields","fixVersions",0,"id"]
    ["issues",0,"fields","fixVersions",0,"releaseDate"]
    ["issues",0,"fields","status","id"]
    ["issues",0,"fields","status","statusCategory","id"]
    ["issues",0,"fields","status","statusCategory","name"]
    ["issues",0,"id"]
    ["issues",0,"key"]
    ["issues",0,"name"]
    ["startAt"]
    ["total"]
    

    您可以更接近您所表示的希望我将数字 0 映射到字符串的内容,但是您必须小心该字符串和键名之间的潜在冲突。举例说明:

    [paths(scalars) | map(if type == "number" then "[]" else . end)]
    | unique
    | .[]
    | join(":")
    

    产生:

    issues:[]:fields:assignee:id
    issues:[]:fields:assignee:name
    issues:[]:fields:fixVersions:[]:id
    issues:[]:fields:fixVersions:[]:releaseDate
    issues:[]:fields:status:id
    issues:[]:fields:status:statusCategory:id
    issues:[]:fields:status:statusCategory:name
    issues:[]:id
    issues:[]:key
    issues:[]:name
    startAt
    total
    

    请注意,这种方法产生的结果与基于模式推理的方法基本相同。这是一件好事。

    使用 INDEX/2

    如上使用unique/0有两个潜在的缺点:(1)输出的排序不反映数据中的排序; (2) 效率(尽管在实践中这不太可能成为真正的问题,除非可能是具有大量叶路径的 JSON 文本)。

    在任何情况下,都可以使用INDEX/2 代替unique。如果您的 jq 没有INDEX/2,则在此处给出其定义。

    简而言之:

    def INDEX(stream; idx_expr):
      reduce stream as $row ({};
        .[$row|idx_expr|
          if type != "string" then tojson
          else .
          end] |= $row);
    
    INDEX(paths(scalars)
          | map(if type == "number" then "[]" else . end); .)
    | .[]
    | join(":")
    

    产量:

    startAt
    total
    issues:[]:id
    issues:[]:name
    issues:[]:fields:fixVersions:[]:id
    issues:[]:fields:fixVersions:[]:releaseDate
    issues:[]:fields:status:id
    issues:[]:fields:status:statusCategory:id
    issues:[]:fields:status:statusCategory:name
    issues:[]:key
    issues:[]:fields:assignee:id
    issues:[]:fields:assignee:name
    

    空数组的路径

    如果您还希望报告空数组的路径,您可以(例如)简单地将“paths(scalars)”更改为“(paths(scalars),paths(arrays))”。

    【讨论】:

    • 你肯定对 jq 有扎实的理解和专业知识,至少可以这么说。我将通过查看您的代码本身来学习一些关于 jq 编程的知识。空数组是一个不错的选择。出于我的目的,您的第一个解决方案已经足够好了。非常感谢。
    【解决方案2】:

    如果您想要数据的示意图表示,您可能想考虑一种基于模式推断的方法。

    例如,使用https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed 中定义的schema 函数,您的输入会产生以下推断架构:

    {
      "startAt": "number",
      "total": "number",
      "issues": [
        {
          "fields": {
            "assignee": {
              "id": "string",
              "name": "string"
            },
            "fixVersions": [
              {
                "id": "string",
                "releaseDate": "string"
              }
            ],
            "status": {
              "id": "string",
              "statusCategory": {
                "id": "number",
                "name": "string"
              }
            }
          },
          "id": "string",
          "key": "string",
          "name": "string"
        }
      ]
    }
    

    如果你通过paths(scalars) 过滤,你会得到:

    ["startAt"]
    ["total"]
    ["issues",0,"fields","assignee","id"]
    ["issues",0,"fields","assignee","name"]
    ["issues",0,"fields","fixVersions",0,"id"]
    ["issues",0,"fields","fixVersions",0,"releaseDate"]
    ["issues",0,"fields","status","id"]
    ["issues",0,"fields","status","statusCategory","id"]
    ["issues",0,"fields","status","statusCategory","name"]
    ["issues",0,"id"]
    ["issues",0,"key"]
    ["issues",0,"name"]
    

    除了排序之外,这些结果与使用更直接的方法获得的结果相同;我建议验证这两种方法。

    【讨论】:

      【解决方案3】:

      paths 绝对是正确的方法,但获得所需的确切输出有点麻烦。这是一个过滤器,除了精确的排序之外,它执行此操作:

      def normalize:    # convert paths to requested structure
          if .[-1]|type=="number" then .[-1]="[]" else . end
        | map(select(type!="number"));
      
      def collect:      # collect unique normalized paths into an object
        reduce (paths|normalize) as $p (
           {}
         ; if getpath($p)==null then setpath($p;null) else . end
        );
      
      def colonize($p): # convert object back into : separated paths
          keys_unsorted[] as $k
        | (if $p=="" then $k else "\($p):\($k)" end) as $n
        | $n, (.[$k] | if type=="object" then colonize($n) else empty end);
      
      def summary:      # final output without redundant foo: if foo:[] is present 
          [ collect | colonize("") ]
        | map(select(endswith(":[]"))|.[:-3]) as $remove
        | map(select($remove[[.]]==[]));
      
      summary[]
      

      示例运行(假设filter.jq 中的过滤器和data.json 中的数据)

      $ jq -Mcr -f filter.jq data.json
      startAt
      total
      issues:[]
      issues:id
      issues:name
      issues:fields
      issues:fields:fixVersions:[]
      issues:fields:fixVersions:id
      issues:fields:fixVersions:releaseDate
      issues:fields:status
      issues:fields:status:id
      issues:fields:status:statusCategory
      issues:fields:status:statusCategory:id
      issues:fields:status:statusCategory:name
      issues:fields:assignee
      issues:fields:assignee:id
      issues:fields:assignee:name
      issues:key
      

      Try it online!

      注意这里有一个空数组的问题。如果您的数据中有空数组,此过滤器会将它们报告为普通字段,因为paths 返回的相应路径不会以数字结尾。弥补这一点的最简单方法是首先将空数组映射到非空数组,例如[{}]。例如

      def walk(f):  # defined here in case your jq doesn't have it
          . as $in
        | if type == "object" then reduce keys_unsorted[] as $key (
              {}; . + { ($key):  ($in[$key] | walk(f)) } ) | f
          elif type == "array" then map( walk(f) ) | f
          else f
          end;
      
        walk(if .==[] then [{}] else . end)
      | summary[]
      

      【讨论】:

        【解决方案4】:

        为了清楚起见——编写一个 jq 过滤器以最初设想的格式生成输出是很容易的,尽管这种格式不太可能被普遍使用。

        以下方法无需使用walk/1 来处理空数组的特殊情况。它使用 unique 只是因为 INDEX/2 不包含在 jq 版本 1.5 (*) 中。

        使用示例输入和-r 命令行选项,如下:

         [paths as $p
          | if (getpath($p)|type) == "array" then $p + [" []"]
            elif ($p[-1]|type) == "number" then empty
            else $p
            end
            | map(select(type != "number"))]
         | unique
         | .[]
         | join(":")
        

        产生:

        issues: []
        issues:fields
        issues:fields:assignee
        issues:fields:assignee:id
        issues:fields:assignee:name
        issues:fields:fixVersions: []
        issues:fields:fixVersions:id
        issues:fields:fixVersions:releaseDate
        issues:fields:status
        issues:fields:status:id
        issues:fields:status:statusCategory
        issues:fields:status:statusCategory:id
        issues:fields:status:statusCategory:name
        issues:id
        issues:key
        issues:name
        startAt
        total
        

        (*) unique 可以通过使用INDEX/2 在此处轻松避免,如本页其他地方所述。

        【讨论】:

        • 设计一个 RDBMS 模式/表结构来保存来自已知来源的 json 输出由于流中嵌入了数组和/或对象而有其自身的问题。了解详尽的对象和数组结构有助于定义表结构、规范化和相关设计问题。而这个要求引发了这个问题。输出格式是表示输入思维过程的一种方式,它不必是一种概括的方式。这里提供的解决方案是我深入了解 jq 世界的良好起点
        • @kishore - 在我看来,您需要一个模式推理引擎,例如在 schema.jq 中定义的,在本页其他地方提到。它可用于推断包含多个源的“通用模式”,或推断对应于每个源的多个模式。
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-12-15
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-01-24
        • 2023-01-20
        • 1970-01-01
        相关资源
        最近更新 更多