【问题标题】:Jq: Gather JSON data for each objectJq:为每个对象收集 JSON 数据
【发布时间】:2018-05-22 15:04:09
【问题描述】:

我发现我的标题不太清楚,但是我不知道如何更好地重写它,所以请随意编辑它!


数据
我有以下(简化的)JSON:

[
  {
    "genes_id": "eco:b0002",
    "entry_id": "b0002",
    "division": "CDS",
    "organism": "Escherichia coli K-12 MG1655",
    "organism_code": "eco",
    "organism_id": "T00007",
    "name": "thrA",
    "names": [
      "thrA"
    ],
    "definition": "(RefSeq) Bifunctional aspartokinase/homoserine dehydrogenase 1",
    "eclinks": [

    ],
    "orthologs": {
      "K12524": "bifunctional aspartokinase / homoserine dehydrogenase 1 [EC:2.7.2.4 1.1.1.3]"
    },
    "pathways": {
      "eco00260": "Glycine, serine and threonine metabolism",
      "eco00261": "Monobactam biosynthesis",
      "eco00270": "Cysteine and methionine metabolism",
      "eco00300": "Lysine biosynthesis",
      "eco01100": "Metabolic pathways",
      "eco01110": "Biosynthesis of secondary metabolites",
      "eco01120": "Microbial metabolism in diverse environments",
      "eco01130": "Biosynthesis of antibiotics",
      "eco01230": "Biosynthesis of amino acids"
    },
    "modules": {
      "eco_M00016": "Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine",
      "eco_M00017": "Methionine biosynthesis, apartate => homoserine => methionine",
      "eco_M00018": "Threonine biosynthesis, aspartate => homoserine => threonine"
    },
    "classes": [

    ],
    "position": "337..2799",
    "chromosome": null,
    "gbposition": "337..2799",
    "motifs": {
      "Pfam": [
        "Homoserine_dh",
        "AA_kinase",
        "NAD_binding_3",
        "ACT_7",
        "ACT",
        "Sacchrp_dh_NADP"
      ]
    },
    "dblinks": {
      "NCBI-GeneID": [
        "945803"
      ],
      "NCBI-ProteinID": [
        "NP_414543"
      ],
      "Pasteur": [
        "thrA"
      ],
      "RegulonDB": [
        "ECK120000987"
      ],
      "ECOCYC": [
        "EG10998"
      ],
      "ASAP": [
        "ABE-0000008"
      ],
      "UniProt": [
        "P00561"
      ]
    }
  },
  {
    "genes_id": "eco:b0003",
    "entry_id": "b0003",
    "division": "CDS",
    "organism": "Escherichia coli K-12 MG1655",
    "organism_code": "eco",
    "organism_id": "T00007",
    "name": "thrB",
    "names": [
      "thrB"
    ],
    "definition": "(RefSeq) homoserine kinase",
    "eclinks": [

    ],
    "orthologs": {
      "K00872": "homoserine kinase [EC:2.7.1.39]"
    },
    "pathways": {
      "eco00260": "Glycine, serine and threonine metabolism",
      "eco01100": "Metabolic pathways",
      "eco01110": "Biosynthesis of secondary metabolites",
      "eco01120": "Microbial metabolism in diverse environments",
      "eco01230": "Biosynthesis of amino acids"
    },
    "modules": {
      "eco_M00018": "Threonine biosynthesis, aspartate => homoserine => threonine"
    },
    "classes": [

    ],
    "position": "2801..3733",
    "chromosome": null,
    "gbposition": "2801..3733",
    "motifs": {
      "Pfam": [
        "GHMP_kinases_N",
        "GHMP_kinases_C"
      ]
    },
    "dblinks": {
      "NCBI-GeneID": [
        "947498"
      ],
      "NCBI-ProteinID": [
        "NP_414544"
      ],
      "Pasteur": [
        "thrB"
      ],
      "RegulonDB": [
        "ECK120000988"
      ],
      "ECOCYC": [
        "EG10999"
      ],
      "ASAP": [
        "ABE-0000010"
      ],
      "UniProt": [
        "P00547"
      ]
    }
  }
]

期望的输出
这是一个包含两个对象的数组。我对这两个对象的genes_idpathways 感兴趣,并希望获得一个包含以下内容的制表符分隔文件:

eco:b0002   eco00260    Glycine, serine and threonine metabolism
eco:b0002   eco00261    Monobactam biosynthesis
eco:b0002   eco00270    Cysteine and methionine metabolism
eco:b0002   eco00300    Lysine biosynthesis
eco:b0002   eco01100    Metabolic pathways
eco:b0002   eco01110    Biosynthesis of secondary metabolites
eco:b0002   eco01120    Microbial metabolism in diverse environments
eco:b0002   eco01130    Biosynthesis of antibiotics
eco:b0002   eco01230    Biosynthesis of amino acids
eco:b0003   eco00260    Glycine, serine and threonine metabolism
eco:b0003   eco01100    Metabolic pathways
eco:b0003   eco01110    Biosynthesis of secondary metabolites
eco:b0003   eco01120    Microbial metabolism in diverse environments
eco:b0003   eco01230    Biosynthesis of amino acids

我发现了什么
我知道可以以如下格式提取数据:

eco:b0002: list of pathways and ids
eco:b0003: list of pathways and ids

但是,我想将路径传播到各个行,如上例所示。我找不到有关如何使用 jq 执行此操作的任何信息,因此怀疑这是否真的可行。因此,如果这是可能的,如何使用 Jq 来实现?

【问题讨论】:

    标签: json csv jq data-extraction


    【解决方案1】:

    调用:jq -rf totsv.jq input.json

    程序(totsv.jq):

    .[]
    | .genes_id as $id
    | .pathways
    | to_entries[]
    | [$id, .key, .value]
    | @tsv
    

    TSV 是一个不错的选择(jq 也是如此)!

    【讨论】:

    • 哈哈你是英雄,整天都在尝试这个哈哈哈。非常感谢!
    • jq 会给你丰厚的回报!
    • 嘿峰,我在我的完整数据集上试过这个:togows.org/entry/kegg-genes/eco:b0002,eco:b0003.json 但它似乎不起作用。当然,您正确地回答了这个问题,因为我很愚蠢地删除(显然)相关键。你能快速看一下吗?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-05-20
    • 1970-01-01
    • 2019-03-01
    • 2021-07-21
    • 1970-01-01
    相关资源
    最近更新 更多