【问题标题】:How to extract a repeated nested field from json string and join with existing repeated nested field in bigquery如何从json字符串中提取重复的嵌套字段并与bigquery中现有的重复嵌套字段连接
【发布时间】:2018-06-22 21:44:06
【问题描述】:

我有一个表,其中包含一个名为 article_id 的嵌套重复字段和一个包含 json 字符串的字符串字段。

这是我的表的架构:

这是表格的示例行:

[
  {
"article_id": "2732930586",
"author_names": [
  {
    "AuN": "h kanahashi",
    "AuId": "2591665239",
    "AfN": null,
    "AfId": null,
    "S": "1"
  },
  {
    "AuN": "t mukai",
    "AuId": "2607493793",
    "AfN": null,
    "AfId": null,
    "S": "2"
  },
  {
    "AuN": "y yamada",
    "AuId": "2606624579",
    "AfN": null,
    "AfId": null,
    "S": "3"
  },
  {
    "AuN": "k shimojima",
    "AuId": "2606600298",
    "AfN": null,
    "AfId": null,
    "S": "4"
  },
  {
    "AuN": "m mabuchi",
    "AuId": "2606138976",
    "AfN": null,
    "AfId": null,
    "S": "5"
  },
  {
    "AuN": "t aizawa",
    "AuId": "2723380540",
    "AfN": null,
    "AfId": null,
    "S": "6"
  },
  {
    "AuN": "k higashi",
    "AuId": "2725066679",
    "AfN": null,
    "AfId": null,
    "S": "7"
  }
],
"extra_informations": "{
\"DN\": \"Experimental study for improvement of crashworthiness in AZ91 magnesium foam controlling its microstructure.\",
\"S\":[{\"Ty\":1,\"U\":\"https://shibaura.pure.elsevier.com/en/publications/experimental-study-for-improvement-of-crashworthiness-in-az91-mag\"}],
 \"VFN\":\"Materials Science and Engineering\",
 \"FP\":283,
 \"LP\":287,
 \"RP\":[{\"Id\":2024275625,\"CoC\":5},{\"Id\":2035451257,\"CoC\":5},     {\"Id\":2141952446,\"CoC\":5},{\"Id\":2126566553,\"CoC\":6},  {\"Id\":2089573897,\"CoC\":5},{\"Id\":2069241702,\"CoC\":7},  {\"Id\":2000323790,\"CoC\":6},{\"Id\":1988924750,\"CoC\":16}],
\"ANF\":[
{\"FN\":\"H.\",\"LN\":\"Kanahashi\",\"S\":1},
{\"FN\":\"T.\",\"LN\":\"Mukai\",\"S\":2},    
{\"FN\":\"Y.\",\"LN\":\"Yamada\",\"S\":3},    
{\"FN\":\"K.\",\"LN\":\"Shimojima\",\"S\":4},    
{\"FN\":\"M.\",\"LN\":\"Mabuchi\",\"S\":5},    
{\"FN\":\"T.\",\"LN\":\"Aizawa\",\"S\":6},    
{\"FN\":\"K.\",\"LN\":\"Higashi\",\"S\":7}
],
\"BV\":\"Materials Science and Engineering\",\"BT\":\"a\"}"
  }
]

extra_information.ANF 中,我有一个嵌套数组,其中包含更多作者姓名信息。

嵌套重复的author_name 字段有一个子字段author_name.S,可以映射到extra_informations.ANF.S 以进行连接。使用此映射,我试图实现下表:

| article_id | author_names.AuN | S | extra_information.ANF.FN | extra_information.ANF.LN|
| 2732930586 |  h kanahashi     | 1 | H.                       | Kanahashi               | 
| 2732930586 |  t mukai         | 2 | T.                       | Mukai                   | 
| 2732930586 |  y yamada        | 3 | Y.                       | Yamada.                 |
| 2732930586 |  k shimojima     | 4 | K.                       | Shimojima               |
| 2732930586 |  m mabuchi       | 5 | M.                       | Mabuchi                 |
| 2732930586 |  t aizawa        | 6 | T.                       | Aizawa                  |
| 2732930586 |  k higashi       | 7 | K.                       | Higashi                 |

我遇到的主要问题是,当我使用JSON_EXTRACT(extra_information,"$.ANF") 转换 json_string 时,它没有给我一个数组,而是给了我嵌套重复数组的字符串格式,我无法将其转换为数组。

是否可以在 bigquery 中使用标准 SQL 生成这样的表?

【问题讨论】:

    标签: google-bigquery bigquery-standard-sql


    【解决方案1】:

    选项一

    这是基于 REGEXP_REPLACE 函数和其他几个函数(REPLACE、SPLIT 等)来处理结果。注意 - 我们需要额外的操作,因为 BigQuery 的 JsonPath 表达式不支持通配符和过滤器?

    #standard SQL
    SELECT 
      article_id, author.AuN, author.S, 
      REPLACE(SPLIT(extra, '","')[OFFSET(0)], '"FN":"', '') FirstName,
      REPLACE(SPLIT(extra, '","')[OFFSET(1)], 'LN":"', '') LastName
    FROM `table` , UNNEST(author_names) author
    LEFT JOIN UNNEST(SPLIT(REGEXP_REPLACE(JSON_EXTRACT(extra_informations, '$.ANF'), r'\[{|}\]', ''), '},{')) extra
    ON author.S = CAST(REPLACE(SPLIT(extra, '","')[OFFSET(2)], 'S":', '') AS INT64) 
    

    选项 2

    要克服 JsonPath 的 BigQuery“限制”,您可以使用 custom function,如下例所示:
    注意:它使用 jsonpath-0.8.0.js,可以从 https://code.google.com/archive/p/jsonpath/downloads 下载并假设上传到 Google Cloud Storage - gs://your_bucket/jsonpath-0.8.0.js

    #standard SQL
    CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
    RETURNS STRING
    LANGUAGE js AS """
        try { var parsed = JSON.parse(json);
            return jsonPath(parsed, json_path);
        } catch (e) { return null }
    """
    OPTIONS (
        library="gs://your_bucket/jsonpath-0.8.0.js"
    );
    SELECT 
      article_id, author.AuN, author.S,
      CUSTOM_JSON_EXTRACT(extra_informations, CONCAT('$.ANF[?(@.S==', CAST(author.S AS STRING), ')].FN')) FirstName,
      CUSTOM_JSON_EXTRACT(extra_informations, CONCAT('$.ANF[?(@.S==', CAST(author.S AS STRING), ')].LN')) LastName
    FROM `table`, UNNEST(author_names) author 
    

    如您所见 - 现在您可以在一个简单的 JsonPath 中完成所有魔法

    【讨论】:

    • 这两个解决方案都对我有用,但我特别喜欢 json 路径的灵活性。谢谢!
    猜你喜欢
    • 1970-01-01
    • 2016-07-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-01-20
    • 1970-01-01
    • 2022-01-03
    • 2017-10-25
    相关资源
    最近更新 更多