【问题标题】:How to convert an array extracted from a json string field to a bigquery Repeated field?如何将从 json 字符串字段中提取的数组转换为 bigquery 重复字段?
【发布时间】:2018-01-01 02:12:19
【问题描述】:

我们在 Bigquery 表的字符串字段中加载了 json blob。我需要在表上创建一个视图(使用标准 sql),将数组字段提取为“RECORD”类型的 bigquery 数组/重复字段(它本身包括一个重复字段)。

这是一个示例记录(json_blob):

{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}

我希望最终得到一个具有以下布局的视图:

[
{
    "name": "order_id",
    "type": "STRING",
    "mode": "NULLABLE"
},
{
    "mode": "NULLABLE",
    "name": "customer_id",
    "type": "STRING"
},
{
    "mode": "REPEATED",
    "name": "items",
    "type": "RECORD",
    "fields": [
        {
            "mode": "NULLABLE",
            "name": "line",
            "type": "STRING"
        },
        {
            "mode": "REPEATED",
            "name": "ref_ids",
            "type": "STRING"
        },
        {
            "mode": "NULLABLE",
            "name": "sku",
            "type": "STRING"
        },
        {
            "mode": "NULLABLE",
            "name": "amount",
            "type": "INTEGER"
        }
    ]
}
]

Json_extract(json_blob, '$.items') 提取项目部分,但如何将其转换为“RECORD”类型的 bigquery 数组,然后可以像普通 bigquery 数组/重复 STRUCT 一样处理?

感谢任何帮助。

【问题讨论】:

    标签: google-bigquery


    【解决方案1】:

    在撰写本文时,无法在 BigQuery 中使用 SQL 函数来执行此操作,除非您可以对 JSON 数组中的值数量施加硬性限制;见the relevant issue tracker item。您的选择是:

    • 以不同方式处理数据(例如,使用 Cloud Dataflow 或其他工具),以便您可以将其从以换行符分隔的 JSON 加载到 BigQuery。
    • 使用接受输入 JSON 并返回所需类型的 JavaScript UDF;这相当简单,但通常会使用更多 CPU(因此可能需要更高的计费层)。
    • 使用 SQL 函数时要明白,如果元素过多,解决方案就会失败。

    这是使用 JavaScript UDF 的方法:

    #standardSQL
    CREATE TEMP FUNCTION JsonToItems(input STRING)
    RETURNS STRUCT<order_id INT64, customer_id STRING, items ARRAY<STRUCT<line STRING, ref_ids ARRAY<STRING>, sku STRING, amount INT64>>>
    LANGUAGE js AS """
    return JSON.parse(input);
    """;
    
    WITH Input AS (
      SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}' AS json
    )
    SELECT
      JsonToItems(json).*
    FROM Input;
    

    如果您确实想在不使用 JavaScript 的情况下尝试基于 SQL 的方法,在上述功能请求得到解决之前,这里有一些技巧,其中数组元素的数量必须不超过 10:

    #standardSQL
    CREATE TEMP FUNCTION JsonExtractRefIds(json STRING) AS (
      (SELECT ARRAY_AGG(v IGNORE NULLS)
       FROM UNNEST([
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[0]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[1]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[2]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[3]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[4]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[5]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[6]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[7]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[8]'),
         JSON_EXTRACT_SCALAR(json, '$.ref_ids[9]')]) AS v)
    );
    
    CREATE TEMP FUNCTION JsonToItem(json STRING)
    RETURNS STRUCT<line STRING, ref_ids ARRAY<STRING>, sku STRING, amount INT64>
    AS (
      IF(json IS NULL, NULL,
        STRUCT(
          JSON_EXTRACT_SCALAR(json, '$.line'),
          JsonExtractRefIds(json),
          JSON_EXTRACT_SCALAR(json, '$.sku'),
          CAST(JSON_EXTRACT_SCALAR(json, '$.amount') AS INT64)
        )
      )
    );
    
    CREATE TEMP FUNCTION JsonToItems(json STRING) AS (
      (SELECT AS STRUCT
        CAST(JSON_EXTRACT_SCALAR(json, '$.order_id') AS INT64) AS order_id,
        JSON_EXTRACT_SCALAR(json, '$.customer_id') AS customer_id,
        (SELECT ARRAY_AGG(v IGNORE NULLS)
         FROM UNNEST([
           JsonToItem(JSON_EXTRACT(json, '$.items[0]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[1]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[2]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[3]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[4]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[5]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[6]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[7]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[8]')),
           JsonToItem(JSON_EXTRACT(json, '$.items[9]'))]) AS v) AS items
      )
    );
    
    WITH Input AS (
      SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}' AS json
    )
    SELECT
      JsonToItems(json).*
    FROM Input;
    

    【讨论】:

    • 非常酷的答案(和黑客)!只是好奇地询问数组长度必须不大于 10。是因为内存消耗吗?我在这里用超过 10 个的 json 进行了测试,它工作正常,所以我认为这可能与内存有关。
    • 这是因为我硬编码了用于提取数组元素的路径(路径必须是常量)。例如,即使该数组中有 11 个项目,您也只会得到 10 个项目。
    【解决方案2】:

    更多的蛮力版本 - 如果需要,我认为更容易阅读和修改/调整

    #standardSQL
    WITH `yourTable` AS (
      SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}' AS json_blob
    )
    SELECT 
       JSON_EXTRACT_SCALAR(json_blob, '$.order_id') AS order_id,
       JSON_EXTRACT_SCALAR(json_blob, '$.customer_id') AS customer_id,
       ARRAY(
        SELECT STRUCT(
            JSON_EXTRACT_SCALAR(split_items, '$.line') AS line,
            SPLIT(REGEXP_REPLACE(JSON_EXTRACT (split_items, '$.ref_ids'), r'[\[\]\"]', '')) AS ref_ids,
            JSON_EXTRACT_SCALAR(split_items, '$.sku') AS sku,
            JSON_EXTRACT_SCALAR(split_items, '$.amount') AS amount
          )
        FROM (
          SELECT CONCAT('{', REGEXP_REPLACE(split_items, r'^\[{|}\]$', ''), '}') AS split_items
          FROM UNNEST(SPLIT(JSON_EXTRACT(json_blob, '$.items'), '},{')) AS split_items
        )
       ) AS items
    FROM `yourTable` 
    

    【讨论】:

      【解决方案3】:

      自 2020 年 5 月 1 日起,已添加 JSON_EXTRACT_ARRAY 功能,并且 可用于从 json 中检索数组。

      #standardSQL
      WITH `yourTable` AS (
        SELECT '{"order_id":"123456","customer_id":"2abcd", "items":[{"line":"1","ref_ids":["66b56e60","9e7ca2b7"],"sku":"1111","amount":40 },{"line":"2","ref_ids":["7777h0","8888j0"],"sku":"2222","amount":10 }]}' AS json_blob 
      )
      SELECT
        json_extract_scalar(json_blob,'$.order_id') AS order_id,
        json_extract_scalar(json_blob,'$.customer_id') AS customer_id,
        ARRAY(
        SELECT
          STRUCT(json_extract_scalar(split_items,'$.line') AS line,
                ARRAY(SELECT json_extract_scalar(ref_element,'$') FROM UNNEST(json_extract_array(split_items, '$.ref_ids')) ref_element) AS ref_ids,
                json_extract_scalar(split_items,'$.sku') AS sku,
                json_extract_scalar(split_items,'$.amount') AS amount 
            )
          FROM UNNEST(json_extract_array(json_blob,'$.items')) split_items 
        ) AS items
      FROM
        `yourTable`
      

      返回:

      仅获取类型查询将是:

      #standardSQL
      WITH `yourTable` AS (
        SELECT '{ "firstName": "John", "lastName" : "doe", "age"      : 26, "address"  : {     "streetAddress": "naist street",     "city"         : "Nara",     "postalCode"   : "630-0192" }, "phoneNumbers": [     {       "type"  : "iPhone",       "number": "0123-4567-8888"     },     {       "type"  : "home",       "number": "0123-4567-8910"     } ]}' AS json_blob 
      )
        SELECT
          json_extract_scalar(split_items,'$.type') AS type FROM `yourTable`, UNNEST(json_extract_array(json_blob,'$.phoneNumbers')) split_items
      

      返回:

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2018-06-22
        • 2018-02-16
        • 1970-01-01
        • 2019-08-08
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多