【问题标题】:BigQuery - transform generic JSON to STRUCTBigQuery - 将通用 JSON 转换为 STRUCT
【发布时间】:2021-08-09 09:13:02
【问题描述】:

我的 BigQuery 中有一个列,其中包含各种不同的消息,采用简单的单深度 JSON 格式,我想将其提取到 STRUCT 中。输入表看起来像

应该转化为

我知道诸如 JSON_EXTRACT 之类的 BigQuery json 函数,例如 here。但是,这种方法是毫无疑问的,因为在生产中存在 100 个不同的发送者。因此,我需要能够动态提取这些 JSON,而无需手动指定它们的键。

我一直在玩正则表达式,如图所示here

    WITH input_table AS (
    SELECT
        1 AS Row,
        20210101 AS Date,
        'Sender1' AS Sender,
        '{"param1": 123, "param2": 456, "param3": 78,  "value1": 42, "label1": "hello", "timestamp": 1234567890}' AS Message
    UNION ALL SELECT
        2 AS Row,
        20210101 AS Date,
        'Sender2' AS Sender,
        '{"value1": 4, "label1": "myLabel", "label2": "yourLabel"}' AS Message
    UNION ALL SELECT
        3 AS Row,
        20210102 AS Date,
        'Sender1' AS Sender,
        '{"param1": 12, "param2": 90, "param3": 55,  "value1": 11, "label1": "there", "timestamp": 1235555555}' AS Message
    )
    SELECT
    CONCAT("SELECT ", key, " AS key, JSON_EXTRACT_SCALAR(Message, '$.", key, "') AS ", key, " FROM input_table")
    FROM input_table, unnest(regexp_extract_all(regexp_replace(JSON_EXTRACT(Message, '$'), r':{.*?}+', ''), r'"(.*?)":')) key

但除了我的正则表达式仍然有点偏离之外,我正在努力将这些语句转换为 STRUCT。

【问题讨论】:

    标签: json google-bigquery


    【解决方案1】:

    这是一个非常棘手的问题 AFAIK 实际上不可能将动态参数填充到 RECORD STRUCT 中,因为它的子架构必须在表/查询级别上预定义。我已经使用 javascript 函数遵循了类似于 firebase bigquery 事件/用户参数架构(带有键和值对)的内容,供您开始使用。

    CREATE TEMP FUNCTION processJson(input STRING)
    RETURNS ARRAY<STRUCT<key STRING, value STRING>>
    LANGUAGE js AS """
    var obj = JSON.parse(input);
    var keys = Object.keys(obj);
    var arr = [];
    for (i = 0; i < keys.length; i++) {
        arr.push({'key': keys[i], 'value': JSON.stringify(obj[keys[i]])});
    }
    return arr;
    """;
    WITH
      input_table AS (
      SELECT
        1 AS ROW,
        20210101 AS Date,
        'Sender1' AS Sender,
        '{"param1": 123, "param2": 456, "param3": 78,  "value1": 42, "label1": "hello", "timestamp": 1234567890}' AS Message
      UNION ALL
      SELECT
        2 AS ROW,
        20210101 AS Date,
        'Sender2' AS Sender,
        '{"value1": 4, "label1": "myLabel", "label2": "yourLabel"}' AS Message
      UNION ALL
      SELECT
        3 AS ROW,
        20210102 AS Date,
        'Sender1' AS Sender,
        '{"param1": 12, "param2": 90, "param3": 55,  "value1": 11, "label1": "there", "timestamp": 1235555555}' AS Message )
    SELECT
      ROW,
      date,
      sender,
      processJson(Message) message
    FROM
      input_table
    

    要过滤的简单查询:

    SELECT
      *
    FROM
      final_view , unnest(message) message
      WHERE message in (('param1', '123'), ('param2', '90'), ('value1', '4'))
    

    【讨论】:

    • 我担心它不是那么简单。我想我会在 ETL 过程中尝试将这个问题转移到 Python 上。不过还是谢谢! :)
    猜你喜欢
    • 1970-01-01
    • 2021-10-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-06-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多