BigQuery：展平多个不同长度的重复列答案

【问题标题】：BigQuery: flatten multiple repeated columns of different lengthBigQuery：展平多个不同长度的重复列
【发布时间】：2021-08-28 14:34:54
【问题描述】：

我遵循了这个问题的答案：BigQuery: flatten two repeated columns，但它并不完全有效，尽管它最接近我正在寻找的东西。

我有数据从应用程序从 Google Analytics 发送到 Google BigQuery。我有 10 个重复的列：

event_params [RECORD REPEATED]
user_properties [RECORD REPEATED]
user_ltv [RECORD NULLABLE]
device [RECORD NULLABLE]
geo [RECORD NULLABLE]
app_info [RECORD NULLABLE]
traffic_source [RECORD NULLABLE]
event_dimensions [RECORD NULLABLE]
ecommerce [RECORD NULLABLE]
items [RECORD REPEATED]

只要有新的活动，就会有：

event_date
event_timestamp
event_name

对于每一行。这些重复的属性对于每个事件可以有不同的长度，并且在事件的索引上有对应关系。

下面是前两个重复列 event_params 和 user_properties 的快照，以及我想用这两列和其他列（如果需要）生成的内容：

在这里，我们看到event_params 的长度为 7，user_properties 的长度为 4。当我运行以下代码时：

-- standardSQL
SELECT
    event_name, event_params,
    user_properties[OFFSET(off)] AS user_properties
FROM
    `yepic-2021.analytics_264796885.events_intraday_*`,
    UNNEST(event_params) AS event_params WITH OFFSET off
ORDER BY
    event_timestamp DESC
LIMIT 50

但这会导致错误：

Array index 4 is out of bounds (overflow)

这是有道理的，因为它们的长度不同。所以我的想法是，如果有人知道如何将null 添加到所有其他列，直到它们的长度等于具有最长长度的列，那么这将产生我想要的完全展平的输出。

这是一个不是我想要的示例，通过在已经展平的桌子上展平，重复出现爆炸式增长：

-- standardSQL
SELECT
    event_name, event_params, user_properties
FROM
    `yepic-2021.analytics_264796885.events_intraday_*`,
    UNNEST(event_params) AS event_params,
    UNNEST(user_properties) AS user_properties
ORDER BY
    event_timestamp DESC
LIMIT 50

结果：

如果有人可以提供这种方法的帮助，或者比我更了解 BigQuery 以及一种扁平化来自 GA 的数据的简单方法，那么我将非常感谢您的帮助。

临时编辑：

这是我在 BigQuery 中尝试过的代码：

-- standardSQL
WITH data1 AS (
    SELECT GENERATE_UUID() AS row_id, event_params, user_properties
    FROM `yepic-2021.analytics_264796885.events_intraday_*`
),
data2 AS (
    SELECT *, GENERATE_ARRAY(1, GREATEST(ARRAY_LENGTH(event_params), ARRAY_LENGTH(user_properties))) ordinals
    FROM data1
)
SELECT row_id, event_params[SAFE_ORDINAL(o)] event_params, user_properties[SAFE_ORDINAL(o)] user_properties
FROM data2, UNNEST(ordinals) o

结果：

【问题讨论】：

标签： google-bigquery google-analytics-firebase

【解决方案1】：

您的第一种方法几乎可行，除了两个问题：

您需要使用SAFE_OFFSET 以避免错误
仅当event_params 的元素多于user_properties 时才有效，否则您会错过一些用户属性。

让我们修复它。为了允许任一列更大，我们将取最大的数组长度，并将其用于数组索引生成。然后我们使用SAFE_ORDINAL（SAFE_OFFSET 也可以）：

-- sample input data, id and two two repeated fields x and y.
with data as (
    select 1 id, [1,2,3] x, ['a', 'b'] y 
    union all
    select 2 id, [4,5] x, ['c', 'd', 'e', 'f'] y
), 
-- let's add the ordinals arrays, taking length of longer array.
data2 as (
    select *, 
      generate_array(1, greatest(array_length(x), array_length(y))) ordinals
    from data
)
select id, x[safe_ordinal(o)] x, y[safe_ordinal(o)] y
from data2, unnest(ordinals) o

结果

id  x       y
-------------
1   1       a
1   2       b
1   3       null
2   4       c
2   5       d
2   null    e
2   null    f

更新

如果重复字段很简单，则此方法有效。如果重复字段是 RECORD，您将获得 NULL RECORD。您可能想要一个带有 NULL 叶值的 RECORD，为此使用类似

的表达式

coalesce(x[safe_ordinal(o)], struct<a int64, b int64>(null, null)) x

或者要真正完全展平输出表，只得到叶子字段，没有任何记录字段的痕迹，只需提取叶子字段作为另一个步骤：

-- sample input data, id and two two repeated fields x and y.
with data as (
    select 1 id, [struct<a int64, b int64>(1, 2), (3, 4), (5, 6)] x, ['a', 'b'] y 
    union all
    select 2 id, [struct<a int64, b int64>(7, 8), (9, 10)] x, ['c', 'd', 'e', 'f'] y
), 
data2 as (
  select id, x[safe_ordinal(o)] x, y[safe_ordinal(o)] y
  from data, unnest( generate_array(1, greatest(array_length(x), array_length(y)))) o
)
select id, x.a a, x.b b, y 
from data2

最后一点

所以我们有一个解决方案，但我必须警告说，这种扁平化是非常罕见的。在您的情况下，user_properties 中的第 i 个项目与 event_params 中的第 i 个项目之间没有联系。这里第一个user_properties 与第一个event_params 配对，但它与它们中的任何一个都同样相关。它们只是两个独立的列表，以这种方式将它们展平是非常随意的。

【讨论】：

非常感谢您的回答，在 BigQuery 中绑定您的代码会产生很好的效果。我了解每列中的第 i 个项目之间没有联系。它们都在某个时间戳连接到特定事件。为事件创建一个 id 非常棒，这有助于将它们链接到正确的事件和时间戳。如果您有更好的数据结构方法，那么我很乐意接受。不幸的是，我已经用我的数据尝试了解决方案，但它没有产生与你相同的输出。对于我的代码，我仍然得到重复的 BigQuery 视图而不是空值。
我将编辑我的问题只是为了向您展示我得到了什么
是的，扁平化 RECORD 需要做更多的工作，因为你得到的是 NULL RECORD，而不是带有 NULL 叶字段的 RECORD。我添加了几个如何处理这些的想法。