【问题标题】:How to update a column for all rows after each time one row is processed by a UDF in BigQuery?BigQuery中的UDF每次处理一行后如何更新所有行的列?
【发布时间】:2017-05-11 07:49:52
【问题描述】:

每次 UDF 处理一行后,我都会尝试更新所有行的列。

该示例有 3 行 6 列。列“A”在 3 行中具有相同的值; “B”列和“A”列是每一行的联合标识符; “C”列是在 a、b、c、d、e 中包含任何字母的数组; “D”列是要填写的目标数组; “E”列是一些整数;列“abcde”是具有 5 个整数的整数数组,指定每个字母 a、b、c、d、e 的计数。

每一行都将被传递到一个 UDF 中,以根据“C”列和“E”列更新“D”列和“abcde”列。规则是:从“C”中选择“E”指定的项目数放入“D”;选择是随机的;在为一行完成每次选择后,“abcde”列将更新所有行

例如,为了处理第一行,我们从 ('a','b','c') 中随机选择一项放入“D”。假设系统在“C”列中选择了“c”,因此该行的“D”中的值变为 ['c'] 并且 'abcde' 更新为 [1,3,1,1,1] (之前是 [1,3,2,1,1])所有三行。

示例数据:

#StandardSQL in BigQuery
#code to generate the example table
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, [] as D, 1 as E, [1,3,2,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,2,1,1] union all
select 'y1','x3',['c','d','e'],[],3,[1,3,2,1,1])
select * from sample order by B

第一行处理完毕后:

with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [1,3,1,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,1,1,1] union all
select 'y1','x3',['c','d','e'],[],3,[1,3,1,1,1])
select * from sample order by B

处理完第二行后:

with sample as (
    select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [0,2,1,1,1] as abcde union all
    select 'y1','x2',['a','b'],['a','b'],2,[0,2,1,1,1] union all
    select 'y1','x3',['c','d','e'],[],3,[0,2,1,1,1])
    select * from sample order by B

第三行处理完毕后:

with sample as (
    select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [0,2,0,0,0] as abcde union all
    select 'y1','x2',['a','b'],['a','b'],2,[0,2,0,0,0] union all
    select 'y1','x3',['c','d','e'],['c','d','e'],3,[0,2,0,0,0])
    select * from sample order by B

不用担心 UDF 将如何进行随机选择。我只是想知道,是否可以在 BigQuery 中以我想要的方式执行更新“abcde”列的任务?

我尝试过使用 UDF,但我很难让它发挥作用,因为我对 UDF 的理解是它只能输入一行并输出多行。所以,我无法更新其他行。是否可以只使用 SQL?

预期输出:

第一行处理完毕后:

第三行处理完毕后:

其他信息:

create temporary function selection(A string, B string,  C ARRAY<STRING>, D ARRAY<STRING>, E INT64, abcde ARRAY<INT64>)
returns STRUCT< A stRING, B string, C array<string>, D array<string>, E int64, abcde array<int64>>
language js AS """
/*
for the row i in the data:
select the number i.E of items (randomly) from i.C where the numbers associated with the item in i.abcde is bigger than 0 (i.e. only the items with numbers in abcde bigger than 0 can be the cadidates for the random selection);
put the selected items in i.D and deduct the amount of selected items from the number for the corresponding item in the column 'abcde' FOR ALL ROWS;
proceed to the next row i+1 until every row is processed;
*/
return {A,B,C,D,E,abcde}
""";
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, CAST([] AS ARRAY<STRING>) as D, 1 as E, [1,3,2,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,2,1,1] union all
select 'y1','x3',['c','d','e'],[],2,[1,3,2,1,1])
select selection(A,B,C,D,E,abcde) from sample order by B

【问题讨论】:

    标签: google-bigquery


    【解决方案1】:

    以下是 BigQuery 标准 SQL

    #StandardSQL
    WITH sample AS (
      SELECT 'y1' AS A, 'x1' AS B, ['a','b','c'] AS C, ['c'] AS D, 1 AS E, [1,3,2,1,1] AS abcde UNION ALL
      SELECT 'y1','x2',['a','b'],['a','b'],2,[1,3,2,1,1] UNION ALL
      SELECT 'y1','x3',['c','d','e'],['c','d','e'],3,[1,3,2,1,1] UNION ALL
    
      SELECT 'y2' AS A, 'x1' AS B, ['a','b','c'] AS C, ['a','b'] AS D, 2 AS E, [1,3,2,1,1] AS abcde UNION ALL
      SELECT 'y2','x2',['a','b'],['b'],1,[1,3,2,1,1] UNION ALL
      SELECT 'y2','x3',['c','d','e'],['d','e'],2,[1,3,2,1,1]  
    ),
    counts AS (
      SELECT A AS AA, dd, COUNT(1) AS cnt
      FROM sample, UNNEST(D) AS dd
      GROUP BY AA, dd
    ),
    processed AS (
      SELECT A, B, ARRAY_AGG(aa - IFNULL(cnt, 0) ORDER BY pos) AS abcde
      FROM sample, UNNEST(abcde) AS aa WITH OFFSET AS pos
      LEFT JOIN counts ON A = counts.AA 
      AND CASE dd 
            WHEN 'a' THEN 0 
            WHEN 'b' THEN 1 
            WHEN 'c' THEN 2 
            WHEN 'd' THEN 3 
            WHEN 'e' THEN 4 
          END = pos
      GROUP BY A, B
    )
    SELECT s.A, s.B, s.C, s.D, s.E, p.abcde
    FROM sample AS s
    JOIN processed AS p
    USING (A, B)
    -- ORDER BY A, B  
    

    不用担心 UDF 将如何进行随机选择

    所以,如您所见 - 我只是将“随机”值放入样本数据中以模拟 D

    【讨论】:

    • 非常感谢您的回答!但我可能没有说清楚我遇到的问题。我在帖子中添加了一些额外的信息。基本上我想使用 UDF 来完成我提到的这个随机选择步骤。我解释了 UDF 中的算法应该如何在附加信息中工作(在 /* */ 之间)。主要问题是每次在 UDF 中处理一行后,我需要为 ALL ROWS 更新此列“abcde”(考虑每一行一个接一个地处理......如果可能的话......)。对我造成的任何困惑感到抱歉......再次感谢! :)
    • @Ran - 1)我认为你已经足够清楚了,你说不用担心 udf 中的那一步,所以这就是为什么我跳过这一步并预先填写字段 D(我什至在我的答案的底部)2)当你说ALL ROWS时 - 我明白了 - 但我想到了相同 A 字段的所有行(否则字段 A 没有多大意义) - 但请确认它实际上适用于所有字段 3)上面的解决方案以固定的方式做你想要的 - SQL 通常的工作方式 - 不是逐行(光标)的方式 - 这不适合大数据 4)澄清#2,我会更新我的答案:o)
    • 关于#2,您正确理解它确实是同一个A字段的所有行。您的 SQL 正在正确的聚合级别上工作(再次感谢!^_^);关于#3,我知道这个算法确实不适合大数据,通常 SQL 不适用于这种算法。但我希望在 BigQuery 中会有一些“不寻常”的方式来实现它? ...再次非常感谢您的帮助! :)
    • 当然。如果由于某种原因你真的需要以游标方式进行操作 - 从技术上讲,可以将所有具有相同 A 键的行传递给 udf 到 js udf 并在里面逐行执行 -但它效率不高,并且可能导致相当大的计费层!如果您仍然对这种类型的解决方案感兴趣 - 让我知道 - 我可以为此发布另一个答案(不过需要一些时间:)
    • 我有兴趣!只要有时间!!还有一千次感谢!!! ^_^ ^_^
    猜你喜欢
    • 1970-01-01
    • 2020-04-13
    • 2016-02-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-12-06
    • 1970-01-01
    • 2010-10-17
    相关资源
    最近更新 更多