【发布时间】:2017-05-11 07:49:52
【问题描述】:
每次 UDF 处理一行后,我都会尝试更新所有行的列。
该示例有 3 行 6 列。列“A”在 3 行中具有相同的值; “B”列和“A”列是每一行的联合标识符; “C”列是在 a、b、c、d、e 中包含任何字母的数组; “D”列是要填写的目标数组; “E”列是一些整数;列“abcde”是具有 5 个整数的整数数组,指定每个字母 a、b、c、d、e 的计数。
每一行都将被传递到一个 UDF 中,以根据“C”列和“E”列更新“D”列和“abcde”列。规则是:从“C”中选择“E”指定的项目数放入“D”;选择是随机的;在为一行完成每次选择后,“abcde”列将更新所有行。
例如,为了处理第一行,我们从 ('a','b','c') 中随机选择一项放入“D”。假设系统在“C”列中选择了“c”,因此该行的“D”中的值变为 ['c'] 并且 'abcde' 更新为 [1,3,1,1,1] (之前是 [1,3,2,1,1])所有三行。
示例数据:
#StandardSQL in BigQuery
#code to generate the example table
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, [] as D, 1 as E, [1,3,2,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,2,1,1] union all
select 'y1','x3',['c','d','e'],[],3,[1,3,2,1,1])
select * from sample order by B
第一行处理完毕后:
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [1,3,1,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,1,1,1] union all
select 'y1','x3',['c','d','e'],[],3,[1,3,1,1,1])
select * from sample order by B
处理完第二行后:
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [0,2,1,1,1] as abcde union all
select 'y1','x2',['a','b'],['a','b'],2,[0,2,1,1,1] union all
select 'y1','x3',['c','d','e'],[],3,[0,2,1,1,1])
select * from sample order by B
第三行处理完毕后:
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, ['c'] as D, 1 as E, [0,2,0,0,0] as abcde union all
select 'y1','x2',['a','b'],['a','b'],2,[0,2,0,0,0] union all
select 'y1','x3',['c','d','e'],['c','d','e'],3,[0,2,0,0,0])
select * from sample order by B
不用担心 UDF 将如何进行随机选择。我只是想知道,是否可以在 BigQuery 中以我想要的方式执行更新“abcde”列的任务?
我尝试过使用 UDF,但我很难让它发挥作用,因为我对 UDF 的理解是它只能输入一行并输出多行。所以,我无法更新其他行。是否可以只使用 SQL?
预期输出:
第一行处理完毕后:
第三行处理完毕后:
其他信息:
create temporary function selection(A string, B string, C ARRAY<STRING>, D ARRAY<STRING>, E INT64, abcde ARRAY<INT64>)
returns STRUCT< A stRING, B string, C array<string>, D array<string>, E int64, abcde array<int64>>
language js AS """
/*
for the row i in the data:
select the number i.E of items (randomly) from i.C where the numbers associated with the item in i.abcde is bigger than 0 (i.e. only the items with numbers in abcde bigger than 0 can be the cadidates for the random selection);
put the selected items in i.D and deduct the amount of selected items from the number for the corresponding item in the column 'abcde' FOR ALL ROWS;
proceed to the next row i+1 until every row is processed;
*/
return {A,B,C,D,E,abcde}
""";
with sample as (
select 'y1' as A, 'x1' as B, ['a','b','c'] as C, CAST([] AS ARRAY<STRING>) as D, 1 as E, [1,3,2,1,1] as abcde union all
select 'y1','x2',['a','b'],[],2,[1,3,2,1,1] union all
select 'y1','x3',['c','d','e'],[],2,[1,3,2,1,1])
select selection(A,B,C,D,E,abcde) from sample order by B
【问题讨论】:
标签: google-bigquery