【发布时间】:2025-12-19 07:00:17
【问题描述】:
当我在 google bigquery 中的一个查询中使用 UDF 中的 CASE WHEN 语句时,我注意到一些非常奇怪的行为。结果真的很奇怪,所以要么我遗漏了一些非常明显的东西,要么在查询执行中出现了一些奇怪的行为。
(旁注:如果有更有效的方法来实现下面的查询逻辑,我全神贯注,我的查询需要永远)
我正在处理一些日志行,其中每一行包含用于解码的data: string 和topics: array<string> 字段。每种类型的日志行都会有不同的topics长度,并且需要不同的解码逻辑。我在 UDF 中使用 CASE WHEN 来切换到不同的解码方法。我最初遇到了一个奇怪的错误,即对数组的索引太远。这要么意味着不符合标准的数据,要么意味着在某些时候调用了错误的解码器。我验证了所有数据都符合规范,所以肯定是后者。
我已将其范围缩小为在我的 CASE WHEN 中针对错误类型执行的错误/无关解码器。
最奇怪的是,当我插入固定值而不是解码函数时,CASE WHEN 的返回值并不表示它是错误匹配。不知何故,当我使用函数时,第一个函数被调用,但在调试时,我从第二个 WHEN 的正确值中获取值。
我从 udf 中提取了逻辑,并使用 if(..) 而不是 CASE WHEN 来实现它,并且一切都可以正常解码。我想知道这里发生了什么,如果它是 bigquery 中的错误,或者在使用 UDF 时发生了一些奇怪的事情。
这是查询的精简版
-- helper function to normalize different payloads into a flattened struct
create temporary function wrap_struct(payload array<struct<name string, value string>>) as (
(select as struct
decode_field_type1(field1) as field1,
decode_field_type1(field2) as field2,
decode_field_type2(field3) as field3,
-- a bunch more fields
from (select * from
(select p.name, p.value
from unnest(payload) as p) pivot(string_agg(value) for name in (
'field1', 'field2', 'field3', --a bunch more fields
)
)
))
);
-- this topic uses the data and topics in the decoding, and has a topics array of length 4
-- this gets called from the switch with a payload from topics2, which has a shorter topics array of length 1, causing a failure
create temporary function decode_topic1(data string, topics array<string>) as
(
wrap_struct([
struct("field1" as name, substring(topics[offset(1)], 3) as value),
struct("field2" as name, substring(topics[offset(2)], 3) as value),
struct("field3" as name, substring(topics[offset(3)], 3) as value),
struct("field4" as name, substring(data, 3, 64) as value)
])
);
--this uses only the data_payload, and has a topics array of length 1
create temporary function decode_topic2(data string, topics array<string>) as
(
wrap_struct([
struct("field1" as name, substring(data, 3, 64) as value),
struct("field5" as name, substring(data, 67, 64) as value),
struct("field6" as name, substring(data, 131, 64) as value)
])
);
create temporary function decode_event_data(data string, topics array<string>) as
(
-- first element of topics denotes the type of event
case
-- somehow the function decode_topic1 gets called when topics[0] == topic2
-- HOWEVER, when i replaced the functions with a fixed value to debug
-- i get the expected results, indicating a proper match.
-- this is not unique these topics
-- it happens with other combinations also.
when topics[offset(0)] = 'topic1' then decode_topic1(data, topics)
when topics[offset(0)] = 'topic2' then decode_topic2(data, topics)
-- a bunch more topics
else wrap_struct([])
end
);
select
id, data, topics,
decode_event_data(data, topics) as decoded_payload
from (select * from mytable
where
topics[offset(0)] = 'topic1'
or topics[offset(0)] = 'topic2'
当我将基本查询更改为:
select
id, data, topics, decode_topic2(data, topics)
from (select * from mytable
where
topics[offset(0)] = 'topic2'
它解码得很好。
CASE WHEN 怎么了?
编辑:这是对可能产生问题的公共数据集的查询:
concat('0x', substring(raw, 25, 40))
);
create temporary function decode_amount(raw string) as (
concat('0x', raw)
);
create temporary function wrap_struct(payload array<struct<name string, value string>>) as (
(select as struct
decode_address(sender) as reserve,
decode_address(`to`) as `to`,
decode_amount(amount1) as amount1,
decode_amount(amount2) as amount2,
from (select * from
(select p.name, p.value
from unnest(payload) as p) pivot(string_agg(value) for name in (
'sender', 'to', 'amount1', 'amount2'
)
)
))
);
create temporary function decode_mint(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value)
])
);
create temporary function decode_burn(data_payload string, topics array<string>) as
(
wrap_struct([
struct("sender" as name, substring(topics[offset(1)], 3) as value),
struct("amount1" as name, substring(data_payload, 3, 64) as value),
struct("amount2" as name, substring(data_payload, 67, 64) as value),
struct("to" as name, substring(topics[offset(2)], 67, 64) as value)
])
);
select
*,
case
when topics[offset(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f' then decode_mint(data, topics)
when topics[offset(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822' then decode_burn(data, topics)
end as decoded_payload
from `public-data-finance.crypto_ethereum_kovan.logs`
where
array_length(topics) > 0
and (
(array_length(topics) = 2 and topics[offset(0)] = '0x4c209b5fc8ad50758f13e2e1088ba56a560dff690a1c6fef26394f4c03821c4f')
or (array_length(topics) = 3 and topics[offset(0)] = '0xd78ad95fa46c994b6551d0da85fc275fe613ce37657fb8d5e3d130840159d822')
)
【问题讨论】:
-
我无法在测试数据上复制问题。我使用了相同级别的嵌套临时函数,所以这似乎与数据有关。您能否提供一些产生问题的示例数据?
-
您好,感谢您的回复。我编辑了我的初始帖子,以包含对可以重现问题的公共数据集的查询(在帖子的底部)。
-
代码似乎不完整...我试图修复
decode_address函数,但它无法正常运行,显示Array index 2 is out of bounds (overflow)... -
是的,我减少了很多查询,因为那里没有我的业务逻辑的详细信息。
-
这个数组问题是我发布的问题。使用 case 语句执行 UDF 的方式有些奇怪。也许所有都被执行,但它只选择与 WHEN 匹配的执行。 where 子句应该保证我们永远不会有一个与解码器期望的不匹配的主题长度。无论如何,我现在运行多个作业,每个主题行 1 个,它工作正常。当我尝试一次运行所有主题时,数组超出范围。
标签: sql google-bigquery