【发布时间】:2021-05-10 11:01:46
【问题描述】:
问题陈述:
process_name 是“test.exe”,registry_key 是 \\REGISTRY\\test,ip 是 192.x.x.x。
架构:
process_name is in process table
registry_key is in registry table
ip is in network table
process_id is common across all tables
每个表大小约为 500 GB,数据为 orc 格式的 s3。我通过创建 hive 外部表并使用 presto 作为处理引擎来查询数据。
我可以通过以下方法解决上述问题
-
使用联合交集
SELECT process_id FROM process_table WHERE process_name = 'test.exe' INTERSECT SELECT process_id FROM registry_table WHERE registry_key = '\\REGISTRY\\test' INTERSECT SELECT process_id FROM network_table WHERE ip = '192.x.x.x' -
使用连接
SELECT process_table.process_id FROM process_table INNER JOIN registry_table ON process_table.process_id = registry_table.process_id INNER JOIN network_table ON process_table.process_id = network_table.process_id WHERE process_name = 'test.exe' AND registry_key = '\\REGISTRY\\test' AND ip = '192.x.x.x'
两者都返回相同的结果;我想知道哪个更有效 - join 或 intersect-union ?
【问题讨论】: