让我们看看这是否有帮助?
步骤+伪代码
1 - 将组合数据 (300GB) 上传到 BigQuery 到 CombinedData 表
2 - 分年(成本 1x2x300GB = 600GB)
SELECT * FROM CombinedData WHERE year = year1 -> write to DataY1 table
SELECT * FROM CombinedData WHERE year = year2 -> write to DataY2 table
3 - 拆分为 6 个月(成本 2x2x150GB = 600GB)
SELECT * FROM DataY1 WHERE month in (1,2,3,4,5,6) -> write to DataY1H1 table
SELECT * FROM DataY1 WHERE month in (7,8,9,10,11,12) -> write to DataY1H2 table
SELECT * FROM DataY2 WHERE month in (1,2,3,4,5,6) -> write to DataY2H1 table
SELECT * FROM DataY2 WHERE month in (7,8,9,10,11,12) -> write to DataY2H2 table
4 - 拆分为 3 个月(成本 4x2x75GB = 600GB)
SELECT * FROM DataY1H1 WHERE month in (1,2,3) -> write to DataY1Q1 table
SELECT * FROM DataY1H1 WHERE month in (4,5,6) -> write to DataY1Q2 table
SELECT * FROM DataY1H2 WHERE month in (7,8,9) -> write to DataY1Q3 table
SELECT * FROM DataY1H2 WHERE month in (10,11,12) -> write to DataY1Q4 table
SELECT * FROM DataY2H1 WHERE month in (1,2,3) -> write to DataY2Q1 table
SELECT * FROM DataY2H1 WHERE month in (4,5,6) -> write to DataY2Q2 table
SELECT * FROM DataY2H2 WHERE month in (7,8,9) -> write to DataY2Q3 table
SELECT * FROM DataY2H2 WHERE month in (10,11,12) -> write to DataY2Q4 table
5 - 将每个季度分成 1 个月和 2 个月(成本 8x2x37.5GB = 600GB)
SELECT * FROM DataY1Q1 WHERE month = 1 -> write to DataY1M01 table
SELECT * FROM DataY1Q1 WHERE month in (2,3) -> write to DataY1M02-03 table
SELECT * FROM DataY1Q2 WHERE month = 4 -> write to DataY1M04 table
SELECT * FROM DataY1Q2 WHERE month in (5,6) -> write to DataY1M05-06 table
其余 Y(1/2)Q(1-4) 表相同
6 - 将所有双月表拆分为单独的月表(成本 8x2x25GB = 400GB)
SELECT * FROM DataY1M002-03 WHERE month = 2 -> write to DataY1M02 table
SELECT * FROM DataY1M002-03 WHERE month = 3 -> write to DataY1M03 table
SELECT * FROM DataY1M005-06 WHERE month = 5 -> write to DataY1M05 table
SELECT * FROM DataY1M005-06 WHERE month = 6 -> write to DataY1M06 table
其余 Y(1/2)M(XX-YY) 表相同
7 - 最后你有 24 个每月牌桌,现在我希望你所面临的限制将消失,这样你就可以继续你的计划 - 比如说第二种方法 - 进一步拆分每日牌桌
我认为,在成本方面,这是最优化的方法,最终查询成本是
(假设计费层级 1)
4x600GB + 400GB = 2800GB = $14
当然别忘了删除中间表
注意:我对这个计划不满意 - 但如果无法将原始文件拆分为 BigQuery 之外的每日块 - 这会有所帮助