【发布时间】:2021-10-15 20:16:42
【问题描述】:
我想提取变量 B2 的 5-95 个分位数内的均值、最大值、最小值和 sd 提取值,
B3, B4, B8, NDVI, SAVI, SIPI, SR, RGI, TVI, MSR, PRI, GNDVI, @987654337 @、GCI 由 CMPC 表中的 AGE 和 ESPAC 变量聚合:
我的 CMPC SQL 表 ([PROJECT_ID].spectra_calibration.CMPC) 在 BigQuery 中创建:
Rows: 55.310
Columns: 27
Database: BigQueryConnection
$ x <dbl> -52.5502, -52.5501, -52.5501, -52.5501, -52.5501, -52.5500, -52.5500, -52.5500, -52.5500, -52.5500,~
$ y <dbl> -30.8295, -30.8297, -30.8296, -30.8295, -30.8294, -30.8298, -30.8297, -30.8296, -30.8295, -30.8294,~
$ stand <chr> "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO0~
$ date <chr> "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "~
$ B2 <dbl> 213, 205, 181, 207, 216, 205, 165, 161, 173, 182, 181, 259, 227, 190, 153, 147, 160, 164, 194, 210,~
$ B3 <dbl> 361.0, 362.0, 346.0, 352.0, 369.0, 330.0, 290.0, 326.0, 334.0, 332.0, 325.0, 375.0, 352.0, 307.0, 2~
$ B4 <dbl> 227.0, 233.0, 198.0, 207.0, 209.0, 227.0, 178.0, 164.0, 180.0, 207.0, 209.0, 267.0, 269.0, 194.0, 1~
$ B8 <dbl> 3033.0, 3307.0, 3322.0, 3232.0, 3241.0, 3065.0, 3306.0, 3422.0, 3427.0, 3392.0, 3165.0, 3206.0, 298~
$ NDVI <dbl> 0.86074, 0.86836, 0.88750, 0.87962, 0.87884, 0.86209, 0.89782, 0.90853, 0.90019, 0.88497, 0.87611, ~
$ SAVI <dbl> 4549.379, 4960.386, 4982.905, 4847.897, 4861.397, 4597.380, 4958.915, 5132.925, 5140.417, 5087.903,~
$ SIPI <dbl> 1.00499, 1.00911, 1.00544, 1.00000, 0.99769, 1.00775, 1.00416, 1.00092, 1.00216, 1.00785, 1.00947, ~
$ SR <dbl> 13.36123, 14.19313, 16.77778, 15.61353, 15.50718, 13.50220, 18.57303, 20.86585, 19.03889, 16.38647,~
$ RGI <dbl> 0.62881, 0.64365, 0.57225, 0.58807, 0.56640, 0.68788, 0.61379, 0.50307, 0.53892, 0.62349, 0.64308, ~
$ TVI <int> 173720, 189600, 193360, 187300, 188320, 174400, 192160, 201960, 200980, 196100, 182000, 180660, 166~
$ MSR <dbl> 3.65530, 3.76738, 4.09607, 3.95140, 3.93792, 3.67453, 4.30964, 4.56792, 4.36336, 4.04802, 3.89147, ~
$ PRI <dbl> -0.25784, -0.27690, -0.31309, -0.25939, -0.26154, -0.23364, -0.27473, -0.33881, -0.31755, -0.29183,~
$ GNDVI <dbl> 0.78727, 0.80267, 0.81134, 0.80357, 0.79557, 0.80560, 0.83871, 0.82604, 0.82239, 0.82170, 0.81375, ~
$ PSRI <dbl> -0.04418, -0.03901, -0.04455, -0.04486, -0.04937, -0.03361, -0.03388, -0.04734, -0.04494, -0.03685,~
$ GCI <dbl> 7.40166, 8.13536, 8.60116, 8.18182, 7.78320, 8.28788, 10.40000, 9.49693, 9.26048, 9.21687, 8.73846,~
$ ID_PROJETO <int> 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245,~
$ PROJETO <chr> "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", ~
$ CD_TALHAO <chr> "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "00~
$ DATA_PLANT <chr> "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "~
$ ESPECIE <chr> "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", ~
$ ESPAC <chr> "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.~
$ AGE_1 <dbl> 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.~
$ AGE <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,~
引用 CSV 中的示例表:https://raw.githubusercontent.com/Leprechault/trash/main/my_ds_CSV.csv
我的文件架构是:
x FLOAT NULLABLE
y FLOAT NULLABLE
stand STRING NULLABLE
date STRING NULLABLE
B2 FLOAT NULLABLE
B3 FLOAT NULLABLE
B4 FLOAT NULLABLE
B8 FLOAT NULLABLE
NDVI FLOAT NULLABLE
SAVI FLOAT NULLABLE
SIPI FLOAT NULLABLE
SR FLOAT NULLABLE
RGI FLOAT NULLABLE
TVI INTEGER NULLABLE
MSR FLOAT NULLABLE
PRI FLOAT NULLABLE
GNDVI FLOAT NULLABLE
PSRI FLOAT NULLABLE
GCI FLOAT NULLABLE
ID_PROJETO INTEGER NULLABLE
PROJETO STRING NULLABLE
CD_TALHAO STRING NULLABLE
DATA_PLANT STRING NULLABLE
ESPECIE STRING NULLABLE
ESPAC STRING NULLABLE
AGE_1 FLOAT NULLABLE
AGE INTEGER NULLABLE
我尝试为测试只查询一个变量 (B2),而理想的查询类似于:
SELECT DISTINCT AGE, ESPAC
,PERCENTILE_DISC(B2,0.05) OVER(PARTITION BY AGE, ESPAC) AS P05_B2
,PERCENTILE_DISC(B2,0.95) OVER(PARTITION BY AGE, ESPAC) AS P95_B2
,MIN(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_min
,AVG(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_mean
,MAX(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_max
,stddev(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_sd
FROM `[PROJECT_ID].spectra_calibration.CMPC`
ORDER BY AGE, ESPAC
基本思想是仅在值 > P05_B2 和 P05_B2 & B2
# AGE ESPAC B2_mean B2_max B2_min B2_sd
# 1 -2 4X1.85 125. 175 75 14.2
# 2 -1 4X1.85 153. 300 67 34.0
# 3 0 4X1.85 419. 928. 71 274.
# 4 1 4X1.85 344. 683 129 83.4
# 5 11 3.5x2.14 137. 259 70 29.8
# 6 12 3.5x2.14 150. 298 67.5 23.6
# 7 13 3.5x2.14 130. 302 70 35.3
# ...
请帮忙看看这个查询结构?
【问题讨论】:
-
您能否重新发送更新的 CSV 文件,其中支持 BigQuery Schema and data types,因为您提供的 CSV 文件在加载到 BigQuery 时会抛出多个 errors,且架构为自动检测。
-
谢谢@Sandeep Mohanty,我根据请求更新了 csv 文件并显示了表架构。
标签: sql google-bigquery