【问题标题】:Standard deviation, mean, max and min extraction just inside 5-95 quantiles values标准差、平均值、最大值和最小值在 5-95 个分位数值内提取
【发布时间】:2021-10-15 20:16:42
【问题描述】:

我想提取变量 B2 的 5-95 个分位数内的均值、最大值、最小值和 sd 提取值, B3, B4, B8, NDVI, SAVI, SIPI, SR, RGI, TVI, MSR, PRI, GNDVI, @987654337 @、GCICMPC 表中的 AGEESPAC 变量聚合:

我的 CMPC SQL 表 ([PROJECT_ID].spectra_calibration.CMPC) 在 BigQuery 中创建:

Rows: 55.310
Columns: 27
Database: BigQueryConnection
$ x          <dbl> -52.5502, -52.5501, -52.5501, -52.5501, -52.5501, -52.5500, -52.5500, -52.5500, -52.5500, -52.5500,~
$ y          <dbl> -30.8295, -30.8297, -30.8296, -30.8295, -30.8294, -30.8298, -30.8297, -30.8296, -30.8295, -30.8294,~
$ stand      <chr> "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO001A", "ABRANJO0~
$ date       <chr> "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "2019-01-28", "~
$ B2         <dbl> 213, 205, 181, 207, 216, 205, 165, 161, 173, 182, 181, 259, 227, 190, 153, 147, 160, 164, 194, 210,~
$ B3         <dbl> 361.0, 362.0, 346.0, 352.0, 369.0, 330.0, 290.0, 326.0, 334.0, 332.0, 325.0, 375.0, 352.0, 307.0, 2~
$ B4         <dbl> 227.0, 233.0, 198.0, 207.0, 209.0, 227.0, 178.0, 164.0, 180.0, 207.0, 209.0, 267.0, 269.0, 194.0, 1~
$ B8         <dbl> 3033.0, 3307.0, 3322.0, 3232.0, 3241.0, 3065.0, 3306.0, 3422.0, 3427.0, 3392.0, 3165.0, 3206.0, 298~
$ NDVI       <dbl> 0.86074, 0.86836, 0.88750, 0.87962, 0.87884, 0.86209, 0.89782, 0.90853, 0.90019, 0.88497, 0.87611, ~
$ SAVI       <dbl> 4549.379, 4960.386, 4982.905, 4847.897, 4861.397, 4597.380, 4958.915, 5132.925, 5140.417, 5087.903,~
$ SIPI       <dbl> 1.00499, 1.00911, 1.00544, 1.00000, 0.99769, 1.00775, 1.00416, 1.00092, 1.00216, 1.00785, 1.00947, ~
$ SR         <dbl> 13.36123, 14.19313, 16.77778, 15.61353, 15.50718, 13.50220, 18.57303, 20.86585, 19.03889, 16.38647,~
$ RGI        <dbl> 0.62881, 0.64365, 0.57225, 0.58807, 0.56640, 0.68788, 0.61379, 0.50307, 0.53892, 0.62349, 0.64308, ~
$ TVI        <int> 173720, 189600, 193360, 187300, 188320, 174400, 192160, 201960, 200980, 196100, 182000, 180660, 166~
$ MSR        <dbl> 3.65530, 3.76738, 4.09607, 3.95140, 3.93792, 3.67453, 4.30964, 4.56792, 4.36336, 4.04802, 3.89147, ~
$ PRI        <dbl> -0.25784, -0.27690, -0.31309, -0.25939, -0.26154, -0.23364, -0.27473, -0.33881, -0.31755, -0.29183,~
$ GNDVI      <dbl> 0.78727, 0.80267, 0.81134, 0.80357, 0.79557, 0.80560, 0.83871, 0.82604, 0.82239, 0.82170, 0.81375, ~
$ PSRI       <dbl> -0.04418, -0.03901, -0.04455, -0.04486, -0.04937, -0.03361, -0.03388, -0.04734, -0.04494, -0.03685,~
$ GCI        <dbl> 7.40166, 8.13536, 8.60116, 8.18182, 7.78320, 8.28788, 10.40000, 9.49693, 9.26048, 9.21687, 8.73846,~
$ ID_PROJETO <int> 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245, 245,~
$ PROJETO    <chr> "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", "ABRANJO", ~
$ CD_TALHAO  <chr> "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", "00~
$ DATA_PLANT <chr> "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "2008-07-15", "~
$ ESPECIE    <chr> "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", "SALIGNA", ~
$ ESPAC      <chr> "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.5x2.14", "3.~
$ AGE_1      <dbl> 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.5, 10.~
$ AGE        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,~

引用 CSV 中的示例表:https://raw.githubusercontent.com/Leprechault/trash/main/my_ds_CSV.csv

我的文件架构是:

x   FLOAT   NULLABLE    
y   FLOAT   NULLABLE    
stand   STRING  NULLABLE    
date    STRING  NULLABLE    
B2  FLOAT   NULLABLE    
B3  FLOAT   NULLABLE    
B4  FLOAT   NULLABLE    
B8  FLOAT   NULLABLE    
NDVI    FLOAT   NULLABLE    
SAVI    FLOAT   NULLABLE    
SIPI    FLOAT   NULLABLE    
SR  FLOAT   NULLABLE    
RGI FLOAT   NULLABLE    
TVI INTEGER NULLABLE    
MSR FLOAT   NULLABLE    
PRI FLOAT   NULLABLE    
GNDVI   FLOAT   NULLABLE    
PSRI    FLOAT   NULLABLE    
GCI FLOAT   NULLABLE    
ID_PROJETO  INTEGER NULLABLE    
PROJETO STRING  NULLABLE    
CD_TALHAO   STRING  NULLABLE    
DATA_PLANT  STRING  NULLABLE    
ESPECIE STRING  NULLABLE    
ESPAC   STRING  NULLABLE    
AGE_1   FLOAT   NULLABLE    
AGE INTEGER NULLABLE 

我尝试为测试只查询一个变量 (B2),而理想的查询类似于:

SELECT DISTINCT AGE, ESPAC
,PERCENTILE_DISC(B2,0.05) OVER(PARTITION BY AGE, ESPAC) AS P05_B2
,PERCENTILE_DISC(B2,0.95) OVER(PARTITION BY AGE, ESPAC) AS P95_B2
,MIN(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_min
,AVG(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_mean
,MAX(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_max
,stddev(B2 > P05_B2 & B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS B2_sd
FROM `[PROJECT_ID].spectra_calibration.CMPC`
ORDER BY AGE, ESPAC

基本思想是仅在值 > P05_B2 和 P05_B2 & B2

#     AGE ESPAC    B2_mean B2_max B2_min B2_sd 
# 1    -2 4X1.85      125.   175    75    14.2    
# 2    -1 4X1.85      153.   300    67    34.0   
# 3     0 4X1.85      419.   928.   71   274.     
# 4     1 4X1.85      344.   683   129    83.4    
# 5    11 3.5x2.14    137.   259    70    29.8    
# 6    12 3.5x2.14    150.   298    67.5  23.6    
# 7    13 3.5x2.14    130.   302    70    35.3    
# ...

请帮忙看看这个查询结构?

【问题讨论】:

  • 您能否重新发送更新的 CSV 文件,其中支持 BigQuery Schema and data types,因为您提供的 CSV 文件在加载到 BigQuery 时会抛出多个 errors,且架构为自动检测。
  • 谢谢@Sandeep Mohanty,我根据请求更新了 csv 文件并显示了表架构。

标签: sql google-bigquery


【解决方案1】:

表达式(

MIN(B2 > P05_B2 & B2 < P95_B2) ,

AVG(B2 > P05_B2 & B2 < P95_B2),
 
MAX(B2 > P05_B2 & B2 < P95_B2)

)

使用 等运算符将返回 二进制值,即 True/False,而不是为按照这个doc表达。

对于 MIN()、MAX()、AVG()、STDDEV() 等函数,您需要指定一列,以便函数扫描该列并提供输出。

例如:

Select MIN(AGE) as minmum_age from my-project.dataset2.tab1;

这里的 MIN() 函数将扫描 AGE 列并提供该列中的最小值,即 -2

您提供的示例查询在 BigQuery 中运行时会引发错误。

错误:

作为参考,您还可以检查以下修改后的查询和输出。

查询:

select DISTINCT AGE, ESPAC,P05_B2 ,
MIN(B2 > P05_B2 AND B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS MIN_B2,
MAX(B2 > P05_B2 AND B2 < P95_B2) OVER (PARTITION BY AGE, ESPAC ORDER BY B2) AS MAX_B2
FROM(
SELECT DISTINCT AGE, ESPAC,B2
,PERCENTILE_DISC(B2,0.05) OVER(PARTITION BY AGE, ESPAC) AS P05_B2
,PERCENTILE_DISC(B2,0.95) OVER(PARTITION BY AGE, ESPAC) AS P95_B2
FROM `my-project.dataset2.tab1`
ORDER BY AGE, ESPAC
)

输出:

根据您的要求,我尝试使用您的数据集编写查询,得到的结果与您的输出相似。

你能不能试试同样的方法,让我知道这是否适合你。

查询:

SELECT AGE,ESPAC,AVG (B2) AS B2_mean,

MAX(B2)  AS B2_max,

MIN(B2)  AS B2_min,

STDDEV(B2)  AS B2_sd  FROM (

SELECT *,

FROM (

SELECT *,

PERCENTILE_DISC(B2, 0.05)  OVER () AS P05_B2,

PERCENTILE_DISC(B2, 0.95)  OVER () AS P95_B2,

FROM `my-project.dataset2.tab1` 
ORDER BY AGE,ESPAC ) WHERE B2 > P05_B2 AND B2 < P95_B2

) GROUP BY AGE,ESPAC ORDER BY AGE,ESPAC

输出:

请注意,上面显示的输出基于提供的示例 CSV 文件。聚合值可能会根据数据集发生变化。

【讨论】:

  • 谢谢,@Sandepp 很好的解决方案!!!!!!
  • 很高兴帮助@Leprechault
  • 很抱歉@Sandepp Monhanty,但我需要重新提出问题。我今天做了一些仔细的测试,结果还没有得到纠正。代码将平均值/最小值/等限制为一个减少的集合,不包括低于 5 和高于 95 的值,但我不知道为什么我没有一个 AGE -2 和 12 以及 ESPA 4x1.85 和 3.5x2 的值.14(如在期望的输出中)例如,很明显我们需要 ORDER BY AGE, ESPAC。我相信代码会返回所有低于 5 和高于 95 的值。
  • 嗨@Leprechault 我已经更新了答案。
  • 非常感谢@Sandepp Monhanty。现在,没关系,我测试了结果,只有GROUP BY AGE,ESPAC。你的解决方案对我很有帮助!!!!
猜你喜欢
  • 2014-08-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2012-08-16
  • 1970-01-01
  • 2015-04-09
  • 2020-03-28
  • 2016-09-14
相关资源
最近更新 更多