【发布时间】:2010-05-10 00:43:35
【问题描述】:
代码
以下代码针对大量数据计算线性回归的斜率和截距。然后它将方程y = mx + b 应用于相同的结果集以计算每一行的回归线的值。
如何连接两个查询,以便在不执行两次WHERE 子句的情况下计算数据及其斜率/截距?
问题的一般形式是:
SELECT a.group, func(a.group, avg_avg)
FROM a
(SELECT AVG(field1_avg) as avg_avg
FROM (SELECT a.group, AVG(field1) as field1_avg
FROM a
WHERE (SOME_CONDITION)
GROUP BY a.group) as several_lines -- potentially
) as one_line -- always
WHERE (SOME_CONDITION)
GROUP BY a.group -- again, potentially several lines
我让SOME_CONDITION 执行了两次。如下所示(更新为 STRAIGHT_JOIN 优化):
SELECT STRAIGHT_JOIN
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR * ymxb.SLOPE + ymxb.INTERCEPT as REGRESSION_LINE,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE,
ymxb.SLOPE,
ymxb.INTERCEPT,
ymxb.CORRELATION,
ymxb.MEASUREMENTS
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D,
(SELECT
SUM(MEASUREMENTS) as MEASUREMENTS,
((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,
((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
(sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
(power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
(stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
FROM (
SELECT STRAIGHT_JOIN
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y,
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
) t
) ymxb
WHERE
-- For a specific city ...
--
$X{ IN, C.ID, CityCode } AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = $P{CategoryCode} AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
问题
如何每次查询只执行一次重复位,而不是两次?重复代码:
$X{ IN, C.ID, CityCode } AND
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= $P{Radius} AND
SD.ID = S.STATION_DISTRICT_ID AND
Y.STATION_DISTRICT_ID = SD.ID AND
Y.YEAR BETWEEN 1900 AND 2009 AND
M.YEAR_REF_ID = Y.ID AND
M.CATEGORY_ID = $P{CategoryCode} AND
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
更新 1
使用变量和拆分查询似乎允许缓存启动,因为它现在在 3.5 秒内运行,而它曾经在 7 秒内运行。不过,如果有任何方法可以删除重复的代码,我会感谢您的帮助。
更新 2
上面的代码不能在 JasperReports 中运行,而 VIEW 虽然是一种可能的修复方法,但可能效率极低(因为 WHERE 子句是参数化的)。
更新 3
使用 Unreason 提出的勾股公式与收敛经络验证距离:
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) )
(这与问题无关,但其他人是否想知道......)
更新 4
如图所示,代码在 JasperReports 中运行,针对 MySQL 数据库运行。 JasperReports 不允许变量或多个查询。
更新 5
我正在寻找一种可以干净利落地执行的解决方案。 ;-) 我已经编写了一些部分有效的解决方案,但遗憾的是,MySQL 不理解部分正确。请参阅与 Unreason 的讨论,以获得几乎可行的答案。
更新 6
我也许可以重用第一个 WHERE 子句中的变量并将它们与第二个子句进行比较(从而消除 一些 重复 - 对 $P{} 值的检查),但我会真的很喜欢消除重复。
更新 7
比较 YEAR 子句(如上一次更新中的假设)以消除重复的 BETWEEN 不起作用。
相关
How to eliminate duplicate calculation in SQL?
谢谢!
【问题讨论】:
-
您是否询问过查询规划器打算如何执行该查询?真的是重复努力吗?顺便说一句,
Y.YEAR BETWEEN 1900 AND 2009是一个错误吗? -
另外,
SQRT( POW( C.LATITUDE - S.LATITUDE, 2 ) + POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < $P{Radius}定义了一个椭圆...如果你真的想要一个圆,请使用SQRT( POW( C.LATITUDE - S.LATITUDE, 2 ) + POW( C.LONGITUDE - S.LONGITUDE, 2 ) * COS ( (C.LATITUDE + S.LATITUDE) / 2 ) < $P{Radius} -
@Andrew:如果返回的数据没有经过回归线计算,那么它会在约 3.5 秒内执行。使用回归,大约需要 7 秒。我的猜测是重复的努力。 ;-) 需要年份条件(1900 年之前的数据不稳定,而 2009 年之后的整年不存在——也不存在;我没有原始数据,也没有收到新数据的更新)。
-
好的,听起来确实是重复的。好吧,您可以创建一个临时表来缓存 where 条件的结果,或者向现有表添加一列(在事务中,然后故意放弃)。
-
@Andrew:这比我最初想象的还要复杂;我必须使用Haversine 公式来确定距离。 en.wikipedia.org/wiki/Great-circle_distance
标签: sql mysql postgresql ireport code-duplication