【发布时间】:2017-06-29 10:13:24
【问题描述】:
我们正在测试 Apache Impala,并注意到将 GROUP BY 和 LIKE 一起使用的速度非常慢 - 单独的查询工作得更快。这里有两个例子:
# 1.37s 1.08s 1.35s
SELECT * FROM hive.default.pcopy1B where
(lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%')
or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%')
or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%')
or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%')
limit 100;
# 156.64s 155.63s
select "by", type, ranking, count(*) from pcopy where
(lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%')
or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%')
or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%')
or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%')
group by "by", type, ranking
order by 4 desc limit 10;
为什么会出现此问题,是否有任何解决方法?
【问题讨论】:
-
这两个查询对我来说似乎非常不同。第一个只选择记录并且只需要一个游标,第二个必须检索所有记录并运行 GROUP 和 SORT。如果返回的记录非常多,这可能解释了时间差异。还是我错过了什么?
标签: performance hadoop cloudera impala