蜂巢的采样问题

【问题标题】：Sampling Issue with hive蜂巢的采样问题
【发布时间】：2015-09-30 16:09:21
【问题描述】：

“all_members”是 hive 中的一个表，有 10m 行和 1 列：“membership_nbr”。我想采样 3000 行。这就是我所做的：

hive>create table sample_members as select * from all_members limit 1;
hive>insert overwrite table sample_members select membership_nbr from all_members tablesample(3000 rows);
hive>select count(*) from sample_members;

OK 45000

如果我用 300 行替换 3000 行，结果不会改变我是不是做错了什么？

【问题讨论】：

标签： hadoop hive sample sampling

【解决方案1】：

使用tablesample(3000 rows) 的表采样不会从整个表中获取 3000 行，而是会从每个输入拆分中获取 3000 行。

因此，您的查询可能会运行 15 个映射器。因此，每个映射器将获取 3000 行。总共 3000 * 15 = 45000 行。此外，如果您将 3000 行更改为 300 行，您将在采样后获得 4500 行作为输出。

因此，根据您的要求，您必须提供tablesample(200 rows)。结果，每个映射器将获取 200 行。最后，15 个映射器将获取 3000 个采样行。

请参阅以下链接了解各种类型的采样： https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

【讨论】：