【发布时间】:2017-05-23 12:59:49
【问题描述】:
我有一个数据框并希望将其分成相等数量的行。
换句话说,我想要一个数据帧列表,其中每个数据帧都是原始数据帧的一个不相交的子集。
假设输入数据帧如下:
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 100|[15.5097750043269...|
|15.509775004326936| 0| 101|[15.5097750043269...|
|15.509775004326936| 0| 102|[15.5097750043269...|
|15.509775004326936| 0| 103|[15.5097750043269...|
|15.509775004326936| 0| 104|[15.5097750043269...|
|15.509775004326936| 0| 105|[15.5097750043269...|
|15.509775004326936| 0| 106|[15.5097750043269...|
|15.509775004326936| 0| 107|[15.5097750043269...|
|15.509775004326936| 0| 108|[15.5097750043269...|
|15.509775004326936| 0| 109|[15.5097750043269...|
|15.509775004326936| 0| 110|[15.5097750043269...|
|15.509775004326936| 0| 111|[15.5097750043269...|
|15.509775004326936| 0| 112|[15.5097750043269...|
|15.509775004326936| 0| 113|[15.5097750043269...|
|15.509775004326936| 0| 114|[15.5097750043269...|
|15.509775004326936| 0| 115|[15.5097750043269...|
| 43.01955000865387| 0| 116|[43.0195500086538...|
+------------------+-----------+-----+--------------------+
我希望将其拆分为 K 个大小相等的数据帧。如果 k = 4,那么可能的结果是:
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 106|[15.5097750043269...|
|15.509775004326936| 0| 107|[15.5097750043269...|
|15.509775004326936| 0| 110|[15.5097750043269...|
|15.509775004326936| 0| 111|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 104|[15.5097750043269...|
|15.509775004326936| 0| 108|[15.5097750043269...|
|15.509775004326936| 0| 112|[15.5097750043269...|
|15.509775004326936| 0| 114|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 100|[15.5097750043269...|
|15.509775004326936| 0| 105|[15.5097750043269...|
|15.509775004326936| 0| 109|[15.5097750043269...|
|15.509775004326936| 0| 115|[15.5097750043269...|
+------------------+-----------+-----+--------------------+
+------------------+-----------+-----+--------------------+
| eventName|original_dt|count| features|
+------------------+-----------+-----+--------------------+
|15.509775004326936| 0| 101|[15.5097750043269...|
|15.509775004326936| 0| 102|[15.5097750043269...|
|15.509775004326936| 0| 103|[15.5097750043269...|
|15.509775004326936| 0| 113|[15.5097750043269...|
| 43.01955000865387| 0| 116|[43.0195500086538...|
+------------------+-----------+-----+--------------------+
【问题讨论】:
-
类似于
RDDs 在单个转换中没有拆分操作。最简单的方法是将COUNT与LIMIT和OFFSET结合使用一次...但是,由于Spark 中也没有OFFSET,您可以通过滑动使用解决方法,请参阅stackoverflow.com/questions/31685714/… -
输入和输出示例将让您快速得到答案。所以相应地更新你的问题
-
.randomSplit() 方法是否是您在此处提出的要求:stackoverflow.com/questions/43567164/… ?
-
@SteffenSchmitz randomSplit 给出了不平衡的分布。
-
@RameshMaharjan 我根据需要添加了输入和输出。
标签: scala apache-spark dataframe