【发布时间】:2018-03-29 15:34:27
【问题描述】:
我有这个 spark DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT|AUTRE| 2|null| 08:58:00| 23:29:00|
|TDR| QWA| 3|null| 08:57:00| 23:28:00|
|ALT| TEST| 4|null| 08:56:00| 23:27:00|
|ALT| QWA| 6|null| 08:55:00| 23:26:00|
|ALT| QWA| 2|null| 08:54:00| 23:25:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
我想获得一个新的数据框,其中仅包含 "ID"、"ID2" 和 "Number" 三个字段不唯一的行。
表示我想要这个DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
或者可能是一个包含所有重复项的数据框:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT| QWA| 6|null| 08:55:00| 23:26:00|
|ALT| QWA| 2|null| 08:54:00| 23:25:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
【问题讨论】:
标签: apache-spark pyspark spark-dataframe