【发布时间】:2020-01-08 04:06:48
【问题描述】:
我有一个 DataFrame,它有一个字符串形式的列。这看起来像:
`+--------------------------------------------------------------------------------------------------------------------------------------+
|queue_sequence |
+--------------------------------------------------------------------------------------------------------------------------------------+
|In Queue,In-Progress,Internally,Development Done/ Eng testing,In-Progress,Development Done/ Eng testing,Complete |
|In Queue,In-Progress,Complete,In-Progress,Complete |
|In Queue,Development,Development Ready,In Queue,Development,In Queue,Complete |
|In Queue,Analyze,In-Progress,ISRM,Externally,ISRM,Complete |
|In Queue,Complete,In-Progress,Complete |
|In Queue,DSM/UCL,Complete |
|In Queue,In-Progress,Development Done/ Eng testing,Complete,In Queue,In-Progress,Development Done/ Eng testing,Complete |
|In Queue,In-Progress,Externally,Development Done/ Eng testing,Complete |
|In Queue,In-Progress,Development Done/ Eng testing,DSM/UCL,In-Progress,ISRM,In-Progress,Development Done/ Eng testing,Complete |
|In Queue,Development,Development Ready,In Queue,Development,Development Done/ Eng testing,Development,Complete |
|In Queue,In-Progress,In Queue,In-Progress,ISRM,Complete |
|In Queue,Development Ready,In-Progress,Done,Complete |`
我想取每行中所有逗号分隔的单词的唯一性。
我试过下面的代码
`df.select("queue_sequence") .collect() .map(_.mkString)`
并将其存储在一个看起来像 Array[String] 的变量中:
Array[String] = Array(In Queue,
In-Progress,
Internally,
Development Done/ Eng testing,
In-Progress,
Development Done/ Eng testing,
Complete,
In Queue,
In-Progress,
Complete,
In-Progress,
Complete,
In Queue,
Analyze,
In-Progress,
ISRM,
Externally,
ISRM,
Complete,
In Queue,
Development,
Development Ready,
In Queue,
Development,
In Queue,Complete
)
但是这个列表并不是唯一的。那么我如何让它们成为不同的格式
我尝试了以下方法:
.toSet.toList
.toList.Distinct
我无法从该数组中获得不同的单词。上面的方法我都试过了,都不管用。
【问题讨论】:
-
Spark 与这个问题有何关联?
-
是的。因为该列表是从 spark 数据框列中收集的。查看我的 cmets
-
编辑了问题以澄清。这显然是火花
标签: arrays string scala apache-spark distinct-values