【问题标题】:How to merge the array of lists into single column and fit it to already existing dataframe?如何将列表数组合并到单列中并使其适合现有的数据框?
【发布时间】:2017-11-20 23:41:35
【问题描述】:

我是 spark 和 scala 的新手。请帮我解决一下这个。

我有以下输出,我需要生成一个新的数据框,其中所有功能都合并而不是单独的列表。另外,我需要将此数据帧附加到另一个数据帧。我怎样才能在 scala 中做到这一点?

val tab = inter.map(_.groupBy().sum())
tab.map(_.show())

tab: Array[org.apache.spark.sql.DataFrame] = Array([sum(vec_0): double, sum(vec_1): double ... 2 more fields], [sum(vec_0): double, sum(vec_1): double ... 2 more fields])
+------------------+------------------+------------------+------------------+
|        sum(vec_0)|        sum(vec_1)|        sum(vec_2)|        sum(vec_3)|
+------------------+------------------+------------------+------------------+
|2.5046410000000003|2.1487149999999997|1.0884870000000002|3.5877090000000003|
+------------------+------------------+------------------+------------------+
+------------------+------------------+----------+------------------+
|        sum(vec_0)|        sum(vec_1)|sum(vec_2)|        sum(vec_3)|
+------------------+------------------+----------+------------------+
|0.9558040000000001|0.9843780000000002|  0.545025|0.9979860000000002|
+------------------+------------------+----------+------------------+
res325: Array[Unit] = Array((), ())
FINISHED   

    val temp = tab.map(_.alias("t").select(array("t.*") as "List"))
    temp.map(_.toDF().show(false))

    temp: Array[org.apache.spark.sql.DataFrame] = Array([List: array<double>], [List: array<double>])
    +--------------------------------------------------------------------------------+
    |List                                                                            |
    +--------------------------------------------------------------------------------+
    |[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
    +--------------------------------------------------------------------------------+
    +----------------------------------------------------------------------+
    |List                                                                  |
    +----------------------------------------------------------------------+
    |[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
    +----------------------------------------------------------------------+
    res443: Array[Unit] = Array((), ())
val newtable = temp.map(_.toDF("features"))
newtable.map(_.show(false))

     newtable: Array[org.apache.spark.sql.DataFrame] = Array([features: 
array<double>], [features: array<double>])
    +--------------------------------------------------------------------------------+
    |features                                                                        |
    +--------------------------------------------------------------------------------+
    |[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
    +--------------------------------------------------------------------------------+
    +----------------------------------------------------------------------+
    |features                                                              |
    +----------------------------------------------------------------------+
    |[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
    +----------------------------------------------------------------------+
    res328: Array[Unit] = Array((), ())

预期输出:

+--------------------------------------------------------------------------------+
|features                                                                        |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]| 
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
+---------------------------------------------------------------------------------+

【问题讨论】:

  • 试试 flatmap 代替 map,它应该做数组的展平。比如 val newtable = temp.flatMap(_.toDF("features"))
  • 如果我尝试 flatMap,我会收到以下错误。找到:org.apache.spark.sql.DataFrame(扩展为)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 需要:scala.collection.GenTraversableOnce[?] val newtable = temp .flatMap(_.toDF)
  • 您的输入看起来如何(临时变量)?您可以将其添加到问题中吗?
  • 是的,我也同意 Shaido。临时样本应该可以帮助您快速获得解决方案
  • temp 类似于 newtable....我通过合并列创建了一个列表并将其转换为列表。现在我正在尝试将所有列表放在一列下,以便可以将其附加到数据框。

标签: scala apache-spark


【解决方案1】:

这解决了问题。

val fList = newtable.reduce(_.union(_))
newtable.show(false



 fList: org.apache.spark.sql.DataFrame = [features: array<double>]
+--------------------------------------------------------------------------------+
|features                                                                        |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]          |
+--------------------------------------------------------------------------------+

【讨论】:

    猜你喜欢
    • 2018-11-19
    • 2015-04-18
    • 2013-04-13
    • 1970-01-01
    • 2022-01-14
    • 2021-07-21
    • 2019-01-07
    • 2018-11-18
    • 1970-01-01
    相关资源
    最近更新 更多