【问题标题】:convert RDD Array[Any] = Array(List([String], ListBuffer([string])) to RDD(String, Seq[String])将 RDD Array[Any] = Array(List([String], ListBuffer([string])) 转换为 RDD(String, Seq[String])
【发布时间】:2018-03-06 06:28:12
【问题描述】:

我有一个Any 类型的RDD,例如:

Array(List(Mathematical Sciences, ListBuffer(applications, asymptotic, largest, enable, stochastic)))

我想把它转换成RDD[(String, Seq[String])]类型的RDD

我试过了:

val rdd = sc.makeRDD(strList)
case class X(titleId: String, terms: List[String])

val df = rdd.map { case Array(s0, s1) => X(s0, s1) }.toDF()

我试了很久没有成功

【问题讨论】:

  • 我想将 rdd 类型 Array(List([String], ListBuffer([string])) 转换为 RDD(String, Seq[String]) 示例:Array(List(Mathematical Sciences, ListBuffer(applications , asymptotic, maximum, enable, stochastic))) 我想将其转换为 rdd Array(Mathematical Sciences, ListBuffer(applications, asymptotic, maximum, enable, stochastic))
  • 左下角有一个灰色的“编辑”按钮。

标签: scala apache-spark rdd


【解决方案1】:

你可以使用:

val result: RDD[(String, Seq[String])] = 
  rdd.map { case List(s0: String, s1: ListBuffer[String]) =>  (s0, s1) }

但请注意,输入RDD[Any] 中与这些类型不匹配的任何记录(无法在编译时检查)都会抛出scala.MatchError

【讨论】:

  • 我执行你的代码,我有错误:scala> val result: RDD[(String, Seq[String])] = | rdd.map { case List(s0: String, s1: ListBuffer[String]) => (s0, s1) } :32: error: not found: type RDD val result: RDD[(String, Seq[String ])] = ^ :33: error: not found: type ListBuffer rdd.map { case List(s0: String, s1: ListBuffer[String]) => (s0, s1) }
  • 我有一个列表:List(List(a9000038, ListBuffer(applications, asymptotic, maximum, enable, stochastic, stochastic)), List(a9000031, ListBuffer(loci, loci,formal, meet, size)), List(a9000006, ListBuffer(exploitation,exploitation,exploitation,characteristics))))
  • 我将它转换为 ti rdd:Array[Any] = Array(List(a9000038, ListBuffer(applications, asymptotic, maximum, enable, stochastic, stochastic)), List(a9000031, ListBuffer(loci, loci ,formal, meet,size)), List(a9000006, ListBuffer(exploitation,exploitation,exploitation,characteristics))))
  • 我想要卡 :Array(a9000038, ListBuffer(applications, asymptotic, maximum, enable, stochastic, stochastic)),(a9000031, ListBuffer(loci, loci,formal, meet,size) ),(a9000006, ListBuffer(exploitation,exploitation,exploitation,characteristics))) 类型 RDD[String , Seq[String])
【解决方案2】:

如问题中所述,如果您有

val strList = Array(List("Mathematical Sciences", ListBuffer("applications", "asymptotic", "largest", "enable", "stochastic")))
val rdd = sc.makeRDD(strList)

属于以下dataTypes

rdd: org.apache.spark.rdd.RDD[List[java.io.Serializable]]

你可以把它转换成你需要的dataTypes

res0: org.apache.spark.rdd.RDD[(String, Seq[String])]

只需使用map将数据类型转换为

rdd.map(x => (x(0).toString, x(1).asInstanceOf[ListBuffer[String]].toSeq))

希望回答对你有帮助

【讨论】:

  • 我执行你的代码,我在 spark scala 中有这个错误> rdd.map(x => (x(0).toString, x(1).asInstanceOf[ListBuffer[String]].toSeq )) :33: error: Any 不带参数 rdd.map(x => (x(0).toString, x(1).asInstanceOf[ListBuffer[String]].toSeq)) ^ :33: 错误:任何不带参数 rdd.map(x => (x(0).toString, x(1).asInstanceOf[ListBuffer[String]].toSeq))
  • 我的列表是 List(List(a9000038, ListBuffer(applications, asymptotic, maximum, enable, stochastic, stochastic)), List(a9000031, ListBuffer(loci, loci,formal, meet,size)) , List(a9000006, ListBuffer(exploitation,exploitation,exploitation,characteristics)))
  • 我将其转换为 rdd 我有:Array[Any] = Array(List(a9000038, ListBuffer(applications, asymptotic, maximum, enable, stochastic, stochastic)), List(a9000031, ListBuffer(loci , 位点, 正式, 满足, 大小)), List(a9000006, ListBuffer(exploitation,exploitation,exploitation,characteristics))))
  • 我想要:Array(a9000038, ListBuffer(applications, asymptotic, maximum, enable, stochastic, stochastic)),(a9000031, ListBuffer(loci, loci,formal, meet,size)), (a9000006, ListBuffer(exploitation,exploitation,exploitation,characteristics))) 类型 RDD[String , Seq[String])
  • 请你想好
【解决方案3】:

终于成功了 我有一个警告但有效

val rdd = sc.makeRDD(strList)

val 结果 = rdd.map { case List(s0: String, s1: Seq[String]) => (s0, s1) }

:32: 警告:类型模式 Seq[String](Seq[String] 的底层)中的非变量类型参数 String 未检查,因为它已被擦除消除 val 结果 = rdd.map { case List(s0: String, s1: Seq[String]) => (s0, s1) } ^ 结果:org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1051] at map at :32

谢谢

【讨论】:

    猜你喜欢
    • 2015-12-11
    • 2018-11-26
    • 2020-08-17
    • 2018-07-06
    • 1970-01-01
    • 2018-04-05
    • 2017-01-29
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多