spark scala reducekey 数据帧操作答案

【问题标题】：spark scala reducekey dataframe operationspark scala reducekey 数据帧操作
【发布时间】：2018-04-17 08:13:05
【问题描述】：

我正在尝试使用数据框在 scala 中进行计数。我的数据有 3 列，我已经加载了数据并按选项卡拆分。所以我想做这样的事情：

val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)

我想把数据放到dataframe中，语法有点问题

val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
        .count()

有人可以帮忙检查一下这是否正确吗？

【问题讨论】：

标签： scala apache-spark dataframe spark-dataframe word-count

【解决方案1】：

spark 需要知道 df 的架构
有很多方法可以指定架构，这里是一种选择：

val df = file
   .map(line=>line.split("\t"))
   .map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
   .toDF("a", "b") //give the columns names for ease of use

df
 .groupby('a)
 .count()

【讨论】：