【发布时间】:2017-08-11 03:35:18
【问题描述】:
我有以下数据框,比如说:
+----------+------------+------------+------------+---+
| day |inout_amount|prev_balance|post_balance|row|
+----------+------------+------------+------------+---+
|2016-10-29| -17000| 17000| 0| 1|
|2016-10-30| -17000| 17000| 0| 2|
|2016-10-30| 5600| 0| 5600| 3|
|2016-10-30| 5600| 5600| 11200| 4|
|2016-10-30| 5800| 11200| 17000| 5|
+----------+------------+------------+------------+---+
第一行对于“2016-10-29”是正确的,但下面的 4 行 (“2016-10-30”) 被打乱了。这是上表的代码:
case class transaction(
day: String,
inout_amount: Int,
prev_balance: Int,
post_balance: Int
)
val snippet = Seq(
transaction("2016-10-29", -17000, 17000, 0),
transaction("2016-10-30", -17000, 17000, 0),
transaction("2016-10-30", 5600, 0, 5600),
transaction("2016-10-30", 5600, 5600, 11200),
transaction("2016-10-30", 5800, 11200, 17000)
)
// below could be sparkContext if you working in zeppelin
val df = sqlContext.createDataFrame(snippet)
import org.apache.spark.sql.expressions.Window
val window = Window.orderBy("day")
df.withColumn("row", row_number().over(window)).show
我现在需要根据“prev_balance”等于前一个交易的“post_balance”的逻辑对“2016-10-30”的行进行排名。即所需的数据框应如下所示:
+----------+------------+------------+------------+---+-------+
| day|inout_amount|prev_balance|post_balance|row|order-1|
+----------+------------+------------+------------+---+-------+
|2016-10-29| -17000| 17000| 0| 1| 0|
|2016-10-30| -17000| 17000| 0| 2| 4|
|2016-10-30| 5600| 0| 5600| 3| 1|
|2016-10-30| 5600| 5600| 11200| 4| 2|
|2016-10-30| 5800| 11200| 17000| 5| 3|
+----------+------------+------------+------------+---+-------+
我是 spark 新手,我猜测我需要创建一个“udf”,然后将其与“withColumn”一起应用...请帮助!
【问题讨论】:
标签: scala apache-spark spark-dataframe