如何在for和if循环中获取spark scala数据帧的最后一行的第一列值答案

【问题标题】：how to fetch the last row's 1st column value of spark scala dataframe inside the for and if loop如何在for和if循环中获取spark scala数据帧的最后一行的第一列值
【发布时间】：2018-10-12 14:02:28
【问题描述】：

s_n181n 是一个数据框，在这里我逐行浏览数据框的第 3 列和第 5 列

和

nd 列的位置是 <=1.0，它会破坏代码

ts(timestamp) | nd (nearest distance)

是输出列，如上所示

But what i need is the timestamp of last row value i.e 1529157727000

我想打破循环显示循环的最后一个值这里。如何将最后一行的时间戳值存储在一个变量中，以便在这个循环之外我可以使用它。

【问题讨论】：

从给定的示例中，看起来数据框已排序，您正在尝试获取 ts 列的最大值。对吗？
你可以按desc排序并选择时间戳示例，基于df.first，对吧？或参考这里：stackoverflow.com/questions/39544796/…
@jegan 是的，我想在满足条件时获取该特定数据的 ts 最大值。如果我使用 max(ts)，那么它将去获取整个数据集 (ts) 的下一个最大值。我不需要这个中断条件，我只需要它打破循环的最大时间戳

标签： scala apache-spark for-loop apache-spark-sql

【解决方案1】：

根据您的问题描述和评论，这是我对您要求的理解：

逐行循环遍历collect-ed RDD，只要nd在当前行小于等于ndLimit，从中提取ts 上一行并将 ndLimit 从同一行重置为 nd 的值行。

如果正确的话，我建议使用foldLeft 来组装时间戳列表，如下所示：

import org.apache.spark.sql.Row

val s_n181n = Seq(
  (1, "a1", 101L, "b1", 1.0),  // nd 1.0 is the initial limit
  (2, "a2", 102L, "b2", 1.6),
  (3, "a3", 103L, "b3", 1.2),
  (4, "a4", 104L, "b4", 0.8),  // 0.8 <= 1.0, hence ts 103 is saved and nd 1.2 is the new limit
  (5, "a5", 105L, "b5", 1.5),
  (6, "a6", 106L, "b6", 1.3),
  (7, "a7", 107L, "b7", 1.1),  // 1.1 <= 1.2, hence ts 106 is saved and nd 1.3 is the new limit
  (8, "a8", 108L, "b8", 1.2)   // 1.2 <= 1.3, hence ts 107 is saved and nd 1.1 is the new limit
).toDF("c1", "c2", "ts", "c4", "nd")

val s_rows = s_n181n.rdd.collect

val s_list = s_rows.map(r => (r.getAs[Long](2), r.getAs[Double](4))).toList
// List[(Long, Double)] = List(
//   (101,1.0), (102,1.6), (103,1.2), (104,0.8), (105,1.5), (106,1.3), (107,1.1), (108,1.2)
// )

val ndLimit = s_list.head._2  // 1.0

s_list.tail.foldLeft( (s_list.head._1, s_list.head._2, ndLimit, List.empty[Long]) ){
  (acc, x) =>
    if (x._2 <= acc._3)
      (x._1, x._2, acc._2, acc._1 :: acc._4)
    else
      (x._1, x._2, acc._3, acc._4)
}._4.reverse
// res1: List[Long] = List(103, 106, 107)

请注意，( previous ts, previous nd, current ndLimit, list of timestamps ) 的元组用作累加器，用于从前一行结转项目，用于当前行中必要的比较逻辑。

【讨论】：

这不是数据集的结束，它只是满足条件的数据，之后数据仍然存在。但我只需要最后一个值。并想在这里选择最后一个 ts 值。对于另一个循环，我想根据该值启动循环
@stackoverflow，请查看修改后的答案。
thanx mate 但这已解决。这正是我所感动stackoverflow.com/questions/52758059/…