【问题标题】:Selecting with conditions from a dataframe using Spark Scala使用 Spark Scala 从数据框中选择条件
【发布时间】:2019-07-06 05:16:37
【问题描述】:

我是 Scala 新手,在使用 Spark 中的简单数据集时遇到了困难。我希望能够通过 EventType 和 crow 查看以下数据集排序,但无法通过 Descending 值来完成它。我还想一次只读出一个事件类型。

当我尝试时

dataset.orderBy("eventType")

它可以工作,但是如果我添加一个“.desc”它就不起作用了。

scala> setB.orderBy("eventType").desc
<console>:32: error: value desc is not a member of 
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
   setB.orderBy("eventType").desc

scala> dataset.orderBy("eventType".desc)
<console>:32: error: value desc is not a member of String
   dataset.orderBy("eventType".desc)

我也在尝试使用过滤器,但它也不喜欢我尝试的任何东西。 就像是: dataset.filter("eventType"="agg%")

样本数据集:

+----------------+------------------------------------------------------------------------------------+-----------------------------------+-------------+----------------+----+
|deadletterbucket|split                                                                               |eventType                          |clientVersion|dDeviceSurrogate|crow|
+----------------+------------------------------------------------------------------------------------+-----------------------------------+-------------+----------------+----+
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.3.0.108    |1               |3   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.3.0.10     |1               |11  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.9.1.10     |3               |11  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.7.0.1      |3               |15  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.5.0.5      |6               |16  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.0.0.62     |7               |26  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.6.4.6      |9               |31  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|7.12.0.113   |1               |1   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|6.3.2.15     |1               |2   |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_network_traffic|5.1.2.10     |1               |3   |

理想情况下,我正在尝试使以下内容起作用

dataset.orderBy("crow").desc.filter("eventType"="%app_launches").show(3,false)


|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |5.5.0.5      |6               |31  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.0.0.62     |7               |26  |
|event_failure   |instance type (null) does not match any allowed primitive type (allowed: ["object"])|aggregate_event.app_launches       |4.6.4.6      |9               |16  |

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:

    您几乎有正确的解决方案,只是缺少语法细节。 Spark(scala) 的正确语法如下,

    
     import org.apache.spark.sql.functions._
    
     dataset.orderBy(desc("crow")).filter($"eventType".contains("app_launches")).show(3, false)
    
    

    您可以使用$col 访问该列,您可以在此处找到更多信息 (https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/Column.html)

    我还可以推荐从 spark 主页阅读本教程,它非常有帮助! https://spark.apache.org/docs/2.1.0/sql-programming-guide.html

    【讨论】:

      【解决方案2】:

      您正在传递String 来标识您希望订购的列。这是一种方便的方法,但如果您想要更多控制权,则需要传递 Column 参数。 Spark 提供了几种从数据集中检索此对象的惯用方法:

      dataset.orderBy($"crow".desc)...

      dataset.orderBy(col("crow").desc)...

      dataset.orderBy('crow.desc)...

      dataset.orderBy(dataset("crow").desc)...

      https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@sort(sortExprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2019-07-17
        • 2018-08-08
        • 1970-01-01
        • 1970-01-01
        • 2019-12-29
        • 2018-03-02
        • 2020-03-21
        相关资源
        最近更新 更多