【问题标题】:Split string column on Dataset<Row> and get new column on Dataset<Row>在 Dataset<Row> 上拆分字符串列并在 Dataset<Row> 上获取新列
【发布时间】:2018-04-15 22:54:08
【问题描述】:

我正在使用 Spark(2.0) 开发 Spark SQL,并使用 Java API 读取 CSV。

在 CSV 文件中有一个双引号,/ 分隔的列。例如:"Express Air,Delivery Truck"

读取CSV并返回Dataset的代码:

Dataset<Row> df = spark.read()
                .format("com.databricks.spark.csv")
                .option("inferSchema", "true")
                .option("header", "true")
                .load(filename) 

结果:

+-----+-----------------------+--------------------------+
|Year |       State           |                Ship Mode |...
+-----+-----------------------+--------------------------+
|2012 |New York/California    |Express Air/Delivery Truck|...
|2013 |Nevada/Texas           |Delivery Truck            |...
|2014 |North Carolina/Kentucky|Regular Air/Delivery Truck|...
+-----+-----------------------+--------------------------+

但是,我想将 StateShop Mode 拆分为 Mode 列并作为数据集返回并希望它保持顺序。 ex) {New York,Express Air} {California,Delivery Truck}

+-----+--------------------------+
|Year |      Mode                |   
+-----+--------------------------+
|2012 |New York,Express Air      |
|2012 |California,Delivery Truck |
|2013 |Nevada,Delivery Truck     |
|2013 |Texas,Delivery Truck      |
|2014 |North Carolina,Regular Air|
|2014 |Kentucky,Delivery Truck   |
+-----+--------------------------+

我有什么方法可以使用 Java Spark 做到这一点?

【问题讨论】:

    标签: java sql apache-spark dataset apache-spark-sql


    【解决方案1】:

    这是一个 Spark SQL 方法:

    df.createOrReplaceTempView("tab")
    
    val q = """
    with m as (
      select year, explode(split(State, "/")) as State, row_number() over(order by year) as rn from tab
    ), s as (
      select year, explode(split(`Ship Mode`, "/")) as Mode, row_number() over(order by year) as rn from tab
    )
    select m.year, m.State, s.Mode
    from m
    join s
      on m.year = s.year and m.rn = s.rn
    """
    
    spark.sql(q).show
    

    结果:

    scala> spark.sql(q).show
    +----+--------------+--------------+
    |year|         State|          Mode|
    +----+--------------+--------------+
    |2012|      New York|   Express Air|
    |2012|    California|Delivery Truck|
    |2013|        Nevada|Delivery Truck|
    |2014|North Carolina|Delivery Truck|
    +----+--------------+--------------+
    

    如果需要,您可以轻松连接列:

    val q = """
    with m as (
      select year, explode(split(State, "/")) as State, row_number() over(order by year) as rn from tab
    ), s as (
      select year, explode(split(`Ship Mode`, "/")) as Mode, row_number() over(order by year) as rn from tab
    )
    select m.year, concat(m.State, ',', s.Mode) as Mode
    from m
    join s
      on m.year = s.year and m.rn = s.rn
    """
    

    结果:

    scala> spark.sql(q).show(false)
    +----+-----------------------------+
    |year|Mode                         |
    +----+-----------------------------+
    |2012|New York,Express Air         |
    |2012|California,Delivery Truck    |
    |2013|Nevada,Delivery Truck        |
    |2014|North Carolina,Delivery Truck|
    +----+-----------------------------+
    

    PS 我用过 Scala,但 Java 应该差不多...

    【讨论】:

    • 当有另一个附加列时我该如何解决? (已添加类型列。例如){2012, New York/California, Express Air/Delivery Truck, a/b } 预期结果:{New York,Express Air,a} {California,Delivery Truck,b}
    • @yong,您将使用相同的技术 - 在“with”子句中再添加一个表,加入这个新表并连接新列......目前我在我的手机上,所以我无法测试它
    • 感谢您的评论,但我还有一个问题。在联接查询中,我怎样才能使其成为多个联接?例如)再添加一张表 'k' 。如何在 from 和 join 部分查询?
    • @yong, “from m join s on ... join k on m.year = k.year and m.rn = k.rn”
    【解决方案2】:

    是的,它可以通过几个步骤来实现。

    步骤 1 ds1

    步骤 2 ds2

    第3步 ds3

    应该做的工作

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-10-02
      • 1970-01-01
      • 2020-10-21
      • 2020-05-20
      • 1970-01-01
      相关资源
      最近更新 更多