【问题标题】:Split Contents of String column in PySpark DataframePySpark Dataframe 中字符串列的拆分内容
【发布时间】:2017-05-08 02:22:26
【问题描述】:

我有一个 pyspark 数据框,其中有一列包含字符串。我想将此列拆分为单词

代码:

>>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
>>> sentenceData.show(truncate=False)
+---+---------------------------+
|key|desc                       |
+---+---------------------------+
|1  |Virat is good batsman      |
|2  |sachin was good            |
|3  |but modi sucks big big time|
|4  |I love the formulas        |
+---+---------------------------+


Expected Output
---------------

>>> sentenceData.show(truncate=False)
+---+-------------------------------------+
|key|desc                                 |
+---+-------------------------------------+
|1  |[Virat,is,good,batsman]              |
|2  |[sachin,was,good]                    |
|3  |....                                 |
|4  |...                                  |
+---+-------------------------------------+

我怎样才能做到这一点?

【问题讨论】:

    标签: apache-spark pyspark spark-dataframe apache-spark-mllib


    【解决方案1】:

    使用split函数:

    from pyspark.sql.functions import split
    
    df.withColumn("desc", split("desc", "\s+"))
    

    【讨论】:

      猜你喜欢
      • 2020-11-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-01-07
      • 2015-12-02
      • 1970-01-01
      相关资源
      最近更新 更多