【问题标题】:Slice all values of column in PySpark DataFrame [duplicate]切片PySpark DataFrame中列的所有值[重复]
【发布时间】:2020-03-09 16:01:57
【问题描述】:

我有一个数据框,我想对该列的所有值进行切片,但我不知道该怎么做?

我的数据框

+-------------+------+
|    studentID|gender|
+-------------+------+
|1901000200   |     M|
|1901000500   |     M|
|1901000500   |     M|
|1901000500   |     M|
|1901000500   |     M|
+-------------+------+

我已将 studentID 转换为字符串,但无法从中删除前 190 个。我想要下面的输出。

+-------------+------+
|    studentID|gender|
+-------------+------+
|   1000200   |     M|
|   1000500   |     M|
|   1000500   |     M|
|   1000500   |     M|
|   1000500   |     M|
+-------------+------+

我尝试了以下方法,但它给了我错误。

students_data = students_data.withColumn('studentID',F.lit(students_data["studentID"][2:]))

TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'NoneType'>, respectively.

【问题讨论】:

  • 是的,我也是这样做的,但是当我尝试再次将 studentID 转换为 int 时,它给了我一些奇怪的负整数值。

标签: python dataframe pyspark apache-spark-sql pyspark-sql


【解决方案1】:
from pyspark.sql import functions as F

# replicating the sample data from the OP.
students_data = sqlContext.createDataFrame(
[[1901000200,'M'],
[1901000500,'M'],
[1901000500,'M'],
[1901000500,'M'],
[1901000500,'M']],
["studentid", "gender"])

# unlike a simple python list transformation - we need to define the last position in the transform
# in case you aren't sure about the length one can define a random large number say 10k.
students_data = students_data.withColumn(
  'studentID',
  F.lit(students_data["studentID"][4:10000]).cast("string"))

students_data.show()

输出:

+---------+------+
|studentID|gender|
+---------+------+
|  1000200|     M|
|  1000500|     M|
|  1000500|     M|
|  1000500|     M|
|  1000500|     M|
+---------+------+

【讨论】:

    猜你喜欢
    • 2019-07-09
    • 1970-01-01
    • 2017-01-15
    • 1970-01-01
    • 2015-10-06
    • 2021-11-04
    • 2017-05-20
    • 2018-12-21
    相关资源
    最近更新 更多