【问题标题】:PySpark create new column from existing column with a list of valuesPySpark 从现有列创建具有值列表的新列
【发布时间】:2019-12-27 12:43:27
【问题描述】:

我有一个这样的 DataFrame:

from pyspark.sql import SparkSession
from pyspark import Row

spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

df = spark.createDataFrame([Row(a=1, b='', c=['0', '1'], d='foo'),
                            Row(a=2, b='', c=['0', '1'], d='bar'),
                            Row(a=3, b='', c=['0', '1'], d='foo')])

|  a|  b|     c|  d|
+---+---+------+---+
|  1|   |[0, 1]|foo|
|  2|   |[0, 1]|bar|
|  3|   |[0, 1]|foo|
+---+---+------+---+

我想用"c" 列的第一个元素创建"e" 列,用"c" 列的第二个元素创建"f" 列,如下所示:

|a  |b  |c     |d  |e  |f  |
+---+---+------+---+---+---+
|1  |   |[0, 1]|foo|0  |1  |
|2  |   |[0, 1]|bar|0  |1  |
|3  |   |[0, 1]|foo|0  |1  |
+---+---+------+---+---+---+

【问题讨论】:

标签: python pyspark


【解决方案1】:
df = spark.createDataFrame([Row(a=1, b='', c=['0', '1'], d='foo'),
                            Row(a=2, b='', c=['0', '1'], d='bar'),
                            Row(a=3, b='', c=['0', '1'], d='foo')])

df2 = df.withColumn('e', df['c'][0]).withColumn('f', df['c'][1])
df2.show()

+---+---+------+---+---+---+
|a  |b  |c     |d  |e  |f  |
+---+---+------+---+---+---+
|1  |   |[0, 1]|foo|0  |1  |
|2  |   |[0, 1]|bar|0  |1  |
|3  |   |[0, 1]|foo|0  |1  |
+---+---+------+---+---+---+

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-09-24
    • 2021-01-02
    • 2020-03-08
    • 1970-01-01
    • 1970-01-01
    • 2015-07-27
    • 2012-11-14
    • 2019-08-15
    相关资源
    最近更新 更多