【发布时间】:2020-06-03 22:47:32
【问题描述】:
我有一个数据框,我正在应用一个 lambda 函数来根据列的值复制行值。
在 Pandas 中是这样的:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': ['one', 'two', 'three', 'five']})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': ['five', 'six', nan, nan]})
new_df = df1.merge(df2, how='left', left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo one foo five
1 foo one foo NaN
2 bar two bar six
3 baz three baz NaN
4 foo five foo five
5 foo five foo NaN
def my_func(row):
if not row['value_y'] in [nan]:
row['value_x'] = row['value_y']
return row
applied_df = new_df.apply(lambda x: my_func(x), axis=1)
lkey value_x rkey value_y
0 foo five foo five
1 foo one foo NaN
2 bar six bar six
3 baz three baz NaN
4 foo five foo five
5 foo five foo NaN
我如何在 Pyspark 中做类似的事情?
【问题讨论】:
标签: python pyspark apache-spark-sql user-defined-functions pyspark-dataframes