【发布时间】:2020-05-31 10:49:48
【问题描述】:
以下是为 pandas df 编写的代码,由于内存问题,我不得不转移到 PySpark,这就是为什么我需要转换此代码以便可以为 spark df 执行它。我尝试直接运行它,但它会产生错误。PySpark 中以下代码的替代方法是什么?
def units(x):
if x <= 0:
return 0
if x >= 1:
return 1
sets = df.applymap(units)
这是我得到的错误:
AttributeErrorTraceback (most recent call last)
<ipython-input-20-7e54b4e7a7e7> in <module>()
----> 1 sets = pivoted.applymap(units)
/usr/lib/spark/python/pyspark/sql/dataframe.py in __getattr__(self, name)
1180 if name not in self.columns:
1181 raise AttributeError(
-> 1182 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
1183 jc = self._jdf.apply(name)
1184 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'applymap'
【问题讨论】:
-
Pyspark 数据框没有 applymap 属性,看看 when+otherwise:
df.select(*[F.when(F.col(i)<=0,0).otherwise(1).alias(i) for i in df.columns]).show()?将 sql 函数导入为F后,如import pyspark.sql.functions as F
标签: python pandas apache-spark pyspark google-cloud-dataproc