【发布时间】:2019-12-13 06:34:20
【问题描述】:
我正在尝试运行一些代码,但出现错误:
“DataFrame”对象没有属性“_get_object_id”
代码:
items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300),
(3,float('Nan'))]
sc = spark.sparkContext
rdd = sc.parallelize(items)
df = rdd.toDF(["id", "col1"])
import pyspark.sql.functions as func
means = df.groupby("id").agg(func.mean("col1"))
# The error is thrown at this line
df = df.withColumn("col1", func.when((df["col1"].isNull()), means.where(func.col("id")==df["id"])).otherwise(func.col("col1")))
【问题讨论】:
-
您不能在这样的函数中使用第二个数据框 - 请改用连接。
标签: python dataframe apache-spark pyspark