【问题标题】:AttributeError: 'DataFrame' object has no attribute '_data'AttributeError:“DataFrame”对象没有属性“_data”
【发布时间】:2021-04-05 00:23:31
【问题描述】:

在 Pandas 数据帧上并行化时出现 Azure Databricks 执行错误。该代码能够创建 RDD,但在执行 .collect() 时中断

设置:

import pandas as pd
# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
  
# Create the pandas DataFrame 
my_df = pd.DataFrame(data, columns = ['Name', 'Age']) 

def testfn(i):
  return my_df.iloc[i]
test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
print (test_var)

错误:

Py4JJavaError                             Traceback (most recent call last)
<command-2941072546245585> in <module>
      1 def testfn(i):
      2   return my_df.iloc[i]
----> 3 test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
      4 print (test_var)

/databricks/spark/python/pyspark/rdd.py in collect(self)
    901         # Default path used in OSS Spark / for non-credential passthrough clusters:
    902         with SCCallSiteSync(self.context) as css:
--> 903             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    904         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    905 

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    125     def deco(*a, **kw):
    126         try:
--> 127             return f(*a, **kw)
    128         except py4j.protocol.Py4JJavaError as e:
    129             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 3845.0 failed 4 times, most recent failure: Lost task 16.3 in stage 3845.0 : org.apache.spark.api.python.PythonException: 'AttributeError: 'DataFrame' object has no attribute '_data'', from <command-2941072546245585>, line 2. Full traceback below:
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 654, in main
    process()
  File "/databricks/spark/python/pyspark/worker.py", line 646, in process
    serializer.dump_stream(out_iter, outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 279, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/databricks/spark/python/pyspark/util.py", line 109, in wrapper
    return f(*args, **kwargs)
  File "<command-2941072546245585>", line 2, in testfn
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 1767, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
    self._validate_integer(key, axis)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2060, in _validate_integer
    len_axis = len(self.obj._get_axis(axis))
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 424, in _get_axis
    return getattr(self, name)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'

版本详情:

火花:'3.0.0' python:3.7.6(默认,2020 年 1 月 8 日,19:59:22) [GCC 7.3.0]

【问题讨论】:

  • 我也面临同样的问题。关注这个问题。
  • 您是通过 databricks-connect 运行它吗?您使用的运行时版本是什么?如何安装 Pandas?

标签: python apache-spark pyspark databricks azure-databricks


【解决方案1】:

当驱动程序和执行程序安装了不同版本的 Pandas 时,我看到了这样的错误。在我的例子中,它是 Pandas 1.1.0 的驱动程序(通过 databricks-connect),而执行程序在 Databricks Runtime 7.3 和 Pandas 1.0.1 上。 Pandas 1.1.0 内部结构发生了很大变化,因此驱动程序发送给执行程序的代码被破坏了。您需要检查您的执行程序和驱动程序是否具有相同版本的 Pandas(您可以在 release notes 中找到 Databricks Runtimes 使用的 Pandas 版本)。您可以使用following script 来比较执行器和驱动程序上的 Python 库版本。

【讨论】:

    【解决方案2】:

    我遇到了同样的问题。

    我认为是因为熊猫版本不同。

    我通过将我的 pandas 版本从 1.0.1 更新到 1.0.5 解决了这个错误

    【讨论】:

    • 如果你检查另一个答案,那就是因为 Pandas 版本不同:-)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2013-10-23
    • 2017-01-24
    • 2018-10-10
    • 2019-08-18
    • 2021-01-20
    • 2020-05-16
    • 2018-10-04
    相关资源
    最近更新 更多