【发布时间】:2022-08-03 19:25:17
【问题描述】:
我在 Synapse 中的 PySpark Notebook 中以这种格式(不可重现)进行了一些计算:
list_col= X.dtypes
X_array_col = spark.createDataFrame(list_col)\\
.withColumnRenamed(\"_1\", \"Colname\")\\
.withColumnRenamed(\"_2\", \"TypeColumn\")\\
.filter(col(\"TypeColumn\").contains(\"array\"))
tab_toPd = X_array_col.toPandas()
n_array = tab_toPd.count()[\"Colname\"]
X2 = X
while n_array != 0:
for index, row in tab_toPd.iterrows():
colname = tab_toPd[\"Colname\"][index]
typename = tab_toPd[\"TypeColumn\"][index]
X2= X2.withColumn(colname, explode_outer(col(colname)))
if \"struct\" in typename:
X2= X2.select(\"*\", colname + \".*\").drop(colname)
list_col= X2.dtypes
df_OroCommerce_array_col = spark.createDataFrame(list_col)\\
.withColumnRenamed(\"_1\", \"Colname\")\\
.withColumnRenamed(\"_2\", \"TypeColumn\")\\
.filter(col(\"TypeColumn\").contains(\"array\"))
tab_toPd = X_array_col.toPandas()
n_array = tab_toPd.count()[\"Colname\"]
我想在桌子上数一数:
X2.count()
我收到以下错误:
Py4JJavaError: An error occurred while calling o967.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 29.0 failed 4 times, most recent failure: Lost task 1.3 in stage 29.0 (TID 202) (vm-cd401785 executor 2): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file wasbs://xxx-prd-we-eventstore@1933851374948631570.z40.blob.storage.azure.net/xxx-prd-we-eventstore/kdxTAA==/kdxTAOJwY-A=.Partitioned.Parquet/PartitionId=cf3494bc-c19f-4e70-85c3-5561c102b483/1933851374948631570_0_0.snappy.Parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
我对 Spark 很陌生。我尝试使用 Google 调试此错误,但没有成功。这个错误来自哪里?
标签: python apache-spark pyspark parallel-processing azure-synapse