【发布时间】:2019-08-22 18:31:23
【问题描述】:
我有一个数据框 df,如下所示
+--------+--------------------+--------+------+
| id| path|somestff| hash1|
+--------+--------------------+--------+------+
| 1|/file/dirA/fileA.txt| 58| 65161|
| 2|/file/dirB/fileB.txt| 52| 65913|
| 3|/file/dirC/fileC.txt| 99|131073|
| 4|/file/dirF/fileD.txt| 46|196233|
+--------+--------------------+--------+------+
注意:/file/dir 不同。并非所有文件都存储在同一目录中。事实上,不同的目录中有数百个文件。
我在这里要完成的是读取列路径中的文件并计算文件中的记录并将行计数的结果写入数据帧的新列。
我尝试了以下函数和udf:
def executeRowCount(fileCount: String): Long = {
val rowCount = spark.read.format("csv").option("header", "false").load(fileCount).count
rowCount
}
val execUdf = udf(executeRowCount _)
df.withColumn("row_count", execUdf (col("path"))).show()
这会导致以下错误
org.apache.spark.SparkException: Failed to execute user defined fu
nction($anonfun$1: (string) => bigint)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:28)
at $line39.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:25)
... 19 more
我试图在收集时遍历该列
val te = df.select("path").as[String].collect()
te.foreach(executeRowCount)
在这里它工作得很好,但我想将结果存储在 df...
我尝试了几种解决方案,但我在这里面临死胡同。
【问题讨论】:
标签: scala apache-spark user-defined-functions