【问题标题】:How come PySpark can't find my SPARK_HOMEPySpark 怎么找不到我的 SPARK_HOME
【发布时间】:2020-12-11 11:39:11
【问题描述】:

我正在尝试在我的机器上本地运行来自 Archives Unleashed 的 Jupyter 笔记本。当笔记本构建 PySpark 时,会遇到以下异常:

Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly

知道如何配置 SPARK_HOME 吗?

我尝试在干净的 conda 环境中运行笔记本。这是出现错误之前的完整笔记本:

%%capture

!wget "https://github.com/archivesunleashed/aut/releases/download/aut-0.50.0/aut-0.50.0.zip"
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-0.50.0/aut-0.50.0-fatjar.jar"

!ls

%%capture

!apt-get update
!apt-get install -y openjdk-8-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz" > spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-0.50.0-fatjar.jar --py-files aut-0.50.0.zip pyspark-shell'

import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

这是我得到的回报:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~/opt/miniconda3/envs/arc/lib/python3.8/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
    142     try:
--> 143         py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
    144     except IndexError:

IndexError: list index out of range

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-2-03823ebc9ad8> in <module>
      1 import findspark
----> 2 findspark.init()
      3 import pyspark
      4 sc = pyspark.SparkContext()
      5 from pyspark.sql import SQLContext

~/opt/miniconda3/envs/arc/lib/python3.8/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
    143         py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
    144     except IndexError:
--> 145         raise Exception(
    146             "Unable to find py4j, your SPARK_HOME may not be configured correctly"
    147         )

Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly

【问题讨论】:

    标签: python pyspark jupyter-notebook


    【解决方案1】:

    "/content/spark-2.4.5-bin-hadoop2.7" 不是正确的SPARK_HOME。在curl -L 行中,您将 Spark 下载到某个地方,然后将其解压缩。找到下载并解压到的位置,并将SPARK_HOME 设置为该路径。

    【讨论】:

      猜你喜欢
      • 2018-09-17
      • 1970-01-01
      • 2018-07-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-06-12
      • 2014-12-19
      相关资源
      最近更新 更多