【问题标题】:Running pySpark in Jupyter notebooks - Windows在 Jupyter 笔记本中运行 pySpark - Windows
【发布时间】:2016-11-04 20:57:15
【问题描述】:

我想从 Jupyter notebook 运行 pySpark。我下载并安装了带有 Juptyer 的 Anaconda。我创建了以下几行

 from pyspark import SparkConf, SparkContext
 conf = SparkConf().setMaster("local").setAppName("My App")
 sc = SparkContext(conf = conf)

我收到以下错误

ImportError                               Traceback (most recent call last)
<ipython-input-3-98c83f0bd5ff> in <module>()
  ----> 1 from pyspark import SparkConf, SparkContext
  2 conf = SparkConf().setMaster("local").setAppName("My App")
  3 sc = SparkContext(conf = conf)

 C:\software\spark\spark-1.6.2-bin-hadoop2.6\python\pyspark\__init__.py in   <module>()
 39 
 40 from pyspark.conf import SparkConf
  ---> 41 from pyspark.context import SparkContext
 42 from pyspark.rdd import RDD
 43 from pyspark.files import SparkFiles

 C:\software\spark\spark-1.6.2-bin-hadoop2.6\python\pyspark\context.py in <module>()
 26 from tempfile import NamedTemporaryFile
 27 
 ---> 28 from pyspark import accumulators
 29 from pyspark.accumulators import Accumulator
 30 from pyspark.broadcast import Broadcast

 ImportError: cannot import name accumulators

我尝试根据 Stackoverflow importing pyspark in python shell 中的答案添加以下指向 spark/python 目录的环境变量 PYTHONPATH

但这无济于事

【问题讨论】:

    标签: python pyspark jupyter


    【解决方案1】:

    这对我有用:

    import os
    import sys
    
    spark_path = "D:\spark"
    
    os.environ['SPARK_HOME'] = spark_path
    os.environ['HADOOP_HOME'] = spark_path
    
    sys.path.append(spark_path + "/bin")
    sys.path.append(spark_path + "/python")
    sys.path.append(spark_path + "/python/pyspark/")
    sys.path.append(spark_path + "/python/lib")
    sys.path.append(spark_path + "/python/lib/pyspark.zip")
    sys.path.append(spark_path + "/python/lib/py4j-0.9-src.zip")
    
    from pyspark import SparkContext
    from pyspark import SparkConf
    
    sc = SparkContext("local", "test")
    

    验证:

    In [2]: sc
    Out[2]: <pyspark.context.SparkContext at 0x707ccf8>
    

    【讨论】:

    • 不。我收到以下错误 ImportError ---> 41 from pyspark.context import SparkContext 42 from pyspark.rdd import RDD 43 from pyspark.files import SparkFiles C:\software\spark\spark-1.6.2-bin-hadoop2.6\ python\pyspark\context.py in () 26 from tempfile import NamedTemporaryFile 27 ---> 28 from pyspark import accumulators 29 from pyspark.accumulators import Accumulator 30 from pyspark.broadcast import Broadcast ImportError: cannot import name accumulators跨度>
    【解决方案2】:

    2018 版

    在 Windows 10 上安装 PYSPARK 带 ANACONDA NAVIGATOR 的 JUPYTER 笔记本

    第 1 步

    下载包

    1) spark-2.2.0-bin-hadoop2.7.tgz Download

    2) java jdk 8 版本Download

    3) Anaconda v 5.2 Download

    4) scala-2.12.6.msi Download

    5) hadoop v2.7.1Download

    第 2 步

    C:/ 驱动器中创建 SPARK 文件夹并将所有内容放入其中 It will look like this

    注意:在安装 SCALA 期间,请在 Spark 文件夹中提供 SCALA 的路径

    第 3 步

    现在设置新的 WINDOWS 环境变量

    1. HADOOP_HOME=C:\spark\hadoop

    2. JAVA_HOME=C:\Program Files\Java\jdk1.8.0_151

    3. SCALA_HOME=C:\spark\scala\bin

    4. SPARK_HOME=C:\spark\spark\bin

    5. PYSPARK_PYTHON=C:\Users\user\Anaconda3\python.exe

    6. PYSPARK_DRIVER_PYTHON=C:\Users\user\Anaconda3\Scripts\jupyter.exe

    7. PYSPARK_DRIVER_PYTHON_OPTS=notebook

    8. 现在选择火花路径

      点击编辑并添加新的

      将“C:\spark\spark\bin”添加到变量“Path”窗口

    第 4 步

    • 创建要存储 Jupyter-Notebook 输出和文件的文件夹
    • 然后打开 Anaconda 命令提示符和 cd 文件夹名称
    • 然后输入 Pyspark

    就是这样,您的浏览器将弹出 Juypter localhost

    第 5 步

    检查 pyspark 是否正常工作!

    输入简单的代码并运行它

    from pyspark.sql import Row
    a = Row(name = 'Vinay' , age=22 , height=165)
    print("a: ",a)
    

    【讨论】:

      【解决方案3】:

      在 Jupyter 笔记本中运行 pySpark - Windows

      JAVA8:https://www.guru99.com/install-java.html

      阿纳康达:https://www.anaconda.com/distribution/

      jupyter 中的 Pyspark:https://changhsinlee.com/install-pyspark-windows-jupyter/

      import findspark
      
      findspark.init()
      
      from pyspark.sql import SparkSession
      from pyspark.sql.functions import *
      from pyspark.sql.types import *
      
      spark = SparkSession.builder.appName('test').getOrCreate()
      data = [(1, "siva", 100), (2, "siva2", 200),(3, "siva3", 300),(4, "siva4", 400),(5, "siva5", 500)]
      schema = ['id', 'name', 'sallary']
      
      df = spark.createDataFrame(data, schema=schema)
      df.show()
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-03-22
        • 1970-01-01
        • 1970-01-01
        • 2016-10-26
        • 2020-05-17
        • 1970-01-01
        • 2018-09-23
        • 2020-03-11
        相关资源
        最近更新 更多