【问题标题】:Connecting to a local Docker Spark Cluster连接到本地 Docker Spark 集群
【发布时间】:2022-06-15 15:32:19
【问题描述】:

我正在尝试连接到我从笔记本电脑本地创建的 spark 集群。 我使用的 docker-compose 如下:


services:
  spark-master:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '7075:8080'
      - "7077:7077"
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
  spark-worker:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
  spark-worker-2:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
  spark-worker-3:
    image: docker.io/bitnami/spark:3.2.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no

上图是一个包含 3 个 worker 和 1 个 master 的 bitnami 图像。 我尝试通过我的 jupyter 笔记本连接的代码如下:

import findspark
findspark.init()
findspark.find()
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Day1_1").master("spark://localhost:7077").getOrCreate()
df_NYTaxi =  spark.read.csv(file)

运行上述代码后我得到的错误如下:

: java.lang.NullPointerException
    at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:78)
    at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:518)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:596)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)
    ```
I have tried a lot of things but every time I just can't seem to connect to that docker image some how, or I can connect but the job times out. 
my local spark version is 3.2.1 and the image used has the same version. 

【问题讨论】:

  • 我知道这并不能回答您的问题,但是我可以看到您正在尝试启动 1 个 master 和 3 个 worker。文档建议的方式是通过此命令 docker-compose up --scale spark-worker=3.[link ] (hub.docker.com/r/bitnami/spark)

标签: python docker apache-spark docker-compose jupyter-notebook


【解决方案1】:

所以解决方法是实际创建一个包含多个容器的 docker 映像,然后通过 VS 代码连接到它,然后从内部运行脚本。

这里是修改后的docker compose

version: '2'

services:
  spark:
    build : .
    container_name: spark_master
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - '7075:8080'
      - "7077:7077"
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"

    
  spark-worker:
    build : .
    container_name: spark_worker_1

    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"
 
  spark-worker-2:
    build : .
    container_name: spark_worker_2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - "./execution_scripts:/execution_scripts:rw"
      - "./resources:/resources:rw"

用于构建此映像的 docker 文件如下:

 FROM bitnami/spark:3.2.1 USER root
 
 # Installing package into Spark if needed
 # spark-shell --master local --packages "<package name>" RUN pip install findspark 
 EXPOSE 8080 
 EXPOSE 7075 
 EXPOSE 7077

构建此映像后(当然,您需要创建一个名为 execution_scripts 和资源的 2 个文件夹。您可以在 VS Code 中附加到正在运行的容器或任何其他 IDE 中的任何类似方式。

【讨论】:

    猜你喜欢
    • 2020-08-10
    • 2018-12-10
    • 2018-05-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-12-23
    • 2017-05-20
    • 1970-01-01
    相关资源
    最近更新 更多